Professional Documents
Culture Documents
MAPREDUCE
Haiping Wang ctqlwhp1022@163.com
OUTLINE
MapReduce Framework MapReduce implementation on Hadoop Join algorithms using MapReduce
ah ah er
map(String inputkey, String inputvalue):
ah
if or
or uh
or
ah if
ah:1
if:1 or:1
ah:1 if:1
(ah)
(er) (if)
(or) (uh)
JobTracker
InputFormat OutputFormat
Record Writer Record Reader Copy
Mapper Partitioner
Sorter Reducer
TaskTracker
VLDB09,EDBT2010
merger
merger
merger
reducer
reducer
reducer
reducer
reducer
reducer
mapper
mapper
mapper
mapper
mapper
mapper
split
split
split
split
split
split
split
split
Connections
A(ma, ra ), B(mb , rb ), r mergers suppose ra=rb=r Map->Reduce connections= ra*ma+rb*mb=r*(ma+mb) Reduce->Merge in one-to-one case, connections=2r matcher: compare tuples to see id they should be merged or not
Conclusion
Use multiple map-reduce job Partitioner may cause data skew problem The number of ma, ra, mb, rb, r ra=rb? > connections
L,R and the Join Result is stored in DFS. Scans are used to access L and R. Each map or reduce task can optionally implement two additional functions: init() and close() . These functions can be called before or after each map or reduce task.
REPARTITION JOIN(HIVE)
input
1::1193::5::978300760 1::661::3::978302109 1::661::3::978301968 1::661::4::978300275 1 ::1193::5::97882429
map
Pairs: (key, targeted record)
(661, ) (661, ) (661, ) (1193, ) (1193, )
shuffle
Group by join key
(661,
reduce
output
1193, L:1::1193::5::978300760 661, L :1::661::3::978302109 661, L :1::661::3::978301968 661, L :1::661::4::978300275 1193, L :1 ::1193::5 ::97882429
L: Ratings.dat
661::James and the Glant 914::My Fair Lady.. 1193::One Flew Over the 2355::Bugs Life, A 3408::Erin Brockovich
661, R:661::James and the Gla 914, R: 914::My Fair Lady.. 1193, R: 1193::One Flew Over 2355, R: 2355::Bugs Life, A 3408, R: 3408::Erin Brockovi (661, ) (2355, ) (3048, ) (914, ) (1193, )
Buffers records into two sets according to the table tag + Cross-product
R: movies.dat
Out of memory The key cardinality is small The data is highly skewed
Improvement Output key is changed to a composite of the join key and the table tag. Hashcode is computed from just the join key part of the composite key Records are grouped on just the join key
The communication cost of a process is the size of the input to the process
This paper does not count the output size for a process
The output must be input to at least one other process The final output is much smaller than its input
The total communication cost is the sum of the communication costs of all processes that constitute an algorithm The elapsed communication cost is defined on the acyclic graph of processes
Consider a path through this graph, and sum the communication costs of the processes along that path The maximum sum, over all paths is the elapsed communication cost
R(A,B)
V
S(B,C)
A
R
K b0 b0 b0 K b1 b1
a0 a1 a2 B b0
A a0 a0 a1
B b0 b0 b1
C c0 c1 c2
b0 b1
Table
Partition& sort
b->(a, c)
R
S
(a ,b )
(b , c )
S(B,C)
Reduce input
T(C,D)
Final output
Map
Reduce
S(B,C)
T(C,D)
h(c) = 0
h(T.c) = 1 1 2
h(S.b) = 2 h(S.c) = 1 3
h(b) = 0
1 2
Each Reduce process 3 computes the join of h(R.b) = 2 the tuples it receives
Reduce processes
PROBLEM SOLVING
SPECIAL CASES
Star Joins
Chain Joins
CONCLUSION
Just suitable for Equal join Use one map-reduce Does not consider the IO ( intermediate <K,V> pairs IO ) and CPU time