You are on page 1of 22

JOIN ALGORITHMS USING

MAPREDUCE
Haiping Wang ctqlwhp1022@163.com

OUTLINE
MapReduce Framework MapReduce implementation on Hadoop Join algorithms using MapReduce

MAPREDUCE: SIMPLIFIED DATA PROCESSING


ON LARGE CLUSTERS. IN OSDI, 2004

MAPREDUCE WORDCOUNT DIAGRAM


file1 file2 file3 file4 file5 file6 file7

ah ah er
map(String inputkey, String inputvalue):

ah

if or

or uh

or

ah if

ah:1 ah:1 er:1

ah:1

if:1 or:1

or:1 uh:1 or:1

ah:1 if:1

ah:1,1,1,1 er:1 if:1,1 or:1,1,1 uh:1

reduce(String outputkey, Iterator intermediate_alues):

(ah)

(er) (if)

(or) (uh)

MAPREDUCE IMPLEMENTATION ON HADOOP

JobTracker

InputFormat OutputFormat
Record Writer Record Reader Copy

Mapper Partitioner

Sorter Reducer

TaskTracker

MAPREDUCE IMPLEMENTATION ON HADOOP

HADOOP MAPREDUCE FRAMEWORK


ARCHITECTURE

JOIN ALGORITHMS USING MAPREDUCE


Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters sigmod07 Semi-join Computation on Distributed File Systems Using Map-Reduce-Merge Model Sac10 Optimizing joins in a map-reduce environment

VLDB09,EDBT2010

A Comparison of Join Algorithms for Log Processing in MapReduce sigmod10

MAP-REDUCE-MERGE: SIMPLIFIED RELATIONAL DATA PROCESSING ON LARGE CLUSTERS SIGMOD07

MAP-REDUCE-MERGE IMPLEMENTATIONS OF RELATIONAL JOIN ALGORITHMS


Sort-merger join Map Reduce Merge Hash join Map Reduce range partitioner , ordered bucket s, each bucket a reducer Read the designed buckets from all mappers and merged them into a sorted set Read sorted buckets from two data sets and do sort-merge join Hash partitioner, hashed buckets, each bucket a reducer Read the designed buckets from all mappers , use a hash table to group and aggregate these records(the same hash function as the mapper ), does not need a sorter In memory hash join The same as the hash join The same as the hash join Nested loop join

Merge Block Nested loop join Map Reduce Merge

EXAMPLE: HASH JOIN


Read from two sets of reducer outputs that share the same hashing buckets One is used as a build set and the other probe

merger

merger

merger

Read from every mapper for one designated partition

reducer

reducer

reducer

reducer

reducer

reducer

Use a hash partitioner

mapper

mapper

mapper

mapper

mapper

mapper

split

split

split

split

split

split

split

split

ANALYSIS AND CONCLUSION

Connections

A(ma, ra ), B(mb , rb ), r mergers suppose ra=rb=r Map->Reduce connections= ra*ma+rb*mb=r*(ma+mb) Reduce->Merge in one-to-one case, connections=2r matcher: compare tuples to see id they should be merged or not

Conclusion
Use multiple map-reduce job Partitioner may cause data skew problem The number of ma, ra, mb, rb, r ra=rb? > connections

SEMI-JOIN COMPUTATION STEPS AND


WORKFLOW

Equal join reduce communication costs disk I/O costs

Insensitive to data skew ?

A COMPARISON OF JOIN ALGORITHMS FOR LOG PROCESSING IN MAPREDUCE SIGMOD10

Equi-join between a log table L and a reference table R on a single column.


L L.k=R.k R, with |L| |R|

L,R and the Join Result is stored in DFS. Scans are used to access L and R. Each map or reduce task can optionally implement two additional functions: init() and close() . These functions can be called before or after each map or reduce task.

REPARTITION JOIN(HIVE)
input
1::1193::5::978300760 1::661::3::978302109 1::661::3::978301968 1::661::4::978300275 1 ::1193::5::97882429

map
Pairs: (key, targeted record)
(661, ) (661, ) (661, ) (1193, ) (1193, )

shuffle
Group by join key
(661,

reduce

output

1193, L:1::1193::5::978300760 661, L :1::661::3::978302109 661, L :1::661::3::978301968 661, L :1::661::4::978300275 1193, L :1 ::1193::5 ::97882429

L: Ratings.dat
661::James and the Glant 914::My Fair Lady.. 1193::One Flew Over the 2355::Bugs Life, A 3408::Erin Brockovich
661, R:661::James and the Gla 914, R: 914::My Fair Lady.. 1193, R: 1193::One Flew Over 2355, R: 2355::Bugs Life, A 3408, R: 3408::Erin Brockovi (661, ) (2355, ) (3048, ) (914, ) (1193, )

[L :1::661::3::97], [R:661::James], [L:1::661::3::978], [L :1::661::4::97])


[R:2355::B]) (3408, [R:3408::Eri])
(2355,

{(661::James) } X (1::661::3::97), (1::661::3::97), (1::661::4::97)

(1,Ja..,3, ) (1,Ja..,3, ) (1,Ja..,4, )

Buffers records into two sets according to the table tag + Cross-product

Drawback: all records may have to be buffered

R: movies.dat

Out of memory The key cardinality is small The data is highly skewed

Phase /Function Map Function Partitioning function Grouping function

Improvement Output key is changed to a composite of the join key and the table tag. Hashcode is computed from just the join key part of the composite key Records are grouped on just the join key

THE COST MEASURE FOR MR ALGORITHMS

The communication cost of a process is the size of the input to the process

This paper does not count the output size for a process

The output must be input to at least one other process The final output is much smaller than its input

The total communication cost is the sum of the communication costs of all processes that constitute an algorithm The elapsed communication cost is defined on the acyclic graph of processes
Consider a path through this graph, and sum the communication costs of the processes along that path The maximum sum, over all paths is the elapsed communication cost

2-WAY JOIN IN MAPREDUCE


Input
Reduce input

R(A,B)
V

S(B,C)

A
R

B b0 b1 b2 C c0 c1 c2 tuple map b ->(c, S)


Map

K b0 b0 b0 K b1 b1

a0 a1 a2 B b0

(a0, R) (c0, S) (c1, S) V (a1, R) (c2, S)


Reduce Final output

A a0 a0 a1

B b0 b0 b1

C c0 c1 c2

b0 b1

Table

Partition& sort
b->(a, c)

R
S

(a ,b )
(b , c )

b ->(a, R) Hash(b) ->(a, R)


Hash(b) ->(c, S)

JOINING SEVERAL RELATIONS AT ONCE R(A,B)


Input R

S(B,C)
Reduce input

T(C,D)
Final output

Map

Reduce

JOINING SEVERAL RELATIONS AT ONCE R(A,B)

S(B,C)

T(C,D)
h(c) = 0

Let h be a hash function with range 1, 2, , m


S(b, c) -> (h(b), h(c)) R(a, b) -> (h(b), all) T(c, d) -> (all, h(c))

h(T.c) = 1 1 2

h(S.b) = 2 h(S.c) = 1 3

h(b) = 0
1 2

Each Reduce process 3 computes the join of h(R.b) = 2 the tuples it receives

(# of Reduce processes: 42 = 16) m=4, k=16

Reduce processes

PROBLEM SOLVING

Problem solving using the method of Lagrange Multipliers

Take derivatives with respect to the three variables a, b, c

Multiply the three equations

SPECIAL CASES

Star Joins

Chain Joins

A chain join is a join of the form

CONCLUSION
Just suitable for Equal join Use one map-reduce Does not consider the IO ( intermediate <K,V> pairs IO ) and CPU time

Main contribution: use Lagrangean multipliers method

You might also like