Professional Documents
Culture Documents
FR NACHWUCHSWISSENSCHAFTLER 2011
UNIVERSIT DT FRANCO-ALLEMANDE
POUR JEUNES CHERCHEURS 2011
CLOUD COMPUTING :
HERAUSFORDERUNGEN UND MGLICHKEITEN
CLOUD COMPUTING :
DFIS ET OPPORTUNITS
DIMA TU Berlin
$600 to buy a disk drive that can store all of the worlds music
5 billion mobile phones in use in 2010
30 billion pieces of content shared on Facebook every month
40% projected growth in global data per year
Source: Big Data: The next frontier for innovation, competition and productivity
(McKinsey)
7/25/2011
DIMA TU Berlin
Big Data
Data have swept into every industry and business function
important factor of production
exabytes of data stored by companies every year
much of modern economic activity could not take place without that
Use of Big Data will become a key basis of competition and growth
companies failing to develop their analysis capabilities will fall behind
Source: Big Data: The next frontier for innovation, competition and productivity (McKinsey)
7/25/2011
DIMA TU Berlin
DIMA TU Berlin
Trends
Claremont Report
Massive parallelization
Virtualization
Service-based computing
Re-architecting DBMS
Web-scale data
management
Parallelization
Continuous optimization
Tight integration
Service-based everything
Programming Model
Combining structured and
unstructured data
Media Convergence
Analytics / BI
Operational
Multi-tenancy
7/25/2011
DIMA TU Berlin
Overview
Introduction
Big Data Analytics
Map/Reduce/Merge
Introducing the Cloud
Stratosphere (PACT and Nephele)
Demo
(Thomas Bodner, Matthias Ringwald)
DIMA TU Berlin
Map/Reduce Revisited
7/25/2011
DIMA TU Berlin
7/25/2011
DIMA TU Berlin
7/25/2011
DIMA TU Berlin
10
Map/Reduce Revisited
The data model
key/value pairs
e.g. (int, string)
reduce:
input key
output key
The framework
accepts a list
outputs result pairs
7/25/2011
DIMA TU Berlin
11
(K m,Vm)
(K m,Vm)
(K m,Vm)
MAP(K m,Vm)
MAP(K m,Vm)
MAP(K m,Vm)
(K r ,Vr)*
(K r ,Vr)*
(K r ,Vr)*
(K r ,Vr*)
(K r ,Vr*)
(K r ,Vr*)
REDUCE(K r ,Vr*)
REDUCE(K r ,Vr*)
REDUCE(K r ,Vr*)
(K r ,Vr)
(K r ,Vr)
(K r ,Vr)
Framework
Framework
(K r ,Vr)*
7/25/2011
DIMA TU Berlin
12
7/25/2011
DIMA TU Berlin
13
7/25/2011
DIMA TU Berlin
14
selection ()
projection ()
set/bag union ()
set/bag difference (\ or -)
Cartesian product ()
Derived Operators
join ()
set/bag intersection ()
division (/)
Further Operators
de-duplication
generalized projection
(grouping and aggregation)
outer-joins und semi-joins
Sort
7/25/2011
DIMA TU Berlin
15
Map/Reduce job:
map(key, tuple) {
int year = YEAR(tuple.date);
if (tuple.area_code = US)
emit(year, {year => year, price => tuple.price });
}
reduce(key, tuples) {
double sum_price = 0;
foreach (tuple in tuples) {
sum_price += tuple.price;
}
emit(key, sum_price);
}
7/25/2011
DIMA TU Berlin
16
7/25/2011
DIMA TU Berlin
17
map(key, bosses_phonebook_entry) {
emit(bosses_phonebook_entry.number, ``);
}
reduce(phone_number, tuples) {
emit(phone_number, ``);
}
7/25/2011
DIMA TU Berlin
18
map(key, boss_listing_entry) {
emit(bosses_listing_entry.first_name, `B`);
}
reduce(first_name, markers) {
if (`E` in markers and `B` in markers) {
emit(first_name, ``);
}
}
7/25/2011
DIMA TU Berlin
19
7/25/2011
DIMA TU Berlin
20
DIMA TU Berlin
21
7/25/2011
DIMA TU Berlin
22
Map/Reduce Revisited
JOINS IN MAP/REDUCE
7/25/2011
DIMA TU Berlin
23
7/25/2011
DIMA TU Berlin
24
p * B(S)
???
transport
local join
Asymmetric Fragment-and-replicate
Join is a special case of the Symmetric
Algorithm with m=p and n=1.
The Asymmetric Fragment-and-replicate
join is also called Broadcast Join
7/25/2011
DIMA TU Berlin
25
Broadcast Join
Equi-Join: L(A,X)
R(X,C)
Idea
broadcast L to each node completely before
the map phase begins
Mapper
only over R
step 1: read assigned input split of R into a hash-table (build phase)
step 2: scan local copy of L and find matching R tuples (probe)
step 3: emit each such pair
Alternatively read L into Hash-Table, then read R and probe
7/25/2011
DIMA TU Berlin
26
Repartition Join
Equi-Join: L(A,X)
R(X,C)
Mapper L(A,X)
LR
LR
LR
build L
R(X,C)
h(key) % n
read
L R L R
the value of the actual join key X
an annotation identifying to which relation the tuple belongs to (L or R)
Reduce
collect all L-tuples for the current L(i) block in a hash map
combine them with each R-tuple of the corresponding R(i)-tuple block
7/25/2011
DIMA TU Berlin
27
D2(B,Y)
F(C,X,Y)
D2 D1
F D1
D2
Fragment
D1 and D2 are partitioned independently
the partitions for F are defined as D1 x D2
Replicate
for F-tuple f the partition is uniquely defined as (hash(f.x), hash(f.y))
for D1-tuple d1 there is one degree of freedom (d1.y is undefined)
D1-tuples are thus replicated for each possible y value
symmetric for D2
Reduce
find and emit (f, d1, d2) pairs
depending on the input sorting, different join strategies are possible
7/25/2011
DIMA TU Berlin
29
Joins in Hadoop
time
nodes
Asym. = Multi-Dimensional
Partitioned Join
selectivity
7/25/2011
DIMA TU Berlin
31
Map/Reduce
Schema Support
Indexing
Programming Model
Presenting an algorithm
(procedural: C/C++,
Java, )
Optimization
Scaling
1 500
10 - 5000
Fault Tolerance
Limited
Good
Execution
Pipelines results
between operators
Materializes results
between phases
7/25/2011
DIMA TU Berlin
32
MAP-REDUCE-MERGE
7/25/2011
DIMA TU Berlin
33
Map-Reduce-Merge
Motivation
Map/Reduce does not directly support processing multiple related heterogeneous
datasets
difficulties and/or inefficiency when one must implement relational operators like joins
Map-Reduce-Merge
adds a merge phase that
Goal: efficiently merge data already partitioned and sorted (or hashed)
7/25/2011
DIMA TU Berlin
34
Introducing
THE CLOUD
7/25/2011
DIMA TU Berlin
35
In the Cloud
7/25/2011
DIMA TU Berlin
36
DIMA TU Berlin
37
7/25/2011
DIMA TU Berlin
38
7/25/2011
DIMA TU Berlin
39
7/25/2011
DIMA TU Berlin
40
STRATOSPHERE
7/25/2011
DIMA TU Berlin
41
Use-Cases
Scientific Data
Life Sciences
StratoSphere
Above the Clouds
Linked Data
Query Processor
Database-inspired approach
Analyze, aggregate, and
query
Textual and (semi-)
structured data
Infrastructure as a Service
...
* FOR 1306: DFG funded collaborative project among TU Berlin, HU Berlin and HPI Potsdam
7/25/2011
DIMA TU Berlin
42
1100km,
2km resolution
10TB
Filter
Aggregation (sliding window)
Join
Multi-dimensional sliding-window operations
Geospatial/Temporal joins
Uncertainty
950km,
2km resolution
7/25/2011
DIMA TU Berlin
43
Further Use-Cases
Text Mining in the biosciences
7/25/2011
DIMA TU Berlin
44
Outline
Architecture of the Stratosphere System
The PACT Programming Model
The Nephele Execution Engine
Parallelizing PACT Programs
7/25/2011
DIMA TU Berlin
45
Architecture Overview
Higher-Level
Language
Parallel Programming
Model
Execution Engine
JAQL,
Pig,
Hive
Scope,
DryadLINQ
PACT
Programming
Model
Map/Reduce
Programming
Model
Hadoop
Hadoop Stack
7/25/2011
JAQL?
Pig?
Hive?
DIMA TU Berlin
Dryad
Nephele
Dryad Stack
Stratosphere
Stack
46
Relational Databases
Map
Reduce
Map
Map
Reduce
Reduce
Schema Free
Many semantics hidden inside the
user code (tricks required to push
operations into map/reduce)
Single default way of parallelization
DIMA TU Berlin
47
Stratosphere in a Nutshell
PACT Programming Model
Parallelization Contract (PACT)
Declarative definition of data parallelism
Centered around second-order functions
Generalization of map/reduce
Nephele
Dryad-style execution engine
Evaluates dataflow graphs in parallel
Data is read from distributed filesystem
Flexible engine for complex jobs
PACT Compiler
Nephele
DIMA TU Berlin
48
Overview
Parallelization Contracts (PACTs)
The Nephele Execution Engine
Compiling/Optimizing Programs
Related Work
7/25/2011
DIMA TU Berlin
49
Key
Value
Independent
subsets
Input set
Reduce
Records with identical key must
be processed together
7/25/2011
DIMA TU Berlin
50
7/25/2011
DIMA TU Berlin
51
Data
Input
Contract
First-order function
(user code)
Output
Contract
Data
Input Contract
Specifies dependencies between records
(a.k.a. "What must be processed together?")
Generalization of map/reduce
Logically: Abstracts a (set of) communication pattern(s)
For "reduce": repartition-by-key
For "match" : broadcast-one or repartition-by-key
Output Contract
Generic properties preserved or produced by the user code
key property, sort order, partitioning, etc.
Relevant to parallelization of succeeding functions
7/25/2011
DIMA TU Berlin
52
7/25/2011
DIMA TU Berlin
53
invoke():
while (!input2.eof)
KVPair p = input2.next();
hash-table.put(p.key, p.value);
while (!input1.eof)
KVPair p = input1.next();
KVPait t = hash-table.get(p.key);
if (t != null)
KVPair[] result =
UF.match(p.key, p.value, t.value);
output.write(result);
end
User
Function
Nephele code
(communication)
UF1
(map)
UF2
(map)
V4
In-Memory
Channel
UF3
(match)
UF4
(reduce)
V1
compile
V3
V4
V2
span
V4
V3
V3
V3
V3
V1
V2
V1
V2
Network
Channel
PACT Program
7/25/2011
Nephele DAG
DIMA TU Berlin
NEPHELE EXECUTION
ENGINE
7/25/2011
DIMA TU Berlin
55
Design goals
PACT Compiler
Nephele
Infrastructure-as-a-Service
7/25/2011
DIMA TU Berlin
56
Nephele Architecture
Standard master worker pattern
Workers can be allocated on demand
Client
Public Network (Internet)
7/25/2011
Master
Private / Virtualized Network
Worker
Worker
DIMA TU Berlin
Worker
Persistent Storage
Cloud Controller
Compute Cloud
57
Output 1
Task: LineWriterTask.program
Output: s3://user:key@storage/outp
Task 1
Task: MyTask.program
Input 1
Task: LineReaderTask.program
Input: s3://user:key@storage/input
7/25/2011
58
Task
1 1 (2)
Task
Explicit parallelization
Parallelization range (mpl) derived from PACT
Wiring of subtasks derived from PACT
ID: 1
Type: m1.small
Input
Input
1 1 (1)
7/25/2011
DIMA TU Berlin
59
Execution Stages
Stage 1
Output 1 (1)
ID: 2
Type: m1.large
Stage 0
Task 1 (2)
ID: 1
Type: m1.small
Input 1 (1)
7/25/2011
DIMA TU Berlin
60
Channel Types
Stage 1
Output 1 (1)
ID: 2
Type: m1.large
Stage 0
Task 1 (2)
File channels
Vertices must run on same VM
Vertices must be in different stages
ID: 1
Type: m1.small
Input 1 (1)
7/25/2011
DIMA TU Berlin
61
DIMA TU Berlin
62
20
40
60
80
250
200
150
Time [minutes]
7/25/2011
100
20
40
60
80
100
Time [minutes]
DIMA TU Berlin
63
400
300
350
80
60
100
150
100
0
50
20
0
50
200
250
60
300
(h)
40
(f)
(e)
20
350
(e)
400
(a)
(c)
(d)
(g)
(d)
100
500
450
100
(c)
Poor resource
utilization!
(a)
500
Automatic VM
deallocation
(b)
40
80
(b)
USR
SYS
WAIT
Network traffic
450
USR
SYS
WAIT
Network traffic
References
[WK09] Daniel Warneke, Odej Kao: Nephele: efficient
parallel data processing in the cloud. SC-MTAGS 2009
[BEH+10] D. Battr, S. Ewen, F. Hueske, O. Kao, V. Markl,
D. Warneke: Nephele/PACTs: a programming model and
execution framework for web-scale analytical processing.
SoCC 2010: 119-130
[ABE+10] A. Alexandrov, D. Battr, S. Ewen, M. Heimel, F.
Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke:
Massively Parallel Data Analysis with PACTs on Nephele.
PVLDB 3(2): 1625-1628 (2010)
[AEH+11] A.Alexandrov, S. Ewen, M. Heimel, Fabian Hske,
et al.: MapReduce and PACT - Comparing Data Parallel
Programming Models, to appear at BTW 2011
7/25/2011
DIMA TU Berlin
64
Ongoing Work
7/25/2011
DIMA TU Berlin
65
Overview
Introduction
Big Data Analytics
Map/Reduce/Merge
Introducing the Cloud
Stratosphere (PACT and Nephele)
Demo
(Thomas Bodner, Matthias Ringwald)
DIMA TU Berlin
66
http://mediatedcultures.net/ksudigg/?p=120
7/25/2011
DIMA TU Berlin
67
Demo Screenshots
WEBLOG ANALYSIS
QUERY
7/25/2011
DIMA TU Berlin
74
7/25/2011
DIMA TU Berlin
75
7/25/2011
DIMA TU Berlin
76
7/25/2011
DIMA TU Berlin
77
7/25/2011
DIMA TU Berlin
78
Demo Screenshots
ENUMERATING TRIANGLES
FOR SOCIAL NETWORK
MINING
7/25/2011
DIMA TU Berlin
79
7/25/2011
DIMA TU Berlin
80
7/25/2011
DIMA TU Berlin
81
7/25/2011
DIMA TU Berlin
82
7/25/2011
DIMA TU Berlin
83
APACHE MAHOUT
Sebastian Schelter
7/25/2011
DIMA TU Berlin
85
Scalability
time is proportional to problem size by resource size
does not imply Hadoop or parallel, although
the majority of implementations use Map/Reduce
7/25/2011
DIMA TU Berlin
P
t
R
86
Algorithms
K-Means
Fuzzy K-Means
Canopy
Mean Shift
Dirichlet Process
Spectral Clustering
7/25/2011
DIMA TU Berlin
87
Algorithms
Logistic Regression (sequential but fast)
Naive Bayes / Complementary Nave Bayes
Random Forests
7/25/2011
DIMA TU Berlin
88
Algorithms
Neighborhood methods: Itembased Collaborative Filtering
Latent factor models: matrix factorization using Alternating Least
Squares
7/25/2011
DIMA TU Berlin
89
Algorithms
Lanczos Algorithm
Stochastic SVD
7/25/2011
DIMA TU Berlin
90
7/25/2011
DIMA TU Berlin
92
Problem description
Pairwise row similarity computation
Computes the pairwise similarities of the rows (or
columns) of a sparse matrix using a predefined
similarity function
used for computing document
similarities in large corpora
used to precompute item-itemsimilarities for recommendations
(Collaborative Filtering)
similarity function can be cosine,
Pearson-correlation, loglikelihood
ratio, Jaccard coefficient,
7/25/2011
DIMA TU Berlin
93
Map/Reduce
Map/Reduce Step 1
compute similarity specific row weights
transpose the matrix, there by create an inverted index
Map/Reduce Step 2
map out all pairs of cooccurring values
collect all cooccurring values per row pair, compute similarity value
Map/Reduce Step 3
use secondary sort to only keep the k most similar rows
PACT
7/25/2011
DIMA TU Berlin
94
Comparison
Equivalent implementations in Mahout and PACT
problem maps relatively well to the Map/Reduce paradigm
insight: standard Map/Reduce code can be ported to Nephele/PACT
with very little effort
output contracts and memory forwards offer hooks for performance
improvements (unfortunately not applicable in this particular usecase)
7/25/2011
DIMA TU Berlin
95
Problem description
K-Means
Simple iterative clustering algorithm
7/25/2011
DIMA TU Berlin
96
Mahout
Initialization
generate k random cluster centers from datapoints (optional)
put centers to distributed cache
Map
find nearest cluster for each data point
emit (cluster id, data point)
Combine
partially aggregate distances per cluster
Repeat
Reduce
compute new centroid for each cluster
7/25/2011
DIMA TU Berlin
97
Stratosphere Implementation
7/25/2011
DIMA TU Berlin
Source: www.stratosphere.eu
98
Code analysis
Comparison of the implementations
7/25/2011
DIMA TU Berlin
99
Problem description
Nave Bayes
Simple classification algorithm based on Bayes theorem
7/25/2011
DIMA TU Berlin
100
M/R Overview
Classification
straight-forward approach, simply reads complete model into memory
classification is done in the mapper, reducer only sums up statistics for
confusion matrix
Trainer
much higher complexity
needs to count documents, features, features per document, features
per corpus
Mahouts implementation is optimized by exploiting Hadoop specific
features like secondary sort and reading results in memory from the
cluster filesystem
7/25/2011
DIMA TU Berlin
101
Feature
Extractor
TermDoc
Counter
termDocC
wordFreq
Weight Summer
Tf-Idf
Calculation
Tf-Idf
tfIdf
WordFr.
Counter
Doc
Counter
Feature
Counter
kj
kj
Theta
Normalizer
docC
featureC
Theta N.
Vocab
Counter
vocabC
thetaNorm
7/25/2011
DIMA TU Berlin
102
PACT implementation
looks even more complex, but PACTs can be combined in a much more
fine-grained manner
as PACT offers the ability to use local memory forwards, more and
higher level functions can be used like Cross and Match
less framework specific tweaks necessary for a performant
implementation
visualized execution plan is much more similar to the algorithmic
formulation of computing several counts and combining them to a
model in the end
subcalculations can be seen and unit-tested in isolation
7/25/2011
DIMA TU Berlin
103
7/25/2011
DIMA TU Berlin
104
Hot Path
7,4 GB
14,8 GB
5,89 GB
5,89 GB
3,53 GB
84 kB
8 kB
5 kB
7/25/2011
DIMA TU Berlin
105
7/25/2011
DIMA TU Berlin
106
Hindi
Thai
Traditional Chinese
Gracias
Spanish
Russian
Thank You
Obrigado
English
Brazilian Portuguese
Arabic
Danke
German
Grazie
Merci
Italian
Simplified Chinese
Tamil
French
Japanese
Korean
7/25/2011
DIMA TU Berlin
107
DIMA TU Berlin
108
Introduction
MapReduce paradigm is too low-level
7/25/2011
DIMA TU Berlin
109
Hive
Data warehouse infrastructure built on top of Hadoop,
providing:
Data Summarization
Ad hoc querying
7/25/2011
DIMA TU Berlin
110
Hive - Example
LOAD DATA INPATH `/data/visits` INTO TABLE visits
INSERT OVERWRITE TABLE visitCounts
SELECT url, category, count(*)
FROM visits
GROUP BY url, category;
LOAD DATA INPATH /data/urlInfo INTO TABLE urlInfo
INSERT OVERWRITE TABLE visitCounts
SELECT vc.*, ui.*
FROM visitCounts vc JOIN urlInfo ui ON (vc.url = ui.url);
INSERT OVERWRITE TABLE gCategories
SELECT category, count(*)
FROM visitCounts
GROUP BY category;
INSERT OVERWRITE TABLE topUrls
SELECT TRANSFORM (visitCounts) USING top10;
7/25/2011
DIMA TU Berlin
111
JAQL
Higher level query language for JSON documents
http://www.jaql.org/
7/25/2011
DIMA TU Berlin
112
JAQL - Example
registerFunction(top, de.tuberlin.cs.dima.jaqlextensions.top10);
$visits = hdfsRead(/data/visits);
$visitCounts =
$visits
-> group by $url = $
into { $url, num: count($)};
$urlInfo = hdfsRead(data/urlInfo);
$visitCounts =
join $visitCounts, $urlInfo
where $visitCounts.url == $urlInfo.url;
$gCategories =
$visitCounts
-> group by $category = $
into {$category, num: count($)};
$topUrls = top10($gCategories);
hdfsWrite(/data/topUrls, $topUrls);
7/25/2011
DIMA TU Berlin
113
Pig
A platform for analyzing large data sets
Pig consists of two parts:
Interface between the declarative style of SQL and lowlevel, procedural style of MapReduce
http://hadoop.apache.org/pig/
7/25/2011
DIMA TU Berlin
114
Pig - Example
visits
= load /data/urlInfo
as (url, category, pRank);
= foreach gCategories
generate top(visitCounts,10);
7/25/2011
DIMA TU Berlin
115
Literature
7/25/2011
DIMA TU Berlin
116
QUERY COPROCESSING ON
GRAPHICS PROCESSORS
7/25/2011
DIMA TU Berlin
117
7/25/2011
provide exploitation of GPU hardware features such as high thread parallelism and
reduction of memory stalls through the fast local memory
are scalable to hundreds of processors because of their lock-free design and low
synchronization cost through the use of local memory
DIMA TU Berlin
118
7/25/2011
DIMA TU Berlin
119
Reduce
computes a value based on the input relation
implemented as multipass algorithm by utilizing local memory optimization
logarithmic number of passes constrained by local memory size per multiprocessor
7/25/2011
DIMA TU Berlin
120
HADOOP DB
7/25/2011
DIMA TU Berlin
121
2.
MapReduce
Data analysis via parallel Map and Reduce jobs in a replicated cluster.
7/25/2011
DIMA TU Berlin
122
Parallel RDBMs
Pros:
Usually very good and consistent performance.
Flexible and proven interface (SQL).
Cons:
Scaling is rather limited (10s of nodes).
Does not work well in heterogeneous clusters.
Not very Fault-Tolerant.
7/25/2011
DIMA TU Berlin
123
MapReduce
Pros:
Very fault-tolerant and automatic load-balancing.
Operates well in heterogeneous clusters.
Cons:
Writing map/reduce jobs is more complicated than writing SQL queries.
Performance depends largely on the skill of the programmer.
7/25/2011
DIMA TU Berlin
124
HadoopDB
Both approaches have their strengths and weaknesses.
Idea of HadoopDB: Combine them!
Traditional relational databases as data storage and data processing nodes.
MapReduce for Query Parallelization, Job Tracking, etc.
Automatic SQL to MapReduce to SQL (SMS) query rewriter (based on Hive).
7/25/2011
DIMA TU Berlin
125
HadoopDB overview
SQL query
System
catalog
SMS Planner
MapReduce Job
User
Master Node
Task
Tracker
Task
Tracker
SQL
SQL
Task
Tracker
SQL
Postgres
DB
Postgres
DB
Postgres
DB
Node #1
Node #2
Node #n
7/25/2011
DIMA TU Berlin
Replicated
Table
Data
126
SELECT
YEAR(saleDate),
SUM(revenue)
FROM sales
GROUP BY YEAR(saleDate);
7/25/2011
SMS
Rewrite
DIMA TU Berlin
127
7/25/2011
DIMA TU Berlin
128
7/25/2011
DIMA TU Berlin
129
Literature
7/25/2011
DIMA TU Berlin
130
Levels of Parallelism
Instruction-Level, Data, Task
7/25/2011
DIMA TU Berlin
143
Parallel Speedup
7/25/2011
DIMA TU Berlin
144
Parallel Speedup
7/25/2011
DIMA TU Berlin
145
Data Parallelism
Different Data can be processed independently
Each processor executes the same operations on its share of the input data.
Example: Distributing loop iterations over multiple processors, or CPUs
vectors
Task Parallelism
Tasks are distributed among the processors/nodes
Each processor executes a different thread/process.
Example: Threaded programs.
7/25/2011
DIMA TU Berlin
146
7/25/2011
DIMA TU Berlin
147
Pipeline Parallelism
Return
Step 2:
One thread scans the
table, probes the hash
tables. Second thread
starts the sort (sorting
sub-lists, merging the
first lists)
7/25/2011
Step 3:
One thread, return
result, business as
usual
Sort
HS-Join
HS-Join
Scan
Scan
Scan
T1
T2
T3
DIMA TU Berlin
Step 1:
Two threads scan
one base table each
and build the hash
tables for the joins.
148
Pipeline Parallelism
Pipeline Parallelism, also called Inter Operator Parallelism
Inter Operator, because the parallelism is between the operators
Limited in its applicability, only if multiple pipelines are present and not totally
dependent on each other
Problem:
7/25/2011
DIMA TU Berlin
149
Data Parallelism
Pipeline Parallelism is not applicable to a large degree
Data Parallelism
Data divided into several sub-sets
Most operations don't need a complete view of the data
E.g. "Filter" looks only at a single tuple at a time
7/25/2011
DIMA TU Berlin
150
Data Partitioning
Round-robin, Hash, Range
7/25/2011
DIMA TU Berlin
151
Shared Memory
Several CPUs share a single memory and disk (array)
Communication over a single common bus
Source:
Garcia-Molina et al.,
Database Systems
The Complete Book.
Second Edition
7/25/2011
DIMA TU Berlin
152
Shared Disk
Several nodes with multiple CPUs, each node has its private memory
Single attached disk (array): Often NAS, SAN, etc
Source:
Garcia-Molina et al.,
Database Systems
The Complete Book.
Second Edition
7/25/2011
DIMA TU Berlin
153
Shared Nothing
Each node has it own set of CPUs, memory and disks attached
Data needs to be partitioned over the nodes
Data is exchanged through direct node-to-node communication
Source:
Garcia-Molina et al.,
Database Systems
The Complete Book.
Second Edition
7/25/2011
DIMA TU Berlin
154
7/25/2011
DIMA TU Berlin
155
7/25/2011
DIMA TU Berlin
156
Coordinator compiles
the query
ClusterNode
Client
Query
Final
Results
Coordinator
Partial
Results
ClusterNode
ClusterNode
7/25/2011
DIMA TU Berlin
157
Return
Point of data
shipping
PreAggregation
Group
Agg
Final Aggregation
Sub-plan result
collection
Queue
Group
Agg
Sort
NL-Join
Parallel
Instances
Fetch
T2 (part)
7/25/2011
DIMA TU Berlin
Scan
IX-Scan
T1 (part)
IX-T2.1 (part)
158
Parallel Operators
Ideally: Operate as much as possible on individual partitions of the data
Bring the operation to the data
No communication needed, ideal parallelism
7/25/2011
DIMA TU Berlin
159
S
S[i, h]
B(S)
p
Relation S
Partition i of relation S according to partitioning scheme h.
Number of Blocks of Relation S
Number of Nodes
7/25/2011
DIMA TU Berlin
160
Parallel Selection
Selection can be parallelized very efficiently (embarrassingly parallel problem)
Each node performs the selection on its existing local partition.
Selection needs no context
Data can be partitioned in a arbitrary way
B(S)/p
7/25/2011
DIMA TU Berlin
161
7/25/2011
DIMA TU Berlin
162
Parallel Sorting
Range partitioning sort
7/25/2011
DIMA TU Berlin
163
Idea: Partition relations R and S using the same partitioning scheme over
the join key.
All values of R and S with the same join key end up at the same node!
All joins can be performed locally!
7/25/2011
DIMA TU Berlin
164
2.
3.
Co-Located Join
No re-partitioning is needed!
Cost:
???
Local join cost
Directed Join
Repartition Join
7/25/2011
DIMA TU Berlin
165
Join
7/25/2011
DIMA TU Berlin
166
7/25/2011
DIMA TU Berlin
167
p * B(S)
???
transport
local join
Asymmetric Fragment-and-replicate
Join is a special case of the Symmetric
Algorithm with m=p and n=1.
The Asymmetric Fragment-and-replicate
Join is also called Broadcast Join
7/25/2011
DIMA TU Berlin
168
Shared Disk: Does not scale infinitely, bus and synchronization become
overhead
For Updates: Cache Coherency Problem
For Reads: I/O Bandwidth Limits
7/25/2011
DIMA TU Berlin
169
Literature
7/25/2011
DIMA TU Berlin
170
7/25/2011
DIMA TU Berlin
171
7/25/2011
DIMA TU Berlin
172
7/25/2011
DIMA TU Berlin
173
7/25/2011
DIMA TU Berlin
174
7/25/2011
DIMA TU Berlin
175
7/25/2011
DIMA TU Berlin
176
7/25/2011
DIMA TU Berlin
177
7/25/2011
DIMA TU Berlin
178
7/25/2011
DIMA TU Berlin
179
7/25/2011
DIMA TU Berlin
180
7/25/2011
DIMA TU Berlin
181
7/25/2011
DIMA TU Berlin
182
7/25/2011
DIMA TU Berlin
183
7/25/2011
DIMA TU Berlin
184
7/25/2011
DIMA TU Berlin
185