Beyond Map and Reduce

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT
FR NACHWUCHSWISSENSCHAFTLER 2011
UNIVERSIT DT FRANCO-ALLEMANDE
POUR JEUNES CHERCHEURS 2011
CLOUD COMPUTING :
HERAUSFORDERUNGEN UND MGLICHKEITEN
CLOUD COMPUTING :
DFIS ET OPPORTUNITS
Big Data Analytics beyond Map/Reduce

17.7. 22.7. 2011
Prof. Dr. Volker Markl
TU Berlin
Shift Happens! Our Digital World!
Video courtesy of Michael Brodie, Chief Scientist, Verizon

Original "Shift Happens" video by K. Fisch and S. McLeod
Original focuses on shift in society, aimed at teachers education
Michael Brodie focuses on shift in/because of the digital world
7/25/2011
DIMA TU Berlin
Data Growth and Value

About data growth:
$600 to buy a disk drive that can store all of the worlds music
5 billion mobile phones in use in 2010
30 billion pieces of content shared on Facebook every month
40% projected growth in global data per year
About the value of captured data:

250 billion Euro potential value
to Europes public sector administration
60% potential increase

in retailers operating margins possible with big data
140,000-190,000 more deep analytical talent positions needed
Source: Big Data: The next frontier for innovation, competition and productivity
(McKinsey)
7/25/2011
DIMA TU Berlin
Big Data
Data have swept into every industry and business function
important factor of production
exabytes of data stored by companies every year
much of modern economic activity could not take place without that
Big Data creates value in several ways

provides transparency
enables experimentation
brings about customization
and tailored products
supports human decisions
triggers new business models
Use of Big Data will become a key basis of competition and growth
companies failing to develop their analysis capabilities will fall behind
Source: Big Data: The next frontier for innovation, competition and productivity (McKinsey)
7/25/2011
DIMA TU Berlin
Big Data Analytics

Data volume keeps growing
Data Warehouse sizes of about 1PB are not uncommon!
Some businesses produce >1TB of new data per day!
Scientific scenarios are even larger (e.g. LHC experiment results in ~15PB / yr)
Some systems are required to support extreme throughput in

transaction processing
Especially financial institutes
Analysis Queries become more and more complex

Discovering statistical patterns is compute intensive
May require multiple passes over the data
Performance of single computing cores or single machines is not

increasing substantially enough to cope with this development
7/25/2011
DIMA TU Berlin
Trends and Challenges
Trends
Claremont Report
Massive parallelization
Virtualization
Service-based computing
Re-architecting DBMS
Web-scale data
management
Parallelization
Continuous optimization
Tight integration
Service-based everything
Programming Model
Combining structured and
unstructured data
Media Convergence
Analytics / BI
Operational
Multi-tenancy
7/25/2011
DIMA TU Berlin
Overview
Introduction
Big Data Analytics
Map/Reduce/Merge
Introducing the Cloud
Stratosphere (PACT and Nephele)
Demo
(Thomas Bodner, Matthias Ringwald)
Mahout and Scalable Data Mining

(Sebastian Schelter)
7/25/2011
DIMA TU Berlin
Map/Reduce Revisited
BIG DATA ANALYTICS
7/25/2011
DIMA TU Berlin
Data Partitioning (I)

Partitioning the data means creating a set of disjunct sub-sets
Example: Sales data, every year gets its own partition
For shared-nothing, data must be partitioned across nodes

If it were replicated, it would effectively become a shared-disk with the local
disks acting like a cache (must be kept coherent)
Partitioning with certain characteristics has more advantages

Some queries can be limited to operate on certain sets only, if it is provable
that all relevant data (passing the predicates) is in that partition
Partitions can be simply dropped as a whole (data is rolled out) when it is no
longer needed (e.g. discard old sales)
7/25/2011
DIMA TU Berlin
Data Partitioning (II)

How to partition the data into disjoint sets?
Round robin: Each set gets a tuple in a round, all sets have guaranteed
equal amount of tuples, no apparent relationship between tuples in one
set.
Hash Partitioned: Define a set of partitioning columns. Generate a hash
value over those columns to decide the target set. All tuples with equal
values in the partitioning columns are in the same set.
Range Partitioned: Define a set of partitioning columns and split the

domain of those columns into ranges. The range determines the target set.
All tuples on one set are in the same range.
7/25/2011
DIMA TU Berlin
10
The data model
key/value pairs
e.g. (int, string)
Functional programming model with 2nd order functions

map:
input key-value pairs:

output key-value pairs:
reduce:
input key
output key
and a list of values

and a single value
The framework
accepts a list
outputs result pairs
7/25/2011
DIMA TU Berlin
11
Data Flow in Map/Reduce

(K m,Vm)*
Framework
(K m,Vm)
(K m,Vm)
(K m,Vm)
MAP(K m,Vm)
MAP(K m,Vm)
MAP(K m,Vm)
(K r ,Vr)*
(K r ,Vr)*
(K r ,Vr)*
(K r ,Vr*)
(K r ,Vr*)
(K r ,Vr*)
REDUCE(K r ,Vr*)
REDUCE(K r ,Vr*)
REDUCE(K r ,Vr*)
(K r ,Vr)
(K r ,Vr)
(K r ,Vr)
Framework
Framework
(K r ,Vr)*
7/25/2011
DIMA TU Berlin
12
Map Reduce Illustrated (1)

Problem: Counting words in a parallel fashion
How many times different words appear in a set of files

juliet.txt: Romeo, Romeo, wherefore art thou Romeo?
benvolio.txt: What, art thou hurt?
Expected output: Romeo (3), art (2), thou (2), art (2), hurt (1),
wherefore (1), what (1)
Solution: Map-Reduce Job

map(filename, line) {
foreach (word in line)
emit(word, 1);
}
reduce(word, numbers) {
int sum = 0;
foreach (value in numbers) {
sum += value;
}
emit(word, sum);
}
7/25/2011
DIMA TU Berlin
13
Map Reduce Illustrated (2)
7/25/2011
DIMA TU Berlin
14
Data Analytics: Relational Algebra

Base Operators
selection ()
projection ()
set/bag union ()
set/bag difference (\ or -)
Cartesian product ()
Derived Operators
join ()
set/bag intersection ()
division (/)
Further Operators
de-duplication
generalized projection
(grouping and aggregation)
outer-joins und semi-joins
Sort
7/25/2011
DIMA TU Berlin
15
Relational Operators as Map/Reduce jobs

Selection / projection / aggregation
SQL Query:
SELECT year, SUM(price)
FROM sales
WHERE area_code = US
GROUP BY year
Map/Reduce job:
map(key, tuple) {
int year = YEAR(tuple.date);
if (tuple.area_code = US)
emit(year, {year => year, price => tuple.price });
}
reduce(key, tuples) {
double sum_price = 0;
foreach (tuple in tuples) {
sum_price += tuple.price;
}
emit(key, sum_price);
}
7/25/2011
DIMA TU Berlin
16

Sorting
SQL Query:
SELECT *
FROM sales
ORDER BY year
Map/Reduce job:
map(key, tuple) {
emit(YEAR(tuple.date) DIV 10, tuple);
}
reduce(key, tuples) {
emit(key, sort(tuples));
}
7/25/2011
DIMA TU Berlin
17

UNION
SQL Query:
SELECT phone_number FROM employees
UNION
SELECT phone_number FROM bosses
Map/Reduce job needs two different mappers:
map(key, employees_phonebook_entry) {
emit(employees_phonebook_entry.number, ``);
}
map(key, bosses_phonebook_entry) {
emit(bosses_phonebook_entry.number, ``);
}
reduce(phone_number, tuples) {
emit(phone_number, ``);
}
7/25/2011
DIMA TU Berlin
18

INTERSECT
SQL Query:
SELECT first_name FROM employees
INTERSECT
SELECT first_name FROM bosses
Map/Reduce job needs two different mappers:
map(key, employee_listing_entry) {
emit(employee_listing_entry.first_name, `E`);
}
map(key, boss_listing_entry) {
emit(bosses_listing_entry.first_name, `B`);
}
reduce(first_name, markers) {
if (`E` in markers and `B` in markers) {
emit(first_name, ``);
}
}
7/25/2011
DIMA TU Berlin
19
The Petabyte Sort Benchmark

Benchmark to test the performance of distributed systems
Goal: Sort one Petabyte of 100 byte numbers
Implementation in Hadoop:
Range-Partitioner that splits the data in equal ranges (one for each
participating node)
Sort is basically "Range partitioning sort" as described

earlier
7/25/2011
DIMA TU Berlin
20
Petabyte sorting benchmark
Per node: 2 quad core Xeons @ 2.5ghz, 4 SATA disks, 8G

RAM (upgraded to
16GB before petabyte sort), 1 Gigabit Ethernet.
Per Rack: 40 nodes, 8 gigabit Ethernet uplinks.
7/25/2011
DIMA TU Berlin
21
Cluster Utilization during Sort
7/25/2011
DIMA TU Berlin
22
JOINS IN MAP/REDUCE
7/25/2011
DIMA TU Berlin
23
Symmetric Fragment-and-Replicate Join (II)
Nodes in the Cluster
7/25/2011
DIMA TU Berlin
24
Asymmetric Fragment-and-Replicate Join

We can do better, if relation S is much
smaller than R.
Idea: Reuse the existing partitioning of R
and replicate the whole relation S to each
node.
Cost:
p * B(S)
???
transport
local join
Asymmetric Fragment-and-replicate
Join is a special case of the Symmetric
Algorithm with m=p and n=1.
The Asymmetric Fragment-and-replicate
join is also called Broadcast Join
7/25/2011
DIMA TU Berlin
25
Broadcast Join
Equi-Join: L(A,X)
R(X,C)
assumption: |L| << |R|
Idea
broadcast L to each node completely before
the map phase begins
by utilities, like Hadoop's distributed cache

or mappers read L from the cluster filesystem at startup
Mapper
only over R
step 1: read assigned input split of R into a hash-table (build phase)
step 2: scan local copy of L and find matching R tuples (probe)
step 3: emit each such pair
Alternatively read L into Hash-Table, then read R and probe
No need for partition / sort / reduce processing

Mapper outputs the final join result
7/25/2011
DIMA TU Berlin
26
Repartition Join
Equi-Join: L(A,X)
R(X,C)
assumption: |L| < |R|
Mapper L(A,X)
LR
LR
LR
build L
R(X,C)
identical processing logic for L and R

emit each tuple once
the intermediate key is a pair of
h(key) % n
read
L R L R
the value of the actual join key X
an annotation identifying to which relation the tuple belongs to (L or R)
Partition and sort

partition by join key hash value
input is ordered first on the join key, then on the relation name
output: a sequence of L(i), R(i) blocks of tuples for ascending join key i
Reduce
collect all L-tuples for the current L(i) block in a hash map
combine them with each R-tuple of the corresponding R(i)-tuple block
7/25/2011
DIMA TU Berlin
27
Multi-Dimensional Partitioned Join

Equi-Join: D1(A,X)
D2(B,Y)
F(C,X,Y)
star-schema with fact table F and dimensions Di
D2 D1
F D1
D2
Fragment
D1 and D2 are partitioned independently
the partitions for F are defined as D1 x D2
Replicate
for F-tuple f the partition is uniquely defined as (hash(f.x), hash(f.y))
for D1-tuple d1 there is one degree of freedom (d1.y is undefined)
D1-tuples are thus replicated for each possible y value
symmetric for D2
Reduce
find and emit (f, d1, d2) pairs
depending on the input sorting, different join strategies are possible
7/25/2011
DIMA TU Berlin
29
Joins in Hadoop
time
nodes
Asym. = Multi-Dimensional
Partitioned Join
selectivity
7/25/2011
DIMA TU Berlin
31
Parallel DBMS vs. Map/Reduce

Parallel DBMS
Map/Reduce
Schema Support
Indexing
Programming Model
Stating what you want

(declarative: SQL)
Presenting an algorithm
(procedural: C/C++,
Java, )
Optimization
Scaling
1 500
10 - 5000
Fault Tolerance
Limited
Good
Execution
Pipelines results
between operators
Materializes results
between phases
7/25/2011
DIMA TU Berlin
32
Simplified Relational Data Processing on Large Clusters
MAP-REDUCE-MERGE
7/25/2011
DIMA TU Berlin
33
Map-Reduce-Merge
Motivation
Map/Reduce does not directly support processing multiple related heterogeneous
datasets
difficulties and/or inefficiency when one must implement relational operators like joins
Map-Reduce-Merge
adds a merge phase that
Goal: efficiently merge data already partitioned and sorted (or hashed)
Map-Reduce-Merge workflows are comparable to RDBMS execution plans

Can more easily implement parallel join algorithms
(k1, v1) (k 2, v2)

map:
reduce: (k 2,[v2]) (k 2, v3)
merge: (k 2, v3) , (k 3, v4) (k 4, v5)
7/25/2011
DIMA TU Berlin
34
Introducing
THE CLOUD
7/25/2011
DIMA TU Berlin
35
In the Cloud
7/25/2011
DIMA TU Berlin
36
"The interesting thing about cloud computing is

that we've redefined cloud computing to include
everything that we already do.
I can't think of anything that isn't cloud computing
with all of these announcements.
The computer industry is the only industry
that is more fashion-driven than women's fashion.
Maybe I'm an idiot, but I have no idea what anyone is talking about.
What is it?
It's complete gibberish. It's insane.
When is this idiocy going to stop?
"We'll make cloud computing announcements.

I'm not going to fight this thing.
But I don't understand what we would do differently
in the light of cloud."
7/25/2011
DIMA TU Berlin
37
Steve Ballmers Vision of Cloud Computing
7/25/2011
DIMA TU Berlin
38
What does Hadoop have to do with Cloud?

A few months back, Hamid Pirahesh and I were
doing a roundtable with a customer of ours, on
cloud and data.
We got into a set of standard issues -- data security
being the primary but when the dialog turned to
Hadoop, a person raised his hands and asked:
What has Hadoop got to do with cloud?"

I responded, somewhat quickly perhaps, "Nothing
specific, and I am willing to have a dialog with you
on Hadoop in and out of the cloud context", but it
got me thinking. Is there a relationship, or not?
7/25/2011
DIMA TU Berlin
39
Re-inventing the wheel

- or not?
7/25/2011
DIMA TU Berlin
40
Parallel Analytics in the Cloud beyond Map/Reduce
STRATOSPHERE
7/25/2011
DIMA TU Berlin
41
The Stratosphere Project*

Explore the power of Cloud
computing for complex
information management
applications
Use-Cases
Scientific Data
Life Sciences
StratoSphere
Above the Clouds
Linked Data
Query Processor
Database-inspired approach
Analyze, aggregate, and
query
Textual and (semi-)
structured data
Infrastructure as a Service
...
Research and prototype a

web-scale data analytics
infrastructure
* FOR 1306: DFG funded collaborative project among TU Berlin, HU Berlin and HPI Potsdam
7/25/2011
DIMA TU Berlin
42
1100km,
2km resolution
Example: Climate Data Analysis

PS,1,1,0,Pa, surface pressure
T_2M,11,105,0,K,air_temperature
TMAX_2M,15,105,2,K,2m maximum temperature
TMIN_2M,16,105,2,K,2m minimum temperature
U,33,110,0,ms-1,U-component of wind
V,34,110,0,ms-1,V-component of wind
QV_2M,51,105,0,kgkg-1,2m specific humidity
CLCT,71,1,0,1,total cloud cover
(Up to 200 parameters)
Analysis Tasks on Climate Data Sets

Validate climate models
Locate hot-spots in climate models
Monsoon
Drought
Flooding
Compare climate models
Based on different parameter settings
Necessary Data Processing Operations
10TB
Filter
Aggregation (sliding window)
Join
Multi-dimensional sliding-window operations
Geospatial/Temporal joins
Uncertainty
950km,
2km resolution
7/25/2011
DIMA TU Berlin
43
Further Use-Cases
Text Mining in the biosciences
Cleansing of linked open data
7/25/2011
DIMA TU Berlin
44
Outline
Architecture of the Stratosphere System
The PACT Programming Model
The Nephele Execution Engine
Parallelizing PACT Programs
7/25/2011
DIMA TU Berlin
45
Architecture Overview
Higher-Level
Language
Parallel Programming
Model
Execution Engine
JAQL,
Pig,
Hive
Scope,
DryadLINQ
PACT
Programming
Model
Map/Reduce
Programming
Model
Hadoop
Hadoop Stack
7/25/2011
JAQL?
Pig?
Hive?
DIMA TU Berlin
Dryad
Nephele
Dryad Stack
Stratosphere
Stack
46
Data-Centric Parallel Programming

Map / Reduce
Relational Databases
Map
Reduce
Map
Map
Reduce
Reduce
Schema Free
Many semantics hidden inside the
user code (tricks required to push
operations into map/reduce)
Single default way of parallelization
Schema bound (relational model)

Well defined properties and
requirements for parallelization
Flexible and optimizable
GOAL: Advance the m/r programming model

7/25/2011
DIMA TU Berlin
47
Stratosphere in a Nutshell
PACT Programming Model
Parallelization Contract (PACT)
Declarative definition of data parallelism
Centered around second-order functions
Generalization of map/reduce
Nephele
Dryad-style execution engine
Evaluates dataflow graphs in parallel
Data is read from distributed filesystem
Flexible engine for complex jobs
PACT Compiler
Nephele
Stratosphere = Nephele + PACT

Compiles PACT programs to Nephele dataflow graphs
Combines parallelization abstraction and flexible execution
Choice of execution strategies gives optimization potential
7/25/2011
DIMA TU Berlin
48
Overview
Parallelization Contracts (PACTs)
The Nephele Execution Engine
Compiling/Optimizing Programs
Related Work
7/25/2011
DIMA TU Berlin
49
Intuition for Parallelization Contracts

Map and reduce are second-order functions
Call first-order functions (user code)
Provide first-order functions with subsets of the input data
Define dependencies between the
records that must be obeyed when
splitting them into subsets
Cp: Required partition properties
Map
All records are independently
processable
Key
Value
Independent
subsets
Input set
Reduce
Records with identical key must
be processed together
7/25/2011
DIMA TU Berlin
50
Contracts beyond Map and Reduce

Cross
Two inputs
Each combination of records from the two inputs
is built and is independently processable
Match
Two inputs, each combination of records with
equal key from the two inputs is built
Each pair is independently processable
CoGroup
Multiple inputs
Pairs with identical key are grouped for each input
Groups of all inputs with identical key
are processed together
7/25/2011
DIMA TU Berlin
51
Parallelization Contracts (PACTs)

Second-order function that defines properties on the input and
output data of its associated first-order function
Data
Input
Contract
First-order function
(user code)
Output
Contract
Data
Input Contract
Specifies dependencies between records
(a.k.a. "What must be processed together?")
Generalization of map/reduce
Logically: Abstracts a (set of) communication pattern(s)
For "reduce": repartition-by-key
For "match" : broadcast-one or repartition-by-key
Output Contract
Generic properties preserved or produced by the user code
key property, sort order, partitioning, etc.
Relevant to parallelization of succeeding functions
7/25/2011
DIMA TU Berlin
52
Optimizing PACT Programs

For certain PACTs, several distribution patterns exist that
fulfill the contract
Choice of best one is up to the system
Created properties (like a partitioning) may be reused for

later operators
Need a way to find out whether they still hold after the user code
Output contracts are a simple way to specify that
Example output contracts: Same-Key, Super-Key, Unique-Key
Using these properties, optimization across multiple PACTs

is possible
Simple System-R style optimizer approach possible
7/25/2011
DIMA TU Berlin
53
From PACT Programs to Data Flows

PACT code
(grouping)
function match(Key k, Tuple val1,
Tuple val2)
-> (Key, Tuple)
{
Tuple res = val1.concat(val2);
res.project(...);
Key k = res.getColumn(1);
Return (k, res);
}
invoke():
while (!input2.eof)
KVPair p = input2.next();
hash-table.put(p.key, p.value);
while (!input1.eof)
KVPair p = input1.next();
KVPait t = hash-table.get(p.key);
if (t != null)
KVPair[] result =
UF.match(p.key, p.value, t.value);
output.write(result);
end
User
Function
Nephele code
(communication)
UF1
(map)
UF2
(map)
V4
In-Memory
Channel
UF3
(match)
UF4
(reduce)
V1
compile
V3
V4
V2
span
V4
V3
V3
V3
V3
V1
V2
V1
V2
Network
Channel
PACT Program
7/25/2011
Nephele DAG
DIMA TU Berlin
Spanned Data Flow

54
NEPHELE EXECUTION
ENGINE
7/25/2011
DIMA TU Berlin
55
Nephele Execution Engine

Executes Nephele schedules
compiled from PACT programs
Design goals
Exploit scalability/flexibility of clouds

Provide predictable performance
Efficient execution on 1000+ cores
Flexible fault tolerance mechanisms
PACT Compiler
Nephele
Inherently designed to run on top of an

IaaS Cloud
Heterogeneity through different types of VMs
Knows Clouds pricing model
Infrastructure-as-a-Service
VM allocation and de-allocation
Network topology inference
7/25/2011
DIMA TU Berlin
56
Nephele Architecture
Standard master worker pattern
Workers can be allocated on demand
Workload over time
Client
Public Network (Internet)
7/25/2011
Master
Private / Virtualized Network
Worker
Worker
DIMA TU Berlin
Worker
Persistent Storage
Cloud Controller
Compute Cloud
57
Structure of a Nephele Schedule
Output 1
Task: LineWriterTask.program
Output: s3://user:key@storage/outp
Task 1
Task: MyTask.program
Input 1
Task: LineReaderTask.program
Input: s3://user:key@storage/input
7/25/2011
Nephele Schedule is represented as DAG

Vertices represent tasks
Edges denote communication channels
Mandatory information for each vertex
Task program
Input/output data location (I/O vertices
only)
Optional information for each vertex
Number of subtasks (degree of parallelism)
Number of subtasks per virtual machine
Type of virtual machine (#CPU cores, RAM)
Channel types
Sharing virtual machines among tasks
DIMA TU Berlin
58
Internal Schedule Representation

Nephele schedule is converted into internal
representation
Output
1 1 (1)
Output
ID: 2
Type: m1.large
Task
1 1 (2)
Task
Explicit parallelization
Parallelization range (mpl) derived from PACT
Wiring of subtasks derived from PACT
Explicit assignment to virtual machines

Specified by ID and type
Type refers to hardware profile
ID: 1
Type: m1.small
Input
Input
1 1 (1)
7/25/2011
DIMA TU Berlin
59
Execution Stages
Stage 1
Output 1 (1)
ID: 2
Type: m1.large
Issues with on-demand allocation:

When to allocate virtual machines?
When to deallocate virtual machines?
No guarantee of resource availability!
Stage 0
Task 1 (2)
Stages ensure three properties:

VMs of upcoming stage are available
All workers are set up and ready
Data of previous stages is stored in persistent
manner
ID: 1
Type: m1.small
Input 1 (1)
7/25/2011
DIMA TU Berlin
60
Channel Types
Stage 1
Network channels (pipeline)

Vertices must be in same stage
Output 1 (1)
ID: 2
Type: m1.large
In-memory channels (pipeline)

Vertices must run on same VM
Vertices must be in same stage
Stage 0
Task 1 (2)
File channels
Vertices must run on same VM
Vertices must be in different stages
ID: 1
Type: m1.small
Input 1 (1)
7/25/2011
DIMA TU Berlin
61
Some Evaluation (1/2)

Demonstrates benefits of dynamic resource allocation
Challenge: Sort and Aggregate
Sort 100 GB of integer numbers (from GraySort benchmark)
Aggregate TOP 20% of these numbers (exact result!)
First execution as map/reduce jobs with Hadoop
Three map/reduce jobs on 6 VMs (each with 8 CPU cores, 24 GB
RAM)
TeraSort code used for sorting
Custom code for aggregation
Second execution as map/reduce jobs with Nephele
Map/reduce compatilibilty layer allows to run Hadoop M/R programs
Nephele controls resource allocation
Idea: Adapt allocated resources to required processing power
7/25/2011
DIMA TU Berlin
62
First Evaluation (2/2)
20
40
60
80
250
200
150
Time [minutes]
M/R jobs on Hadoop
7/25/2011
100
20
40
60
80
100
Time [minutes]
M/R jobs on Nephele
DIMA TU Berlin
63
Average network traffic among instances [MBit/s]
400
300
350
80
60
100
150
100
0
50
20
0
50
200
250
60
300
(h)
(f) (g) (h)
40
(f)
(e)
20
350
(e)
Average instance utilization [%]
400
(a)
(c)
(d)
Average network traffic among instances [MBit/s]
(g)
(d)
100
500
450
100
(c)
Poor resource
utilization!
(a)
500
Automatic VM
deallocation
(b)
40
Average instance utilization [%]
80
(b)
USR
SYS
WAIT
Network traffic
450
USR
SYS
WAIT
Network traffic
References
[WK09] Daniel Warneke, Odej Kao: Nephele: efficient
parallel data processing in the cloud. SC-MTAGS 2009
[BEH+10] D. Battr, S. Ewen, F. Hueske, O. Kao, V. Markl,
D. Warneke: Nephele/PACTs: a programming model and
execution framework for web-scale analytical processing.
SoCC 2010: 119-130
[ABE+10] A. Alexandrov, D. Battr, S. Ewen, M. Heimel, F.
Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke:
Massively Parallel Data Analysis with PACTs on Nephele.
PVLDB 3(2): 1625-1628 (2010)
[AEH+11] A.Alexandrov, S. Ewen, M. Heimel, Fabian Hske,
et al.: MapReduce and PACT - Comparing Data Parallel
Programming Models, to appear at BTW 2011
7/25/2011
DIMA TU Berlin
64
Ongoing Work
Adaptive Fault-Tolerance (Odej Kao)

Robust Query Optimization (Volker Markl)
Parallelization of the PACT Programming Model (Volker Markl)
Continuous Re-Optimization (Johann-Christoph Freytag)
Validating Climate Simulations with Stratosphere (Volker Markl)
Text Analysis with Stratosphere (Ulf Leser)
Data Cleansing with Stratosphere (Felix Naumann)
JAQL on Stratosphere: Student Project at TUB
Open Source Release: Nephele + PACT (TUB, HPI, HU)
7/25/2011
DIMA TU Berlin
65
Overview
Introduction
Big Data Analytics
Map/Reduce/Merge
Introducing the Cloud
Stratosphere (PACT and Nephele)
Demo
(Thomas Bodner, Matthias Ringwald)
Mahout and Scalable Data Mining

(Sebastian Schelter)
7/25/2011
DIMA TU Berlin
66
The Information Revolution
http://mediatedcultures.net/ksudigg/?p=120
7/25/2011
DIMA TU Berlin
67
Demo Screenshots
WEBLOG ANALYSIS
QUERY
7/25/2011
DIMA TU Berlin
74
Weblog Query and Plan
SELECT r.url, r.rank, r.avg_duration

FROM Documents d
JOIN
Rankings r
ON r.url = d.url
WHERE CONTAINS(d.text, [keywords])
AND r.rank > [rank]
AND NOT EXISTS
(SELECT * FROM Visits v
WHERE v.url = d.url
AND v.date < [date]);
7/25/2011
DIMA TU Berlin
75
Weblog Query Job Preview
7/25/2011
DIMA TU Berlin
76
Weblog Query Optimized Plan
7/25/2011
DIMA TU Berlin
77
Weblog Query Nephele Schedule in Execution
7/25/2011
DIMA TU Berlin
78
Demo Screenshots
ENUMERATING TRIANGLES
FOR SOCIAL NETWORK
MINING
7/25/2011
DIMA TU Berlin
79
Enumerating Triangles Graph and Job
7/25/2011
DIMA TU Berlin
80
Enumerating Triangles Job Preview
7/25/2011
DIMA TU Berlin
81
Enumerating Triangles Optimized Plan
7/25/2011
DIMA TU Berlin
82
Enumerating Triangles Nephele Schedule in

Execution
7/25/2011
DIMA TU Berlin
83
Scalable data mining
APACHE MAHOUT
Sebastian Schelter
7/25/2011
DIMA TU Berlin
85
Apache Mahout: Overview

What is Apache Mahout?
An Apache Software Foundation project aiming to create scalable
machine learning libraries under the Apache License
focus on scalability, not a competitor for R or Weka
in use at Adobe, Amazon, AOL, Foursquare, Mendeley, Twitter, Yahoo
Scalability
time is proportional to problem size by resource size
does not imply Hadoop or parallel, although
the majority of implementations use Map/Reduce
7/25/2011
DIMA TU Berlin
P
t
R
86
Apache Mahout: Clustering

Clustering
Unsupervised learning: assign a set of data points into subsets (called
clusters) so that points in the same cluster are similar in some sense
Algorithms
K-Means
Fuzzy K-Means
Canopy
Mean Shift
Dirichlet Process
Spectral Clustering
7/25/2011
DIMA TU Berlin
87
Apache Mahout: Classification

Classification
supervised learning: learn a decision function that predicts labels y on
data points x given a set of training samples {(x,y)}
Algorithms
Logistic Regression (sequential but fast)
Naive Bayes / Complementary Nave Bayes
Random Forests
7/25/2011
DIMA TU Berlin
88
Apache Mahout: Collaborative Filtering

Collaborative Filtering
approach to recommendation mining: given a user's preferences for
items, guess which other items would be highly preferred
Algorithms
Neighborhood methods: Itembased Collaborative Filtering
Latent factor models: matrix factorization using Alternating Least
Squares
7/25/2011
DIMA TU Berlin
89
Apache Mahout: Singular Value Decomposition
Singular Value Decomposition

matrix decomposition technique used to create an optimal low-rank
approximation of a matrix
used for dimensional reduction, unsupervised feature selection, Latent
Semantic Indexing
Algorithms
Lanczos Algorithm
Stochastic SVD
7/25/2011
DIMA TU Berlin
90
Comparing implementations of data mining algorithms in

Hadoop/Mahout and Nephele/PACT
SCALABLE DATA MINING
7/25/2011
DIMA TU Berlin
92
Problem description
Pairwise row similarity computation
Computes the pairwise similarities of the rows (or
columns) of a sparse matrix using a predefined
similarity function
used for computing document
similarities in large corpora
used to precompute item-itemsimilarities for recommendations
(Collaborative Filtering)
similarity function can be cosine,
Pearson-correlation, loglikelihood
ratio, Jaccard coefficient,
7/25/2011
DIMA TU Berlin
93
Map/Reduce
Map/Reduce Step 1
compute similarity specific row weights
transpose the matrix, there by create an inverted index
Map/Reduce Step 2
map out all pairs of cooccurring values
collect all cooccurring values per row pair, compute similarity value
Map/Reduce Step 3
use secondary sort to only keep the k most similar rows
PACT
7/25/2011
DIMA TU Berlin
94
Comparison
Equivalent implementations in Mahout and PACT
problem maps relatively well to the Map/Reduce paradigm
insight: standard Map/Reduce code can be ported to Nephele/PACT
with very little effort
output contracts and memory forwards offer hooks for performance
improvements (unfortunately not applicable in this particular usecase)
7/25/2011
DIMA TU Berlin
95
Problem description
K-Means
Simple iterative clustering algorithm
uses a predefined number of clusters (k)

start with a random selection of cluster centers
assign points to nearest cluster
recompute cluster centers, iterate until convergence
7/25/2011
DIMA TU Berlin
96
Mahout
Initialization
generate k random cluster centers from datapoints (optional)
put centers to distributed cache
Map
find nearest cluster for each data point
emit (cluster id, data point)
Combine
partially aggregate distances per cluster
Repeat
Reduce
compute new centroid for each cluster
output converged cluster centers or centers after n

iterations
optionally output clustered data points
7/25/2011
DIMA TU Berlin
97
Stratosphere Implementation
7/25/2011
DIMA TU Berlin
Source: www.stratosphere.eu
98
Code analysis
Comparison of the implementations
actual execution plan in the underlying distributed systems is nearly

equivalent
Stratosphere implementation is more intuitive and closer to the

mathematical formulation of the algorithm
7/25/2011
DIMA TU Berlin
99
Problem description
Nave Bayes
Simple classification algorithm based on Bayes theorem
General Nave Bayes

assumes feature independence
often good results even this is
not given
Mahouts version of Nave Bayes

Specialized approach for document
classification
based on tf-idf weight metric
7/25/2011
DIMA TU Berlin
100
M/R Overview
Classification
straight-forward approach, simply reads complete model into memory
classification is done in the mapper, reducer only sums up statistics for
confusion matrix
Trainer
much higher complexity
needs to count documents, features, features per document, features
per corpus
Mahouts implementation is optimized by exploiting Hadoop specific
features like secondary sort and reading results in memory from the
cluster filesystem
7/25/2011
DIMA TU Berlin
101
M/R Trainer Overview

Train Data
Feature
Extractor
TermDoc
Counter
termDocC
wordFreq
Weight Summer
Tf-Idf
Calculation
Tf-Idf
tfIdf
WordFr.
Counter
Doc
Counter
Feature
Counter
kj
kj
Theta
Normalizer
docC
featureC
Theta N.
Vocab
Counter
vocabC
thetaNorm
7/25/2011
DIMA TU Berlin
102
Pact Trainer Overview
PACT implementation
looks even more complex, but PACTs can be combined in a much more
fine-grained manner
as PACT offers the ability to use local memory forwards, more and
higher level functions can be used like Cross and Match
less framework specific tweaks necessary for a performant
implementation
visualized execution plan is much more similar to the algorithmic
formulation of computing several counts and combining them to a
model in the end
subcalculations can be seen and unit-tested in isolation
7/25/2011
DIMA TU Berlin
103
PACT Trainer Overview
7/25/2011
DIMA TU Berlin
104
Hot Path
7,4 GB
14,8 GB
5,89 GB
5,89 GB
3,53 GB
84 kB
8 kB
5 kB
7/25/2011
DIMA TU Berlin
105
Pact Trainer Overview
Future work: PACT implementation can still be tuned by

sampling input data
more variable memory management of Stratosphere
employing context-concept of PACTs for simpler distribution of
computed parameters
7/25/2011
DIMA TU Berlin
106
Hindi
Thai
Traditional Chinese
Gracias
Spanish
Russian
Thank You
Obrigado
English
Brazilian Portuguese
Arabic
Danke
German
Grazie
Merci
Italian
Simplified Chinese
Tamil
French
Japanese
Korean
7/25/2011
DIMA TU Berlin
107
Programming in a more abstract way
PARALLEL DATA FLOW

LANGUAGES
7/25/2011
DIMA TU Berlin
108
Introduction
MapReduce paradigm is too low-level
Only two declarative primitives (map + reduce)

Extremely rigid (one input, two-stage data flow)
Custom code for e.g.: projection and filtering
Code is difficult to reuse and maintain
Impedes Optimizations
Combination of high-level declarative querying and low-level

programming with MapReduce
Dataflow Programming Languages
Hive
JAQL
Pig
7/25/2011
DIMA TU Berlin
109
Hive
Data warehouse infrastructure built on top of Hadoop,
providing:
Data Summarization
Ad hoc querying
Simple query language: Hive QL (based on SQL)

Extendable via custom mappers and reducers
Subproject of Hadoop
No Hive format
http://hadoop.apache.org/hive/
7/25/2011
DIMA TU Berlin
110
Hive - Example
LOAD DATA INPATH `/data/visits` INTO TABLE visits
INSERT OVERWRITE TABLE visitCounts
SELECT url, category, count(*)
FROM visits
GROUP BY url, category;
LOAD DATA INPATH /data/urlInfo INTO TABLE urlInfo
INSERT OVERWRITE TABLE visitCounts
SELECT vc.*, ui.*
FROM visitCounts vc JOIN urlInfo ui ON (vc.url = ui.url);
INSERT OVERWRITE TABLE gCategories
SELECT category, count(*)
FROM visitCounts
GROUP BY category;
INSERT OVERWRITE TABLE topUrls
SELECT TRANSFORM (visitCounts) USING top10;
7/25/2011
DIMA TU Berlin
111
JAQL
Higher level query language for JSON documents
Developed at IBMs Almaden research center

Supports several operations known from SQL
Grouping, Joining, Sorting
Built-in support for
Loops, Conditionals, Recursion
Custom Java methods extend JAQL

JAQL scripts are compiled to MapReduce jobs
Various I/O
Local FS, HDFS, Hbase, Custom I/O adapters
http://www.jaql.org/
7/25/2011
DIMA TU Berlin
112
JAQL - Example
registerFunction(top, de.tuberlin.cs.dima.jaqlextensions.top10);
$visits = hdfsRead(/data/visits);
$visitCounts =
$visits
-> group by $url = $
into { $url, num: count($)};
$urlInfo = hdfsRead(data/urlInfo);
$visitCounts =
join $visitCounts, $urlInfo
where $visitCounts.url == $urlInfo.url;
$gCategories =
$visitCounts
-> group by $category = $
into {$category, num: count($)};
$topUrls = top10($gCategories);
hdfsWrite(/data/topUrls, $topUrls);
7/25/2011
DIMA TU Berlin
113
Pig
A platform for analyzing large data sets
Pig consists of two parts:
PigLatin: A Data Processing Language

Pig Infrastructure: An Evaluator for PigLatin programs
Pig compiles Pig Latin into physical plans
Plans are to be executed over Hadoop
Interface between the declarative style of SQL and lowlevel, procedural style of MapReduce
http://hadoop.apache.org/pig/
7/25/2011
DIMA TU Berlin
114
Pig - Example
visits
= load /data/visits as (user, url, time);
visitCounts = foreach visits generate url, count(visits);

urlInfo
= load /data/urlInfo
as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;

topUrls
= foreach gCategories
generate top(visitCounts,10);
store topUrls into /data/topUrls;
Example taken from:

Pig Latin: A Not-So-Foreign Language For Data Processing Talk, Sigmod 2008
7/25/2011
DIMA TU Berlin
115
Literature
C. Olston, et al. (2008). `Pig Latin: a not-so-foreign language for data

processing'. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international
conference on Management of data, pp. 1099-1110, New York, NY, USA. ACM.
Apache Pig http://wiki.apache.org/pig/FrontPage
Hive - A Warehousing Solution Over a Map-Reduce Framework. Thusoo,
Ashish; Sarma, Joydeep Sen; Jain, Namit; Shao, Zheng; Chakka, Prasad; Anthony,
Suresh; Liu, Hao; Wyckoff, Pete; Murthy, Raghotham
Apache Hive http://wiki.apache.org/hadoop/Hive
Towards a Scalable Enterprise Content Analytics Platform. Kevin S. Beyer, Vuk
Ercegovac, Rajasekar Krishnamurthy, Sriram Raghavan, Jun Rao, Frederick Reiss,
Eugene J. Shekita, David E. Simmen, Sandeep Tata, Shivakumar Vaithyanathan,
Huaiyu Zhu. IEEE Data Eng. Bull. (32): 28-35 (2009)
JAQL http://code.google.com/p/jaql/wiki/
7/25/2011
DIMA TU Berlin
116
QUERY COPROCESSING ON
GRAPHICS PROCESSORS
7/25/2011
DIMA TU Berlin
117
Query Coprocessing on GPUs

Graphics Processors (GPUs) have recently emerged as powerful
coprocessors for general purpose computation
10x computational power compared to the CPU
5x memory bandwith compared to the CPU
Parallel primitives available for query processing that
7/25/2011
provide exploitation of GPU hardware features such as high thread parallelism and
reduction of memory stalls through the fast local memory
are scalable to hundreds of processors because of their lock-free design and low
synchronization cost through the use of local memory
DIMA TU Berlin
118

Map
given an array of data tuples and a function, a map applies the function to every tuple
uses multiple thread groups to scan the relation with each thread group being
responsible for a segment of the relation
the access pattern of the threads in each thread group is designed to exploit the
coalesced memory access feature on the GPU
Scatter and Gather

Scatter: perform indexed writes to a relation (e.g. hashing) defined by a location array
Gather: perform indexed reads from a relation also defined by a location array
can be implemented using the multipass optimization scheme to improve their temporal
locality
7/25/2011
DIMA TU Berlin
119

Prefix scan
applies a binary operator to the input relation
example: prefix sum, an important operation in parallel databases
Reduce
computes a value based on the input relation
implemented as multipass algorithm by utilizing local memory optimization
logarithmic number of passes constrained by local memory size per multiprocessor
7/25/2011
DIMA TU Berlin
120
An Architectural Hybrid of MapReduce and DBMS
HADOOP DB
7/25/2011
DIMA TU Berlin
121
Parallel Data Processing Architectures

Two major architectures:
1. Parallel databases
2.
Standard relational databases in a (usually) shared-nothing cluster.
MapReduce
Data analysis via parallel Map and Reduce jobs in a replicated cluster.
Both approaches have their Pros and Cons.
7/25/2011
DIMA TU Berlin
122
Parallel RDBMs
Pros:
Usually very good and consistent performance.
Flexible and proven interface (SQL).
Cons:
Scaling is rather limited (10s of nodes).
Does not work well in heterogeneous clusters.
Not very Fault-Tolerant.
7/25/2011
DIMA TU Berlin
123
MapReduce
Pros:
Very fault-tolerant and automatic load-balancing.
Operates well in heterogeneous clusters.
Cons:
Writing map/reduce jobs is more complicated than writing SQL queries.
Performance depends largely on the skill of the programmer.
7/25/2011
DIMA TU Berlin
124
HadoopDB
Both approaches have their strengths and weaknesses.
Idea of HadoopDB: Combine them!
Traditional relational databases as data storage and data processing nodes.
MapReduce for Query Parallelization, Job Tracking, etc.
Automatic SQL to MapReduce to SQL (SMS) query rewriter (based on Hive).
Pushing as many operations as possible into database layer

improves data access performance.
Map Reduce improves fault-tolerance and offers solid
cluster management.
7/25/2011
DIMA TU Berlin
125
HadoopDB overview
SQL query
System
catalog
SMS Planner
MapReduce Job
User
Master Node
Map Reduce Job Tracker
Task
Tracker
Task
Tracker
SQL
SQL
Task
Tracker
SQL
Postgres
DB
Postgres
DB
Postgres
DB
Node #1
Node #2
Node #n
7/25/2011
DIMA TU Berlin
Replicated
Table
Data
126
HadoopDB Sample query
SELECT
YEAR(saleDate),
SUM(revenue)
FROM sales
GROUP BY YEAR(saleDate);
7/25/2011
SMS
Rewrite
DIMA TU Berlin
127
Experimental Findings (I)

Compared with: native Hadoop (Hive), Vertica, commercial
row-oriented DB.
Experiments performed on a 10/50/100 node Amazon EC2
cloud instance.
Used Benchmark: A. Pavlo et al: A Comparison of Approaches
to Large Scale Data Analysis, SIGMOD, 2009
7/25/2011
DIMA TU Berlin
128
Experimental Findings (II)

In absence of failures, HadoopDB is usually slower than
parallel DBMS.
HadoopDB is consistently faster than Hadoop, but takes ~ 10
times longer to load data.
HadoopDBs performance decreases significantly lower than
Verticas in case of node failures.
HadoopDB is not as susceptible to single slow nodes as
Vertica.
7/25/2011
DIMA TU Berlin
129
Literature
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, A. Silberschatz:

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies
for Analytical Workloads. PVLDB 2(1): 922-933 (2009)
7/25/2011
DIMA TU Berlin
130
Basics of Parallel Processing

Parallel Speedup
Ahmdals Law
Levels of Parallelism
Instruction-Level, Data, Task
Modes of Query Parallelism

Inter-Query / Intra-Query
Pipeline (Inter Operator) / Data (Intra Operator)
Parallel Database Operations
7/25/2011
DIMA TU Berlin
143
Parallel Speedup
7/25/2011
DIMA TU Berlin
144
Parallel Speedup
7/25/2011
DIMA TU Berlin
145
Levels of Parallelism on Hardware

Instruction-level Parallelism
Single instructions are automatically processed in parallel
Example: Modern CPUs with multiple pipelines and instruction units.
Data Parallelism
Different Data can be processed independently
Each processor executes the same operations on its share of the input data.
Example: Distributing loop iterations over multiple processors, or CPUs
vectors
Task Parallelism
Tasks are distributed among the processors/nodes
Each processor executes a different thread/process.
Example: Threaded programs.
7/25/2011
DIMA TU Berlin
146
Modes of Query Parallelism

Inter Query Parallelism (multiple concurrent queries)
Necessary for efficient resource utilization: While one query waits
(e.g. for I/O), another one executes
Requires concurrency control (locking mechanisms) to guarantee transactional
properties (the "I" in ACID)
Important for highly transactional scenarios (OLTP)
Intra-Query Parallelism (parallel processing of a single query)

I/O Parallelism: Concurrent reading from multiple disks
Hidden: Hardware RAID, Transparent: Spanned tablespaces
Intra Operator Parallelism: Multiple threads work on the same operator.
Example: Parallel Sort
Inter Operator Parallelism: Multiple pipelined parts of the plan run in parallel
Important for complex analytical tasks (OLAP)
7/25/2011
DIMA TU Berlin
147
Pipeline Parallelism
Return
Step 2:
One thread scans the
table, probes the hash
tables. Second thread
starts the sort (sorting
sub-lists, merging the
first lists)
7/25/2011
Step 3:
One thread, return
result, business as
usual
Sort
HS-Join
HS-Join
Scan
Scan
Scan
T1
T2
T3
DIMA TU Berlin
Step 1:
Two threads scan
one base table each
and build the hash
tables for the joins.
148
Pipeline Parallelism
Pipeline Parallelism, also called Inter Operator Parallelism
Inter Operator, because the parallelism is between the operators
Execute multiple pipelines simultaneously
Limited in its applicability, only if multiple pipelines are present and not totally
dependent on each other
Problem:
High synchronization overhead

Mostly limited to lower degree of parallelism (not too many pipelines per query)
Only suited for shared-memory architectures
7/25/2011
DIMA TU Berlin
149
Data Parallelism
Pipeline Parallelism is not applicable to a large degree
Data Parallelism
Data divided into several sub-sets
Most operations don't need a complete view of the data
E.g. "Filter" looks only at a single tuple at a time
Subsets can be are processed independently and hence in parallel
Degree of Parallelism as high as the number of possible subsets

For "Filter": As high as the number of tuples
Some operations possibly need a view of larger portions of the data

E.g. Grouping/Aggregation operation needs all tuples with the same grouping key
Are they all in the same set? Can we guarantee that?
Different operators need different sets!
7/25/2011
DIMA TU Berlin
150
Basics of Parallel Query Processing

Levels of Resource Sharing
Shared-Memory, Shared-Disk, Shared-Nothing
Data Partitioning
Round-robin, Hash, Range
Parallel Operators and Costs
Tuple-at-a-time (i.e. Selection)

Sorting
Projection, Grouping, Aggregation
Join
7/25/2011
DIMA TU Berlin
151
Parallel Architectures (I)
Shared Memory
Several CPUs share a single memory and disk (array)
Communication over a single common bus
Source:
Garcia-Molina et al.,
Database Systems
The Complete Book.
Second Edition
7/25/2011
DIMA TU Berlin
152
Parallel Architectures (II)
Shared Disk
Several nodes with multiple CPUs, each node has its private memory
Single attached disk (array): Often NAS, SAN, etc
Source:
Database Systems
The Complete Book.
Second Edition
7/25/2011
DIMA TU Berlin
153
Parallel Architectures (III)
Shared Nothing
Each node has it own set of CPUs, memory and disks attached
Data needs to be partitioned over the nodes
Data is exchanged through direct node-to-node communication
Source:
Database Systems
The Complete Book.
Second Edition
7/25/2011
DIMA TU Berlin
154
Data Partitioning (I)

Partitioning the data means creating a set of disjunct sub-sets
Example: Sales data, every year gets its own partition
For shared-nothing, data must be partitioned across nodes

If it were replicated, it would effectively become a shared-disk with the local
disks acting like a cache (must be kept coherent)
Partitioning with certain characteristics has more advantages

Some queries can be limited to operate on certain sets only, if it is provable
that all relevant data (passing the predicates) is in that partition
Partitions can be simply dropped as a whole (data is rolled out) when it is no
longer needed (e.g. discard old sales)
7/25/2011
DIMA TU Berlin
155
Data Partitioning (II)

How to partition the data into disjoint sets?
Round robin: Each set gets a tuple in a round, all sets have guaranteed
equal amount of tuples, no apparent relationship between tuples in one
set.
Hash Partitioned: Define a set of partitioning columns. Generate a hash
value over those columns to decide the target set. All tuples with equal
values in the partitioning columns are in the same set.
Range Partitioned: Define a set of partitioning columns and split the

domain of those columns into ranges. The range determines the target set.
All tuples on one set are in the same range.
7/25/2011
DIMA TU Berlin
156
Data Parallelism Example

Client send a SQL query to one
of the cluster nodes
Node becomes the

"coordinator"
Coordinator compiles
the query
ClusterNode
Client
Parsing, Checking, Optimization

Parallelization
Query
Final
Results
Sends partial plans to the other

cluster nodes that describes
their tasks
Coordinator
Partial
Results
ClusterNode
ClusterNode
Coordinator also executes the

partial plan on his part of the data
Collects partial results and

finalizes them (see next slide)
7/25/2011
DIMA TU Berlin
157
Data Parallelism Example

For shared-nothing & shared-disk
Multiple instances of a sub-plan are

executed on different computers
The instances operate on different
splits or partitions of the data
At some points, results from the subplans are collected

For more complex queries, results
are not collected but re-distributed,
for further parallel processing
Return
Point of data
shipping
PreAggregation
Group
Agg
Final Aggregation
Sub-plan result
collection
Queue
Group
Agg
Sort
NL-Join
Parallel
Instances
Fetch
T2 (part)
7/25/2011
DIMA TU Berlin
Scan
IX-Scan
T1 (part)
IX-T2.1 (part)
158
Parallel Operators
Ideally: Operate as much as possible on individual partitions of the data
Bring the operation to the data
No communication needed, ideal parallelism
Easy for simple "per-tuple" operators

Scan, IX-Scan, Fetch, Filter
Problematic: Some operators need the whole picture

E.g. Sorts and Aggregations can only be preprocessed in parallel and need a final step on
a single node.
Unless: They occur in a correlated subplan known to contain only tuples from one
partition.
E.g. Joins need matching tuples. Either organize the inputs accordingly ,
or join on the coordinator after the collection of partial results (not parallel any more!).
7/25/2011
DIMA TU Berlin
159
Notations and Assumptions
S
S[i, h]
B(S)
p
Relation S
Partition i of relation S according to partitioning scheme h.
Number of Blocks of Relation S
Number of Nodes
We assume a shared-nothing architecture

Most commercial database vendors use shared-nothing approaches.
Network transfer is at least as expensive as disk access

In some cost models still far more expensive
Today network bandwidth disk bandwidth
But: Network is shared, especially Switches and Routers have a throughput limit
Partitioning schemes (hash/range) produce partitions of roughly equal size.
7/25/2011
DIMA TU Berlin
160
Parallel Selection
Selection can be parallelized very efficiently (embarrassingly parallel problem)
Each node performs the selection on its existing local partition.
Selection needs no context
Data can be partitioned in a arbitrary way
Partial results are unioned afterwards.

Cost:
B(S)/p
7/25/2011
DIMA TU Berlin
161
Parallel Projection, Grouping, Aggregation
7/25/2011
DIMA TU Berlin
162
Parallel Sorting
Range partitioning sort
(partition by range, then sort)
Range-partition the relation according to the sort columns

Sort the single partitions locally (e.g. by TPMMS)
Cost: B(S) partitioning + B(S) transfer + B(S)/p local sorting
Problem: How to find a uniform range parititioning scheme?
Result is already partitoned in the cluster.
Parallel External sort-merge
(sort locally, then merge)
Reuse an existing data partitioning

Partitions are sorted locally (e.g. by TPMMS)
Sorted partitions need to be merged
E.g.: One node merges two partitions until the whole relation is sorted
Cost: B(S)/p local sorting + log2(p)*B(S)/2 transfer + log2(p)*B(S) local merging

Result is sitting on one machine.
7/25/2011
DIMA TU Berlin
163
Parallel Equi-Joins (I)

A special class of Joins that are suited for parallelization are Natural- and
Equi-Joins.
For Equi-Joins we only look at tuple pairs that share the same join key.
Idea: Partition relations R and S using the same partitioning scheme over
the join key.
All values of R and S with the same join key end up at the same node!
All joins can be performed locally!
Actual implementation depends on how the relations are partitioned:

Co-Located Join
Directed Join
Re-Partitioning Join
7/25/2011
DIMA TU Berlin
164
Parallel Equi-Joins (II)

1.
Both R and S are already partitioned over the join key

(and with the same partitioning scheme):
2.
3.
Co-Located Join
No re-partitioning is needed!
Cost:
???
Local join cost
Only one relation is partitioned over the join key:
Directed Join
Re-Partition the other relation with same partitioning scheme.

Cost (assuming R is already partitioned):
B(S)
partitioning
B(S)
transfer
???
Local join cost
No relation is partitioned over the join key:
Repartition Join
Re-Partition both relations over the join key

Cost:
B(S)+B(R)
partitioning
B(S)+B(R)
transfer
???
Local join cost
7/25/2011
DIMA TU Berlin
165
Symmetric Fragment-and-Replicate Join
Join
7/25/2011
DIMA TU Berlin
166
Symmetric Fragment-and-Replicate Join (II)
Nodes in the Cluster
7/25/2011
DIMA TU Berlin
167
Asymmetric Fragment-and-Replicate Join

We can do better, if relation S is much
smaller than R.
Idea: Reuse the existing partitioning of R
and replicate the whole relation S to each
node.
Cost:
p * B(S)
???
transport
local join
Asymmetric Fragment-and-replicate
Join is a special case of the Symmetric
Algorithm with m=p and n=1.
The Asymmetric Fragment-and-replicate
Join is also called Broadcast Join
7/25/2011
DIMA TU Berlin
168
Limits in Parallel Databases

Database clusters tend to scale until 64 or 128 nodes
Afterwards the speedup increase curve flattens

Communication overhead eats speedup through next node
Hard limit example: 1000 nodes for DB2
Shared Disk: Does not scale infinitely, bus and synchronization become
overhead
For Updates: Cache Coherency Problem
For Reads: I/O Bandwidth Limits
Shared Nothing: Cannot compensate loss of a node easily
In large clusters, failures and outages are most common.

Loss of a node means loss of data!
Unless: Data is replicated.
But: Replicated Data must be kept consistent! Has a high overhead
7/25/2011
DIMA TU Berlin
169
Literature
S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System

Software of A Parallel Relational Database Machine GRACE.
D. A. Schneider and D. J. DeWitt. A Performance Evaluation of Four Parallel

Join Algorithms in a Shared-Nothing Multiprocessor Environment. SIGMOD
Conference, 1989
D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M.

Muralikrishna. GAMMA A High Performance Dataflow Database Machine.
J. W. Stamos and H. C. Young. A Symmetric Fragment and Replicate

Algorithm for Distributed Joins. IEEE Trans. Parallel Distrib. Syst., 1993.
7/25/2011
DIMA TU Berlin
170
Side-Note: What about updates/transactions?

OLTP style applications that are beyond relational databases'
capabilities exist as well
Some applications still require fast and efficient lookup and
retrieval of small amounts of data
Web index access, mail accounts, warehouse updates for resellers
Addressed by Key/Value pair based storage systems

(e.g. Google BigTable and Megastore)
Can only access the data through a key
Can only apply an additional filter on columns and timestamps
Some applications do still need updates and certain guarantees

about them
No hard transactions, especially no multi record transactions !!!
Eventual consistency model (Amazon Dynamo)
Techniques require a lecture of their own
7/25/2011
DIMA TU Berlin
171
7/25/2011
DIMA TU Berlin
172
7/25/2011
DIMA TU Berlin
173
7/25/2011
DIMA TU Berlin
174
7/25/2011
DIMA TU Berlin
175
7/25/2011
DIMA TU Berlin
176
7/25/2011
DIMA TU Berlin
177
7/25/2011
DIMA TU Berlin
178
7/25/2011
DIMA TU Berlin
179
7/25/2011
DIMA TU Berlin
180
7/25/2011
DIMA TU Berlin
181
7/25/2011
DIMA TU Berlin
182
7/25/2011
DIMA TU Berlin
183
7/25/2011
DIMA TU Berlin
184
7/25/2011
DIMA TU Berlin
185

Beyond Map and Reduce

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Beyond Map and Reduce

Uploaded by

Copyright:

Available Formats

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT

Big Data Analytics beyond Map/Reduce

Shift Happens! Our Digital World!

Video courtesy of Michael Brodie, Chief Scientist, Verizon

Data Growth and Value

About the value of captured data:

60% potential increase

140,000-190,000 more deep analytical talent positions needed

Big Data creates value in several ways

Big Data Analytics

Some systems are required to support extreme throughput in

Analysis Queries become more and more complex

Performance of single computing cores or single machines is not

Trends and Challenges

Mahout and Scalable Data Mining

BIG DATA ANALYTICS

Data Partitioning (I)

For shared-nothing, data must be partitioned across nodes

Partitioning with certain characteristics has more advantages

Data Partitioning (II)

Range Partitioned: Define a set of partitioning columns and split the

Functional programming model with 2nd order functions

input key-value pairs:

and a list of values

Data Flow in Map/Reduce

Map Reduce Illustrated (1)

How many times different words appear in a set of files

Solution: Map-Reduce Job

Map Reduce Illustrated (2)

Data Analytics: Relational Algebra

Relational Operators as Map/Reduce jobs

Relational Operators as Map/Reduce jobs

Relational Operators as Map/Reduce jobs

Relational Operators as Map/Reduce jobs

The Petabyte Sort Benchmark

Sort is basically "Range partitioning sort" as described

Petabyte sorting benchmark

Per node: 2 quad core Xeons @ 2.5ghz, 4 SATA disks, 8G

Cluster Utilization during Sort

Symmetric Fragment-and-Replicate Join (II)

Nodes in the Cluster

Asymmetric Fragment-and-Replicate Join

assumption: |L| << |R|

by utilities, like Hadoop's distributed cache

No need for partition / sort / reduce processing

assumption: |L| < |R|

identical processing logic for L and R

Partition and sort

Multi-Dimensional Partitioned Join

star-schema with fact table F and dimensions Di

Parallel DBMS vs. Map/Reduce

Stating what you want

Simplified Relational Data Processing on Large Clusters

Map-Reduce-Merge workflows are comparable to RDBMS execution plans

(k1, v1) (k 2, v2)

"The interesting thing about cloud computing is

"We'll make cloud computing announcements.

Steve Ballmers Vision of Cloud Computing

What does Hadoop have to do with Cloud?

What has Hadoop got to do with cloud?"

Re-inventing the wheel

Parallel Analytics in the Cloud beyond Map/Reduce

The Stratosphere Project*

Research and prototype a