You are on page 1of 163

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT

FR NACHWUCHSWISSENSCHAFTLER 2011

UNIVERSIT DT FRANCO-ALLEMANDE
POUR JEUNES CHERCHEURS 2011

CLOUD COMPUTING :
HERAUSFORDERUNGEN UND MGLICHKEITEN

CLOUD COMPUTING :
DFIS ET OPPORTUNITS

Big Data Analytics beyond Map/Reduce


17.7. 22.7. 2011
Prof. Dr. Volker Markl
TU Berlin

Shift Happens! Our Digital World!

Video courtesy of Michael Brodie, Chief Scientist, Verizon


Original "Shift Happens" video by K. Fisch and S. McLeod
Original focuses on shift in society, aimed at teachers education
Michael Brodie focuses on shift in/because of the digital world
7/25/2011

DIMA TU Berlin

Data Growth and Value


About data growth:

$600 to buy a disk drive that can store all of the worlds music
5 billion mobile phones in use in 2010
30 billion pieces of content shared on Facebook every month
40% projected growth in global data per year

About the value of captured data:


250 billion Euro potential value
to Europes public sector administration

60% potential increase


in retailers operating margins possible with big data

140,000-190,000 more deep analytical talent positions needed

Source: Big Data: The next frontier for innovation, competition and productivity
(McKinsey)
7/25/2011

DIMA TU Berlin

Big Data
Data have swept into every industry and business function
important factor of production
exabytes of data stored by companies every year
much of modern economic activity could not take place without that

Big Data creates value in several ways


provides transparency
enables experimentation
brings about customization
and tailored products
supports human decisions
triggers new business models

Use of Big Data will become a key basis of competition and growth
companies failing to develop their analysis capabilities will fall behind
Source: Big Data: The next frontier for innovation, competition and productivity (McKinsey)
7/25/2011

DIMA TU Berlin

Big Data Analytics


Data volume keeps growing
Data Warehouse sizes of about 1PB are not uncommon!
Some businesses produce >1TB of new data per day!
Scientific scenarios are even larger (e.g. LHC experiment results in ~15PB / yr)

Some systems are required to support extreme throughput in


transaction processing
Especially financial institutes

Analysis Queries become more and more complex


Discovering statistical patterns is compute intensive
May require multiple passes over the data

Performance of single computing cores or single machines is not


increasing substantially enough to cope with this development
7/25/2011

DIMA TU Berlin

Trends and Challenges

Trends

Claremont Report

Massive parallelization
Virtualization
Service-based computing

Re-architecting DBMS

Web-scale data
management

Parallelization
Continuous optimization
Tight integration

Service-based everything
Programming Model
Combining structured and
unstructured data
Media Convergence

Analytics / BI
Operational
Multi-tenancy

7/25/2011

DIMA TU Berlin

Overview
Introduction
Big Data Analytics
Map/Reduce/Merge
Introducing the Cloud
Stratosphere (PACT and Nephele)
Demo
(Thomas Bodner, Matthias Ringwald)

Mahout and Scalable Data Mining


(Sebastian Schelter)
7/25/2011

DIMA TU Berlin

Map/Reduce Revisited

BIG DATA ANALYTICS

7/25/2011

DIMA TU Berlin

Data Partitioning (I)


Partitioning the data means creating a set of disjunct sub-sets
Example: Sales data, every year gets its own partition

For shared-nothing, data must be partitioned across nodes


If it were replicated, it would effectively become a shared-disk with the local
disks acting like a cache (must be kept coherent)

Partitioning with certain characteristics has more advantages


Some queries can be limited to operate on certain sets only, if it is provable
that all relevant data (passing the predicates) is in that partition
Partitions can be simply dropped as a whole (data is rolled out) when it is no
longer needed (e.g. discard old sales)

7/25/2011

DIMA TU Berlin

Data Partitioning (II)


How to partition the data into disjoint sets?
Round robin: Each set gets a tuple in a round, all sets have guaranteed
equal amount of tuples, no apparent relationship between tuples in one
set.
Hash Partitioned: Define a set of partitioning columns. Generate a hash
value over those columns to decide the target set. All tuples with equal
values in the partitioning columns are in the same set.

Range Partitioned: Define a set of partitioning columns and split the


domain of those columns into ranges. The range determines the target set.
All tuples on one set are in the same range.

7/25/2011

DIMA TU Berlin

10

Map/Reduce Revisited
The data model
key/value pairs
e.g. (int, string)

Functional programming model with 2nd order functions


map:

input key-value pairs:


output key-value pairs:

reduce:
input key
output key

and a list of values


and a single value

The framework
accepts a list
outputs result pairs
7/25/2011

DIMA TU Berlin

11

Data Flow in Map/Reduce


(K m,Vm)*
Framework

(K m,Vm)

(K m,Vm)

(K m,Vm)

MAP(K m,Vm)

MAP(K m,Vm)

MAP(K m,Vm)

(K r ,Vr)*

(K r ,Vr)*

(K r ,Vr)*

(K r ,Vr*)

(K r ,Vr*)

(K r ,Vr*)

REDUCE(K r ,Vr*)

REDUCE(K r ,Vr*)

REDUCE(K r ,Vr*)

(K r ,Vr)

(K r ,Vr)

(K r ,Vr)

Framework

Framework

(K r ,Vr)*
7/25/2011

DIMA TU Berlin

12

Map Reduce Illustrated (1)


Problem: Counting words in a parallel fashion

How many times different words appear in a set of files


juliet.txt: Romeo, Romeo, wherefore art thou Romeo?
benvolio.txt: What, art thou hurt?
Expected output: Romeo (3), art (2), thou (2), art (2), hurt (1),
wherefore (1), what (1)

Solution: Map-Reduce Job


map(filename, line) {
foreach (word in line)
emit(word, 1);
}
reduce(word, numbers) {
int sum = 0;
foreach (value in numbers) {
sum += value;
}
emit(word, sum);
}

7/25/2011

DIMA TU Berlin

13

Map Reduce Illustrated (2)

7/25/2011

DIMA TU Berlin

14

Data Analytics: Relational Algebra


Base Operators

selection ()
projection ()
set/bag union ()
set/bag difference (\ or -)
Cartesian product ()

Derived Operators
join ()
set/bag intersection ()
division (/)

Further Operators
de-duplication
generalized projection
(grouping and aggregation)
outer-joins und semi-joins
Sort

7/25/2011

DIMA TU Berlin

15

Relational Operators as Map/Reduce jobs


Selection / projection / aggregation
SQL Query:
SELECT year, SUM(price)
FROM sales
WHERE area_code = US
GROUP BY year

Map/Reduce job:
map(key, tuple) {
int year = YEAR(tuple.date);
if (tuple.area_code = US)
emit(year, {year => year, price => tuple.price });
}
reduce(key, tuples) {
double sum_price = 0;
foreach (tuple in tuples) {
sum_price += tuple.price;
}
emit(key, sum_price);
}

7/25/2011

DIMA TU Berlin

16

Relational Operators as Map/Reduce jobs


Sorting
SQL Query:
SELECT *
FROM sales
ORDER BY year
Map/Reduce job:
map(key, tuple) {
emit(YEAR(tuple.date) DIV 10, tuple);
}
reduce(key, tuples) {
emit(key, sort(tuples));
}

7/25/2011

DIMA TU Berlin

17

Relational Operators as Map/Reduce jobs


UNION
SQL Query:
SELECT phone_number FROM employees
UNION
SELECT phone_number FROM bosses
Map/Reduce job needs two different mappers:
map(key, employees_phonebook_entry) {
emit(employees_phonebook_entry.number, ``);
}

map(key, bosses_phonebook_entry) {
emit(bosses_phonebook_entry.number, ``);
}
reduce(phone_number, tuples) {
emit(phone_number, ``);
}

7/25/2011

DIMA TU Berlin

18

Relational Operators as Map/Reduce jobs


INTERSECT
SQL Query:
SELECT first_name FROM employees
INTERSECT
SELECT first_name FROM bosses
Map/Reduce job needs two different mappers:
map(key, employee_listing_entry) {
emit(employee_listing_entry.first_name, `E`);
}

map(key, boss_listing_entry) {
emit(bosses_listing_entry.first_name, `B`);
}
reduce(first_name, markers) {
if (`E` in markers and `B` in markers) {
emit(first_name, ``);
}
}
7/25/2011

DIMA TU Berlin

19

The Petabyte Sort Benchmark


Benchmark to test the performance of distributed systems
Goal: Sort one Petabyte of 100 byte numbers
Implementation in Hadoop:
Range-Partitioner that splits the data in equal ranges (one for each
participating node)

Sort is basically "Range partitioning sort" as described


earlier

7/25/2011

DIMA TU Berlin

20

Petabyte sorting benchmark

Per node: 2 quad core Xeons @ 2.5ghz, 4 SATA disks, 8G


RAM (upgraded to
16GB before petabyte sort), 1 Gigabit Ethernet.
Per Rack: 40 nodes, 8 gigabit Ethernet uplinks.
7/25/2011

DIMA TU Berlin

21

Cluster Utilization during Sort

7/25/2011

DIMA TU Berlin

22

Map/Reduce Revisited

JOINS IN MAP/REDUCE

7/25/2011

DIMA TU Berlin

23

Symmetric Fragment-and-Replicate Join (II)

Nodes in the Cluster

7/25/2011

DIMA TU Berlin

24

Asymmetric Fragment-and-Replicate Join


We can do better, if relation S is much
smaller than R.
Idea: Reuse the existing partitioning of R
and replicate the whole relation S to each
node.
Cost:

p * B(S)
???

transport
local join

Asymmetric Fragment-and-replicate
Join is a special case of the Symmetric
Algorithm with m=p and n=1.
The Asymmetric Fragment-and-replicate
join is also called Broadcast Join
7/25/2011

DIMA TU Berlin

25

Broadcast Join
Equi-Join: L(A,X)

R(X,C)

assumption: |L| << |R|

Idea
broadcast L to each node completely before
the map phase begins

by utilities, like Hadoop's distributed cache


or mappers read L from the cluster filesystem at startup

Mapper

only over R
step 1: read assigned input split of R into a hash-table (build phase)
step 2: scan local copy of L and find matching R tuples (probe)
step 3: emit each such pair
Alternatively read L into Hash-Table, then read R and probe

No need for partition / sort / reduce processing


Mapper outputs the final join result

7/25/2011

DIMA TU Berlin

26

Repartition Join
Equi-Join: L(A,X)

R(X,C)

assumption: |L| < |R|

Mapper L(A,X)

LR

LR

LR

build L

R(X,C)

identical processing logic for L and R


emit each tuple once
the intermediate key is a pair of

h(key) % n
read

L R L R
the value of the actual join key X
an annotation identifying to which relation the tuple belongs to (L or R)

Partition and sort


partition by join key hash value
input is ordered first on the join key, then on the relation name
output: a sequence of L(i), R(i) blocks of tuples for ascending join key i

Reduce
collect all L-tuples for the current L(i) block in a hash map
combine them with each R-tuple of the corresponding R(i)-tuple block

7/25/2011

DIMA TU Berlin

27

Multi-Dimensional Partitioned Join


Equi-Join: D1(A,X)

D2(B,Y)

F(C,X,Y)

star-schema with fact table F and dimensions Di

D2 D1

F D1
D2

Fragment
D1 and D2 are partitioned independently
the partitions for F are defined as D1 x D2

Replicate
for F-tuple f the partition is uniquely defined as (hash(f.x), hash(f.y))
for D1-tuple d1 there is one degree of freedom (d1.y is undefined)
D1-tuples are thus replicated for each possible y value

symmetric for D2

Reduce
find and emit (f, d1, d2) pairs
depending on the input sorting, different join strategies are possible

7/25/2011

DIMA TU Berlin

29

Joins in Hadoop

time

nodes

Asym. = Multi-Dimensional
Partitioned Join

selectivity

7/25/2011

DIMA TU Berlin

31

Parallel DBMS vs. Map/Reduce


Parallel DBMS

Map/Reduce

Schema Support

Indexing

Programming Model

Stating what you want


(declarative: SQL)

Presenting an algorithm
(procedural: C/C++,
Java, )

Optimization

Scaling

1 500

10 - 5000

Fault Tolerance

Limited

Good

Execution

Pipelines results
between operators

Materializes results
between phases

7/25/2011

DIMA TU Berlin

32

Simplified Relational Data Processing on Large Clusters

MAP-REDUCE-MERGE

7/25/2011

DIMA TU Berlin

33

Map-Reduce-Merge
Motivation
Map/Reduce does not directly support processing multiple related heterogeneous
datasets
difficulties and/or inefficiency when one must implement relational operators like joins

Map-Reduce-Merge
adds a merge phase that
Goal: efficiently merge data already partitioned and sorted (or hashed)

Map-Reduce-Merge workflows are comparable to RDBMS execution plans


Can more easily implement parallel join algorithms

(k1, v1) (k 2, v2)


map:
reduce: (k 2,[v2]) (k 2, v3)
merge: (k 2, v3) , (k 3, v4) (k 4, v5)

7/25/2011

DIMA TU Berlin

34

Introducing

THE CLOUD

7/25/2011

DIMA TU Berlin

35

In the Cloud

7/25/2011

DIMA TU Berlin

36

"The interesting thing about cloud computing is


that we've redefined cloud computing to include
everything that we already do.
I can't think of anything that isn't cloud computing
with all of these announcements.
The computer industry is the only industry
that is more fashion-driven than women's fashion.
Maybe I'm an idiot, but I have no idea what anyone is talking about.
What is it?
It's complete gibberish. It's insane.
When is this idiocy going to stop?

"We'll make cloud computing announcements.


I'm not going to fight this thing.
But I don't understand what we would do differently
in the light of cloud."
7/25/2011

DIMA TU Berlin

37

Steve Ballmers Vision of Cloud Computing

7/25/2011

DIMA TU Berlin

38

What does Hadoop have to do with Cloud?


A few months back, Hamid Pirahesh and I were
doing a roundtable with a customer of ours, on
cloud and data.
We got into a set of standard issues -- data security
being the primary but when the dialog turned to
Hadoop, a person raised his hands and asked:

What has Hadoop got to do with cloud?"


I responded, somewhat quickly perhaps, "Nothing
specific, and I am willing to have a dialog with you
on Hadoop in and out of the cloud context", but it
got me thinking. Is there a relationship, or not?

7/25/2011

DIMA TU Berlin

39

Re-inventing the wheel


- or not?

7/25/2011

DIMA TU Berlin

40

Parallel Analytics in the Cloud beyond Map/Reduce

STRATOSPHERE

7/25/2011

DIMA TU Berlin

41

The Stratosphere Project*


Explore the power of Cloud
computing for complex
information management
applications

Use-Cases

Scientific Data

Life Sciences

StratoSphere
Above the Clouds

Linked Data

Query Processor

Database-inspired approach
Analyze, aggregate, and
query
Textual and (semi-)
structured data

Infrastructure as a Service

...

Research and prototype a


web-scale data analytics
infrastructure

* FOR 1306: DFG funded collaborative project among TU Berlin, HU Berlin and HPI Potsdam
7/25/2011

DIMA TU Berlin

42

1100km,
2km resolution

Example: Climate Data Analysis


PS,1,1,0,Pa, surface pressure
T_2M,11,105,0,K,air_temperature
TMAX_2M,15,105,2,K,2m maximum temperature
TMIN_2M,16,105,2,K,2m minimum temperature
U,33,110,0,ms-1,U-component of wind
V,34,110,0,ms-1,V-component of wind
QV_2M,51,105,0,kgkg-1,2m specific humidity
CLCT,71,1,0,1,total cloud cover

(Up to 200 parameters)

Analysis Tasks on Climate Data Sets


Validate climate models
Locate hot-spots in climate models
Monsoon
Drought
Flooding
Compare climate models
Based on different parameter settings

Necessary Data Processing Operations

10TB

Filter
Aggregation (sliding window)
Join
Multi-dimensional sliding-window operations
Geospatial/Temporal joins
Uncertainty

950km,
2km resolution

7/25/2011

DIMA TU Berlin

43

Further Use-Cases
Text Mining in the biosciences

Cleansing of linked open data

7/25/2011

DIMA TU Berlin

44

Outline
Architecture of the Stratosphere System
The PACT Programming Model
The Nephele Execution Engine
Parallelizing PACT Programs

7/25/2011

DIMA TU Berlin

45

Architecture Overview

Higher-Level
Language

Parallel Programming
Model

Execution Engine

JAQL,
Pig,
Hive

Scope,
DryadLINQ

PACT
Programming
Model

Map/Reduce
Programming
Model
Hadoop

Hadoop Stack
7/25/2011

JAQL?
Pig?
Hive?

DIMA TU Berlin

Dryad

Nephele

Dryad Stack

Stratosphere
Stack
46

Data-Centric Parallel Programming


Map / Reduce

Relational Databases

Map

Reduce

Map

Map

Reduce

Reduce

Schema Free
Many semantics hidden inside the
user code (tricks required to push
operations into map/reduce)
Single default way of parallelization

Schema bound (relational model)


Well defined properties and
requirements for parallelization
Flexible and optimizable

GOAL: Advance the m/r programming model


7/25/2011

DIMA TU Berlin

47

Stratosphere in a Nutshell
PACT Programming Model
Parallelization Contract (PACT)
Declarative definition of data parallelism
Centered around second-order functions
Generalization of map/reduce
Nephele
Dryad-style execution engine
Evaluates dataflow graphs in parallel
Data is read from distributed filesystem
Flexible engine for complex jobs

PACT Compiler

Nephele

Stratosphere = Nephele + PACT


Compiles PACT programs to Nephele dataflow graphs
Combines parallelization abstraction and flexible execution
Choice of execution strategies gives optimization potential
7/25/2011

DIMA TU Berlin

48

Overview
Parallelization Contracts (PACTs)
The Nephele Execution Engine
Compiling/Optimizing Programs
Related Work

7/25/2011

DIMA TU Berlin

49

Intuition for Parallelization Contracts


Map and reduce are second-order functions
Call first-order functions (user code)
Provide first-order functions with subsets of the input data
Define dependencies between the
records that must be obeyed when
splitting them into subsets
Cp: Required partition properties
Map
All records are independently
processable

Key

Value

Independent
subsets

Input set

Reduce
Records with identical key must
be processed together

7/25/2011

DIMA TU Berlin

50

Contracts beyond Map and Reduce


Cross
Two inputs
Each combination of records from the two inputs
is built and is independently processable
Match
Two inputs, each combination of records with
equal key from the two inputs is built
Each pair is independently processable
CoGroup
Multiple inputs
Pairs with identical key are grouped for each input
Groups of all inputs with identical key
are processed together

7/25/2011

DIMA TU Berlin

51

Parallelization Contracts (PACTs)


Second-order function that defines properties on the input and
output data of its associated first-order function

Data

Input
Contract

First-order function
(user code)

Output
Contract

Data

Input Contract
Specifies dependencies between records
(a.k.a. "What must be processed together?")
Generalization of map/reduce
Logically: Abstracts a (set of) communication pattern(s)
For "reduce": repartition-by-key
For "match" : broadcast-one or repartition-by-key
Output Contract
Generic properties preserved or produced by the user code
key property, sort order, partitioning, etc.
Relevant to parallelization of succeeding functions
7/25/2011

DIMA TU Berlin

52

Optimizing PACT Programs


For certain PACTs, several distribution patterns exist that
fulfill the contract
Choice of best one is up to the system

Created properties (like a partitioning) may be reused for


later operators
Need a way to find out whether they still hold after the user code
Output contracts are a simple way to specify that
Example output contracts: Same-Key, Super-Key, Unique-Key

Using these properties, optimization across multiple PACTs


is possible
Simple System-R style optimizer approach possible

7/25/2011

DIMA TU Berlin

53

From PACT Programs to Data Flows


PACT code
(grouping)
function match(Key k, Tuple val1,
Tuple val2)
-> (Key, Tuple)
{
Tuple res = val1.concat(val2);
res.project(...);
Key k = res.getColumn(1);
Return (k, res);
}

invoke():
while (!input2.eof)
KVPair p = input2.next();
hash-table.put(p.key, p.value);
while (!input1.eof)
KVPair p = input1.next();
KVPait t = hash-table.get(p.key);
if (t != null)
KVPair[] result =
UF.match(p.key, p.value, t.value);
output.write(result);
end

User
Function

Nephele code
(communication)
UF1
(map)
UF2
(map)

V4

In-Memory
Channel

UF3
(match)

UF4
(reduce)

V1

compile

V3

V4

V2

span

V4

V3

V3

V3

V3

V1

V2

V1

V2

Network
Channel

PACT Program
7/25/2011

Nephele DAG
DIMA TU Berlin

Spanned Data Flow


54

NEPHELE EXECUTION
ENGINE
7/25/2011

DIMA TU Berlin

55

Nephele Execution Engine


Executes Nephele schedules
compiled from PACT programs

Design goals

Exploit scalability/flexibility of clouds


Provide predictable performance
Efficient execution on 1000+ cores
Flexible fault tolerance mechanisms

PACT Compiler

Nephele

Inherently designed to run on top of an


IaaS Cloud
Heterogeneity through different types of VMs
Knows Clouds pricing model

Infrastructure-as-a-Service

VM allocation and de-allocation

Network topology inference

7/25/2011

DIMA TU Berlin

56

Nephele Architecture
Standard master worker pattern
Workers can be allocated on demand

Workload over time

Client
Public Network (Internet)

7/25/2011

Master
Private / Virtualized Network

Worker

Worker

DIMA TU Berlin

Worker

Persistent Storage

Cloud Controller

Compute Cloud

57

Structure of a Nephele Schedule

Output 1
Task: LineWriterTask.program
Output: s3://user:key@storage/outp

Task 1
Task: MyTask.program

Input 1
Task: LineReaderTask.program
Input: s3://user:key@storage/input

7/25/2011

Nephele Schedule is represented as DAG


Vertices represent tasks
Edges denote communication channels
Mandatory information for each vertex
Task program
Input/output data location (I/O vertices
only)
Optional information for each vertex
Number of subtasks (degree of parallelism)
Number of subtasks per virtual machine
Type of virtual machine (#CPU cores, RAM)
Channel types
Sharing virtual machines among tasks
DIMA TU Berlin

58

Internal Schedule Representation


Nephele schedule is converted into internal
representation
Output
1 1 (1)
Output
ID: 2
Type: m1.large

Task
1 1 (2)
Task

Explicit parallelization
Parallelization range (mpl) derived from PACT
Wiring of subtasks derived from PACT

Explicit assignment to virtual machines


Specified by ID and type
Type refers to hardware profile

ID: 1
Type: m1.small

Input
Input
1 1 (1)

7/25/2011

DIMA TU Berlin

59

Execution Stages

Stage 1

Output 1 (1)
ID: 2
Type: m1.large

Issues with on-demand allocation:


When to allocate virtual machines?
When to deallocate virtual machines?
No guarantee of resource availability!

Stage 0

Task 1 (2)

Stages ensure three properties:


VMs of upcoming stage are available
All workers are set up and ready
Data of previous stages is stored in persistent
manner

ID: 1
Type: m1.small

Input 1 (1)

7/25/2011

DIMA TU Berlin

60

Channel Types

Stage 1

Network channels (pipeline)


Vertices must be in same stage

Output 1 (1)
ID: 2
Type: m1.large

In-memory channels (pipeline)


Vertices must run on same VM
Vertices must be in same stage

Stage 0

Task 1 (2)

File channels
Vertices must run on same VM
Vertices must be in different stages

ID: 1
Type: m1.small

Input 1 (1)

7/25/2011

DIMA TU Berlin

61

Some Evaluation (1/2)


Demonstrates benefits of dynamic resource allocation
Challenge: Sort and Aggregate
Sort 100 GB of integer numbers (from GraySort benchmark)
Aggregate TOP 20% of these numbers (exact result!)
First execution as map/reduce jobs with Hadoop
Three map/reduce jobs on 6 VMs (each with 8 CPU cores, 24 GB
RAM)
TeraSort code used for sorting
Custom code for aggregation
Second execution as map/reduce jobs with Nephele
Map/reduce compatilibilty layer allows to run Hadoop M/R programs
Nephele controls resource allocation
Idea: Adapt allocated resources to required processing power
7/25/2011

DIMA TU Berlin

62

First Evaluation (2/2)

20

40

60

80

250
200
150

Time [minutes]

M/R jobs on Hadoop

7/25/2011

100

20

40

60

80

100

Time [minutes]

M/R jobs on Nephele

DIMA TU Berlin

63

Average network traffic among instances [MBit/s]

400
300

350

80
60

100

150
100
0

50

20
0

50

200

250

60

300

(h)

(f) (g) (h)

40

(f)

(e)

20

350

(e)

Average instance utilization [%]

400

(a)

(c)

(d)

Average network traffic among instances [MBit/s]

(g)

(d)

100

500
450

100

(c)

Poor resource
utilization!

(a)

500

Automatic VM
deallocation

(b)

40

Average instance utilization [%]

80

(b)

USR
SYS
WAIT
Network traffic

450

USR
SYS
WAIT
Network traffic

References
[WK09] Daniel Warneke, Odej Kao: Nephele: efficient
parallel data processing in the cloud. SC-MTAGS 2009
[BEH+10] D. Battr, S. Ewen, F. Hueske, O. Kao, V. Markl,
D. Warneke: Nephele/PACTs: a programming model and
execution framework for web-scale analytical processing.
SoCC 2010: 119-130
[ABE+10] A. Alexandrov, D. Battr, S. Ewen, M. Heimel, F.
Hueske, O. Kao, V. Markl, E. Nijkamp, D. Warneke:
Massively Parallel Data Analysis with PACTs on Nephele.
PVLDB 3(2): 1625-1628 (2010)
[AEH+11] A.Alexandrov, S. Ewen, M. Heimel, Fabian Hske,
et al.: MapReduce and PACT - Comparing Data Parallel
Programming Models, to appear at BTW 2011
7/25/2011

DIMA TU Berlin

64

Ongoing Work

Adaptive Fault-Tolerance (Odej Kao)


Robust Query Optimization (Volker Markl)
Parallelization of the PACT Programming Model (Volker Markl)
Continuous Re-Optimization (Johann-Christoph Freytag)
Validating Climate Simulations with Stratosphere (Volker Markl)
Text Analysis with Stratosphere (Ulf Leser)
Data Cleansing with Stratosphere (Felix Naumann)

JAQL on Stratosphere: Student Project at TUB

Open Source Release: Nephele + PACT (TUB, HPI, HU)

7/25/2011

DIMA TU Berlin

65

Overview
Introduction
Big Data Analytics
Map/Reduce/Merge
Introducing the Cloud
Stratosphere (PACT and Nephele)
Demo
(Thomas Bodner, Matthias Ringwald)

Mahout and Scalable Data Mining


(Sebastian Schelter)
7/25/2011

DIMA TU Berlin

66

The Information Revolution

http://mediatedcultures.net/ksudigg/?p=120
7/25/2011

DIMA TU Berlin

67

Demo Screenshots

WEBLOG ANALYSIS
QUERY
7/25/2011

DIMA TU Berlin

74

Weblog Query and Plan

SELECT r.url, r.rank, r.avg_duration


FROM Documents d
JOIN
Rankings r
ON r.url = d.url
WHERE CONTAINS(d.text, [keywords])
AND r.rank > [rank]
AND NOT EXISTS
(SELECT * FROM Visits v
WHERE v.url = d.url
AND v.date < [date]);

7/25/2011

DIMA TU Berlin

75

Weblog Query Job Preview

7/25/2011

DIMA TU Berlin

76

Weblog Query Optimized Plan

7/25/2011

DIMA TU Berlin

77

Weblog Query Nephele Schedule in Execution

7/25/2011

DIMA TU Berlin

78

Demo Screenshots

ENUMERATING TRIANGLES
FOR SOCIAL NETWORK
MINING
7/25/2011

DIMA TU Berlin

79

Enumerating Triangles Graph and Job

7/25/2011

DIMA TU Berlin

80

Enumerating Triangles Job Preview

7/25/2011

DIMA TU Berlin

81

Enumerating Triangles Optimized Plan

7/25/2011

DIMA TU Berlin

82

Enumerating Triangles Nephele Schedule in


Execution

7/25/2011

DIMA TU Berlin

83

Scalable data mining

APACHE MAHOUT
Sebastian Schelter

7/25/2011

DIMA TU Berlin

85

Apache Mahout: Overview


What is Apache Mahout?
An Apache Software Foundation project aiming to create scalable
machine learning libraries under the Apache License
focus on scalability, not a competitor for R or Weka
in use at Adobe, Amazon, AOL, Foursquare, Mendeley, Twitter, Yahoo

Scalability
time is proportional to problem size by resource size
does not imply Hadoop or parallel, although
the majority of implementations use Map/Reduce

7/25/2011

DIMA TU Berlin

P
t
R

86

Apache Mahout: Clustering


Clustering
Unsupervised learning: assign a set of data points into subsets (called
clusters) so that points in the same cluster are similar in some sense

Algorithms

K-Means
Fuzzy K-Means
Canopy
Mean Shift
Dirichlet Process
Spectral Clustering

7/25/2011

DIMA TU Berlin

87

Apache Mahout: Classification


Classification
supervised learning: learn a decision function that predicts labels y on
data points x given a set of training samples {(x,y)}

Algorithms
Logistic Regression (sequential but fast)
Naive Bayes / Complementary Nave Bayes
Random Forests

7/25/2011

DIMA TU Berlin

88

Apache Mahout: Collaborative Filtering


Collaborative Filtering
approach to recommendation mining: given a user's preferences for
items, guess which other items would be highly preferred

Algorithms
Neighborhood methods: Itembased Collaborative Filtering
Latent factor models: matrix factorization using Alternating Least
Squares

7/25/2011

DIMA TU Berlin

89

Apache Mahout: Singular Value Decomposition

Singular Value Decomposition


matrix decomposition technique used to create an optimal low-rank
approximation of a matrix
used for dimensional reduction, unsupervised feature selection, Latent
Semantic Indexing

Algorithms
Lanczos Algorithm
Stochastic SVD

7/25/2011

DIMA TU Berlin

90

Comparing implementations of data mining algorithms in


Hadoop/Mahout and Nephele/PACT

SCALABLE DATA MINING

7/25/2011

DIMA TU Berlin

92

Problem description
Pairwise row similarity computation
Computes the pairwise similarities of the rows (or
columns) of a sparse matrix using a predefined
similarity function
used for computing document
similarities in large corpora
used to precompute item-itemsimilarities for recommendations
(Collaborative Filtering)
similarity function can be cosine,
Pearson-correlation, loglikelihood
ratio, Jaccard coefficient,

7/25/2011

DIMA TU Berlin

93

Map/Reduce
Map/Reduce Step 1
compute similarity specific row weights
transpose the matrix, there by create an inverted index

Map/Reduce Step 2
map out all pairs of cooccurring values
collect all cooccurring values per row pair, compute similarity value

Map/Reduce Step 3
use secondary sort to only keep the k most similar rows

PACT

7/25/2011

DIMA TU Berlin

94

Comparison
Equivalent implementations in Mahout and PACT
problem maps relatively well to the Map/Reduce paradigm
insight: standard Map/Reduce code can be ported to Nephele/PACT
with very little effort
output contracts and memory forwards offer hooks for performance
improvements (unfortunately not applicable in this particular usecase)

7/25/2011

DIMA TU Berlin

95

Problem description

K-Means
Simple iterative clustering algorithm

uses a predefined number of clusters (k)


start with a random selection of cluster centers
assign points to nearest cluster
recompute cluster centers, iterate until convergence

7/25/2011

DIMA TU Berlin

96

Mahout
Initialization
generate k random cluster centers from datapoints (optional)
put centers to distributed cache

Map
find nearest cluster for each data point
emit (cluster id, data point)

Combine
partially aggregate distances per cluster

Repeat

Reduce
compute new centroid for each cluster

output converged cluster centers or centers after n


iterations
optionally output clustered data points

7/25/2011

DIMA TU Berlin

97

Stratosphere Implementation

7/25/2011

DIMA TU Berlin

Source: www.stratosphere.eu
98

Code analysis
Comparison of the implementations

actual execution plan in the underlying distributed systems is nearly


equivalent

Stratosphere implementation is more intuitive and closer to the


mathematical formulation of the algorithm

7/25/2011

DIMA TU Berlin

99

Problem description

Nave Bayes
Simple classification algorithm based on Bayes theorem

General Nave Bayes


assumes feature independence
often good results even this is
not given

Mahouts version of Nave Bayes


Specialized approach for document
classification
based on tf-idf weight metric

7/25/2011

DIMA TU Berlin

100

M/R Overview

Classification
straight-forward approach, simply reads complete model into memory
classification is done in the mapper, reducer only sums up statistics for
confusion matrix

Trainer
much higher complexity
needs to count documents, features, features per document, features
per corpus
Mahouts implementation is optimized by exploiting Hadoop specific
features like secondary sort and reading results in memory from the
cluster filesystem

7/25/2011

DIMA TU Berlin

101

M/R Trainer Overview


Train Data

Feature
Extractor
TermDoc
Counter

termDocC

wordFreq

Weight Summer

Tf-Idf
Calculation

Tf-Idf

tfIdf

WordFr.
Counter
Doc
Counter
Feature
Counter

kj

kj

Theta
Normalizer

docC

featureC

Theta N.

Vocab
Counter
vocabC

thetaNorm
7/25/2011

DIMA TU Berlin

102

Pact Trainer Overview

PACT implementation
looks even more complex, but PACTs can be combined in a much more
fine-grained manner
as PACT offers the ability to use local memory forwards, more and
higher level functions can be used like Cross and Match
less framework specific tweaks necessary for a performant
implementation
visualized execution plan is much more similar to the algorithmic
formulation of computing several counts and combining them to a
model in the end
subcalculations can be seen and unit-tested in isolation

7/25/2011

DIMA TU Berlin

103

PACT Trainer Overview

7/25/2011

DIMA TU Berlin

104

Hot Path

7,4 GB

14,8 GB

5,89 GB

5,89 GB

3,53 GB

84 kB

8 kB

5 kB

7/25/2011

DIMA TU Berlin

105

Pact Trainer Overview

Future work: PACT implementation can still be tuned by


sampling input data
more variable memory management of Stratosphere
employing context-concept of PACTs for simpler distribution of
computed parameters

7/25/2011

DIMA TU Berlin

106

Hindi
Thai

Traditional Chinese

Gracias
Spanish

Russian

Thank You

Obrigado

English

Brazilian Portuguese

Arabic

Danke
German

Grazie

Merci

Italian
Simplified Chinese

Tamil

French

Japanese
Korean

7/25/2011

DIMA TU Berlin

107

Programming in a more abstract way

PARALLEL DATA FLOW


LANGUAGES
7/25/2011

DIMA TU Berlin

108

Introduction
MapReduce paradigm is too low-level

Only two declarative primitives (map + reduce)


Extremely rigid (one input, two-stage data flow)
Custom code for e.g.: projection and filtering
Code is difficult to reuse and maintain
Impedes Optimizations

Combination of high-level declarative querying and low-level


programming with MapReduce
Dataflow Programming Languages
Hive
JAQL
Pig

7/25/2011

DIMA TU Berlin

109

Hive
Data warehouse infrastructure built on top of Hadoop,
providing:
Data Summarization
Ad hoc querying

Simple query language: Hive QL (based on SQL)


Extendable via custom mappers and reducers
Subproject of Hadoop
No Hive format
http://hadoop.apache.org/hive/

7/25/2011

DIMA TU Berlin

110

Hive - Example
LOAD DATA INPATH `/data/visits` INTO TABLE visits
INSERT OVERWRITE TABLE visitCounts
SELECT url, category, count(*)
FROM visits
GROUP BY url, category;
LOAD DATA INPATH /data/urlInfo INTO TABLE urlInfo
INSERT OVERWRITE TABLE visitCounts
SELECT vc.*, ui.*
FROM visitCounts vc JOIN urlInfo ui ON (vc.url = ui.url);
INSERT OVERWRITE TABLE gCategories
SELECT category, count(*)
FROM visitCounts
GROUP BY category;
INSERT OVERWRITE TABLE topUrls
SELECT TRANSFORM (visitCounts) USING top10;
7/25/2011

DIMA TU Berlin

111

JAQL
Higher level query language for JSON documents

Developed at IBMs Almaden research center


Supports several operations known from SQL

Grouping, Joining, Sorting

Built-in support for

Loops, Conditionals, Recursion

Custom Java methods extend JAQL


JAQL scripts are compiled to MapReduce jobs
Various I/O

Local FS, HDFS, Hbase, Custom I/O adapters

http://www.jaql.org/
7/25/2011

DIMA TU Berlin

112

JAQL - Example
registerFunction(top, de.tuberlin.cs.dima.jaqlextensions.top10);
$visits = hdfsRead(/data/visits);
$visitCounts =
$visits
-> group by $url = $
into { $url, num: count($)};
$urlInfo = hdfsRead(data/urlInfo);
$visitCounts =
join $visitCounts, $urlInfo
where $visitCounts.url == $urlInfo.url;
$gCategories =
$visitCounts
-> group by $category = $
into {$category, num: count($)};
$topUrls = top10($gCategories);
hdfsWrite(/data/topUrls, $topUrls);
7/25/2011

DIMA TU Berlin

113

Pig
A platform for analyzing large data sets
Pig consists of two parts:

PigLatin: A Data Processing Language


Pig Infrastructure: An Evaluator for PigLatin programs
Pig compiles Pig Latin into physical plans
Plans are to be executed over Hadoop

Interface between the declarative style of SQL and lowlevel, procedural style of MapReduce

http://hadoop.apache.org/pig/

7/25/2011

DIMA TU Berlin

114

Pig - Example
visits

= load /data/visits as (user, url, time);

visitCounts = foreach visits generate url, count(visits);


urlInfo

= load /data/urlInfo
as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;


topUrls

= foreach gCategories
generate top(visitCounts,10);

store topUrls into /data/topUrls;

Example taken from:


Pig Latin: A Not-So-Foreign Language For Data Processing Talk, Sigmod 2008

7/25/2011

DIMA TU Berlin

115

Literature

C. Olston, et al. (2008). `Pig Latin: a not-so-foreign language for data


processing'. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international
conference on Management of data, pp. 1099-1110, New York, NY, USA. ACM.
Apache Pig http://wiki.apache.org/pig/FrontPage
Hive - A Warehousing Solution Over a Map-Reduce Framework. Thusoo,
Ashish; Sarma, Joydeep Sen; Jain, Namit; Shao, Zheng; Chakka, Prasad; Anthony,
Suresh; Liu, Hao; Wyckoff, Pete; Murthy, Raghotham
Apache Hive http://wiki.apache.org/hadoop/Hive
Towards a Scalable Enterprise Content Analytics Platform. Kevin S. Beyer, Vuk
Ercegovac, Rajasekar Krishnamurthy, Sriram Raghavan, Jun Rao, Frederick Reiss,
Eugene J. Shekita, David E. Simmen, Sandeep Tata, Shivakumar Vaithyanathan,
Huaiyu Zhu. IEEE Data Eng. Bull. (32): 28-35 (2009)
JAQL http://code.google.com/p/jaql/wiki/

7/25/2011

DIMA TU Berlin

116

QUERY COPROCESSING ON
GRAPHICS PROCESSORS
7/25/2011

DIMA TU Berlin

117

Query Coprocessing on GPUs


Graphics Processors (GPUs) have recently emerged as powerful
coprocessors for general purpose computation
10x computational power compared to the CPU
5x memory bandwith compared to the CPU

Parallel primitives available for query processing that

7/25/2011

provide exploitation of GPU hardware features such as high thread parallelism and
reduction of memory stalls through the fast local memory
are scalable to hundreds of processors because of their lock-free design and low
synchronization cost through the use of local memory

DIMA TU Berlin

118

Query Coprocessing on GPUs


Map
given an array of data tuples and a function, a map applies the function to every tuple
uses multiple thread groups to scan the relation with each thread group being
responsible for a segment of the relation
the access pattern of the threads in each thread group is designed to exploit the
coalesced memory access feature on the GPU

Scatter and Gather


Scatter: perform indexed writes to a relation (e.g. hashing) defined by a location array
Gather: perform indexed reads from a relation also defined by a location array
can be implemented using the multipass optimization scheme to improve their temporal
locality

7/25/2011

DIMA TU Berlin

119

Query Coprocessing on GPUs


Prefix scan
applies a binary operator to the input relation
example: prefix sum, an important operation in parallel databases

Reduce
computes a value based on the input relation
implemented as multipass algorithm by utilizing local memory optimization
logarithmic number of passes constrained by local memory size per multiprocessor

7/25/2011

DIMA TU Berlin

120

An Architectural Hybrid of MapReduce and DBMS

HADOOP DB

7/25/2011

DIMA TU Berlin

121

Parallel Data Processing Architectures


Two major architectures:
1. Parallel databases

2.

Standard relational databases in a (usually) shared-nothing cluster.

MapReduce

Data analysis via parallel Map and Reduce jobs in a replicated cluster.

Both approaches have their Pros and Cons.

7/25/2011

DIMA TU Berlin

122

Parallel RDBMs
Pros:
Usually very good and consistent performance.
Flexible and proven interface (SQL).

Cons:
Scaling is rather limited (10s of nodes).
Does not work well in heterogeneous clusters.
Not very Fault-Tolerant.

7/25/2011

DIMA TU Berlin

123

MapReduce
Pros:
Very fault-tolerant and automatic load-balancing.
Operates well in heterogeneous clusters.

Cons:
Writing map/reduce jobs is more complicated than writing SQL queries.
Performance depends largely on the skill of the programmer.

7/25/2011

DIMA TU Berlin

124

HadoopDB
Both approaches have their strengths and weaknesses.
Idea of HadoopDB: Combine them!
Traditional relational databases as data storage and data processing nodes.
MapReduce for Query Parallelization, Job Tracking, etc.
Automatic SQL to MapReduce to SQL (SMS) query rewriter (based on Hive).

Pushing as many operations as possible into database layer


improves data access performance.
Map Reduce improves fault-tolerance and offers solid
cluster management.

7/25/2011

DIMA TU Berlin

125

HadoopDB overview

SQL query

System
catalog

SMS Planner
MapReduce Job

User

Master Node

Map Reduce Job Tracker

Task
Tracker

Task
Tracker

SQL

SQL

Task
Tracker

SQL

Postgres
DB

Postgres
DB

Postgres
DB

Node #1

Node #2

Node #n

7/25/2011

DIMA TU Berlin

Replicated
Table
Data

126

HadoopDB Sample query

SELECT

YEAR(saleDate),
SUM(revenue)
FROM sales
GROUP BY YEAR(saleDate);

7/25/2011

SMS
Rewrite

DIMA TU Berlin

127

Experimental Findings (I)


Compared with: native Hadoop (Hive), Vertica, commercial
row-oriented DB.
Experiments performed on a 10/50/100 node Amazon EC2
cloud instance.
Used Benchmark: A. Pavlo et al: A Comparison of Approaches
to Large Scale Data Analysis, SIGMOD, 2009

7/25/2011

DIMA TU Berlin

128

Experimental Findings (II)


In absence of failures, HadoopDB is usually slower than
parallel DBMS.
HadoopDB is consistently faster than Hadoop, but takes ~ 10
times longer to load data.
HadoopDBs performance decreases significantly lower than
Verticas in case of node failures.
HadoopDB is not as susceptible to single slow nodes as
Vertica.

7/25/2011

DIMA TU Berlin

129

Literature

A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, A. Silberschatz:


HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies
for Analytical Workloads. PVLDB 2(1): 922-933 (2009)

7/25/2011

DIMA TU Berlin

130

Basics of Parallel Processing


Parallel Speedup
Ahmdals Law

Levels of Parallelism
Instruction-Level, Data, Task

Modes of Query Parallelism


Inter-Query / Intra-Query
Pipeline (Inter Operator) / Data (Intra Operator)

Parallel Database Operations

7/25/2011

DIMA TU Berlin

143

Parallel Speedup

7/25/2011

DIMA TU Berlin

144

Parallel Speedup

7/25/2011

DIMA TU Berlin

145

Levels of Parallelism on Hardware


Instruction-level Parallelism
Single instructions are automatically processed in parallel
Example: Modern CPUs with multiple pipelines and instruction units.

Data Parallelism
Different Data can be processed independently
Each processor executes the same operations on its share of the input data.
Example: Distributing loop iterations over multiple processors, or CPUs
vectors

Task Parallelism
Tasks are distributed among the processors/nodes
Each processor executes a different thread/process.
Example: Threaded programs.

7/25/2011

DIMA TU Berlin

146

Modes of Query Parallelism


Inter Query Parallelism (multiple concurrent queries)
Necessary for efficient resource utilization: While one query waits
(e.g. for I/O), another one executes
Requires concurrency control (locking mechanisms) to guarantee transactional
properties (the "I" in ACID)
Important for highly transactional scenarios (OLTP)

Intra-Query Parallelism (parallel processing of a single query)


I/O Parallelism: Concurrent reading from multiple disks
Hidden: Hardware RAID, Transparent: Spanned tablespaces
Intra Operator Parallelism: Multiple threads work on the same operator.
Example: Parallel Sort
Inter Operator Parallelism: Multiple pipelined parts of the plan run in parallel
Important for complex analytical tasks (OLAP)

7/25/2011

DIMA TU Berlin

147

Pipeline Parallelism

Return

Step 2:
One thread scans the
table, probes the hash
tables. Second thread
starts the sort (sorting
sub-lists, merging the
first lists)

7/25/2011

Step 3:
One thread, return
result, business as
usual

Sort
HS-Join

HS-Join

Scan

Scan

Scan

T1

T2

T3

DIMA TU Berlin

Step 1:
Two threads scan
one base table each
and build the hash
tables for the joins.

148

Pipeline Parallelism
Pipeline Parallelism, also called Inter Operator Parallelism
Inter Operator, because the parallelism is between the operators

Execute multiple pipelines simultaneously

Limited in its applicability, only if multiple pipelines are present and not totally
dependent on each other

Problem:

High synchronization overhead


Mostly limited to lower degree of parallelism (not too many pipelines per query)
Only suited for shared-memory architectures

7/25/2011

DIMA TU Berlin

149

Data Parallelism
Pipeline Parallelism is not applicable to a large degree
Data Parallelism
Data divided into several sub-sets
Most operations don't need a complete view of the data
E.g. "Filter" looks only at a single tuple at a time

Subsets can be are processed independently and hence in parallel

Degree of Parallelism as high as the number of possible subsets


For "Filter": As high as the number of tuples

Some operations possibly need a view of larger portions of the data


E.g. Grouping/Aggregation operation needs all tuples with the same grouping key
Are they all in the same set? Can we guarantee that?
Different operators need different sets!

7/25/2011

DIMA TU Berlin

150

Basics of Parallel Query Processing


Levels of Resource Sharing
Shared-Memory, Shared-Disk, Shared-Nothing

Data Partitioning
Round-robin, Hash, Range

Parallel Operators and Costs

Tuple-at-a-time (i.e. Selection)


Sorting
Projection, Grouping, Aggregation
Join

7/25/2011

DIMA TU Berlin

151

Parallel Architectures (I)

Shared Memory
Several CPUs share a single memory and disk (array)
Communication over a single common bus

Source:
Garcia-Molina et al.,
Database Systems
The Complete Book.
Second Edition
7/25/2011

DIMA TU Berlin

152

Parallel Architectures (II)

Shared Disk
Several nodes with multiple CPUs, each node has its private memory
Single attached disk (array): Often NAS, SAN, etc

Source:
Garcia-Molina et al.,
Database Systems
The Complete Book.
Second Edition
7/25/2011

DIMA TU Berlin

153

Parallel Architectures (III)

Shared Nothing
Each node has it own set of CPUs, memory and disks attached
Data needs to be partitioned over the nodes
Data is exchanged through direct node-to-node communication

Source:
Garcia-Molina et al.,
Database Systems
The Complete Book.
Second Edition
7/25/2011

DIMA TU Berlin

154

Data Partitioning (I)


Partitioning the data means creating a set of disjunct sub-sets
Example: Sales data, every year gets its own partition

For shared-nothing, data must be partitioned across nodes


If it were replicated, it would effectively become a shared-disk with the local
disks acting like a cache (must be kept coherent)

Partitioning with certain characteristics has more advantages


Some queries can be limited to operate on certain sets only, if it is provable
that all relevant data (passing the predicates) is in that partition
Partitions can be simply dropped as a whole (data is rolled out) when it is no
longer needed (e.g. discard old sales)

7/25/2011

DIMA TU Berlin

155

Data Partitioning (II)


How to partition the data into disjoint sets?
Round robin: Each set gets a tuple in a round, all sets have guaranteed
equal amount of tuples, no apparent relationship between tuples in one
set.
Hash Partitioned: Define a set of partitioning columns. Generate a hash
value over those columns to decide the target set. All tuples with equal
values in the partitioning columns are in the same set.

Range Partitioned: Define a set of partitioning columns and split the


domain of those columns into ranges. The range determines the target set.
All tuples on one set are in the same range.

7/25/2011

DIMA TU Berlin

156

Data Parallelism Example


Client send a SQL query to one
of the cluster nodes

Node becomes the


"coordinator"

Coordinator compiles
the query

ClusterNode

Client

Parsing, Checking, Optimization


Parallelization

Query
Final
Results

Sends partial plans to the other


cluster nodes that describes
their tasks

Coordinator

Partial
Results

ClusterNode

ClusterNode

Coordinator also executes the


partial plan on his part of the data

Collects partial results and


finalizes them (see next slide)

7/25/2011

DIMA TU Berlin

157

Data Parallelism Example


For shared-nothing & shared-disk

Multiple instances of a sub-plan are


executed on different computers
The instances operate on different
splits or partitions of the data

At some points, results from the subplans are collected


For more complex queries, results
are not collected but re-distributed,
for further parallel processing

Return

Point of data
shipping
PreAggregation

Group
Agg

Final Aggregation
Sub-plan result
collection

Queue
Group
Agg

Sort
NL-Join

Parallel
Instances
Fetch
T2 (part)

7/25/2011

DIMA TU Berlin

Scan

IX-Scan

T1 (part)

IX-T2.1 (part)

158

Parallel Operators
Ideally: Operate as much as possible on individual partitions of the data
Bring the operation to the data
No communication needed, ideal parallelism

Easy for simple "per-tuple" operators


Scan, IX-Scan, Fetch, Filter

Problematic: Some operators need the whole picture


E.g. Sorts and Aggregations can only be preprocessed in parallel and need a final step on
a single node.
Unless: They occur in a correlated subplan known to contain only tuples from one
partition.
E.g. Joins need matching tuples. Either organize the inputs accordingly ,
or join on the coordinator after the collection of partial results (not parallel any more!).

7/25/2011

DIMA TU Berlin

159

Notations and Assumptions

S
S[i, h]
B(S)
p

Relation S
Partition i of relation S according to partitioning scheme h.
Number of Blocks of Relation S
Number of Nodes

We assume a shared-nothing architecture


Most commercial database vendors use shared-nothing approaches.

Network transfer is at least as expensive as disk access


In some cost models still far more expensive
Today network bandwidth disk bandwidth
But: Network is shared, especially Switches and Routers have a throughput limit

Partitioning schemes (hash/range) produce partitions of roughly equal size.

7/25/2011

DIMA TU Berlin

160

Parallel Selection
Selection can be parallelized very efficiently (embarrassingly parallel problem)
Each node performs the selection on its existing local partition.
Selection needs no context
Data can be partitioned in a arbitrary way

Partial results are unioned afterwards.


Cost:

B(S)/p

7/25/2011

DIMA TU Berlin

161

Parallel Projection, Grouping, Aggregation

7/25/2011

DIMA TU Berlin

162

Parallel Sorting
Range partitioning sort

(partition by range, then sort)

Range-partition the relation according to the sort columns


Sort the single partitions locally (e.g. by TPMMS)
Cost: B(S) partitioning + B(S) transfer + B(S)/p local sorting
Problem: How to find a uniform range parititioning scheme?
Result is already partitoned in the cluster.

Parallel External sort-merge

(sort locally, then merge)

Reuse an existing data partitioning


Partitions are sorted locally (e.g. by TPMMS)
Sorted partitions need to be merged
E.g.: One node merges two partitions until the whole relation is sorted

Cost: B(S)/p local sorting + log2(p)*B(S)/2 transfer + log2(p)*B(S) local merging


Result is sitting on one machine.

7/25/2011

DIMA TU Berlin

163

Parallel Equi-Joins (I)


A special class of Joins that are suited for parallelization are Natural- and
Equi-Joins.
For Equi-Joins we only look at tuple pairs that share the same join key.

Idea: Partition relations R and S using the same partitioning scheme over
the join key.
All values of R and S with the same join key end up at the same node!
All joins can be performed locally!

Actual implementation depends on how the relations are partitioned:


Co-Located Join
Directed Join
Re-Partitioning Join

7/25/2011

DIMA TU Berlin

164

Parallel Equi-Joins (II)


1.

Both R and S are already partitioned over the join key


(and with the same partitioning scheme):

2.

3.

Co-Located Join
No re-partitioning is needed!
Cost:
???
Local join cost

Only one relation is partitioned over the join key:

Directed Join

Re-Partition the other relation with same partitioning scheme.


Cost (assuming R is already partitioned):
B(S)
partitioning
B(S)
transfer
???
Local join cost

No relation is partitioned over the join key:

Repartition Join

Re-Partition both relations over the join key


Cost:
B(S)+B(R)
partitioning
B(S)+B(R)
transfer
???
Local join cost

7/25/2011

DIMA TU Berlin

165

Symmetric Fragment-and-Replicate Join

Join

7/25/2011

DIMA TU Berlin

166

Symmetric Fragment-and-Replicate Join (II)

Nodes in the Cluster

7/25/2011

DIMA TU Berlin

167

Asymmetric Fragment-and-Replicate Join


We can do better, if relation S is much
smaller than R.
Idea: Reuse the existing partitioning of R
and replicate the whole relation S to each
node.
Cost:

p * B(S)
???

transport
local join

Asymmetric Fragment-and-replicate
Join is a special case of the Symmetric
Algorithm with m=p and n=1.
The Asymmetric Fragment-and-replicate
Join is also called Broadcast Join
7/25/2011

DIMA TU Berlin

168

Limits in Parallel Databases


Database clusters tend to scale until 64 or 128 nodes

Afterwards the speedup increase curve flattens


Communication overhead eats speedup through next node
Hard limit example: 1000 nodes for DB2

Shared Disk: Does not scale infinitely, bus and synchronization become
overhead
For Updates: Cache Coherency Problem
For Reads: I/O Bandwidth Limits

Shared Nothing: Cannot compensate loss of a node easily

In large clusters, failures and outages are most common.


Loss of a node means loss of data!
Unless: Data is replicated.
But: Replicated Data must be kept consistent! Has a high overhead

7/25/2011

DIMA TU Berlin

169

Literature

S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System


Software of A Parallel Relational Database Machine GRACE.

D. A. Schneider and D. J. DeWitt. A Performance Evaluation of Four Parallel


Join Algorithms in a Shared-Nothing Multiprocessor Environment. SIGMOD
Conference, 1989

D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M.


Muralikrishna. GAMMA A High Performance Dataflow Database Machine.

J. W. Stamos and H. C. Young. A Symmetric Fragment and Replicate


Algorithm for Distributed Joins. IEEE Trans. Parallel Distrib. Syst., 1993.

7/25/2011

DIMA TU Berlin

170

Side-Note: What about updates/transactions?


OLTP style applications that are beyond relational databases'
capabilities exist as well
Some applications still require fast and efficient lookup and
retrieval of small amounts of data

Web index access, mail accounts, warehouse updates for resellers

Addressed by Key/Value pair based storage systems


(e.g. Google BigTable and Megastore)
Can only access the data through a key
Can only apply an additional filter on columns and timestamps

Some applications do still need updates and certain guarantees


about them
No hard transactions, especially no multi record transactions !!!
Eventual consistency model (Amazon Dynamo)

Techniques require a lecture of their own

7/25/2011

DIMA TU Berlin

171

7/25/2011

DIMA TU Berlin

172

7/25/2011

DIMA TU Berlin

173

7/25/2011

DIMA TU Berlin

174

7/25/2011

DIMA TU Berlin

175

7/25/2011

DIMA TU Berlin

176

7/25/2011

DIMA TU Berlin

177

7/25/2011

DIMA TU Berlin

178

7/25/2011

DIMA TU Berlin

179

7/25/2011

DIMA TU Berlin

180

7/25/2011

DIMA TU Berlin

181

7/25/2011

DIMA TU Berlin

182

7/25/2011

DIMA TU Berlin

183

7/25/2011

DIMA TU Berlin

184

7/25/2011

DIMA TU Berlin

185

You might also like