You are on page 1of 6

CHAPTER

18

Parallel Databases
This chapter is suitable for an advaru:Ed course, but can also be used for independent study projects by students of a first course. The chap ter cov ers several
aspects of the design of parallel d atabase systems - partitioning of data,. parallelization of indhidual relational operations, and parallelization of relational
expressions. The chapter also briefly coven some systems issues, such as cache
coherency and failure resiliency.
The most important applications of parallel databases today are for v..areh ousing and analyzing large amounts of data. Therefore partitioning of d ata a.."'ld
parallel query processing are co..e:red in significant detail. Query optimiz.ation is
also of importance, for the same reason. HoV\ever, parallel query optim.iz.ation is
still not a fully solved problem; exhaustive search,. as is used for sequential query
optimization, is too expensive in a parallel system, forcing the use of heuristics..
The description of parallel query processing algorithms is based O.."\ the
shared-nothing model. Students may be asked to study h0\'1' the algorithms can
be: improved if shared-memory machines are used instead.

13.9 For each of the three partitioning techniques, namely roW'ld-robin,. hash

partitioning, and range partitioning, gh:e an example of a query for "-'hich


that partitioning technique , ..ould provide the fastest response.
Am;wu:

Round robin partitioning:


\<\1hen relations a.re large and queries read entire relations, roundrobin gives good speed-up and fast response time.
Hash partitioning
For point queries on the partitioning attributes, this gives the fastest
response, as each disk can process a diffemt query simultaneously.
If the hash partitioning is uniform, entire relation scans can be performed efficiently.

Rang e partitioning For rmge queries on the partitioning attributes,.


, ...hkh acca1 1 fC'\\ tuple1, nnge: partitioning ghes the fastest re- ponM.
U.U \.Vhat faeton could rault in'"'"

Ylhen a

relation is p artitioned on o.."rte of

it& attrlbut.a by:

..

Huh partit!<mlng?

b.

R.angc partit!<mlng?

In each ease, " "hat can be done to red.Utt the sk.n.-?


A

..

HCL

~artldoning:

Too many rtt0rdl '"ith tht u.mt ,.alu.e for the hashing attribute,. or a
poodr choocn huh function without the properties of randomness
one! unlfonnlty, can ...Wt in sbwed partition. To impron the
oituadcn,, w e ohoWd c:xpaiment with better has!Ung functions for
th.at relation.

b.

IW>s-p&rt!doning:
Non-unifonn diltribution of 1"&lun for the partitioning a.ttribute
(including dupliate 1"&lua for the partitioning attribute) ,,+uch are
not taken into account by 1 Nd partiti.on!ng \.-ector is the main
rcuon for llccwed partitions. Sorting the relatio.'l an the partitioning
attribute and then dhi.dirlg i t into n rang e:s "ith equal number of
tupla per range " 1ll gi1e 1 g ood putitioning 1ector 1vith \"a')' lO'\\
&biv.

U.U. Give: an ex.ample of 1 join that i1 not a simple equi-join for ""hich pa.rti-

tioned p uAilclll.n'I. can be used. \t\'Nt 1.ttrib u tes should be used for partitioning?
A.n.wa: \o\'e ght: tv.o eampl1:1 of such joins.
a...

r ~r. .-....t.A)-"(t........t.() 1l
Here: l\"I: N.vc &n cqui1oin cond ition \\hich can be executed first, a.nd
the extr1 conditions can be checked ind ependently on each tup le in
the join rault. Pa.rtitioned pa-rallelism is useful to execute the equi-

Joln,
b. r '(r..<l:(,"H/,J-'O)*""><IJ'OJ) l)l s
Thi.I ii a query in 1..-hich an r tuple and an s tuple join v.ith each
otMr 1f they fall into the 1arn1 range of \alues. Hence partitioned
puallclltm applies natuB.lly to this scenario, e\"en though the join
is l10t an cquljoln.
For both the qucrln, r ahould be partitioned on attribu te A a.."ld s on
attribute
For the accond query, the putitioning of !; should actually be

o.

dor> an (l;.8/20J) 20.

us

1.8.12 Describe a good ""a y to pa.ral.lehz.e each of the follow ing:


a.

The difference o peration

b.

.4.ggregation by the countoperation

c.

.4.gg:regation by the countdistiod:operation

d.

Aggregation by the a.vgoperation

e.

left outer join.. if the join condition involves only equ ality

f.

Left outer join,. if the join condition involv es comparisons other than
equality

g. Full outer join,. if the join condition involv es comparisons other than
equality
Amiwu:
a.

\.Ve can pa...rallelize the difference operation by partitioning the relations on all the attributes,. and then computing differences locally at
each processor . "5 in aggregation, the cost o f transferring tuples during partitioning can be redu ced by partially computing differences
at each processor,. before partitioning .

b.

Let us refer to the group-by attribute as attribute A, and the attribu te


on '"hich the aggregation function operates; as a ttribute G. cou:ntis
performed just like sum.(mentioned in the book) except that, a count
of the number o f values of attribute B for each v alu e o f attribute A)
is transferred to the correct d estination processor, instead of a sum.
After partitioning, the partial counts fro m all the processors are
added up locally at each p rocessor to get the final result.

c.

For this.. partial counts cannot be computed locally before partitioning. Each processor ins.tead transfers all unique Ovalues for each A
v alue to the correct d estination processor. After partitioning, each
processor locally counts the n umber o f unique tup les for each v alue
o f A, and then ou tpu ts the final tt:S-ult.

d.

This can again be implemented like i;um, except that for each v alue
o f A.. a ISUDlo f the 9 ,-alues as ,.,ell as a coontof the n umber of tuples
in the gro up.. is transferred during partitio.-tlng. Then each p rocessor
o u tputs its local result.. by dh..-iding the total sum by total n umber of
tuples for each A v alue assigned to its p artition.

e . This can be performed just like partitioned natural join. .~er partitioning, each p rocessor computes the left outer join locally using
any o f the strategies of Chapter 12.
f.

The left outer join can be comp uted using a."\ exte."\Sion o f the
Fragment~d-Replicate scheme to compute non equi-jo ins. Consider r ~ s. The relations. a.re p artitioned, and r bd s is comp uted at

each site. \>\ie also collect tuples from r that did not match any tuples
from s; call the set of these dangling tuples at site i as 11, Jdter the
above step is d one at each site, for each fragment of r, \Ve take the
intersection of the 1//s from ev ery processor in \Vhich the &agment
of r \-\'as replicated. The intersections give the real set of dangling
tuples; these tup les a.re padded ,.,.-}th nulls and added to the resull
The intersections themsehes, followed by addition of padded tuples
to the result, can be done in parallel by partitioning.

g.

The algorithm is basically the same as abo, e, except that "''hen combining results, the p rocessing of dangling tuples must done for both
relations.
4

18.ts Describe the benefits and drav.-backs of p ipelined parallelism.


An.5wa:

Bendits:..J'\io need to \\'lite intermediate relations to disk only to read


them back immediately.
Dr.awbada;:

a. Cannot take adv antage of high d egrees of parallelism, as ~-pica!


queries d o n ot have large number o f operations.
b. Not possible to p ipeline operators v.hich need to look at all the
input before producing any ou tput.
c. Since each operation executes on a single proce~or, the most expensiYe ones take a long time to finish. Thus speed-up v.ill be lo\V
despite the use of parallelism.
13.14 Suppose you \.\'1.sh to handle a \Vorkload consisting of a large number of

small tran..~ctions by using shared-n othing parallelism.


a..

Is intraqu ery parallelism required in such a situation? If n ot, "''hy,


and \Vhat form of parallelism is appro priate?

b.

What form of skel... \vould be of significance \\rith such a \\'orl<load?

c.

Suppose most transactions accessed one ua:vunl record, , ..hi.ch inclu des an uavu~1! typ..~attribute, and an associated nav unl type JJJtr.ilt.r
record, "'hich pr"-ides information about the account type. Ho"''
, ..ould you partition and/ or replicate data to speed up transactions?
You may assume that the 11rro1111!. iypeJJu1slt:r relation is rarely u pdated.

An.5wa:

a..

Intra.query parallelism is probably not appro priate for this situation.


Since each indh-idual transaction is small, the overhead of parallelizing each query may exceed the potential benefits. Interquery
parallelism ""ould be a better choice, allo,.,.-ing many transactions to
run in parallel

b.

Partition ske\.\ can be a performance issue in this type o f system,.


especially \.\-ith the use o f shared-nothing parallelism.. A load imbalance amongst the processors of the distributed system can significant
reduce the speedup gained by parallel execution. For example,. if all
transactions happen to involve only the data in a single p artition,.
the: processors not associated \\ith that partition ".\rill not be used at

c.

Since ncrolf11I /ype. 1111J$/1.r is rarely u pdated,. it can be replicated in


entirety across all nodes. If the nlrvlf11lrelation is updated frequently
and accesses are , ..ell-disbibuted,. it should be partitioned across
n odes.

all.

1.&.15

The attribu te on \.\hich a relation is partitioned can have a significant


impact on the cost o f a query .
a.

Given a \\'orkload of SQL queries on a single relatio."l,. \\h at attributes


\.\ould be candidates for partitio ning?

b.

Ho'v \vould you choose behveen the altem.ati:ve partitioning techniqu es, based on the l\'orldoad?

c.

Is it possible to partition a r elation on more than one attrib u te?


Explain your ans\\er.

a.

The candidate attributes ' "ould be


Attributes on 'vhich o ne or m ore queries has a selection cond ition. The corresponding selection condition can then be ev alu ated at a single processor, instead of being e..aluated at all
processors..
ii. Attributes involved in join conditions. If su ch an attrib u te is
u sed for partitioning, it is possible to perform the join l\-ithou t
repartitioning the r elation. This effect is particularly beneficial
for , ery large relatio."'\S, for ' 'hich repa.rtitio."'\ing can be \"ery
expensive.
iii. Attributes invoh ed in grou~by clauses; similar to joins, it is
possible to perform aggregation l\-ithout rep artitioning the corresponding relation.
i.

b.

A. cost-based a p proach \\'od es best in choosing beh veen altemathes.


In this approach, candidate partitioning choices are generated,. and
for eac..li. candidate the cost of executing all the queries/ updates in a
l\'orldoad is estimated. The cho ice leading to the least cost is picked.
One issue is that the number of candidate choices is g enerally ._.ery
large. Algorithms and heuristics d esigned to limit the number of
ca.."\didates for '"'hi.ch costs n eed to be estimated are \\idel...- used in
practice.

Another Luuc Lt that the ,.,,.orldoad may have a very large number
of qucria/ updata. Techniqun to reduce this number include the
fo~ing (a) combining repeated occurrences of a query that o.....Uy
differ ln conatantt, rtpl.acing them by one parametrized query along
'"-ith a count ol nw:nber ol occurrences and (b) dropping queries
'"hkh arc V'U). cheap ln)"""ay, or not lilc.ely to be affected by the
p.utitionlng choke.
c.

lt i& pou.U>lc to putition a reli tion on more th.an one attribute,. in


two
One Lt to im'Ohc multiple ilttnDutes in a si."lgle composite
putidoning by. The other 1\"ay is to k.eep more tha-"l one copy of the
...,,. moder\, partitioned ;,, different wys. The latter 4pproach is
a.a..- update cools, but an speed up some quems signl.fiGL'llly.

""Y'

You might also like