Parallel Algorithm

Parallel Algorithms
Computation Models
Goal of computation model is to provide a
realistic representation of the costs of
programming.
Model provides algorithm designers and
programmers a measure of algorithm
complexity which helps them decide what is
good (i.e. performance-efficient)
Goal for Modeling
We want to develop computational models which
accurately represent the cost and performance
of programs
If model is poor, optimum in model may not
coincide with optimum observed in practice

Model Real World

x
A
B
optimum
optimum

Y
Models of Computation
Whats a model good for??
Provides a way to think about computers.
Influences design of:
Architectures
Languages
Algorithms
Provides a way of estimating how well a
program will perform.
Cost in model should be roughly same as cost of
executing program

The Random Access Machine Model
RAM model of serial computers:
Memory is a sequence of words, each
capable of containing an integer.
Each memory access takes one unit of time
Basic operations (add, multiply, compare)
take one unit time.
Instructions are not modifiable
Read-only input tape, write-only output tape

Has RAM influenced our thinking?
Language design:
No way to designate registers, cache, DRAM.
Most convenient disk access is as streams.
How do you express atomic read/modify/write?
Machine & system design:
Its not very easy to modify code.
Systems pretend instructions are executed in-order.
Performance Analysis:
Primary measures are operations/sec (MFlop/sec, MHz, ...)
Whats the difference between Quicksort and Heapsort??
What about parallel computers
RAM model is generally considered a very
successful bridging model between
programmer and hardware.
Since RAM is so successful, lets generalize
it for parallel computers ...
PRAM [Parallel Random Access Machine]
PRAM composed of:
P processors, each with its own unmodifiable program.
A single shared memory composed of a sequence of
words, each capable of containing an arbitrary
integer.
a read-only input tape.
a write-only output tape.
PRAM model is a synchronous, MIMD, shared
address space parallel computer.
(Introduced by Fortune and Wyllie, 1978)
PRAM model of computation
p processors, each with local memory
Synchronous operation
Shared memory reads and writes
Each processor has unique id in range 1-p
Shared memory
Characteristics
At each unit of time, a processor is either
active or idle (depending on id)
All processors execute same program
At each time step, all processors execute
same instruction on different data (data-
parallel)
Focuses on concurrency only
Variants of PRAM model
Exclusive
Write
Concurrent
Write
Exclusive
Read
EREW ERCW
Concurrent
Read
CREW CRCW
More PRAM taxonomy
Different protocols can be used for reading
and writing shared memory.
EREW - exclusive read, exclusive write
A program isnt allowed to have two processors access
the same memory location at the same time.
CREW - concurrent read, exclusive write
CRCW - concurrent read, concurrent write
Needs protocol for arbitrating write conflicts
CROW concurrent read, owner write
Each memory location has an official owner
PRAM can emulate a message-passing machine
by partitioning memory into private memories.
Sub-variants of CRCW
Common CRCW
CW iff all processors writing same value
Arbitrary CRCW
Arbitrary value of write set stored
Priority CRCW
Value of min-index processor stored
Combining CRCW
Why study PRAM algorithms?
Well-developed body of literature on design
and analysis of such algorithms
Baseline model of concurrency
Explicit model
Specify operations at each step
Scheduling of operations on processors
Robust design paradigm
Work-Time paradigm
Higher-level abstraction for PRAM algorithms
WT algorithm = (finite) sequence of time steps
with arbitrary number of operations at each step
Two complexity measures
Step complexity T(n)
Work complexity W(n)

WT algorithm work-efficient if W(n) = O(T
S
(n))

optimal sequential
Algorithm
Designing PRAM algorithms
Balanced trees
Pointer jumping
Euler tours
Divide and conquer
Symmetry breaking
. . .
Balanced trees
Key idea: Build balanced binary tree on input
data, sweep tree up and down
Tree not a data structure, often a control
structure (e.g., recursion)
Alg : Sum
Given: Sequence a of n = 2
k
elements
Given: Binary associative operator +
Compute: S = a
1
+ ... + a
n
WT description of sum
integer B[1..n]
forall i in 1 : n do
B[i] := a
i

enddo
for h = 1 to k do
forall i in 1 : n/2
h
do
B[i] := B[2i-1] + B[2i]
enddo
enddo
S := B[1]
Points to note about WT pgm
Global program: no references to processor
id
Contains both serial and concurrent
operations
Semantics of forall
Order of additions different from
sequential order: associativity critical
Analysis of scan operation
Algorithm is correct
O(lg n) steps, O(n) work
EREW model
Two variants
Inclusive: as discussed
Exclusive: s
1
= I, s
k
= x
1
+ ... + x
k-1
If n not power of 2, pad to next power
) (
1 ) (
1
2
n
n
n n W
k
h
h
O =
+ + =

=
Complexity measures of Sum
Recall definitions of
step complexity T(n)
and work complexity
W(n)
Concurrent execution
reduces number of
steps
) (lg 1 1 ) ( n k n T O = + + =
How to do prefix sum ?
Input: Sequence x of n = 2
k
elements, binary
associative operator +
Output: Sequence s of n = 2
k
elements, with
s
k
= x
1
+ ... + x
k

Example:
x = [1, 4, 3, 5, 6, 7, 0, 1]
s = [1, 5, 8, 13, 19, 26, 26, 27]
List Ranking
List ranking problem
Given a singly linked list L with n objects, for each node,
compute the distance to the end of the list
If d denotes the distance
node.d = 0 if node.next = nil
node.next.d + 1 otherwise
Serial algorithm: O(n)
Parallel algorithm
Assign one processor for each node
Assume there are as many processors as list objects
For each node i, perform
1. i.d = i.d + i.next.d
2. i.next = i.next.next // pointer jumping
{
List Ranking - Pointer Jumping
List_ranking(L)
1. for each node i, in parallel do
2. if i.next = nil then i.d = 0
3. else i.d = 1
4. while exists a node i, such that i.next != nil do
6. if i.next != nil then
7. i.d = i.d + i.next.d // i updates i itself
8. i.next = i.next.next
Analysis
After a pointer jumping, a list is transformed into two (interleaved)
lists
After that, four (interleaved) lists
Each pointer jumping doubles the number of lists and halves their
length
After log n(, all lists contain only one node
Total time: O(log n)
List Ranking - Example
List Ranking - Discussion
Synchronization is important
In step 8 (i.next = i.next.next), all processors must read right hand
side before any processor write left hand side
The list ranking algorithm is EREW
If we assume in step 7 (i.d = i.d + i.next.d) all processors read i.d and
then read i.next.d
If j.next = i, i and j do not read i.d concurrently
Work performance
performs O(n log n) work since n processors in O(log n) time
Work efficient
A PRAM algorithm is work efficient w.r.t another algorithm if two
algorithms are within a constant factor
Is the link ranking algorithm work-efficient w.r.t the serial algorithm?
No, because O(n log n) versus O(n)
Speedup
S = n / log n
Parallel Prefix on a List
Prefix computation
Input <x
1
, x
2
, .., x
n
>, a binary, associative operator
Output <y
1
, y
2
, .., y
n
>
Prefix computation: y
k
= x
1
x
2
.. x
k
Example
if x
k
= 1 for k=1..n and = +
Then y
k
= k, for k = 1..n
Serial algorithm: O(n)
Notation
[i, j] = x
i
x
i+1
.. x
j
[k, k] = x
k

[i, k] [k+1, j] = [i, j]
Idea: perform prefix computation on a linked list so that
each node k contains [k, k] = x
k
initially
finally each node k contains [1, k] = y
k
Parallel Prefix on a List (2)
List_prefix(L, X)
// L: list, X: <x
1
, x
2
, .., x
n
>
1. for each node i, in parallel
2. i.y = x
i

3. While exists a node i such that i.next != nil do
5. if i.next != nil then
6. i.next.y = i.y i.next.y // i updates its
successor
7. i.next = i.next.next

Analysis
Initially k-th node has [k,k] as y-value, points to (k+1)-th node
At the first iteration,
k-th node fetches [k+1,k+1] from its successor and
perform [k,k] [k+1,k+1] resulting in [k,k+1] and
update its successor
At the second iteration
k-th node fetches [k+1,k+2] from its successor and
perform [k-1,k] [k+1,k+2] resulting in [k-1,k+2] and
update its successor
Running time: O(log n)
After log n(, all lists contain only one node
Work performed: O(n log n)
Speedup
S = n / log n

Pointer jumping
Fast parallel processing of linked data
structures (lists, trees)
Convention: Draw trees with edges directed
from children to parents
Example: Finding the roots of forest
represented as parent array P
P[i] = j if and only if (i, j) is a forest edge
P[i] = i if and only if i is a root
Algorithm (Roots of forest)
forall i in 1:n do
S[i] := P[i]
while S[i] != S[S[i]] do
S[i] := S[S[i]]
endwhile
enddo
Initial state of forest
After one iteration
After another iteration
Concurrent Read Finding Roots
Analysis of pointer jumping
Termination detection?
At each step, tree distance between i and
S[i] doubles unless S[i] is a root
CREW model
Correctness by induction on h
O(lg h) steps, O(n lg h) work
T
S
(n) = O(n)
Not work-efficient unless h constant
Concurrent Read Finding Roots
This is a CREW algorithm
Suppose Exclusive-Read is used, what will be the running time?
Initially only one node i has root information
First iteration: Another node reads from the node i
Totally two nodes are filled up
Second iteration: Another two nodes can reads from the two
nodes
Totally four nodes are filled up
k-th iteration: 2
k-1
nodes are filled up
If there are n nodes, k=log n
So Find_root with Exclusive-Read takes O(log n).
O(log log n) vs. O(log n)
Euler tours
Technique for fast optimal processing of
tree data
Euler circuit of directed graph: directed
cycle that traverses each edge exactly once
Represent (rooted) tree by Euler circuit of
its directed version
Trees = balanced parentheses
( ( ( ) ( ) ) ( ) ( ( ) ( ) ( ) ) )
Key property: The parenthesis subsequence
corresponding to a subtree is balanced.
Computing the Depth
Problem definition
Given a binary tree with n nodes, compute the depth of
each node
Serial algorithm takes O(n) time
A simple parallel algorithm
Starting from root, compute the depths level by level
Still O(n) because the height of the tree could be as high
as n
Euler tour algorithm
Uses parallel prefix computation
Computing the Depth (2)
Euler tour: A cycle that traverses each edge exactly once in a
graph
It is a directed version of a tree
Regard an undirected edge into two directed edges
Any directed version of a tree has an Euler tour by traversing the
tree
in a DFS way forming a linked list.
Employ 3*n processors
Each node i has fields i.parent, i.left, i.right
Each node i has three processors, i.A, i.B, and i.C.
Three processors in each node of the tree are linked as follows
i.A = i.left.A if i.left != nil
i.B if i.left = nil
i.B = i.right.A if i.right != nil
i.C if i.right = nil
i.C = i.parent.B if i is the left child
i.parent.C if i is the right child
nil if i.parent = nil
{
{
{
Algorithm
Construct the Euler tour for the tree O(1) time
Assign 1 to all A processors, 0 to B processors, -1 to C
processors
Perform a parallel prefix computation
The depth of each node resides in its C processor
O(log n)
Actually log 3n
EREW because no concurrent read or write
Speedup
S = n/log n
Broadcasting on a PRAM
Broadcast can be done on CREW PRAM in
O(1) steps:
Broadcaster sends value to shared memory
Processors read from shared memory

Requires lg(P) steps on EREW PRAM.
M
P P P P P P P P
B
Concurrent Write Finding Max
Finding max problem
Given an array of n elements, find the
maximum(s)
sequential algorithm is O(n)
Data structure for parallel algorithm
Array A[1..n]
Array m[1..n]. m[i] is true if A[i] is the
maximum
Use n
2
processors
Fast_max(A, n)
1. for i = 1 to n do, in parallel
2. m[i] = true // A[i] is potentially
maximum
3. for i = 1 to n, j = 1 to n do, in parallel
4. if A[i] < A[j] then
5. m[i] = false
6. for i = 1 to n do, in parallel
7. if m[i] = true then max = A[i]
8. return max
Time complexity: O(1)
Concurrent Write Finding Max
Concurrent-write
In step 4 and 5, processors with A[i] < A[j] write the same value false
into the same location m[i]
This actually implements m[i] = (A[i] > A[1]) . . (A[i] > A[n])
Is this work efficient?
No, n
2
processors in O(1)
O(n
2
) work vs. sequential algorithm is O(n)
What is the time complexity for the Exclusive-write?
Initially elements think that they might be the maximum
First iteration: For n/2 pairs, compare.
n/2 elements might be the maximum.
Second iteration: n/4 elements might be the maximum.
log n th iteration: one element is the maximum.
So Fast_max with Exclusive-write takes O(log n).
O(1) (CRCW) vs. O(log n) (EREW)
Simulating CRCW with EREW
CRCW algorithms are faster than EREW algorithms
How much fast?
Theorem
A p-processor CRCW algorithm can be no more than O(log
p) times faster than the best p-processor EREW algorithm
Proof by simulating CRCW steps with EREW steps
Assumption: A parallel sorting takes O(log n) time with n processors
When CRCW processor p
i
write a datum x
i
into a location l
i
, EREW p
i

writes the pair (l
i
, x
i
) into a separate location A[i]
Note EREW write is exclusive, while CRCW may be concurrent
Sort A by l
i

O(log p) time by assumption
Compare adjacent elements in A
For each group of the same elements, only one processor, say first, write
x
i
into the global memory l
i
.
Note this is also exclusive.
Total time complexity: O(log p)
Simulating CRCW with EREW (2)
CRCW versus EREW - Discussion
CRCW
Hardware implementations are expensive
Used infrequently
Easier to program, runs faster, more powerful.
Implemented hardware is slower than that of EREW
In reality one cannot find maximum in O(1) time
EREW
Programming model is too restrictive
Cannot implement powerful algorithms

Parallel Algorithm

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Algorithm

Uploaded by

Copyright:

Available Formats

Parallel Algorithms

You might also like