You are on page 1of 151

Futures, Scheduling, and Work

Distribution

Companion slides for


The Art of Multiprocessor
Programming
by Maurice Herlihy & Nir Shavit
How to write Parallel Apps?
• How to
– split a program into parallel parts
– In an effective way
– Thread management

Art of Multiprocessor Programming© Herlihy – 2


Shavit 2007
Matrix Multiplication

C   A  B 

Art of Multiprocessor Programming© Herlihy – 3


Shavit 2007
Matrix Multiplication

cij = k=0N-1 aki * bjk

Art of Multiprocessor Programming© Herlihy – 4


Shavit 2007
Matrix Multiplication
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
this.row = row; this.col = col;
}
public void run() {
double dotProduct = 0.0;
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}

Art of Multiprocessor Programming© Herlihy – 5


Shavit 2007
Matrix Multiplication
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
this.row = row; this.col = col;
}
public void run() { a thread
double dotProduct = 0.0;
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}

Art of Multiprocessor Programming© Herlihy – 6


Shavit 2007
Matrix Multiplication
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
this.row = row; this.col = col;
}
public void run() {
double dotProduct Which matrix entry
= 0.0;
to compute
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}

Art of Multiprocessor Programming© Herlihy – 7


Shavit 2007
Matrix Multiplication
class Worker extends Thread {
int row, col;
Worker(int row, int col) {
this.row = row; this.col = col;
} Actual computation
public void run() {
double dotProduct = 0.0;
for (int i = 0; i < n; i++)
dotProduct += a[row][i] * b[i][col];
c[row][col] = dotProduct;
}}}

Art of Multiprocessor Programming© Herlihy – 8


Shavit 2007
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …)
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
for (int col …)
worker[row][col].start();
for (int row …)
for (int col …)
worker[row][col].join();
}
Art of Multiprocessor Programming© Herlihy – 9
Shavit 2007
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …)
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
for (int col …)
worker[row][col].start();
for (int row …) Create nxn
for (int col …)
worker[row][col].join();
threads
}
Art of Multiprocessor Programming© Herlihy – 10
Shavit 2007
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …)
for (int col …) Start them
worker[row][col] = new Worker(row,col);
for (int row …)
for (int col …)
worker[row][col].start();
for (int row …)
for (int col …)
worker[row][col].join();
}
Art of Multiprocessor Programming© Herlihy – 11
Shavit 2007
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …) Start them
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
for (int col …)
worker[row][col].start();
for (int row …)
for (int col …) Wait for
worker[row][col].join(); them to
} finish
Art of Multiprocessor Programming© Herlihy – 12
Shavit 2007
Matrix Multiplication
void multiply() {
Worker[][] worker = new Worker[n][n];
for (int row …) Start them
for (int col …)
worker[row][col] = new Worker(row,col);
for (int row …)
col …)wrong with this
What’s
for (int
picture?
worker[row][col].start();
for (int row …)
for (int col …) Wait for
worker[row][col].join(); them to
} finish
Art of Multiprocessor Programming© Herlihy – 13
Shavit 2007
Thread Overhead
• Threads Require resources
– Memory for stacks
– Setup, teardown
• Scheduler overhead
• Worse for short-lived threads

Art of Multiprocessor Programming© Herlihy – 14


Shavit 2007
Thread Pools
• More sensible to keep a pool of long-
lived threads
• Threads assigned short-lived tasks
– Runs the task
– Rejoins pool
– Waits for next assignment

Art of Multiprocessor Programming© Herlihy – 15


Shavit 2007
Thread Pool = Abstraction
• Insulate programmer from platform
– Big machine, big pool
– And vice-versa
• Portable code
– Runs well on any platform
– No need to mix algorithm/platform
concerns

Art of Multiprocessor Programming© Herlihy – 16


Shavit 2007
ExecutorService Interface
• In java.util.concurrent
– Task = Runnable object
• If no result value expected
• Calls run() method.
– Task = Callable<T> object
• If result value of type T expected
• Calls T call() method.

Art of Multiprocessor Programming© Herlihy – 17


Shavit 2007
Future<T>
Callable<T> task = …;

Future<T> future = executor.submit(task);

T value = future.get();

Art of Multiprocessor Programming© Herlihy – 18


Shavit 2007
Future<T>
Callable<T> task = …;

Future<T> future = executor.submit(task);

T value = future.get();

Submitting a Callable<T> task


returns a Future<T> object

Art of Multiprocessor Programming© Herlihy – 19


Shavit 2007
Future<T>
Callable<T> task = …;

Future<T> future = executor.submit(task);

T value = future.get();

The Future’s get() method blocks


until the value is available

Art of Multiprocessor Programming© Herlihy – 20


Shavit 2007
Future<?>
Runnable task = …;

Future<?> future = executor.submit(task);

future.get();

Art of Multiprocessor Programming© Herlihy – 21


Shavit 2007
Future<?>
Runnable task = …;

Future<?> future = executor.submit(task);

future.get();

Submitting a Runnable task


returns a Future<?> object

Art of Multiprocessor Programming© Herlihy – 22


Shavit 2007
Future<?>
Runnable task = …;

Future<?> future = executor.submit(task);

future.get();

The Future’s get() method blocks


until the computation is complete

Art of Multiprocessor Programming© Herlihy – 23


Shavit 2007
Note
• Executor Service submissions
– Like New England traffic signs
– Are purely advisory in nature
• The executor
– Like the New England driver
– Is free to ignore any such advice
– And could execute tasks sequentially …

Art of Multiprocessor Programming© Herlihy – 24


Shavit 2007
Matrix Addition

 C00 C00   A00  B00 B01  A01 


  
 C10 C10   A10  B10 A11  B11 

Art of Multiprocessor Programming© Herlihy – 25


Shavit 2007
Matrix Addition

 C00 C00   A00  B00 B01  A01 


  
 C10 C10   A10  B10 A11  B11 

4 parallel additions

Art of Multiprocessor Programming© Herlihy – 26


Shavit 2007
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // multiply this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(add(a00,b00));

Future<?> f11 = exec.submit(add(a11,b11));
f00.get(); …; f11.get();

}}

Art of Multiprocessor Programming© Herlihy – 27


Shavit 2007
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // multiply this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(add(a00,b00));

Future<?> f11 = exec.submit(add(a11,b11));
f00.get(); …; f11.get();

}}
Base case: add directly

Art of Multiprocessor Programming© Herlihy – 28


Shavit 2007
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // multiply this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(add(a00,b00));

Future<?> f11 = exec.submit(add(a11,b11));
f00.get(); …; f11.get();

}} Constant-time operation

Art of Multiprocessor Programming© Herlihy – 29


Shavit 2007
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // multiply this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(add(a00,b00));

Future<?> f11 = exec.submit(add(a11,b11));
f00.get(); …; f11.get();

}} Submit 4 tasks

Art of Multiprocessor Programming© Herlihy – 30


Shavit 2007
Matrix Addition Task
class AddTask implements Runnable {
Matrix a, b; // multiply this!
public void run() {
if (a.dim == 1) {
c[0][0] = a[0][0] + b[0][0]; // base case
} else {
(partition a, b into half-size matrices aij and bij)
Future<?> f00 = exec.submit(add(a00,b00));

Future<?> f11 = exec.submit(add(a11,b11));
f00.get(); …; f11.get();

}} Let them finish

Art of Multiprocessor Programming© Herlihy – 31


Shavit 2007
Dependencies
• Matrix example is not typical
• Tasks are independent
– Don’t need results of one task …
– To complete another
• Often tasks are not independent

Art of Multiprocessor Programming© Herlihy – 32


Shavit 2007
Fibonacci
1 if n = 0 or 1
F(n)
F(n-1) + F(n-2) otherwise
• Note
– potential parallelism
– Dependencies

Art of Multiprocessor Programming© Herlihy – 33


Shavit 2007
Disclaimer
• This Fibonacci implementation is
– Egregiously inefficient
• So don’t deploy it!
– But illustrates our point
• How to deal with dependencies
• Exercise:
– Make this implementation efficient!

Art of Multiprocessor Programming© Herlihy – 34


Shavit 2007
Multithreaded Fibonacci
class FibTask implements Callable<Integer> {
static ExecutorService exec =
Executors.newCachedThreadPool();
int arg;
public FibTask(int n) {
arg = n;
}
public Integer call() {
if (arg > 2) {
Future<Integer> left = exec.submit(new FibTask(arg-1));
Future<Integer> right = exec.submit(new FibTask(arg-2));
return left.get() + right.get();
} else {
return 1;
}}}
Art of Multiprocessor Programming© Herlihy – 35
Shavit 2007
Multithreaded Fibonacci
class FibTask implements Callable<Integer> {
static ExecutorService exec =
Executors.newCachedThreadPool();
int arg;
public FibTask(int n) {
arg = n; Parallel calls
}
public Integer call() {
if (arg > 2) {
Future<Integer> left = exec.submit(new FibTask(arg-1));
Future<Integer> right = exec.submit(new FibTask(arg-2));
return left.get() + right.get();
} else {
return 1;
}}}
Art of Multiprocessor Programming© Herlihy – 36
Shavit 2007
Multithreaded Fibonacci
class FibTask implements Callable<Integer> {
static ExecutorService exec =
Executors.newCachedThreadPool();
int arg;
public FibTask(int n) {
arg = n; Pick up & combine results
}
public Integer call() {
if (arg > 2) {
Future<Integer> left = exec.submit(new FibTask(arg-1));
Future<Integer> right = exec.submit(new FibTask(arg-2));
return left.get() + right.get();
} else {
return 1;
}}}
Art of Multiprocessor Programming© Herlihy – 37
Shavit 2007
Dynamic Behavior
• Multithreaded program is
– A directed acyclic graph (DAG)
– That unfolds dynamically
• Each node is
– A single unit of work

Art of Multiprocessor Programming© Herlihy – 38


Shavit 2007
Fib DAG
fib(4)
get
submit

fib(3) fib(2)

fib(2) fib(1) fib(1) fib(1)

fib(1) fib(1)

Art of Multiprocessor Programming© Herlihy – 39


Shavit 2007
Arrows Reflect Dependencies
fib(4)
get
submit

fib(3) fib(2)

fib(2) fib(1) fib(1) fib(1)

fib(1) fib(1)

Art of Multiprocessor Programming© Herlihy – 40


Shavit 2007
How Parallel is That?
• Define work:
– Total time on one processor
• Define critical-path length:
– Longest dependency path
– Can’t beat that!

Art of Multiprocessor Programming© Herlihy – 41


Shavit 2007
Fib Work
fib(4)

fib(3) fib(2)

fib(2) fib(1) fib(1) fib(1)

fib(1) fib(1)

Art of Multiprocessor Programming© Herlihy – 42


Shavit 2007
Fib Work
1 2 3

4 5 6 7 8 9

10 11 12 13 14 15

16 17
work is 17
Art of Multiprocessor Programming© Herlihy – 43
Shavit 2007
Fib Critical Path
fib(4)

Art of Multiprocessor Programming© Herlihy – 44


Shavit 2007
Fib Critical Path
fib(4)
1 8

2 7

3 4 6

5
Critical path length is 8
Art of Multiprocessor Programming© Herlihy – 45
Shavit 2007
Notation Watch
• TP = time on P processors
• T1 = work (time on 1 processor)
• T∞ = critical path length (time on ∞
processors)

Art of Multiprocessor Programming© Herlihy – 46


Shavit 2007
Simple Bounds
• TP ≥ T1/P
– In one step, can’t do more than P work
• TP ≥ T∞
– Can’t beat infinite resources

Art of Multiprocessor Programming© Herlihy – 47


Shavit 2007
More Notation Watch
• Speedup on P processors
– Ratio T1/TP
– How much faster with P processors
• Linear speedup
– T1/TP = Θ(P)
• Max speedup (average parallelism)
– T1/T∞

Art of Multiprocessor Programming© Herlihy – 48


Shavit 2007
Matrix Addition

 C00 C00   A00  B00 B01  A01 


  
 C10 C10   A10  B10 A11  B11 

Art of Multiprocessor Programming© Herlihy – 49


Shavit 2007
Matrix Addition

 C00 C00   A00  B00 B01  A01 


  
 C10 C10   A10  B10 A11  B11 

4 parallel additions

Art of Multiprocessor Programming© Herlihy – 50


Shavit 2007
Addition
• Let AP(n) be running time
– For n x n matrix
– on P processors
• For example
– A1(n) is work
– A∞(n) is critical path length

Art of Multiprocessor Programming© Herlihy – 51


Shavit 2007
Addition
• Work is
Partition, synch, etc

A1(n) = 4 A1(n/2) + Θ(1)

4 spawned additions

Art of Multiprocessor Programming© Herlihy – 52


Shavit 2007
Addition
• Work is

A1(n) = 4 A1(n/2) + Θ(1)


= Θ(n2)

Same as double-loop summation

Art of Multiprocessor Programming© Herlihy – 53


Shavit 2007
Addition
• Critical Path length is

A∞(n) = A∞(n/2) + Θ(1)

spawned additions in
parallel
Partition, synch, etc
Art of Multiprocessor Programming© Herlihy – 54
Shavit 2007
Addition
• Critical Path length is

A∞(n) = A∞(n/2) + Θ(1)


= Θ(log n)

Art of Multiprocessor Programming© Herlihy – 55


Shavit 2007
Matrix Multiplication Redux

C   A  B 

Art of Multiprocessor Programming© Herlihy – 56


Shavit 2007
Matrix Multiplication Redux

 C11 C12   A11 A12   B11 B12 


       
 C21 C22   A21 A22   B21 B22 

Art of Multiprocessor Programming© Herlihy – 57


Shavit 2007
First Phase …

 C11 C12   A11B11  A12B21 A11B12  A12B22 


    
 C21 C22   A21B11  A22B21 A21B12  A22B22 

8 multiplications

Art of Multiprocessor Programming© Herlihy – 58


Shavit 2007
Second Phase …

 C11 C12   A11B11  A12B21 A11B12  A12B22 


    
 C21 C22   A21B11  A22B21 A21B12  A22B22 

4 additions

Art of Multiprocessor Programming© Herlihy – 59


Shavit 2007
Multiplication
• Work is
Final addition

M1(n) = 8 M1(n/2) + A1(n)

8 parallel
multiplications
Art of Multiprocessor Programming© Herlihy – 60
Shavit 2007
Multiplication
• Work is

M1(n) = 8 M1(n/2) + Θ(n2)


= Θ(n3)

Same as serial triple-nested loop

Art of Multiprocessor Programming© Herlihy – 61


Shavit 2007
Multiplication
• Critical path length is
Final addition

M∞(n) = M∞(n/2) + A∞(n)

Half-size parallel
multiplications

Art of Multiprocessor Programming© Herlihy – 62


Shavit 2007
Multiplication
• Critical path length is

M∞(n) = M∞(n/2) + A∞(n)


= M∞(n/2) + Θ(log n)
= Θ(log2 n)

Art of Multiprocessor Programming© Herlihy – 63


Shavit 2007
Parallelism
• M1(n)/ M∞(n) = Θ(n3/log2 n)
• To multiply two 1000 x 1000 matrices
– 10003/102=107
• Much more than number of
processors on any real machine

Art of Multiprocessor Programming© Herlihy – 64


Shavit 2007
Shared-Memory
Multiprocessors
• Parallel applications
– Do not have direct access to HW
processors
• Mix of other jobs
– All run together
– Come & go dynamically

Art of Multiprocessor Programming© Herlihy – 65


Shavit 2007
Ideal Scheduling Hierarchy

Tasks

User-level scheduler

Processors

Art of Multiprocessor Programming© Herlihy – 66


Shavit 2007
Realistic Scheduling Hierarchy

Tasks

User-level scheduler

Threads

Kernel-level scheduler

Processors

Art of Multiprocessor Programming© Herlihy – 67


Shavit 2007
For Example
• Initially,
– All P processors available for application
• Serial computation
– Takes over one processor
– Leaving P-1 for us
– Waits for I/O
– We get that processor back ….

Art of Multiprocessor Programming© Herlihy – 68


Shavit 2007
Speedup
• Map threads onto P processes
• Cannot get P-fold speedup
– What if the kernel doesn’t cooperate?
• Can try for speedup proportional to
– time-averaged number of processors
the kernel gives us

Art of Multiprocessor Programming© Herlihy – 69


Shavit 2007
Scheduling Hierarchy
• User-level scheduler
– Tells kernel which threads are ready
• Kernel-level scheduler
– Synchronous (for analysis, not correctness!)
– Picks pi threads to schedule at step i
– Processor average
over T steps is: 1 T
 P pA
T

i 1
i

Art of Multiprocessor Programming© Herlihy – 70


Shavit 2007
Greed is Good
• Greedy scheduler
– Schedules as much as it can
– At each time step
• Optimal schedule is greedy (why?)
• But not every greedy schedule is
optimal

Art of Multiprocessor Programming© Herlihy – 71


Shavit 2007
Theorem
• Greedy scheduler ensures that

T ≤ T1/PA + T∞(P-1)/PA

Art of Multiprocessor Programming© Herlihy – 72


Shavit 2007
Deconstructing

T ≤ T1/PA + T∞(P-1)/PA

Art of Multiprocessor Programming© Herlihy – 73


Shavit 2007
Deconstructing

T ≤ T1/PA + T∞(P-1)/PA

Actual time

Art of Multiprocessor Programming© Herlihy – 74


Shavit 2007
Deconstructing

T ≤ T1/PA + T∞(P-1)/PA

Work divided by
processor average

Art of Multiprocessor Programming© Herlihy – 75


Shavit 2007
Deconstructing

T ≤ T1/PA + T∞(P-1)/PA

Cannot do better than


critical path length

Art of Multiprocessor Programming© Herlihy – 76


Shavit 2007
Deconstructing

T ≤ T1/PA + T∞(P-1)/PA

The higher the average


the better it is …

Art of Multiprocessor Programming© Herlihy – 77


Shavit 2007
Proof Strategy

1 T

P A

T
p
i 1
i

1 T
T  p
P
i
A i 1 Bound this!

Art of Multiprocessor Programming© Herlihy – 78


Shavit 2007
Put Tokens in Buckets
Processor found work Processor available but
couldn’t find work

work idle
Art of Multiprocessor Programming© Herlihy – 79
Shavit 2007
At the end ….
T
Total #tokens = p
i 1
i

work idle

Art of Multiprocessor Programming© Herlihy – 80


Shavit 2007
At the end ….
T1 tokens

work idle

Art of Multiprocessor Programming© Herlihy – 81


Shavit 2007
Must Show
≤ T∞(P-1) tokens

work idle

Art of Multiprocessor Programming© Herlihy – 82


Shavit 2007
Idle Steps
• An idle step is one where there is at
least one idle processor
• Only time idle tokens are generated
• Focus on idle steps

Art of Multiprocessor Programming© Herlihy – 83


Shavit 2007
Every Move You Make …
• Scheduler is greedy
• At least one node ready
• Number of idle threads in one idle
step
– At most pi-1 ≤ P-1
• How many idle steps?

Art of Multiprocessor Programming© Herlihy – 84


Shavit 2007
Unexecuted sub-DAG
fib(4)
get
submit

fib(3) fib(2)

fib(2) fib(1) fib(1) fib(1)

fib(1) fib(1)

Art of Multiprocessor Programming© Herlihy – 85


Shavit 2007
Unexecuted sub-DAG
fib(4)

fib(3) fib(2)

fib(2) fib(1) fib(1) fib(1)

fib(1) fib(1)
Longest path
Art of Multiprocessor Programming© Herlihy – 86
Shavit 2007
Unexecuted sub-DAG
fib(4)

fib(3) fib(2)

fib(2) fib(1) fib(1) fib(1)

Last node ready


fib(1) fib(1) to execute
Art of Multiprocessor Programming© Herlihy – 87
Shavit 2007
Every Step You Take …
• Consider longest path in unexecuted
sub-DAG at step i
• At least one node in path ready
• Length of path shrinks by at least one
at each step
• Initially, path is T∞
• So there are at most T∞ idle steps
Art of Multiprocessor Programming© Herlihy – 88
Shavit 2007
Counting Tokens
• At most P-1 idle threads per step
• At most T∞ steps
• So idle bucket contains at most
– T∞(P-1) tokens
• Both buckets contain
– T1 + T∞(P-1) tokens

Art of Multiprocessor Programming© Herlihy – 89


Shavit 2007
Recapitulating
1 T
T 
PA
p
i 1
i

p
i 1
i  T1  T ( P  1)

1
T  T1  T (P  1) 
PA

Art of Multiprocessor Programming© Herlihy – 90


Shavit 2007
Turns Out
• This bound is within a factor of 2 of
optimal
• Actual optimal is NP-complete

Art of Multiprocessor Programming© Herlihy – 91


Shavit 2007
Work Distribution
zzz…

Art of Multiprocessor Programming© Herlihy – 92


Shavit 2007
Work Dealing

Yes!

Art of Multiprocessor Programming© Herlihy – 93


Shavit 2007
The Problem with
Work Dealing
D’oh!

D’oh!

D’oh!

Art of Multiprocessor Programming© Herlihy – 94


Shavit 2007
Work Stealing No
work…
Yes! zzz

Art of Multiprocessor Programming© Herlihy – 95


Shavit 2007
Lock-Free Work Stealing
• Each thread has a pool of ready work
• Remove work without synchronizing
• If you run out of work, steal someone
else’s
• Choose victim at random

Art of Multiprocessor Programming© Herlihy – 96


Shavit 2007
Local Work Pools

Each work pool is a Double-Ended Queue

Art of Multiprocessor Programming© Herlihy – 97


Shavit 2007
Work DEQueue 1

work
pushBottom
popBottom

1. Double-Ended Queue
Art of Multiprocessor Programming© Herlihy – 98
Shavit 2007
Obtain Work

•Obtain work
•Run thread until
•Blocks or terminates
popBottom

Art of Multiprocessor Programming© Herlihy – 99


Shavit 2007
New Work

•Unblock node
•Spawn node

pushBottom

Art of Multiprocessor Programming© Herlihy – 100


Shavit 2007
Whatcha Gonna do When the
Well Runs Dry?

@&%$!!

empty

Art of Multiprocessor Programming© Herlihy – 101


Shavit 2007
Steal Work from Others
Pick random guy’s DEQeueue

Art of Multiprocessor Programming© Herlihy – 102


Shavit 2007
Steal this Thread!
popTop

Art of Multiprocessor Programming© Herlihy – 103


Shavit 2007
Thread DEQueue
• Methods
– pushBottom Never happen
– popBottom concurrently
– popTop

Art of Multiprocessor Programming© Herlihy – 104


Shavit 2007
Thread DEQueue
• Methods
– pushBottom These most
– popBottom common – make
– popTop them fast
(minimize use of
CAS)

Art of Multiprocessor Programming© Herlihy – 105


Shavit 2007
Ideal
• Wait-Free
• Linearizable
• Constant time

Fortune Cookie: “It is better to be young,


rich and beautiful, than old, poor, and ugly”

Art of Multiprocessor Programming© Herlihy – 106


Shavit 2007
Compromise
• Method popTop may fail if
– Concurrent popTop succeeds, or a
– Concurrent popBottom takes last work

Blame the
victim!

Art of Multiprocessor Programming© Herlihy – 107


Shavit 2007
Dreaded ABA Problem

CAS
top

Art of Multiprocessor Programming© Herlihy – 108


Shavit 2007
Dreaded ABA Problem

top

Art of Multiprocessor Programming© Herlihy – 109


Shavit 2007
Dreaded ABA Problem

top

Art of Multiprocessor Programming© Herlihy – 110


Shavit 2007
Dreaded ABA Problem

top

Art of Multiprocessor Programming© Herlihy – 111


Shavit 2007
Dreaded ABA Problem

top

Art of Multiprocessor Programming© Herlihy – 112


Shavit 2007
Dreaded ABA Problem

top

Art of Multiprocessor Programming© Herlihy – 113


Shavit 2007
Dreaded ABA Problem

top

Art of Multiprocessor Programming© Herlihy – 114


Shavit 2007
Dreaded ABA Problem Yes!

CAS
top

Uh-Oh …

Art of Multiprocessor Programming© Herlihy – 115


Shavit 2007
Fix for Dreaded ABA

stamp
top

bottom

Art of Multiprocessor Programming© Herlihy – 116


Shavit 2007
Bounded DEQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] tasks;

}

Art of Multiprocessor Programming© Herlihy – 117


Shavit 2007
Bounded DQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] tasks;

}

Index & Stamp


(synchronized)

Art of Multiprocessor Programming© Herlihy – 118


Shavit 2007
Bounded DEQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] deq;

}

Index of bottom thread


(no need to synchronize
The effect of a write
must be seen – so we
need a memory barrier)
Art of Multiprocessor Programming© Herlihy – 119
Shavit 2007
Bounded DEQueue
public class BDEQueue {
AtomicStampedReference<Integer> top;
volatile int bottom;
Runnable[] tasks;

}

Array holding tasks

Art of Multiprocessor Programming© Herlihy – 120


Shavit 2007
pushBottom()
public class BDEQueue {

void pushBottom(Runnable r){
tasks[bottom] = r;
bottom++;
}

}

Art of Multiprocessor Programming© Herlihy – 121


Shavit 2007
pushBottom()
public class BDEQueue {

void pushBottom(Runnable r){
tasks[bottom] = r;
bottom++;
}

} Bottom is the index to store
the new task in the array

Art of Multiprocessor Programming© Herlihy – 122


Shavit 2007
pushBottom()
public class BDEQueue {
… stamp
top
void pushBottom(Runnable r){
tasks[bottom] = r;
bottom++;
} bottom

Adjust the bottom index
}

Art of Multiprocessor Programming© Herlihy – 123


Shavit 2007
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
}

Art of Multiprocessor Programming© Herlihy – 124


Shavit 2007
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
} Read top (value & stamp)

Art of Multiprocessor Programming© Herlihy – 125


Shavit 2007
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
} Compute new value & stamp

Art of Multiprocessor Programming© Herlihy – 126


Shavit 2007
Steal Work
public Runnable popTop() {
int[] stamp = new int[1]; stamp
top
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop) bottom
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null; Quit if queue is empty
}

Art of Multiprocessor Programming© Herlihy – 127


Shavit 2007
Steal Work
stamp
public Runnable popTop() { top
CAS
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
bottom
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null;
} Try to steal the task
Art of Multiprocessor Programming© Herlihy – 128
Shavit 2007
Steal Work
public Runnable popTop() {
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = oldTop + 1;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom <= oldTop)
return null;
Runnable r = tasks[oldTop];
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
return null; Give up if
} conflict occurs

Art of Multiprocessor Programming© Herlihy – 129


Shavit 2007
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 130
Shavit 2007
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
} Make sure queue is non-empty
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 131
Shavit 2007
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
} Prepare to grab bottom task
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 132
Shavit 2007
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
} Read top, & prepare new values
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 133
Shavit 2007
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
stamp
bottom--;
top
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), bottom newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))

If top & bottom 1 or more


return r;
}
return null;apart, no conflict
} Art of Multiprocessor Programming© Herlihy – 134
Shavit 2007
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
stamp
bottom--;
top
Runnable r = tasks[bottom];
int[] stamp = new int[1];
int oldTop = top.get(stamp), bottom newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))

At most one item left


return r;
}
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 135
Shavit 2007
Take Work
Runnable popBottom() {
Try to steal last item.
if (bottom == 0) return null;
In any case reset Bottom
bottom--;

because the DEQueue will be empty


Runnable r = tasks[bottom];
int[] stamp = new int[1];
even if unsucessful (why?)
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 136
Shavit 2007
Take Work
Runnable popBottom() {
I win CAS
if (bottom == 0) return null; stamp
bottom--; top
CAS

Runnable r = tasks[bottom];
int[] stamp = new int[1];
bottom
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 137
Shavit 2007
Take Work
Runnable popBottom() {
I lose CAS
if (bottom == 0) return null; stamp
Thief must
bottom--; top
CAS

have won…
Runnable r = tasks[bottom];
int[] stamp = new int[1];
bottom
int oldTop = top.get(stamp), newTop = 0;
int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 138
Shavit 2007
Take Work
Runnable popBottom() {
if (bottom == 0) return null;
bottom--;
Runnable r = tasks[bottom];
int[] stamp = new int[1];
failed to get last item
int oldTop = top.get(stamp), newTop = 0;

Must still reset top


int oldStamp = stamp[0], newStamp = oldStamp + 1;
if (bottom > oldTop) return r;
if (bottom == oldTop){
bottom = 0;
if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r;
}
top.set(newTop,newStamp); return null;
} Art of Multiprocessor Programming© Herlihy – 139
Shavit 2007
Old English Proverb
• “May as well be hanged for stealing a
sheep as a goat”
• From which we conclude
– Stealing was punished severely
– Sheep were worth more than goats

Art of Multiprocessor Programming© Herlihy – 140


Shavit 2007
Variations
• Stealing is expensive
– Pay CAS
– Only one thread taken
• What if
– Randomly balance loads?

Art of Multiprocessor Programming© Herlihy – 141


Shavit 2007
Work Balancing

2
d2+5e/2=4

5
b2+5 c/2=3

Art of Multiprocessor Programming© Herlihy – 142


Shavit 2007
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}

Art of Multiprocessor Programming© Herlihy – 143


Shavit 2007
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
while (true) {
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) { Keep running tasks
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}

Art of Multiprocessor Programming© Herlihy – 144


Shavit 2007
Work-Balancing Thread
public void run() {
int me = ThreadID.get(); With probability
while (true) { 1/|queue|
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}

Art of Multiprocessor Programming© Herlihy – 145


Shavit 2007
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
while (true) { Choose random victim
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}

Art of Multiprocessor Programming© Herlihy – 146


Shavit 2007
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
while (true) { Lock queues in canonical order
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}

Art of Multiprocessor Programming© Herlihy – 147


Shavit 2007
Work-Balancing Thread
public void run() {
int me = ThreadID.get();
while (true) { Rebalance queues
Runnable task = queue[me].deq();
if (task != null) task.run();
int size = queue[me].size();
if (random.nextInt(size+1) == size) {
int victim = random.nextInt(queue.length);
int min = …, max = …;
synchronized (queue[min]) {
synchronized (queue[max]) {
balance(queue[min], queue[max]);
}}}}}

Art of Multiprocessor Programming© Herlihy – 148


Shavit 2007
Work Stealing & Balancing
• Clean separation between app &
scheduling layer
• Works well when number of
processors fluctuates.
• Works on “black-box” operating
systems

Art of Multiprocessor Programming© Herlihy – 149


Shavit 2007
TOM
MA R V O L O

R I D DL E

Art of Multiprocessor Programming© Herlihy – 150


Shavit 2007
This work is licensed under a Creative Commons Attribution-
ShareAlike 2.5 License.
• You are free:
– to Share — to copy, distribute and transmit the work
– to Remix — to adapt the work
• Under the following conditions:
– Attribution. You must attribute the work to “The Art of
Multiprocessor Programming” (but not in any way that suggests that
the authors endorse you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you
may distribute the resulting work only under the same, similar or a
compatible license.
• For any reuse or distribution, you must make clear to others the license
terms of this work. The best way to do this is with a link to
– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission from
the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.

Art of Multiprocessor Programming© Herlihy – 151


Shavit 2007

You might also like