You are on page 1of 18

Compiler techniques for exposing ILP

Instruction Level Parallelism


Potential overlap among instructions Few possibilities in a basic block
Blocks are small (6-7 instructions) Instructions are dependent

Goal: Exploit ILP across multiple basic blocks


Iterations of a loop
for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

Basic Scheduling
for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Sequential MIPS Assembly Code
Loop: LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Loop

Pipelined execution: Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD 0(R1), F4 SUBI R1, R1, #8 stall BNEZ R1, Loop stall

1 2 3 4 5 6 7 8 9 10

Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6

Loop Unrolling
Loop: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop

Comment: Often a precursor step for other optimizations

Exit:

Loop Transformations
Instruction independency is the key requirement for the transformations Example
Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code

1. Eliminating Name Dependences


Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD F0, 0(R1) F4, F0, F2 0(R1), F4 F0, -8(R1) F4, F0, F2 -8(R1), F4 F0, -16(R1) F4, F0, F2 -16(R1), F4 F0, -24(R1) Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1)

Register Renaming

ADDD
SD SUBI BNEZ

F4, F0, F2
-24(R1), F4 R1, R1, #32 R1, Loop

ADDD
SD SUBI BNEZ

F16, F14, F2
-24(R1), F16 R1, R1, #32 R1, Loop

2. Eliminating Control Dependences


Loop: LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop

Intermediate BEQZ are never taken Eliminate!

Exit:

3. Eliminating Data Dependences


Loop: LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop

Data dependencies SUBI, LD, SD Force sequential execution of iterations

Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI
Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)

4. Alleviating Data Dependencies


Unrolled loop:
Loop: LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop

Scheduled Unrolled loop:


Loop: LD LD LD LD ADDD ADDD ADDD ADDD SD SD SUBI SD BNEZ SD F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F4, F0, F2 F8, F6, F2 F12, F10, F2 F16, F14, F2 0(R1), F4 -8(R1), F8 R1, R1, #32 16(R1), F12 R1, Loop 8(R1), F16

Some General Comments


Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based

Loop-level Parallelism
Primary focus of dependence analysis Determine all dependences and find cycles
for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];

Dependence Analysis Algorithms


Assume array indexes are affine (ai + b)
GCD test:
For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4)

General graph cycle determination is NP a, b, c, and d may not be known at compile time

Software Pipelining
Start-up

Finish-up

Iteration 0

Iteration 1

Iteration 2

Iteration 3

Software pipelined iteration

Example
Iteration i LD F0, 0(R1) LD F0, 0(R1) LD F0, 0(R1) Iteration i+1 Iteration i+2

ADDD F4, F0, F2


SD 0(R1), F4

ADDD F4, F0, F2 SD 0(R1), F4

ADDD F4, F0, F2 SD 0(R1), F4 16(R1), F4

Loop:

LD

F0, 0(R1)

Loop:

SD

ADDD F4, F0, F2 SD SUBI 0(R1), F4 R1, R1, #8

ADDD F4, F0, F2 LD SUBI F0, 0(R1) R1, R1, #8

BNEZ R1, Loop

BNEZ R1, Loop

Trace (global-code) Scheduling


Find ILP across conditional branches Two-step process
Trace selection
Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches

Trace compaction
Squeeze the trace into a small number of wide instructions Preserve data and control dependences

Trace Selection
A[I] = A[I] + B[I]

LW LW
F

R4, 0(R1) R5, 0(R2)

A[I] = 0?

ADD
SW

R4, R4, R5
0(R1), R4

BNEZ R4, else


B[I] = X

....
SW J Else: .... X 0(R2), . . . join

C[I] =

Join:

.... SW 0(R3), . . .

Summary of Compiler Techniques


Try to avoid dependence stalls Loop unrolling
Reduce loop overhead

Software pipelining
Reduce single body dependence stalls

Trace scheduling
Reduce impact of other branches

Compilers use a mix of three All techniques depend on prediction accuracy

Food for thought: Analyze this


Analyze this for different values of X and Y
To evaluate different branch prediction schemes For compiler scheduling purposes

add r1, r0, 1000 # all numbers in decimal add r2, r0, a # Base address of array a loop:
andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11

even:
addi r2, r2, 4 subi r1, r1, Y bnez r1, loop

You might also like