Professional Documents
Culture Documents
Basic Scheduling
for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Sequential MIPS Assembly Code
Loop: LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Loop
Pipelined execution: Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD 0(R1), F4 SUBI R1, R1, #8 stall BNEZ R1, Loop stall
1 2 3 4 5 6 7 8 9 10
Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZ R1, Loop 5 SD 8(R1), F4 6
Loop Unrolling
Loop: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop
Exit:
Loop Transformations
Instruction independency is the key requirement for the transformations Example
Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code
Register Renaming
ADDD
SD SUBI BNEZ
F4, F0, F2
-24(R1), F4 R1, R1, #32 R1, Loop
ADDD
SD SUBI BNEZ
F16, F14, F2
-24(R1), F16 R1, R1, #32 R1, Loop
Exit:
Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI
Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)
Loop-level Parallelism
Primary focus of dependence analysis Determine all dependences and find cycles
for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }
x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];
General graph cycle determination is NP a, b, c, and d may not be known at compile time
Software Pipelining
Start-up
Finish-up
Iteration 0
Iteration 1
Iteration 2
Iteration 3
Example
Iteration i LD F0, 0(R1) LD F0, 0(R1) LD F0, 0(R1) Iteration i+1 Iteration i+2
Loop:
LD
F0, 0(R1)
Loop:
SD
Trace compaction
Squeeze the trace into a small number of wide instructions Preserve data and control dependences
Trace Selection
A[I] = A[I] + B[I]
LW LW
F
A[I] = 0?
ADD
SW
R4, R4, R5
0(R1), R4
....
SW J Else: .... X 0(R2), . . . join
C[I] =
Join:
.... SW 0(R3), . . .
Software pipelining
Reduce single body dependence stalls
Trace scheduling
Reduce impact of other branches
add r1, r0, 1000 # all numbers in decimal add r2, r0, a # Base address of array a loop:
andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11
even:
addi r2, r2, 4 subi r1, r1, Y bnez r1, loop