CAunitiii

Basic Compiler Techniques for Exposing ILP Instruction Level Parallelism Potential overlap among instructions Few possibilities
es in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Basic Pipeline Scheduling and Loop Unrolling Idea find sequences of unrelated instructions (no hazard) that can be overlapped in the pipeline to exploit ILP A dependent instruction must be separated from the source instruction by a distance in clock cycles equal to latency of the source instruction to avoid stall
Basic Scheduling
for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

Pipelined execution: Loop: LD F0, 0(R1) 1 stall 2 ADDD F4, F0, F2 stall 4 stall 5 SD 0(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZR1, Loop 9 stall 10
Sequential MIPS Assembly Code

Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop
Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 stall 4 BNEZR1, Loop 5 SD 8(R1), F4 6
Loop Unrolling Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size Comment: Often a precursor step for other optimizations
Loop: LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BEQZ LD ADDD SD SUBI BNEZ Exit: F0, 0(R1) F4, F0, F2 0(R1), F4 R1, R1, #8 R1, Exit F6, 0(R1) F8, F6, F2 0(R1), F8 R1, R1, #8 R1, Exit F10, 0(R1) F12, F10, F2 0(R1), F12 R1, R1, #8 R1, Exit F14, 0(R1) F16, F14, F2 0(R1), F16 R1, R1, #8 R1, Loop
Loop Transformations Instruction independency is the key requirement for the transformations Example Determine that is legal to move SD after SUBI and BNEZ
Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code Loop: LD Name 0(R1) Eliminating F0, Dependences
ADDD F4, F0, F2 SD LD 0(R1), F4 F0, -8(R1)
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD LD 0(R1), F4 F6, -8(R1)
ADDD F4, F0, F2 SD LD -8(R1), F4 F0, -16(R1)

-16(R1)
ADDD F8, F6, F2 SD LD -8(R1), F8 F10,
ADDD F4, F0, F2 SD LD -16(R1), F4 F0, -24(R1)
ADDD F12, F10, F2 SD F12 LD -24(R1) ADDD F16, F14, F2 SD F16 SUBI R1, R1, #32 -24(R1), F14, -16(R1),
ADDD F4, F0, F2 SD -24(R1), F4
SUBI R1, R1, #32 BNEZ R1, Loop Eliminating Control Dependences
BNEZ R1, Loop
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1, R1, #8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1, R1, #8 BEQZ R1, Exit LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 SUBI R1, R1, #8 BNEZ R1, Loop Exit:
Eliminating Data Dependences
Loop:
LD SD SUBI LD SD SUBI LD SD SUBI LD SD SUBI
F0, 0(R1) 0(R1), F4 R1, R1, #8 F6, 0(R1) 0(R1), F8 R1, R1, #8 F10, 0(R1) 0(R1), F12 R1, R1, #8 F14, 0(R1) 0(R1), F16 R1, R1, #8
ADDD F4, F0, F2
Data dependencies SUBI, LD, SD Force sequential execution of iterations Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)
ADDD F8, F6, F2
ADDD F12, F10, F2
ADDD F16, F14, F2
BNEZ R1, Loop
Alleviating Data Dependencies
Unrolled loop:
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop
Scheduled Unrolled loop:

Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16
Some General Comments Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based Static Branch Prediction Static Branch Prediction: Using Compiler Technology How to statically predict branches? Examination of program behavior Always predict taken (on average, 67% are taken) Mis-prediction rate varies large (9%59%) Predict backward branches taken, forward branches un-taken (misprediction rate > 60% -- 70%) Profile-based predictor: use profile information collected from earlier runs Simplest is the Basket bit idea Easily extends to use more bits Definite win for some regular applications Static Branch Prediction: Using Compiler Technology Useful for Scheduling instructions when the branch delays are exposed by the architecture (either delayed or canceling branches) Assisting dynamic predictors (IA-64 architecture in Section 4.7) Determining which code paths are more frequent, a key step in code scheduling
Static Multiple Issue: VLIW Overview VLIW (very long instruction word) Issue a fixed number of instructions formatted as One large instruction comprising independent MIPS instructions or A fixed instruction packet with the parallelism among instructions explicitly indicated by instruction Also known as EPIC explicitly parallel instruction computers Rely on compiler to Minimize the potential hazard stall
Actually format the instructions in a potential issue packet so that HW need not check explicitly for dependencies. Compiler ensures Dependences within the issue packet cannot be present or Indicate when a dependence may occur
A VLIW uses multiple, independent functional units A VLIW packages multiple independent operations into one very long instruction The burden for choosing and packaging independent operations falls on compiler HW in a superscalar makes these issue decisions is unneeded This advantage increases as the maximum issue rate grows Here we consider a VLIW processor might have instructions that contain 5 operations, including 1 integer (or branch), 2 FP, and 2 memory references Depend on the available FUs and frequency of operation VLIW depends on enough parallelism for keeping FUs busy Loop unrolling and then code scheduling Compiler may need to do local scheduling and global scheduling Techniques to enhance LS and GS will be mentioned later For now, assume we have a technique to generate long, straight-line code sequences
Loop Unrolling in VLIW
Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.29 clocks per iteration 23 ops in 9 clock, average 2.5 ops per clock, 60% FUs are used Note: Need more registers in VLIW VLIW Problems Technical Increase in code size Ambitious loop unrolling
Whenever instruction are not full, the unused FUs translate to waste bits in the instruction encoding An instruction may need to be left completely empty if no operation can be scheduled Clever encoding or compress/decompress VLIW Problems Logistical Synchronous VS. Independent FUs Early VLIW all FUs must be kept synchronized A stall in any FU pipeline may cause the entire processor to stall Recent VLIW FUs operate more independently Compiler is used to avoid hazards at issue time Hardware checks allow for unsynchronized execution once instructions are issued. Binary code compatibility Code sequence makes use of both the instruction set definition and the detailed pipeline structure (FUs and latencies) Need migration between successive implementations, or between implementations recompliation Solution Object-code translation or emulation Temper the strictness of the approach so that binary compatibility is still feasible (IA-64 in Section 4.7) Advantages of Superscalar over VLIW Old codes still run Like those tools you have that came as binaries HW detects whether the instruction pair is a legal dual issue pair If not they are run sequentially Little impact on code density Dont need to fill all of the cant issue here slots with NOPs Compiler issues are very similar Still need to do instruction scheduling anyway Dynamic issue hardware is there so the compiler does not have to be too conservative
Advanced Compiler Support for Exposing and Exploiting ILP

Overview Discuss compiler technology for increasing the amount of parallelism that we can exploit in a program Detecting and Enhancing loop-level parallelism Finding and eliminating dependent computations Software pipelining: symbolic loop unrolling Global code scheduling Detect and Enhance LLP
Loop-level parallelism: analyzed at the source or near level Most ILP analysis: analyzed once instructions have been generated Loop-level analysis Determine what dependences exist among the operands in a loop across the iterations of that loop Determine whether data accesses in later iterations are dependent on data values produced in earlier iterations Loop-carried dependence (LCD) VS. loop-level parallel LCD forces successive loop iterations to execute in series Finding loop-level parallelism involves recognizing structures such as loops, array references, and induction variable computations The compiler can do this analysis more easily at or near the source level
Example 1
S1 uses a value computed by S1 in an earlier iteration (A [i])

Loop-carried dependence
for (i=1; i <= 100; i=i+1) { A[i+1] = A [i] + C [i]; B[i+1] = B [i] + A[i+1]; } Assume A, B, and C are distinct, non-overlapping arrays /* S1 */ /* S2 */
S2 uses a value computed by S2 in an earlier iteration (B [i])

Loop-carried dependence
S2 uses a value computed by S1 in the same iteration (A[i+1])

Not loop-carried Multiple iterations can execute in parallelism, as long as dependent statements in an iteration are kept in order
Example 2 for (i=1; i <= 100; i=i+1) { A [i] = A [i] + B [i]; B[i+1] = C [i] + D [i]; }
/* S1 */ /* S2 */
The existence of loop-carried dependence does not prevent parallelism S1 uses a value computed by S2 in an earlier iteration (B[i+1]) Loop-carried dependence Dependence is not circular Neither statement depends on itself, and although S1 depends on S2, S2 does not depend on S1 A loop is parallel if it can be written without a cycle in the dependences Absence of a cycle give a partial ordering on the statements
Example 2 (Cont.) A[1] = A[1] + B[1] for (i=1; i <= 99; i=i+1) { B[i+1] = C [i] + D [i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100] Transform the code in the previous slide to conform to the partial ordering and expose the parallelism No longer loop-carried. Iterations of the loop may be overlapped, provided the statements in each iteration are kept in order
Detect and Eliminate Dependences Trick is to find and remove dependencies Simple for single variable accesses More complex for pointers, array references, etc. Things get easier if Non-cyclic dependencies No loop-carried dependencies Recurrent dependency distance is larger (more ILP can be explored) Recurrence: when a variable is defined based on the value of that variable in an earlier iteration ex. Y[i] = Y[i-5] + Y[i] (dependence distance = 5) Array index calculation is consistent Affine array indices ai+b where a and b are constants GCD test (pp. 324) to determine two affine functions can have the same value for different indices between the loop bound
Detect and Eliminate Dependences -- Difficulties Barriers to analysis Reference via pointers rather than predictable array indices Array indexing is Indirect through another array: x[y[i]] (non-affine) Common sparse array accesses False dependency: for some input values a dependence may exist Run time checks must be used to determine the dependent case General: NP hard problem Specific cases can be done precisely Current problem: a lot of special cases that dont apply often The good general heuristic is the holy grail Points-to analysis: analyzing programs with pointers (pp. 326327 Eliminating Dependent Computations Techniques
Eliminate or reduce a dependent computation by back substitution within a basic block and within loop Algebraic simplification + Copy propagation (eliminate operations that copy values within a basic block) Reduce multiple increment of array indices during loop unrolling and move increments across memory addresses in Section 4.1
R1, R2, #4 R1, R1, #4
DADDUI DADDUI
DADDUI
R1, R2, #8
Tree-height reduction increase the code parallelism
DADDUI DADDUI
R1, R2, R3 R4, R1, R6
DADDUI DADDUI
R1, R2, R3 R4, R6, R7
DADDUI R8, R4, R7
DADDUI R8, R1, R4
Eliminating Dependent Computations Techniques Most compilers require that optimizations that rely on associativity (e.g. treeheight reduction) be explicitly enabled Integer/FP arithmetic (range and precision) may lead to round-error Optimization related to recurrence Recurrence: expressions whose value on one iteration is given by a function that depends on previous iteration When a loop with a recurrence is unrolled, we may be able to algebraically optimized the unrolled loop, so that the recurrence need only be evaluated once per unrolled iteration sum = sum + x sum = sum + x1 + x2 + x3 + x4 + x5 (5 dependent operations) sum = ((sum + x1) + (x2 + x3)) + (x4 + x5) (3 dependent operations) An Example to Eliminate False Dependences
for (i=1; i <= 100; i=i+1) { Y[ i] = X [i] / C; /* S1 */ X [i] = X [i] + C; /* S2 */ Z [i] = Y [i] + C; /* S3 */ Y [i] = C Y [i]; /* S4 */ }
for (i=1; i <= 100; i=i+1) { T [i] = X [i] / C; /* S1 */ X1 [i] = X [i] + C; /* S2 */ Z [i] = T [i] + C; /* S3 */ Y [i] = C T [i]; /* S4 */ }
Software Pipelining
Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining (Symbolic loop unrolling): reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop Idea is to separate dependencies in original loop body Register management can be tricky but idea is to turn the code into a single loop body In practice both unrolling and software pipelining will be necessary due to the register limitations
Software Pipelining (Cont.)
Start-up
Finish-up
Iteration 0
Iteration 1
1
Iteration 2
Iteration 3
Software pipelined iteration
Example
Iteration i LD F0, 0(R1) ADDD F2 SD F4, F0, LD
Iteration i+1 F0, 0(R1) F4, F0, LD
Iteration i+2
0(R1), F4
ADDD F2 SD
F0, 0(R1) F4, F0,
0(R1), F4 Loop: SD
ADDD F2
Loop: LD F0, 0(R1) ADDD SD F4, F0, F2
SD 0(R1), F4 16(R1), F4 F4, F0, F2
ADDD LD
0(R1), F4
F0, 0(R1)
SUBI R1, R1, #8 BNEZ R1, Loop
SUBI R1, R1, #8 BNEZ R1, Loop
Software Pipelining Example Before: Unrolled 3 times 1 L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 L.D F6,-8(R1) 5 ADD.D F8,F6,F2 6 S.D F8,-8(R1) 7 L.D F10,-16(R1) 8 ADD.D F12,F10,F2 9 S.D F12,-16(R1) 10 DADDUI R1,R1,#-24 11 BNE R1,R2,LOOP After: Software Pipelined 1 S.D F4,16(R1);Stores M[i] 2 ADD.D F4,F0,F2 ; Adds to M[i-1] 3 L.D F0,0(R1) ;Loads M[i-2] 4 DAADUI R1,R1,#-8 5 BNE R1,R2,LOOP Startup and finish up codes are ignored
5 cycles per result, assuming that DADDUI is scheduled before ADD.D, and L.D (with an adjusted offset) is placed in the branch delay slot Software Pipelining Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling SW Pipeline overlapped LOops
Time Loop Unrolled
Time
Global Code Scheduling

Aim to compact a code fragment with internal control structure into the shortest possible sequence that preserves the data and control dependences Reduce the effect of control dependences arising from conditional nonloop branches by moving code Loop branches can be solved by unrolling Require estimates of the relative frequency of different paths (ThenElse) Global code scheduling example (next slide) Suppose the shaded part (Then) is more frequently executed than the ELSE part In general, extremely complex
LD LD DADDU SD
R4, 0(R1) R5, 0(R2) R4, R4, R5 R4, 0(R1)
; load A ; load B ; Add to A ; store A
BNEZ R4, elsepart ; Test A SD J elsepart: X join: ; after if SW , Global Try to Scheduling Example of B and C before the0(R3) A ; store C[i] Code move the assignment test of Move B If an instruction in X depends on the original value of B, moving B before BNEZ will influence the data dependence Make a shadow copy of B before IF and use that shadow copy in X? Complex to implement Slow down the program if the trace selected is not optimal Require additional instructions to execute Move C To the THEN part A copy of C must be put in the ELSE part Before the BNEZ If can be moved the copy of C in the ELSE part can be eliminated join ; ; code for X ; j over else , 0(R2) ; Store B
Move Assignment of B BeforeBNEZ LD R4, 0(R1) ; load A LD R5, 0(R2) ; load B DADDU R4, R4, R5 ; Add to A SD R4, 0(R1) ; store A SD , 0(R2) ; Store B BNEZ R4, elsepart ; Test A J join ; j over else elsepart: ; X ; code for X LD R5, 0(R2) DADDU R7, R5, R6 join: ; after if
SW
, 0(R3)
; store C[i]
Factors in Moving B The compiler will consider the following factors Relative execution frequencies of THEN and ELSE Cost of executing the computing and assignment to B above branch Any empty instruction issue slots and stalls above branch? How will the movement of B change the execution time for THEN Is B the best code fragment that can be moved? How about C or others? The cost of the compensation code that may be necessary for ELSE
Trace Scheduling
Useful for processors with a large number of issues per clock, where conditional or predicted execution (Section 4.5) is inappropriate or unsupported, and where loop unrolling may not be sufficient by itself to uncover enough ILP A way to organize the global code motion process, so as to simplify the code scheduling by incurring the costs of possible code motion on the less frequent paths Best used where profile information indicates significant differences in frequency between different paths and where the profile information is highly indicative of program behavior independent of the input Parallelism across conditional branches other than loop branches Looking for the critical path across conditional branches Two steps: Trace Selection: Find likely sequence of multiple basic blocks (trace) of long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Move trace before the branch decision Need bookkeeping code in case prediction is wrong Compiler undoes bad guess (discards values in registers) Subtle compiler bugs mean wrong answer vs. poor performance; no hardware interlocks
Trace Scheduling (Cont.)

Trace Scheduling Simplify the decisions concerning global code motion Branches are viewed as jumps into or out of the selected trace (the most probable path) Additional book-keeping code will often be needed on the entry of exit point If the trace is so much more probable than the alternatives that the cost of the bookkeeping code need not be a deciding factor Trace scheduling is good for scientific code Intensive loops and accurate profile data But unclear if this approach is suitable for programs that are less simply characterized and less loop-intensive
Superblocks Drawback of trace scheduling: the entries and exits into the middle of the trace cause significant complications Compensation code, and hard to assess their cost Superblocks similar to trace, but Single entry point but allow multiple exits In a loop that has a single loop exit based on a count, the resulting superblocks have only one exit Use tail duplication to create a separate block corresponding to the portion of the trace after the entry
Some Things to Notice SW pipelining, loop unrolling, trace scheduling, superblocks Useful when the branch behavior is fairly predictable at compiler time Not totally independent techniques All try to avoid dependence induced stalls Primary focus Unrolling: reduce loop overhead of index modification and branch SW pipelining: reduce single body dependence stalls Trace scheduling/superblocks: reduce impact of branch walls
Most advanced compilers attempt all Result is a hybrid which blurs the differences Lots of special case analysis changes the hybrid mix All tend to fail if branch prediction is unreliable
Hardware Support for Exposing More Parallelism at Compiler Time Hardware Support Conditional or predicated instructions Can be used to eliminate branches Control dependence is converted into data dependence Useful in hardware as well as software intensive approaches for ILP Compiler speculation with hardware support Support for preserving the exception behavior Support for reordering loads and stores Predicated Instructions
C if (C) {S} S T
Branch is eliminated
C:S
Conditional MOVE is the simplest form of predicated instruction

BNEZ R4, + 2 MOV R2, R1
Another Example A = abs (B) if (B < 0) A = -B; else A = B; Can be written as two conditional moves or one unconditional move and one conditional move
CMOVZ R2, R1, R4
Full predication Simplest case: Only conditional move Useful for short sequences only For large code blocks, many conditional moves may be required inefficient Full predication: All instructions can be conditional Large code blocks may be converted Entire loop body may become free of branches Multiple branches per clock Very likely with high issue rate processor Complex to handle control dependence among branches difficult to predict, update tables etc. Reducing branches per clock (if not eliminating) is useful Remove a branch that is harder to predict increases potential gain Example: A 2 issue machine
First inst. Slot

LW BEQZ LW LW R1,40(R2) R10, L R8, 0(R10) R9, 0(R8)
Second inst. Slot

ADD ADD R3, R4, R5 R6, R3, R7
Here is a code sequence for a two issue superscalar that can issue a combination of one memory reference and one ALU operation, or a branch by itself, This sequence waste a memory operation slot in the second cycle and will incur a data dependence stall is the branch is not taken, since the second LW after the branch depends on the prior load. Show how the code can be improved using a predicated form of LW.
LW LWC BEQZ LW
R1,40(R2) R8,0(R10),R10 R10, L R9, 0(R8)
ADD ADD
R3, R4, R5 R6, R3, R7
Example: A 2 issue machine One issue slot eliminated One stall cycle eliminated (dep. between loads) No improvement if branch condition is false Entire code (if short) after branch may be moved up Exceptions and predicated instructions Predicated instruction must not generate an exception if the predicate is false LW R8, 0(R10) may generate protection exception if R10 contains 0 When predicate is true, the exception behavior should be as usual LW R8, 0(R10) may still cause a legal and resumable exception (e.g. a page fault) if R10 is not 0 When to annul a pred. instr.? Early - during issue may lead to pipeline stall due to data dependence Late - just before writing results FU resources are consumed - negative impact on performance Limitations with predicated instructions Resources wasted when instructions are annulled except when the slots taken by these instructions would have been idle anyway Useful if predicates can be evaluated early otherwise stalls for data hazards may result Usefulness limited when control flow is more complex than simple if-then-else e.g. moving an instruction across 2 branches requires 2 predicates - large overheads if this is not supported Speed penalty - higher cycle count or slower clock
Compiler speculation with hardware support

Compiler speculation: Prediction of a branch from prog structure/ profile data
Purpose:
Moving an instruction before this branch
Improve scheduling or issue rate Compared with predicated instructions: Latter may not always remove control dependence Here the instruction may be moved even before the condition evaluation What is required to speculate ambitiously? Find instruction which can be moved without effecting data flow use register renaming if that helps Ignore exceptions in speculated instruction until you know for sure Interchange load-store or store-store speculate that there are no address conflicts Hardware support needed for 2nd and 3rd
Hardware support for Preserving exception behavior

1. Ignore exceptions behavior preserved for correct programs only may be acceptable only in fast mode 2. Check instructions Speculated instruction doesnt raise exceptions, Check instructions see if exception should occur 3. Poison bits attached to result register Done if speculated instruction causes exception Cause a fault if non-spec instr reads that register 4. Use reorder buffer results buffered and exceptions delayed until instruction is no longer speculative Exception types Program errors program needs to be terminated results are not well defined e.g. memory protection error Normal events program is resumed after handling the event e.g. page fault Speculative instructions and exception types Normal events
can be handled for speculative instructions in the same way as normal instructions harmless, but resources are consumed Program errors an instruction should not cause program termination until it is found to be no longer speculative Ignore exceptions Resumable exceptions - handle normally, as and when exception occurs Terminating exception - dont terminate, return undefined value speculation correct: wrong program allowed to continue and produce wrong results speculation correct: the result will get ignored anyway Instructions may be marked as speculative or normal helpful, but not necessary errors in normal instructions can terminate program Example if (A == 0) A = B; else A = A + 4; A is at 0(R3) and B is at 0(R2) LW R1,0(R3) LW R14,0(R2) BEQZ R1,L3 ADDI R14,R1,#4 L3: SW R14,0(R3) Use check instructions
LW LW R1,0(R3) R14,0(R2)
BEQZ R1,L3 ADDI R14,R1,#4 L3: SW R14,0(R3)
LW BNEZ LW J L1: ADDI L2: SW
R1,0(R3) R1,L1 R1,0(R2) L2 R1,R1,#4 R1,0(R3)
LW R1,0(R3) sLW R14,0(R2) BNEZ R1,L1 SPECCK 0(R2) J L2 L1: ADDI R14,R1,#4 L3: SW R14,0(R3)
Exception behavior preserved exactly then block reappears
Note: sLW- Speculative, no termination
Use poison bits
poison bits for registers, speculative bits for instructions poison bit of destination set if a speculative instruction encounters terminating exception when an instruction reads a register with poison bit on speculative instruction: poison bit of its destination is set normal instruction: a fault occurs stores are never speculative saving and restoring poison bits on context switch special instruction required Code with poison bit LW R1,0(R3) sLW R14,0(R2) BEQZ R1,L3 ADDI R14,R1,#4 L3: SW R14,0(R3) sLW instruction sets poison bit of R14 if R2 contains 0 Use reorder buffer Reorder buffer as in superscalar processor instructions marked as speculative remember how many branches (usually not more than 1) it moved across and what branch action compiler assumed alternative: original location marked by a sentinel - indicates that the results can be committed
Memory reference speculation Move load up across a store no problem if absence of address clash can be checked statically otherwise, mark the instruction as speculative - it saves the address address examined on subsequent stores - a conflict means speculation failed a special instruction is kept at the original location of load - can take care of reload when speculation fails - may require a fix-up sequence as well
HW Versus SW Speculation Mechanisms

To speculate extensively, we must be able to disambiguate memory reference easy for HW (Tomasulo) HW speculation works better when control flow is unpredictable, and when HW branch prediction is superior to SW branch prediction done at compiler time Misprediction rate 16%/10% for 4 major integer SPEC92 SW/HW HW speculation maintains a completely precise exception model for SI
HW speculation does not require compensation or bookkeeping code, needed by ambitious SW speculation HW speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementation of an architecture HW speculation require complex and additional HW resources Some designers have tried to combine the dynamic and compiler-based approaches to achieve the best of each
IA-64 Architecture(Think Intel Itanium)

also known as (EPIC Extremely Parallel Instruction Computing) a new kind of superscalar computer Superpipelined & Superscaler Machines Super pipelined machine: Super pipelined machines overlap pipe stages Relies on stages being able to begin operations before the last is complete. Super scalar Machine: A Superscalar machine employs multiple independent pipelines to executes multiple independent instructions in parallel. Particularly common instructions (arithmetic, load/store, conditional branch) can be executed independently. Why A New Architecture Direction? Processor designers obvious choices for use of increasing number of transistors on chip and extra speed: Bigger Caches diminishing returns Increase degree of Superscaling by adding more execution units complexity wall: more logic, need improved branch prediction, more renaming registers, more complicated dependencies. Multiple Processors challenge to use them effectively in general computing Longer pipelines greater penalty for misprediction
IA-64 : Background Explicitly Parallel Instruction Computing (EPIC) - Jointly developed by Intel & Hewlett-Packard (HP) New 64 bit architecture
Not extension of x86 series Not adaptation of HP 64bit RISC architecture To exploit increasing chip transistors and increasing speeds
Utilizes systematic parallelism Departure from superscalar trend Note: Became the architecture of the Intel Itanium Basic Concepts for IA-64 Instruction level parallelism EXPLICIT in machine instruction, rather than determined at run time by processor Long or very long instruction words (LIW/VLIW) Fetch bigger chunks already preprocessed * Predicated Execution Marking groups of instructions for a late decision on execution. * Control Speculation Go ahead and fetch & decode instructions, but keep track of them so the decision to issue them, or not, can be practically made later * Data Speculation (or Speculative Loading) Go ahead and load data early so it is ready when needed, and have a practical way to recover if speculation proved wrong Software Pipelining - Multiple iterations of a loop can be executed in parallel
General Organization
Predicate Registers Used as a flag for instructions that may or may not be executed. A set of instructions is assigned a predicate register when it is uncertain whether the instruction sequence will actually be executed (think branch). Only instructions with a predicate value of true are executed. When it is known that the instruction is going to be executed, its predicate is set. All instructions with that predicate true can now be completed. Those instructions with predicate false are now candidates for cleanup.
Predication
Speculative Loading
IA-64 Key Hardware Features Large number of registers IA-64 instruction format assumes 256 Registers 128 * 64 bit integer, logical & general purpose 128 * 82 bit floating point and graphic 64 predicated execution registers (To support high degree of parallelism) Multiple execution units Probably pipelined 8 or more ? IA-64 Register Set
Relationship between Instruction Type & Execution Unit
IA-64 Execution Units I-Unit Integer arithmetic Shift and add Logical Compare Integer multimedia ops M-Unit Load and store Between register and memory Some integer ALU operations B-Unit Branch instructions F-Unit Floating point instructions
Instruction Format Diagram
Instruction Format 128 bit bundles Can fetch one or more bundles at a time Bundle holds three instructions plus template Instructions are usually 41 bit long Have associated predicated execution registers Template contains info on which instructions can be executed in parallel Not confined to single bundle e.g. a stream of 8 instructions may be executed in parallel Compiler will have re-ordered instructions to form contiguous bundles Can mix dependent and independent instructions in same bundle
Field Encoding & Instr Set Mapping
Assembly Language Format [qp] mnemonic [.comp] dest = srcs ;; // qp - predicate register 1 at execution execute and commit result to hardware 0 result is discarded mnemonic - name of instruction comp one or more instruction completers used to qualify mnemonic dest one or more destination operands srcs one or more source operands ;; - instruction groups stops (when appropriate) Sequence without read after write or write after write Do not need hardware register dependency checks // - comment follows
Assembly Example Register Dependency: ld8 r1 = [r5] ;; //first group
add r3 = r1, r4
//second group
Second instruction depends on value in r1 Changed by first instruction Can not be in same group for parallel execution Note ;; ends the group of instructions that can be executed in parallel
Assembly Example Multiple Register Dependencies: ld8 r1 = [r5] //first group sub r6 = r8, r9 ;; //first group add r3 = r1, r4 //second group st8 [r6] = r12 //second group Last instruction stores in the memory location whose address is in r6, which is established in the second instruction
Assembly Example Predicated Code Consider the Following program with branches: if (a&&b) j = j + 1; else if(c) k = k + 1; else k = k 1; i = i + 1; Assembly Example Predicated Code
Source Code
if (a&&b)
Pentium Code
cmp a, 0 je L1
IA-64 Code cmp. eq p1, p2 = 0, a ;;
cmp b, 0 j = j + 1; je L1
(p2) cmp. eq p1, p3 = 0, b
add j, 1 else if(c) k = k + 1; L1: cmp c, 0 je L2 jmp L3
(p3) add
j = 1, j
(p1) cmp. ne p4, p5 = 0, c
add k, 1 else k = k 1; i = i + 1; L2: sub k, 1 L3: add i, 1 jmp L3
(p4) add k = 1, k
(p5) add k = -1, k add i = 1, i
Example of Prediction
Data Speculation Load data from memory before needed What might go wrong? Load moved before store that might alter memory location Need subsequent check in value
Assembly Example Data Speculation Consider the Following program: (p1) br some_label // cycle 0 ld8 r1 = [r5] ;; // cycle 1 add r1 = r1, r3 // cycle 3 Assembly Example Data Speculation Consider the Following program:
ld8.s r1 = [r5] ;; //cycle -2
(p1) br some_label add r1 = r1, r3 //cycle 0 //cycle 3
// other instructions (p1) br some_label //cycle 0
ld8 r1 = [r5] ;; //cycle 1
chk.s r1, recovery //cycle 0 add r2 = r1, r3 //cycle 0

CAunitiii

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CAunitiii

Uploaded by

Copyright:

Available Formats

Basic Compiler Techniques for Exposing ILP Instruction Level Parallelism Potential overlap among instructions Few possibilities

for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

Sequential MIPS Assembly Code

ADDD F4, F0, F2 SD LD -8(R1), F4 F0, -16(R1)

ADDD F8, F6, F2 SD LD -8(R1), F8 F10,

ADDD F4, F0, F2 SD LD -16(R1), F4 F0, -24(R1)

ADDD F4, F0, F2 SD -24(R1), F4

BNEZ R1, Loop

LD SD SUBI LD SD SUBI LD SD SUBI LD SD SUBI

ADDD F4, F0, F2

ADDD F8, F6, F2

ADDD F12, F10, F2

ADDD F16, F14, F2

BNEZ R1, Loop

Alleviating Data Dependencies

Scheduled Unrolled loop:

Loop Unrolling in VLIW

Advanced Compiler Support for Exposing and Exploiting ILP

S1 uses a value computed by S1 in an earlier iteration (A [i])

S2 uses a value computed by S2 in an earlier iteration (B [i])

S2 uses a value computed by S1 in the same iteration (A[i+1])

Tree-height reduction increase the code parallelism

R1, R2, R3 R4, R1, R6

R1, R2, R3 R4, R6, R7

DADDUI R8, R4, R7

DADDUI R8, R1, R4

Software Pipelining (Cont.)

Software pipelined iteration

Iteration i LD F0, 0(R1) ADDD F2 SD F4, F0, LD

Iteration i+1 F0, 0(R1) F4, F0, LD

F0, 0(R1) F4, F0,

Loop: LD F0, 0(R1) ADDD SD F4, F0, F2

SD 0(R1), F4 16(R1), F4 F4, F0, F2

SUBI R1, R1, #8 BNEZ R1, Loop

SUBI R1, R1, #8 BNEZ R1, Loop

Time Loop Unrolled

Global Code Scheduling

R4, 0(R1) R5, 0(R2) R4, R4, R5 R4, 0(R1)

; load A ; load B ; Add to A ; store A

Trace Scheduling (Cont.)

Conditional MOVE is the simplest form of predicated instruction

CMOVZ R2, R1, R4

First inst. Slot

Second inst. Slot

R1,40(R2) R8,0(R10),R10 R10, L R9, 0(R8)

R3, R4, R5 R6, R3, R7

Compiler speculation with hardware support

Moving an instruction before this branch

Hardware support for Preserving exception behavior

BEQZ R1,L3 ADDI R14,R1,#4 L3: SW R14,0(R3)

LW BNEZ LW J L1: ADDI L2: SW

R1,0(R3) R1,L1 R1,0(R2) L2 R1,R1,#4 R1,0(R3)

Exception behavior preserved exactly then block reappears

Note: sLW- Speculative, no termination

Use poison bits

HW Versus SW Speculation Mechanisms

IA-64 Architecture(Think Intel Itanium)

Relationship between Instruction Type & Execution Unit

Instruction Format Diagram

Field Encoding & Instr Set Mapping

Assembly Example Register Dependency: ld8 r1 = [r5] ;; //first group

IA-64 Code cmp. eq p1, p2 = 0, a ;;

(p2) cmp. eq p1, p3 = 0, b