Professional Documents
Culture Documents
RHK.S96 1
RHK.S96 2
ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 ADDD F20,F18,F2
Unrolled 5 times to avoid delays (+1 due to SS) 12 clocks, or 2.4 clocks per iteration
RHK.S96 3
Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers in VLIW
RHK.S96 4
Difficulties in building HW
Duplicate FUs to get parallel execution Increase ports to Register File VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg Increase ports to memory Decoding SS and impact on clock rate, pipeline depth
RHK.S96 5
RHK.S96 6
Review: Summary
Branch Prediction
Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Branch Target Buffer: include branch address & prediction
SW Pipelining
Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead
RHK.S96 8
Trace Scheduling
Parallelism across IF branches vs. LOOP branches Two steps:
Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong
RHK.S96 9
RHK.S96 10
RHK.S96 11
Reorder buffer can be operand source FP Regs Once operand commits, result is found in register 3 fields: instr. type, destination, value Use reorder buffer number instead Res Stations Res Stations of reservation station FP Adder FP Adder Instructionscommit in order As a result, its easy to undo speculated instructions on Figure 4.34, page 311 mispredicted branches or on exceptions
RHK.S96 12
FP Op Queue
Limits to ILP
Conflicting studies of amount of parallelism available in late 1980s and early 1990s. Different assumptions about:
Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication
RHK.S96 14
Limits to ILP
Initial HW Model here; MIPS compilers 1. Register renaminginfinite virtual registers and all WAW & WAR hazards are avoided 2. Branch predictionperfect; no mispredictions 3. Jump predictionall jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysisaddresses are known & a store can be moved before a load provided addresses not equal 1 cycle latency for all instructions
RHK.S96 15
118.7
17.9
Programs
Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle
41 35
61 58
60
48 46 45 46 45 45
40
30
29
20
12
19 16 10 6 6 2 7 6 2 6 7 4 2 15 13 14
10
0 gcc espresso li Program Perfect Selective predictor Standard 2-bit Static None fpppp doducd tomcatv
RHK.S96 17
Perfect
Profile
40
54 49 45 44
35
30
29
28
20
20
15 15 13 11 10 10 9 5 4 10 5 4 12 12 12 11 6 5 5 4
16
15 11 5 7 5 5
10
0 gcc espresso li Program Infinite 256 128 64 32 None fpppp doducd tomcatv
Infinite
256
128
64
32
None
RHK.S96 18
Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers
16 16
45
45
10 5 0
gcc
espresso
li Program
fpppp
doducd
tomcatv
Perfect
Global/stack Perfect
Inspection
None
Perfect
None
RHK.S96 19
40
30
20
Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window
15 15 10 10 10 9 13 10 8 6 4 3 8 6 4 2 12 12 11 11 9 6 4 3
47 45
35
34
22 17 16 14 8 5 3
22
15 12 9 7 4 3
14 9 6 3
10
64
32
16
RHK.S96 20
SPECMarks
espresso
eqntott
doduc
wave5
compress
tomcatv
hydro2d
nasa
gcc
sc
swm256
mdljdp2
mdljsp2
su2cor
alvinn
Benchmark
RHK.S96 21
fpppp
spice
ora
ear
li