You are on page 1of 6

Chapter 3: Pipelining

---

1. Instruction fetch (IF)


-------------------------
IR <- Mem[PC];
NPC <- PC + 4;
2. Instruction decode / register fetch (ID)
-------------------------------------------
A <- Regs[IR6..10];
B <- Regs[IR11..15];
Imm <- ((IR16)^16##IR16..31)
3. Execution / effective address (EX)
-------------------------------------
Memory reference:
ALUOutput <- A + Imm;
Register-Register ALU instruction:
ALUOutput <- A func B;
Register-immediate ALU instruction:
ALUOutput <- A op Imm;
Branch:
ALUOutput <- NPC + Imm;
Cond <- (A op 0)
4. Memory access / branch completion (MEM)
------------------------------------------
Memory reference:
LMD <- Mem[ALUOutput] or
Mem[ALUOutput] <- B;
Branch:
if (cond) PC <- ALUOutput else PC <- NPC;
5. Write-back (WB)
------------------
Register-register ALU:
Regs[IR16..20] <- ALUOutput;
Register-immediate ALU:
Regs[IR11..15] <- ALUOutput;
Load instruction:
Regs[IR11..15] <- LMD;

- the use of different caches for instruction and data eliminates the
conflict for a single memory that would arise between instruction
fetch and data memory access for two instructions;
- imbalance among pipe stages reduces performance since the clock can
run no faster than the slowest pipeline stage;
- hazards: structural hazards, data hazards, and control hazards;
- data hazards:
- read after write (RAW;
- write after write (WAW);
- write after read (WAR);
- the process of letting an instruction move from the ID into the EX
state of the DLX pipeline is called instruction issue. For the DLX
integer pipeline, all the data hazards can be checked during the ID
phase. If a data hazard exists, the instruction is stalled BEFORE
it is issued;
- when a hazard is detected, we need only change the control portion
of the ID/EX pipeline register to all 0s, which happens to be a no-op;
in addition, we simply recirculate the contents of the IF/ID
registers to hold the stalled instruction;
- the simplest method of dealing with branches is to stall the
pipeline as soon as we detect the branch until we reach the MEM
stage, which determines the new PC;
- branch stall: originally 2 cycles (3 if IF needs to be repeated):
IF ID EX MEM WB
IF stall stall IF ID
- the number of clock cycles in a branch stall can be reduced by two
steps:
- find out whether the branch is taken or not taken earlier in
the pipeline;
- compute the taken PC earlier;
- both of these steps should be taken as early as possible in the
pipeline;
- with a separate adder to calculate the branch address in ID and a
branch decision also made during ID, there is only a one-clock-cycle
stall on branches:
IF ID EX MEM WB
IF - - - -
IF ID EX MEM WB
- in some machines, branch hazards are even more expensive in clock
cycles than in our example, since the time to evaluate the branch
condition and compute the destination can be even longer;
- large, deeply pipeline machines often have branch penalties of six
or seven clock cycles;
- 67% of the conditional branches are taken on average;
- 85% of the backwards conditional branches are taken on average
(usually because of loops);
- predict-not-taken, or predict-untaken strategy:

- untaken branch:
branch IF ID EX MEM WB
IF ID EX MEM WB
- taken branch:
branch IF ID EX MEM WB
IF idle idle idle
IF ID EX MEM WB
- because in DLX both the target address and the condition are
evaluated in the same cycle (ID), there is no advantage in using a
predict-taken approach;
- another scheme in use in some machines is to use delayed branches,
filling the branch delay slots with unconditional instructions;
- the limitations on delayed-branch scheduling arise from (1) the
restrictions on the instructions that are scheduled into the delay
slots and (2) our ability to predict at compile time whether a
branch is likely to be taken or not;
- to improve the ability of the compiler to fill branch delay slots,
most machines with conditional branches have introduced a cancelling
or nullifying branch. In a cancelling branch, the instruction
includes the direction that the branch was predicted. When the branch
behaves as predicted, the instruction in the slot is simply executed
as it would normally be. If not, the instruction is turned into a
no-op;
- the advantage of cancelling branches is that they eliminate the
requirements on the instruction placed in the delay slot;
- two ways to statically predict branches: by examination of the
program behavior and by the use of profile information collected
from earlier runs of the program;
- exceptions can be:
+ synchronous or asynchronous:
- synchronous: occurs at the same place every time the
program is executed with same data and memory
allocation;
- asynchronous: with the exception of hardware
malfunction, asynchronous evens are caused by
devices external to the processor and memory;
+ user requested vs. coerced:
- user requested: user task directly asks for it;
- coerced: caused by some hardware event that is not
under the control of the user program;
+ user maskable vs. user nonmaskable:
+ within versus between instructions:
- within: occurs in the middle of executing an
instruction; are usually synchronous;
- between: events do not prevent instruction
completion;
+ resume vs. terminate:
- terminate: program's execution always stops after
the exception;
- resume: program continues after exception handling;
- if a pipeline provides the ability for the machine to handle the
exception, save the state, and restart without affecting the
execution of the program, the pipeline or machine is said to be
restartable (most machines today are restartable);
- when an exception occurs, the pipeline control can take the
following steps to save the pipeline state safely:
- force a trap instruction into the pipeline on the next IF;
- until the trap is taken, turn off all writes for the
faulting instruction and for all instructions that follow
in the pipeline (can be done by zeroing into the pipeline
latches);
- after the exception-handling routing in the operating system
receives control, it immediately saves the PC of the
faulting instruction. This value will be used to return from
the exception later;
- when delayed branches are used, it is no longer possible to
re-create the state of the machine with a single PC because the
instructions in the pipeline may not be sequentially related. So, we
need to save and restore as many PCs as the length of the branch delay
plus one;
- any machine with demand paging or IEEE arithmetic trap handlers must
make its exceptions precise;
- exceptions in pipeline stages:
- IF: page fault on instruction fetch, misaligned access,
protection violation;
- ID: undefined or illegal opcode;
- EX: arithmetic exception;
- MEM: page fault on data fetch, misaligned access, protection
violation;
- WB: none
- when pipelining a given instruction set is difficult, designers
usually pipeline the microinstruction execution: a microinstruction
is a simple instruction used in sequences to implement a more complex
instruction set. They are simple and look a lot like a RISC instr.
set;
- latency of a functional unit: number of intervening clock cycles
between an instruction that produces a result and an instruction
that uses the result; usually the number of stages after the EX
cycle that an instruction produces a result;
- initiation or repeat interval: number of cycles that must elapse
between issuing two operations of a given type;
- Instruction Latency Initiation interval
----------- ------- -------------------
integer ALU 0 1
data memory (loads) 1 1
FP add 3 1
FP multiply 6 1
FP divide 14 15 (not pipelined)

- checks made in ID before an instruction can issue: structural hazards,


RAW data hazard, and WAW data hazard;
MIPS R4000
----------
- deeper pipeline allows it to achieve higher clock rates (8 stages);
- in addition to substantially increasing the amount of forwarding
required, this longer latency pipeline increases both the load and
branch delays:
- load: 2 cycles;
- branch: 3 cycles;
- the MIPS architecture has a single-cycle delayed branch; the R4000
uses a predict-not-taken strategy for the remaining two cycles;
- untaken branches: 1 delay slot;
- taken branches: 1 delay slot + 2 stalls;
- four major causes of pipeline stalls in the R4000:
- load stalls (moderate);
- branch stalls (high);
- FP result stalls (high);
- FP structural stalls (low);
- the frequency of the last two types of stall show that reducing the
latency of FP operations should be the first target, rather than
more pipelining or replication of the functional units;
- limited parallelism in the instruction stream means that increasing
the number of pipeline stages, called the pipeline depth, will
eventually increase the CPI, due to dependences that require stalls;
- second, clock skew and latch overhead combine to limit the decrease
in clock period obtaining by further pipelining (experiments show
decrease in performance for more than 8 stages);

You might also like