You are on page 1of 49

COMPUTER ARCHITECTURE AND

ORGANIZATION

LECTURE
TOPICS

Pipelining
Latency
Throughput
Instruction Pipelining in
Detail Pipeline Stages
Pipeline Hazards
 Structural Hazards
 Data Hazards
 Control Hazards
Pipelining
Instruction Branch
Branch Alternatives
WHAT IS A
PIPELINE?

A conduit of pipe,


especially one used for
the conveyance of water,
gas, or petroleum
products.

A serial arrangement of processors or a serial


arrangement of registers within a processor. Each
processor or register performs part of a task and
passes results to the next processor; several
parts of different tasks can be performed at the
same time.
LATENCY

Time from initiation of an operation until


its results are available.

Example:

 It takes 8 minutes from the time you enter


to get your food served in a fast food place.
THROUGHPUT

Rate at which something happens or gets


done.
Example:

 2 people per minute get served per counter


in the fast food place.
LATENCY VS. THROUGHPUT
L: 8 minutes until food is
served T: 2 people per minute

L: 8 minutes until food is


served T: 2 people per minute

L: 8 minutes until food


is served
T: 2 people per
minute
L: 8 minutes until food
is served
T: 6 people per
minute
WHAT IS PIPELINING?

An implementation technique whereby


multiple instructions are overlapped in
execution.
Key to making fast CPU’s
today.
As an instruction goes through a phase in the
instruction cycle the previous instruction goes through
an earlier phase.
WHAT IS PIPELINING?

Four friends each A B C D


have one load of
clothes
to wash, dry, and fold
“Washer” takes 30
minutes
“Dryer” takes 30
minutes
“Folder” takes 30
minutes
“Stasher” takes 30
minutes
to put clothes into drawers
SEQUENTIAL LAUNDRY

T 6 PM 7 8 10 11 12 1
9
a
s
k 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
A
O
r B
d
e C
r
D

•Sequential laundry takes 8 hours for 4


loads. •What if they implement pipelining?
PIPELINED LAUNDRY
6 PM 7 8 9 10 11 12 1 2 AM

T 30 30 30 30 30 30 30 Time
a
A
s
k B
O C
r
d D
e
r
Start the work
ASAP!

Pipelined laundry takes 3.5 hours for 4

loads!
HARDWARE REQUIREMENTS

Extra incrementer to update the PC more often


(instead of the ALU).

A separate MDR for loads (memory to CPU) and


stores (CPU to memory).

High memory bandwidth to accommodate more data


to and from memory.
PIPELINING LESSONS
1

6 PM 7 8 9
Pipelining doesn’t help
T Time
latency of single task, it
a 30 30 30 30 30 30 30 helps throughput of
s entire workload
k A

O B Multiple tasks operating


r simultaneously using
d C different resources
e D
r Potential
speedup =
Number pipe stages
PIPELINING LESSONS
2
6 PM 7 8 9
T Time
a Pipelinerate limited
30 30 30 30 30 30 30
s by slowest pipeline
k A stage
Unbalanced lengths of
O B pipe stages reduces
r speedup
d C
e D
r Time to “fill” pipeline
and time to “drain” it
reduces speedup

Stall for Dependencies


RECAP: INSTRUCTION CYCLE

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load F D O E S

 Fetch - fetch the instruction from the memory

 Decode - decode the instruction

 Operand Fetch - get the necessary operands

 Execute - execute the instruction

 Store - store the result to the appropriate location


REVIEW: VISUALIZING PIPELINING
Time (clock cycles)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7


I

ALU
Ifetch Reg Reg
n DMem

s
t

ALU
Ifetch Reg DMem Reg
r.

ALU
Ifetch Reg DMem Reg
r
d
e

ALU
Ifetch Reg DMem Reg

r
TIMING OF PIPELINE
NON-PIPELINED

F D O E S F D O E S

time

PIPELINED

F D O E S
F D O E S
F D O E S
F D O E S F - Fetch
D - Decode
time O - Operand Fetch
E - Execute
S - Result Store
THREE, FOUR & FIVE STAGE RISC
PIPELINE
RISC II
Fetch Decode Inst. Execute Inst.
Instruction Select regs. Store result
SPARC MB86900, IBM
801
Fetch Decode Inst. Execute Inst. Store result
Instruction Select regs.
MIPS, intel 486

Fetch Execute Inst. Store result


Decode Inst. Select regs.
Instruction
THREE, FOUR & FIVE STAGE RISC
PIPELINE
cc 1 2 3 4 5 6 7
cc 1 2 3 4 5 6 7 stage
stage
1 a b c d e f g
1 a b c d e f g
2 - a b c d e f
2 - a b c d e f
3 - - a b c d e
3 - - a b c d e
4 - - - a b c d

cc 1 2 3 4 5 6 7
stage
1 a b c d e f g
2 - a b c d e f
3 - - a b c d e
4 - - - a b c d
5 - - - - a b c
OTHER NUMBER OF
STAGES

 INTEL
 Pentium I: 7 stages
 Pentium II/III: 12
stages  Pentium 4: 22
stages
PIPELINE PERFORMANCE

 A greater number of stages always provides better


performance. However:

 It increases the overhead in moving information


between stages.

 the complexity of the CPU grows

 It is difficult to keep a large pipeline at maximum rate


because of pipeline hazards.
PIPELINE HAZARDS

Structural Hazards
 Occurs when a certain resource is requested
by more than one instruction at the same time

 Solutions:
• Duplicate certain resources to
avoid structural hazards.
• Extend the clashing cycle by stopping
the whole of the pipeline until both
memory accesses are finished (stalling).
EXAMPLE: ONE MEMORY PORT/
STRUCTURAL HAZARD
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

I Load Ifetch

ALU
Reg DMem Reg

n
s

ALU
t
Instr 1 Ifetch Reg DMem Reg

r.

ALU
Reg
Instr 2 Ifetch Reg DMem

O
r

ALU
d Instr 3
Ifetch Reg DMem Reg

e
r Instr 4
Structural Hazard
RESOLVING STRUCTURAL HAZARDS

Definition:attempt to use same hardware for


two different things at the same time

Solution 1: Wait
 must detect the hazard
 must have mechanism to
stall
Solution 2: Throw more hardware at the
problem
DETECTING AND RESOLVING
STRUCTURAL HAZARD
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU
Load Ifetch Reg DMem Reg

n
s

ALU
t
Instr 1 Ifetch Reg DMem Reg

r.

ALU
Reg
Instr 2 Ifetch Reg DMem

O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r

ALU
Instr 3 Ifetch Reg DMem Reg
ELIMINATING STRUCTURAL
HAZARDS AT DESIGN TIME

Next PC

MUX
Next SEQ PC Next SEQ PC
Adder

4 RS1
Zero?

MUX MUX

MEM/WB
Address

RS2

EX/MEM
Reg File
Cache

ID/EX
Instr

IF/ID

ALU

Cache
Data

MUX

WB Data
Sign
Extend
Imm
Data path
RD RD RD

Control Path
ROLE OF INSTRUCTION SET DESIGN
IN STRUCTURAL HAZARD RESOLUTION

Simple to determine the sequence of


resources used by an instruction
 opcode tells it
all
Uniformityin the resource usage
Compare MIPS to IA32?
MIPS approach => all instructions flow
through same 5-stage pipeling
PIPELINE HAZARDS

Data Hazards
 Occurs when an instruction depends on the
result of a previous instruction that has not
yet terminated.

 could be avoided by using a technique


called “forwarding” or “bypassing”.
DATA HAZARDS

Time (clock cycles) IF ID/RF EX MEM WB

ALU
add r1,r2,r3 Ifetch Reg DMem Reg

n
s
t

ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
r.

ALU
O and r6,r1,r7 Ifetch Reg DMem Reg

r
d

ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r

ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
THREE GENERIC DATA
HAZARDS
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes
it
I: add r1,r2,r3
J: sub r4,r1,r3

Causedby a “Data Dependence” (in compiler


nomenclature). This hazard results from an actual need for
communication.
THREE GENERIC DATA
HAZARDS
Write After Read (WAR)
InstrJ writes operand before InstrI reads
it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “anti-dependence” by compiler
writers. This results from reuse of the name
“r1”.
Can’t happen in MIPS 5 stage pipeline
because:  All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
THREE GENERIC DATA
HAZARDS

Write After Write (WAW)


InstrJ writes operand before InstrI writes
it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “output dependence” by compiler
writers This also results from the reuse of name
“r1”. happen in MIPS 5 stage pipeline
Can’t

because:  All instructions take 5 stages, and


 Writes are always in stage 5
Will see WAR and WAW in later more complicated
pipes
FORWARDING TO AVOID DATA
HAZARD
Time (clock cycles)
I
n add r1,r2,r3 Ifetch

ALU
Reg DMem Reg

s
t
sub r4,r1,r3

ALU
Reg
r. Ifetch Reg DMem

ALU
Ifetch Reg DMem Reg
r and r6,r1,r7
d
e

ALU
Ifetch Reg DMem Reg

r or r8,r1,r9

ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
HW CHANGE FOR
FORWARDING

NextPC
mux
Registers

MEM/WR
EX/MEM
ALU
ID/EX

Data
mux

Memory

mux
Immediate
DATA HAZARD EVEN WITH
FORWARDING
Time (clock cycles)

I lw r1, 0(r2) Ifetch

ALU
Reg DMem Reg

n
s
t sub r4,r1,r6 Ifetch

ALU
Reg DMem Reg

r.

ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e

ALU
Ifetch Reg DMem Reg
or r8,r1,r9
r
RESOLVING THIS LOAD
HAZARD
Adding hardware? ... not
Detection?

Compilation techniques?

What is the cost of load


delays?
RESOLVING THE LOAD DATA
HAZARD
Time (clock cycles)
I
n

ALU
Reg
s
lw r1, 0(r2) Ifetch Reg DMem

t
r.

ALU
Ifetch Reg Bubble DMem Reg
sub r4,r1,r6
O
r Bubble

ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r Bubble

ALU
Ifetch Reg DMem
or r8,r1,r9
How is this different from the instruction issue
stall?
SOFTWARE SCHEDULING TO
AVOID LOAD HAZARDS
Try producing fast code
for a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.

Slow code: Fast code:


LW Rb,b  LW Rb,b
LW Rc,c  LW Rc,c
ADD Ra,Rb,Rc  LW Re,e
SW a,Ra   ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f W a,Ra
SUB Rd,Re,Rf  SUB Rd,Re,Rf
SW d,Rd  SW d,Rd
INSTRUCTION SET
CONNECTION

What is exposed about this organizational hazard


in the instruction set?
k cycle delay?

 bad, CPI is not part of


ISA
k instruction slot
delay
 load should not be followed by use of the
value in the next k instructions

Nothing,but code can reduce run-time


delays MIPS did the transformation in the
assembler
PIPELINE HAZARDS

Control Hazards
 produced by branch
instructions
 decision made by partially executed
instruction affects currently loading
instruction
CONTROL HAZARD ON
BRANCHES THREE STAGE STALL

10: beq r1,r3,36

ALU
Ifetch Reg DMem Reg

ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5

ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem

ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9

ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
EXAMPLE: BRANCH STALL
IMPACT
If 30% branch, Stall 3 cycles
significant
Two part solution:
 Determine branch taken or not sooner,
AND  Compute taken branch address earlier

MIPS branch tests if register = 0 or


0 MIPS Solution:
 Move Zero test to ID/RF stage
 Adder to calculate new PC in ID/RF
stage  1 clock cycle penalty for branch
versus 3
PIPELINED MIPS DATAPATH
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next

MUX
SEQ PC

Adder
Adder

Zero?
4 RS1

MEM/WB
Address

Memory

RS2

EX/MEM
Reg File

ID/EX

ALU
IF/ID

Memory
MUX

Data

MUX

WB Data
Sign
Extend
Imm

RD RD RD

• Data stationary control


– local decode for each instruction phase / pipeline
FOUR BRANCH HAZARD
ALTERNATIVES
 #1: Stall until branch direction is clear
 #2: Predict Branch Not Taken

Execute successor instructions in


sequence
“Squash” instructions in pipeline if branch actually
taken Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next
instruction  #3: Predict Branch Taken
53% MIPS branches taken on average
But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before
outcome
FOUR BRANCH HAZARD
ALTERNATIVES

#4: Delayed Branch


 Define branch to take place AFTER a following
instruction
branch instruction
sequential successor1
sequential successor2 Branch delay of length n
........
sequential successor n
 ........
 branch target if
taken
 1 slot delay allows proper decision and
branch target address in 5 stage pipeline
 MIPS uses this
DELAYED BRANCH

Where to get instructions to fill branch delay slot?


 Before branch instruction
 From the target address: only valuable when branch taken
 From fall through: only valuable when branch not taken
 Canceling branches allow more slots to be filled

Compiler effectiveness for single branch delay slot:


 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch delay slots
useful in computation
 About 50% (60% x 80%) of slots usefully filled

Delayed Branch downside: 7-8 stage pipelines, multiple


instructions issued per clock (superscalar)
RECALL: SPEED UP
EQUATION FOR
PIPELINING
EXAMPLE: EVALUATING BRANCH
ALTERNATIVES

Assume:
Conditional & Unconditional = 14%, 65% change
PC
Scheduling BranchCPI speedup v.
scheme penalty stall
Stallpipeline 3 1.42 1.0
Predict taken 1 1.14 1.26
Predict not taken 1 1.09 1.29
Delayed branch 0.5 1.07 1.31
END OF LECTURE…

You might also like