Computer Architecture and Organization

COMPUTER ARCHITECTURE AND
ORGANIZATION
LECTURE
TOPICS
Pipelining
Latency
Throughput
Instruction Pipelining in
Detail Pipeline Stages
Pipeline Hazards
 Structural Hazards
 Data Hazards
 Control Hazards
Pipelining
Instruction Branch
Branch Alternatives
WHAT IS A
PIPELINE?
A conduit of pipe,

especially one used for
the conveyance of water,
gas, or petroleum
products.
A serial arrangement of processors or a serial

arrangement of registers within a processor. Each
processor or register performs part of a task and
passes results to the next processor; several
parts of different tasks can be performed at the
same time.
LATENCY
Time from initiation of an operation until

its results are available.
Example:
 It takes 8 minutes from the time you enter

to get your food served in a fast food place.
THROUGHPUT
Rate at which something happens or gets

done.
Example:
 2 people per minute get served per counter

in the fast food place.
LATENCY VS. THROUGHPUT
L: 8 minutes until food is
served T: 2 people per minute
L: 8 minutes until food is

served T: 2 people per minute
L: 8 minutes until food

is served
T: 2 people per
minute
L: 8 minutes until food
is served
T: 6 people per
minute
WHAT IS PIPELINING?
An implementation technique whereby

multiple instructions are overlapped in
execution.
Key to making fast CPU’s
today.
As an instruction goes through a phase in the
instruction cycle the previous instruction goes through
an earlier phase.
WHAT IS PIPELINING?
Four friends each A B C D

have one load of
clothes
to wash, dry, and fold
“Washer” takes 30
minutes
“Dryer” takes 30
minutes
“Folder” takes 30
minutes
“Stasher” takes 30
minutes
to put clothes into drawers
SEQUENTIAL LAUNDRY
T 6 PM 7 8 10 11 12 1
9
a
s
k 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
A
O
r B
d
e C
r
D
•Sequential laundry takes 8 hours for 4

loads. •What if they implement pipelining?
PIPELINED LAUNDRY
6 PM 7 8 9 10 11 12 1 2 AM
T 30 30 30 30 30 30 30 Time
a
A
s
k B
O C
r
d D
e
r
Start the work
ASAP!

Pipelined laundry takes 3.5 hours for 4

loads!
HARDWARE REQUIREMENTS
Extra incrementer to update the PC more often

(instead of the ALU).
A separate MDR for loads (memory to CPU) and

stores (CPU to memory).
High memory bandwidth to accommodate more data

to and from memory.
PIPELINING LESSONS
1
6 PM 7 8 9
Pipelining doesn’t help
T Time
latency of single task, it
a 30 30 30 30 30 30 30 helps throughput of
s entire workload
k A
O B Multiple tasks operating

r simultaneously using
d C different resources
e D
r Potential
speedup =
Number pipe stages
PIPELINING LESSONS
2
6 PM 7 8 9
T Time
a Pipelinerate limited
30 30 30 30 30 30 30
s by slowest pipeline
k A stage
Unbalanced lengths of
O B pipe stages reduces
r speedup
d C
e D
r Time to “fill” pipeline
and time to “drain” it
reduces speedup
Stall for Dependencies

RECAP: INSTRUCTION CYCLE
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load F D O E S
 Fetch - fetch the instruction from the memory
 Decode - decode the instruction
 Operand Fetch - get the necessary operands
 Execute - execute the instruction
 Store - store the result to the appropriate location

REVIEW: VISUALIZING PIPELINING
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

I
ALU
Ifetch Reg Reg
n DMem
s
t
ALU
Ifetch Reg DMem Reg
r.
ALU
Ifetch Reg DMem Reg
r
d
e
ALU
Ifetch Reg DMem Reg
r
TIMING OF PIPELINE
NON-PIPELINED
F D O E S F D O E S
time
PIPELINED
F D O E S
F D O E S
F D O E S
F D O E S F - Fetch
D - Decode
time O - Operand Fetch
E - Execute
S - Result Store
THREE, FOUR & FIVE STAGE RISC
PIPELINE
RISC II
Fetch Decode Inst. Execute Inst.
Instruction Select regs. Store result
SPARC MB86900, IBM
801
Fetch Decode Inst. Execute Inst. Store result
Instruction Select regs.
MIPS, intel 486
Fetch Execute Inst. Store result

Decode Inst. Select regs.
Instruction
THREE, FOUR & FIVE STAGE RISC
PIPELINE
cc 1 2 3 4 5 6 7
cc 1 2 3 4 5 6 7 stage
stage
1 a b c d e f g
1 a b c d e f g
2 - a b c d e f
2 - a b c d e f
3 - - a b c d e
3 - - a b c d e
4 - - - a b c d
cc 1 2 3 4 5 6 7
stage
1 a b c d e f g
2 - a b c d e f
3 - - a b c d e
4 - - - a b c d
5 - - - - a b c
OTHER NUMBER OF
STAGES
 INTEL
 Pentium I: 7 stages
 Pentium II/III: 12
stages  Pentium 4: 22
stages
PIPELINE PERFORMANCE
 A greater number of stages always provides better

performance. However:
 It increases the overhead in moving information

between stages.
 the complexity of the CPU grows
 It is difficult to keep a large pipeline at maximum rate

because of pipeline hazards.
PIPELINE HAZARDS
Structural Hazards
 Occurs when a certain resource is requested
by more than one instruction at the same time
 Solutions:
• Duplicate certain resources to
avoid structural hazards.
• Extend the clashing cycle by stopping
the whole of the pipeline until both
memory accesses are finished (stalling).
EXAMPLE: ONE MEMORY PORT/
STRUCTURAL HAZARD
Time (clock cycles)
I Load Ifetch
ALU
Reg DMem Reg
n
s
ALU
t
Instr 1 Ifetch Reg DMem Reg
r.
ALU
Reg
Instr 2 Ifetch Reg DMem
O
r
ALU
d Instr 3
Ifetch Reg DMem Reg
e
r Instr 4
Structural Hazard
RESOLVING STRUCTURAL HAZARDS
Definition:attempt to use same hardware for

two different things at the same time
Solution 1: Wait
 must detect the hazard
 must have mechanism to
stall
Solution 2: Throw more hardware at the
problem
DETECTING AND RESOLVING
STRUCTURAL HAZARD
Time (clock cycles)
ALU
Load Ifetch Reg DMem Reg
n
s
ALU
t
r.
ALU
Reg
Instr 2 Ifetch Reg DMem
O
r
Stall Bubble Bubble Bubble Bubble Bubble
d
e
r
ALU
ELIMINATING STRUCTURAL
HAZARDS AT DESIGN TIME
Next PC
MUX
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
MEM/WB
Address
RS2
EX/MEM
Reg File
Cache
ID/EX
Instr
IF/ID
ALU
Cache
Data
MUX
WB Data
Sign
Extend
Imm
Data path
RD RD RD
Control Path
ROLE OF INSTRUCTION SET DESIGN
IN STRUCTURAL HAZARD RESOLUTION
Simple to determine the sequence of

resources used by an instruction
 opcode tells it
all
Uniformityin the resource usage
Compare MIPS to IA32?
MIPS approach => all instructions flow
through same 5-stage pipeling
PIPELINE HAZARDS
Data Hazards
 Occurs when an instruction depends on the
result of a previous instruction that has not
yet terminated.
 could be avoided by using a technique

called “forwarding” or “bypassing”.
DATA HAZARDS
Time (clock cycles) IF ID/RF EX MEM WB
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
n
s
t
ALU
Ifetch Reg DMem Reg
sub r4,r1,r3
r.
ALU
O and r6,r1,r7 Ifetch Reg DMem Reg
r
d
ALU
Ifetch Reg DMem Reg
e or r8,r1,r9
r
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
THREE GENERIC DATA
HAZARDS
Read After Write (RAW)
InstrJ tries to read operand before InstrI writes
it
I: add r1,r2,r3
J: sub r4,r1,r3
Causedby a “Data Dependence” (in compiler

nomenclature). This hazard results from an actual need for
communication.
THREE GENERIC DATA
HAZARDS
Write After Read (WAR)
InstrJ writes operand before InstrI reads
it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “anti-dependence” by compiler
writers. This results from reuse of the name
“r1”.
Can’t happen in MIPS 5 stage pipeline
because:  All instructions take 5 stages, and
 Reads are always in stage 2, and
 Writes are always in stage 5
THREE GENERIC DATA
HAZARDS
Write After Write (WAW)

InstrJ writes operand before InstrI writes
it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “output dependence” by compiler
writers This also results from the reuse of name
“r1”. happen in MIPS 5 stage pipeline
Can’t
because:  All instructions take 5 stages, and

 Writes are always in stage 5
Will see WAR and WAW in later more complicated
pipes
FORWARDING TO AVOID DATA
HAZARD
Time (clock cycles)
I
n add r1,r2,r3 Ifetch
ALU
Reg DMem Reg
s
t
sub r4,r1,r3
ALU
Reg
r. Ifetch Reg DMem
ALU
Ifetch Reg DMem Reg
r and r6,r1,r7
d
e
ALU
Ifetch Reg DMem Reg
r or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
HW CHANGE FOR
FORWARDING
NextPC
mux
Registers
MEM/WR
EX/MEM
ALU
ID/EX
Data
mux
Memory
mux
Immediate
DATA HAZARD EVEN WITH
FORWARDING
Time (clock cycles)
I lw r1, 0(r2) Ifetch
ALU
Reg DMem Reg
n
s
t sub r4,r1,r6 Ifetch
ALU
Reg DMem Reg
r.
ALU
Ifetch Reg DMem Reg
and r6,r1,r7
r
d
e
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
r
RESOLVING THIS LOAD
HAZARD
Adding hardware? ... not
Detection?
Compilation techniques?
What is the cost of load

delays?
RESOLVING THE LOAD DATA
HAZARD
Time (clock cycles)
I
n
ALU
Reg
s
lw r1, 0(r2) Ifetch Reg DMem
t
r.
ALU
Ifetch Reg Bubble DMem Reg
sub r4,r1,r6
O
r Bubble
ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r Bubble
ALU
Ifetch Reg DMem
or r8,r1,r9
How is this different from the instruction issue
stall?
SOFTWARE SCHEDULING TO
AVOID LOAD HAZARDS
Try producing fast code
for a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:

LW Rb,b  LW Rb,b
LW Rc,c  LW Rc,c
ADD Ra,Rb,Rc  LW Re,e
SW a,Ra   ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f W a,Ra
SUB Rd,Re,Rf  SUB Rd,Re,Rf
SW d,Rd  SW d,Rd
INSTRUCTION SET
CONNECTION
What is exposed about this organizational hazard

in the instruction set?
k cycle delay?
 bad, CPI is not part of

ISA
k instruction slot
delay
 load should not be followed by use of the
value in the next k instructions
Nothing,but code can reduce run-time

delays MIPS did the transformation in the
assembler
PIPELINE HAZARDS
Control Hazards
 produced by branch
instructions
 decision made by partially executed
instruction affects currently loading
instruction
CONTROL HAZARD ON
BRANCHES THREE STAGE STALL
10: beq r1,r3,36
ALU
Ifetch Reg DMem Reg
ALU
Ifetch Reg DMem Reg
14: and r2,r3,r5
ALU
Reg
18: or r6,r1,r7 Ifetch Reg DMem
ALU
Ifetch Reg DMem Reg
22: add r8,r1,r9
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
EXAMPLE: BRANCH STALL
IMPACT
If 30% branch, Stall 3 cycles
significant
Two part solution:
 Determine branch taken or not sooner,
AND  Compute taken branch address earlier
MIPS branch tests if register = 0 or

0 MIPS Solution:
 Move Zero test to ID/RF stage
 Adder to calculate new PC in ID/RF
stage  1 clock cycle penalty for branch
versus 3
PIPELINED MIPS DATAPATH
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC Next
MUX
SEQ PC
Adder
Adder
Zero?
4 RS1
MEM/WB
Address
Memory
RS2
EX/MEM
Reg File
ID/EX
ALU
IF/ID
Memory
MUX
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
• Data stationary control

– local decode for each instruction phase / pipeline
FOUR BRANCH HAZARD
ALTERNATIVES
 #1: Stall until branch direction is clear
 #2: Predict Branch Not Taken
Execute successor instructions in

sequence
“Squash” instructions in pipeline if branch actually
taken Advantage of late pipeline state update
47% MIPS branches not taken on average
PC+4 already calculated, so use it to get next
instruction  #3: Predict Branch Taken
53% MIPS branches taken on average
But haven’t calculated branch target address in MIPS
• MIPS still incurs 1 cycle branch penalty
• Other machines: branch target known before
outcome
FOUR BRANCH HAZARD
ALTERNATIVES
#4: Delayed Branch

 Define branch to take place AFTER a following
instruction
branch instruction
sequential successor1
sequential successor2 Branch delay of length n
........
sequential successor n
 ........
 branch target if
taken
 1 slot delay allows proper decision and
branch target address in 5 stage pipeline
 MIPS uses this
DELAYED BRANCH
Where to get instructions to fill branch delay slot?

 Before branch instruction
 From the target address: only valuable when branch taken
 From fall through: only valuable when branch not taken
 Canceling branches allow more slots to be filled
Compiler effectiveness for single branch delay slot:

 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch delay slots
useful in computation
 About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: 7-8 stage pipelines, multiple

instructions issued per clock (superscalar)
RECALL: SPEED UP
EQUATION FOR
PIPELINING
EXAMPLE: EVALUATING BRANCH
ALTERNATIVES
Assume:
Conditional & Unconditional = 14%, 65% change
PC
Scheduling BranchCPI speedup v.
scheme penalty stall
Stallpipeline 3 1.42 1.0
Predict taken 1 1.14 1.26
Predict not taken 1 1.09 1.29
Delayed branch 0.5 1.07 1.31
END OF LECTURE…

Computer Architecture and Organization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computer Architecture and Organization

Uploaded by

Copyright:

Available Formats

COMPUTER ARCHITECTURE AND

A conduit of pipe,

A serial arrangement of processors or a serial

Time from initiation of an operation until

 It takes 8 minutes from the time you enter

Rate at which something happens or gets

 2 people per minute get served per counter

L: 8 minutes until food is

L: 8 minutes until food

An implementation technique whereby

Four friends each A B C D

•Sequential laundry takes 8 hours for 4

Extra incrementer to update the PC more often

A separate MDR for loads (memory to CPU) and

High memory bandwidth to accommodate more data

O B Multiple tasks operating

Stall for Dependencies

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

 Fetch - fetch the instruction from the memory

 Decode - decode the instruction

 Operand Fetch - get the necessary operands

 Execute - execute the instruction

 Store - store the result to the appropriate location

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Fetch Execute Inst. Store result

 A greater number of stages always provides better

 It increases the overhead in moving information

 the complexity of the CPU grows

 It is difficult to keep a large pipeline at maximum rate

Definition:attempt to use same hardware for

Simple to determine the sequence of

 could be avoided by using a technique

Time (clock cycles) IF ID/RF EX MEM WB

Causedby a “Data Dependence” (in compiler

Write After Write (WAW)

because:  All instructions take 5 stages, and

I lw r1, 0(r2) Ifetch

What is the cost of load

Slow code: Fast code:

What is exposed about this organizational hazard

 bad, CPI is not part of

Nothing,but code can reduce run-time

10: beq r1,r3,36

MIPS branch tests if register = 0 or

• Data stationary control

Execute successor instructions in

#4: Delayed Branch

Where to get instructions to fill branch delay slot?

Compiler effectiveness for single branch delay slot:

Delayed Branch downside: 7-8 stage pipelines, multiple

You might also like