Professional Documents
Culture Documents
RISC
Instruction Set Implementation Alternatives
== using MIPS as example ==
TU/e 5kk73
Henk Corporaal Bart Mesman
Topics
MIPS ISA: Instruction Set Architecture MIPS single cycle implementation MIPS multi-cycle implementation MIPS pipelined implementation Pipeline hazards Recap of RISC principles Other architectures Based on the book: ch2-4 (4th ed) Many slides; I'll go quick and skip some
Arithmetic
Control flow
MIPS arithmetic
Most instructions have 3 operands Operand order is fixed (destination first) Example: C code: A = B + C
MIPS code: add $s0, $s1, $s2 ($s0, $s1 and $s2 are associated with variables by compiler)
MIPS arithmetic
C code: A = B + C + D; E = F - A;
MIPS code: add $t0, $s1, $s2 add $s0, $t0, $s3 sub $s4, $s5, $s0
Operands must be registers, only 32 registers provided Design Principle: smaller is faster. Why?
Arithmetic instruction operands must be registers, only 32 registers provided Compiler associates variables with registers What about programs with lots of variables ?
CPU
register file
Memory
IO
H.Corporaal EmbProcArch 5kk73 6
Register allocation
Compiler tries to keep as many variables in registers as possible Some variables can not be allocated
large arrays (too few registers) aliased variables (variables accessible through pointers in C) dynamic allocated variables heap stack
Memory Organization
Viewed as a large, single-dimension array, with an address A memory address is an index into the array "Byte addressing" means that successive addresses are one byte apart
0 1
8 bits of data 8 bits of data 8 bits of data
2
3 4 5 6 ...
H.Corporaal EmbProcArch 5kk73
8 bits of data
8 bits of data 8 bits of data 8 bits of data
Memory Organization
Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes.
0 4
32 bits of data 32 bits of data 32 bits of data 32 bits of data
8
... 12
232 bytes with byte addresses from 0 to 232-1 230 words with byte addresses 0, 4, 8, ... 232-4
0 address
4 8 12 16 20 24
Words are aligned What are the least 2 significant bits of a word address?
H.Corporaal EmbProcArch 5kk73 10
C code:
A[8] = h + A[8];
MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 32($s3)
Store word operation has no destination (reg) operand Remember arithmetic operands are registers, not memory!
11
swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } swap: muli add lw lw sw sw jr Explanation: index k : $5 base address of v: $4 address of v[k] is $4 + 4.$5
H.Corporaal EmbProcArch 5kk73 12
Machine Language
Instructions, like registers and words of data, are also 32 bits long
Example: add $t0, $s1, $s2 Registers have numbers: $t0=9, $s1=17, $s2=18
rs 10001
5 bits
rt 10010
5 bits
rd 01000
5 bits
shamt 00000
5 bits
funct 100000
6 bits
13
Machine Language
What would the regularity principle have us do? New principle: Good design demands a compromise I-type for data transfer instructions other format was R-type for register
35
op
18
rs
9
rt
32
16 bit number
14
Program 2
unused
15
Control
alter the control flow, i.e., change the "next" instruction to be executed
MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label
Example:
if (i==j) h = i + j;
bne $s0, $s1, Label add $s3, $s0, $s1 Label: ....
16
Control
Example:
if (i!=j) h=i+j; else h=i-j; beq $s4, $s5, Lab1 add $s3, $s4, $s5 j Lab2 Lab1:sub $s3, $s4, $s5 Lab2:...
So far:
Instruction
add $s1,$s2,$s3 sub $s1,$s2,$s3 lw $s1,100($s2) sw $s1,100($s2) bne $s4,$s5,L beq $s4,$s5,L j Label
Meaning
$s1 = $s2 + $s3 $s1 = $s2 $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 Next instr. is at Label if $s4 $s5 Next instr. is at Label if $s4 = $s5 Next instr. is at Label
Formats:
R I J op op op rs rs rt rt rd shamt funct 16 bit address
26 bit address
18
Control Flow
Can use this instruction to build "blt $s1, $s2, Label" can now build general control structures
Note that the assembler needs a register to do this, use conventions for registers
19
Constants
Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not?
put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like one or .
MIPS Instructions: addi slti andi ori $29, $8, $29, $29, $29, $18, $29, $29, 4 10 6 4
21
We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate" instruction lui $t0, 1010101010101010 filled with zeros
1010101010101010 0000000000000000
Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010
1010101010101010 0000000000000000 1010101010101010
ori
0000000000000000
1010101010101010
H.Corporaal EmbProcArch 5kk73
1010101010101010
22
23
Instructions:
bne $t4,$t5,Label beq $t4,$t5,Label j Label Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5 Next instruction is at Label
Formats:
I J op op rs rt 16 bit address
26 bit address
Addresses are not 32 bits How do we handle this with load and store instructions?
24
Instructions:
bne $t4,$t5,Label beq $t4,$t5,Label Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5
Formats:
I op rs rt 16 bit address
use Instruction Address Register (PC = program counter) most branches are local (principle of locality)
25
To summarize:
Category
add
Instruction
MIPS assembly language Example Meaning add $s1, $s2, $s3 $s1 = $s2 + $s3 sub $s1, $s2, $s3 $s1 = $s2 - $s3 $s1 = $s2 + 100 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = 100 * 2
16
Comments
Three operands; data in registers
Arithmetic
subtract
addi $s1, $s2, 100 lw $s1, 100($s2) load word sw $s1, 100($s2) store word lb $s1, 100($s2) Data transfer load byte sb $s1, 100($s2) store byte load upper immediate lui $s1, 100
add immediate branch on equal
Used to add constants Word from memory to register Word from register to memory Byte from memory to register Byte from register to memory Loads constant in upper 16 bits
if ($s1 == $s2) go to PC + 4 + 100 if ($s1 != $s2) go to PC + 4 + 100 if ($s2 < $s3) $s1 = 1; else $s1 = 0 else $s1 = 0
Conditional branch
$s1, $s2, 100 if ($s2 < 100) $s1 = 1; 2500 $ra 2500
Unconditional jump
Jump to target address go to 10000 For switch, procedure return go to $ra $ra = PC + 4; go to 10000 For procedure call
26
Register
Byte
Halfword
Word
PC
Word
PC
Word
27
MIPS Datapath
Building a datapath
28
FSM or Microprogramming
29
Generic Implementation:
use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do
All instructions use the ALU after reading the registers Why?
30
PC
Instruction
ALU
Register #
elements that operate on data values (combinational) elements that contain state (sequential)
31
State Elements
32
S R 0 0 1 1 S 0 1 0 1 Q Q 1 0 ?
Truth table:
state change
33
Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock
Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology)
A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written
34
D-latch
Two inputs:
the data value to be stored (D) the clock signal (C) indicating when to read & store D the value of the internal state (Q) and it's complement
Two outputs:
C Q
C
_ Q D
35
D flip-flop
36
Our Implementation
read contents of some state elements, send values through some combinational logic, write results to one or more state elements
State element 1
Combinational logic
State element 2
Clock cycle
H.Corporaal EmbProcArch 5kk73 37
Register File
Read reg. #1
Read reg.#2
Write reg.#
38
R e g i s te r n 1 R e g is t e r n R e a d r e g i st e r nu m b er 2
M u x
R e ad d at a 2
0 1 R e g is t e r n u m b e r n -to - 1 d e co d e r n 1
C R e g is te r n 1 D C R e g i s te r n R e g is t e r d a t a
H.Corporaal EmbProcArch 5kk73
D
40
M u x
Read data
M u x
Sign extend
32
MemRead
41
All of the logic is combinational We wait for everything to settle down, and the right thing to be done
ALU might not produce right answer right away we use write signals along with clock to determine when to write
C o m b i n a t io n a l lo g ic
C l o c k c y c le
Control
Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction
Instruction Format:
10001 rs 10010 rt 01000 rd 00000 shamt 100000 funct
Control 2
26
instruction register
2 ALUop 00: lw, sw 01: beq 10: add, sub, and, or, slt
Control 1
3
ALUcontrol 000: and 001: or 010: add 110: sub 111: set on less than
Funct.
5 0
ALU
44
Instruction [2521] PC Read address Instruction [310] Instruction m m ry e o Instruction [1511] Instruction [2016] 0 M u x 1
Read data 1
0 M u x 1
Address
Wite r data
Instruction [150] 16 Sign extend 32 ALU control
1 M u x 0
Instruction [5 0]
45
ALU Control1
What should the ALU do with this instruction example: lw $1, 100($2) 35 op 2 rs 1 rt 100 16 bit offset
46
ALU Control1
given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic function code for arithmetic inputs
F5 X X X X X X X
47
ALU Control1
F3 F2 F (5 0) F1
F0
48
Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1
Determine these control signals directly from the opcodes: R-format: 0 lw: 35 sw: 43 beq: 4
H.Corporaal EmbProcArch 5kk73 49
Control 2
Outputs R-format Iw sw beq RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO
H.Corporaal EmbProcArch 5kk73 50
memory (2ns), ALU and adders (2ns), register file access (1ns)
PCSrc 1 M u x 0
RegWrite
Instruction [25 21] PC Read address Instruction [31 0] Instruction memory Instruction [20 16] 1 M u Instruction [15 11] x 0 RegDst Instruction [15 0] Read register 1 Read register 2
Shift left 2
Read data 1
Address
Read data
Sign 32 extend
Data memory
1 M u x 0
MemRead
Instruction [5 0] ALUOp
H.Corporaal EmbProcArch 5kk73 51
Memory (2ns), ALU & adders (2ns), reg. file access (1ns) Fixed length clock: longest instruction is the lw which requires 8 ns Variable clock length (not realistic, just as exercise):
6 ns 8 ns 7 ns 5 ns 2 ns
52
what if we had a more complicated instruction like floating point? wasteful of area: NO Sharing of Hardware resources
One Solution:
use a smaller cycle time have different instructions take different numbers of cycles a multicycle datapath:
IR
Data
MDR
53
Multicycle Approach
ALU used to compute address and to increment PC Memory used for instruction and data
Add registers after every major functional unit Our control signals will not be determined solely by instruction
54
a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input)
Current state
Next-state function
Next state
Clock Inputs
Output function
Outputs
Multicycle Approach
balance the amount of work to be done restrict each cycle to use only one major functional unit store values for use in later cycles (easiest thing to do) introduce additional internal registers
Notice: we distinguish
processor state: programmer visible registers internal state: programmer invisible registers (like IR, MDR, A, B, and ALUout)
56
Multicycle Approach
PC 0 M u x 1 Instruction [25 21] Instruction [20 16] Instruction [15 0] Instruction [15 11] Instruction register Instruction [15 0] 0 M u x 1 0 M u x 1 16 Read register 1 Read Read data 1 register 2 Registers W ite r Read register data 2 W ite r data A 0 M u x 1
Address
M mory e Mem ata D Wite r data
ALUOut
B 4
0 1M u 2x 3
Sign extend
32
Shift left 2
57
Multicycle Approach
Tclock > max (ALU delay, Memory access, Regfile access) See book for complete picture
58
Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step
Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register-Transfer Language" IR = Memory[PC]; PC = PC + 4;
Can we figure out the values of the control signals? What is the advantage of updating the PC now?
60
Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch Previous two actions are done optimistically!! RTL:
We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
61
ALU is performing one of four functions, based on instruction type Memory Reference:
ALUOut = A + sign-extend(IR[15-0]);
R-type: ALUOut = A op B;
62
The write actually takes place at the end of the cycle on the edge
63
Write-back step
Reg[IR[20-16]]= MDR;
What about all the other instructions?
64
Step name Instruction fetch Instruction decode/register fetch Execution, address computation, branch/ jump completion Memory access or R-type completion Memory read completion
ALUOut = A op B
PC = PC [31-28] II (IR[25-0]<<2)
65
Simple Questions
How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, L1 add $t5, $t2, $t3 sw $t5, 8($t3) L1: ...
What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place?
66
Use the information we have accumulated to specify a finite state machine (FSM)
67
In s tr u c ti o n fe tc h 0 M em R e ad A L U S rc A = 0 Io rD = 0 IR W r i te A L U S rc B = 0 1 ALUOp = 00 P C W r i te P C S o u rc e = 0 0
In s t r u c ti o n d e c o d e / re g i s te r fe t ch 1 A L U S rc A = 0 A L U S rc B = 1 1 A L U O p = 00
M e m o ry a d d r e s s c o m p u t a ti o n 6 A L U S rc A = 1 A L U S rc B = 10 ALUO p = 00
E x e c u ti o n 8 A L U S rc A = 1 A L U S rc B = 00 A L U O p = 10
B ra nc h co m p l e ti o n 9 A L U S rc A = 1 A L U S rc B = 0 0 AL U Op = 0 1 P C W rit eC o nd P C S ou rc e = 0 1
(O p = 'J')
J ump c o m p l e t io n
P C W r i te P C S ou rc e = 1 0
M e m o ry a c ce s s 5
M em o ry ac c es s 7 M e m W r ite Io r D = 1
R - t y p e c o m p l e t io n
3 M e m R ea d Io r D = 1
R e gD s t = 1 R e g W ri te M e m to R e g = 0
W rite - b a c k s te p 4 R eg D st = 0 R e g W r i te M e m to R e g = 1
Implementation:
Control logic
PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0
Inputs
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
State register
S0
69
PLA Implementation
(see book)
Op5 Op4
opcode
current state
S2 S1 S0
If I picked a horizontal or vertical line could you explain it ? What type of FSM is used? Mealy or Moore?
PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0
next state
70
datapath control
Pipelined implementation
71
Pipelining
Improve performance by increasing instruction throughput
P rog ra m e x e c u t io n T im e o rd er ( i n in s t r u c t i o n s ) lw $ 1 , 1 0 0 ( $ 0 ) 2 4 6 8 10 12 14 16 18
I n s t ru c t i o n R eg fe tc h
A LU
D a ta a c c e ss
R eg I n s t ru c t i o n R eg fe tc h D a ta a c c ess
lw $ 2 , 2 0 0 ( $ 0 )
8 ns
A LU
R eg I n s t ru c t i o n fe tc h
lw $ 3 , 3 0 0 ( $ 0 )
8 ns
...
8 ns P ro g ra m e x e c u t io n T im e o rd e r ( i n i n s t r u c t io n s ) lw $ 1 , 1 0 0 ( $ 0 )
10
12
14
I n s t r u c t io n fe tc h
Reg I n s t r u c t io n fe tc h
ALU
D a ta acce ss ALU
R eg D a ta a cc e s s ALU
lw $ 2 , 2 0 0 ( $ 0 )
2 ns
R eg I n s t r u c t io n fe tc h
R eg D a ta acce ss
lw $ 3 , 3 0 0 ( $ 0 )
2 ns
Reg
R eg
2 ns
2 ns
2 ns
2 ns 72
Pipelining
73
Pipelining
all instructions are the same length just a few instruction formats memory operands appear only in loads and stores
structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction
Well build a simple pipeline and look at these issues Well talk about modern processors and what really makes it hard:
74
PC
Address
Read data 1 Read register 2 Registers Read Write data 2 register Write data
0 M u x 1
Read data
1 M u x 0
16
Sign extend
32
75
Pipelined Datapath
0 M u x 1
Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
IF/ID
ID/EX
EX/MEM
MEM/WB
Add 4 Shift left 2 Ins tructio n Read register 1 Add Add result
PC
1 M u x 0
16
Sign extend
32
76
Corrected Datapath
0 M u x 1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add 4 Shift left 2 I nst r uci o n t Read register 1 Add Add result
PC
1 M u x 0
16
Sign extend
32
77
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths
78
Pipeline Control
PCSrc 0 M u x 1 IF/ID Add 4
RegWrite Shift left 2 Add Add result
ID/EX
EX/MEM
MEM/WB
Branch
PC
Address
Instruction memory
Instruction
Read register 1
0 M u x 1
Read
data
1 M u x 0
Instruction 16 [15 0]
Sign extend
32
ALU control
MemRead
0 M u x 1
RegDst
ALUOp
79
Pipeline control
Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back
a fancy control center telling everyone what to do? should we use a finite state machine?
80
Pipeline Control
Instruction R-format lw sw beq Execution/Address Calculation stage control lines Reg ALU ALU ALU Dst Op1 Op0 Src 1 1 0 0 0 0 0 1 X 0 0 1 X 0 1 0 Memory access stage control lines Branc Mem Mem h Read Write 0 0 0 0 1 0 0 0 1 1 0 0 Write-back stage control lines Reg Mem write to Reg 1 0 1 1 0 X 0 X
WB M WB
IF/ID
ID/EX
EX/MEM
MEM/WB
81
IF/ID Add 4
RegWrite
d Add reAuld s t
MemWrite Shift left 2 ALUSrc
Branch
PC
Instruction
Read register 1
Read data 1 Read register 2 Registers Read Write data 2 register Write data
0 M u x 1
Read data
Instruction 16 [15 0]
Sign extend
32
ALU control
MemRead
0 M u x 1 RegDst
ALUOp
MemtoReg
1 M u x 0
82
83
same resource is needed multiple times in the same cycle data dependencies limit pipelining next executed instruction may not be the next specified instruction
Data
Control
84
Structural hazards
Examples: Two accesses to a single ported memory Two operations need the same function unit at the same time Two operations need the same function unit in successive cycles, but the unit is not pipelined Solutions: stalling add more hardware
85
IF
ID
EX
MEM WB
IF
ID
IF
EX
ID IF
MEM WB
EX ID IF MEM WB EX ID MEM WB EX MEM WB
86
Data hazards
Data dependencies:
Hardware solution:
87
Data dependences
Three types: RaW, WaR and WaW
add r1, r2, 5 sub r4, r1, r3 add r1, r2, 5 sub r2, r4, 1 add r1, r2, 5 sub r1, r1, 1 st ld r1, 5(r2) r5, 0(r4) ; r1 := r2+5 ; RaW of r1 ; WaR of r2
; WaW of r1
; M[r2+5] := r1 ; RaW if 5+r2 = 0+r4
WaW and WaR do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and Pentium! use register renaming to solve this!
H.Corporaal EmbProcArch 5kk73 88
P ro g ra m e x e c u ti o n orde r ( in in s tru c t io n s )
su b $ 2 , $ 1 , $ 3 IM Reg DM Reg
and $1 2, $2 , $ 5
IM
R eg
DM
R eg
or $ 1 3 , $ 6 , $ 2
IM
R eg
DM
R eg
a dd $ 1 4 , $ 2 , $ 2
IM
Reg
DM
R eg
sw $ 1 5 , 1 0 0 ( $ 2 )
IM
R eg
DM
Reg
89
Forwarding
Use temporary results, dont wait for them to be written
V a l ue o f re giste r $ 2 : 1 0 V a lu e of E X /M E M : X V a lu e o f M E M /W B : X
a nd $ 1 2 , $ 2 , $ 5
IM
R eg
DM
R eg
or $ 1 3 , $ 6, $ 2
IM
R eg
DM
Reg
a dd $ 1 4 , $ 2 , $ 2
IM
Reg
DM
Reg
IM
Reg
DM
Reg 90
Forwarding hardware
ALU forwarding circuitry principle:
Note: there are two options buf - ALU bypass mux - buf buf - bypass mux ALU - buf H.Corporaal EmbProcArch 5kk73
91
Forwarding
Control IF/ID
ID/EX WB
EX/MEM WB
MEM/WB WB
EX
In str uc tion
M u x Registers
PC
Instruction memory
ForwardA ALU
M u x
Data memory
M u x
Rs Rt Rt Rd M u x
ForwardB
EX/MEM.RegisterRd
Forwarding unit
MEM/WB.RegisterRd
92
Forwarding check
Check for matching register-ids: For each source-id of operation in the EX-stage check if there is a matching pending dest-id
Example:
if (EX/MEM.RegWrite) (EX/MEM.RegisterRd 0) (EX/MEM.RegisterRd = ID/EX.RegisterRs) then ForwardA = 10
T im e ( in c lo c k c y c le s ) P r o gr a m CC 1 e x e c u t io n ord er ( in in s t r u c t i o n s ) CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
lw $ 2 , 2 0 ( $ 1 )
IM
R eg
DM
R eg
an d $4 , $ 2, $5
IM
R eg
DM
Re g
or $8 , $ 2, $6
IM
R eg
DM
Reg
ad d $9 , $ 4, $2
IM
R eg
DM
Reg
IM
Reg
DM
Reg 94
Stalling
We can stall the pipeline by keeping an instruction in the same stage
Program Tim (in clock cycles) e execution CC1 CC2 order (in instructions) CC3 CC4 CC5 CC6 CC 7 CC8 CC9 CC 10
lw$2, 20($1)
IM
Reg
DM
Reg
IM
Reg
Reg
DM
Reg
or $8, $2, $6
IM
IM
Reg
DM
Reg
bubble
add $9, $4, $2 IM Reg DM Reg
In$1, $6, $7 the ALU is not used, CC4 slt Reg, and IM are redone
H.Corporaal EmbProcArch 5kk73
IM
Reg
DM
Reg
95
Control 0
MEM/WB WB
IF/ID
EX
P CW r ite
In str uction
PC
Instruction memory
M u x
IF/ID.RegisterRs IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRd ID/EX.RegisterRt
Rt Rd Rs Rt
M u x Forwarding unit
EX/MEM.RegisterRd
MEM/WB.RegisterRd
96
Have compiler guarantee that no hazards occur Example: where do we insert the NOPs ?
sub nop nop and or Problem: this really slows us down! add nop sw
$2,
$1, $3
Control hazards
branch jump call (jump and link) return (exception/interrupt and rti / return from interrupt)
98
99
Branch example
P ro g ra m e x e c u ti o n o rd e r ( in i n s t r u c t i o n s ) T i m e ( i n c l o c k c y c le s ) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
40 be q $1 , $ 3, 7
IM
Reg
DM
R eg
44 an d $1 2, $2 , $ 5
IM
Reg
DM
R eg
48 or $1 3, $6 , $ 2
IM
R eg
DM
R eg
52 ad d $1 4 , $2 , $ 2
IM
R eg
DM
Reg
7 2 lw $ 4 , 5 0 ($ 7 )
IM
Reg
DM
R eg
100
Branching
Squash pipeline: When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken
101
Clock cycles
Branch L
IF
ID IF
EX ID IF
L:
102
Branch speedup
Earlier address computation Earlier condition calculation Put both in the ID pipeline stage
adder comparator
Clock cycles
IF
ID IF
EX ID IF
Control 0 IF/ID
MEM/WB WB
EX
Registers PC
Instruction memory
M u x
Sign extend
M u x Forwarding unit
104
Exception support
Types of exceptions: Overflow I/O device request Operating system call Undefined instruction Hardware malfunction Page fault
Precise exception:
finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after handling the exception(s)
105
Exceptions
Changes needed for handling overflow exception of an operation in EX stage (see book for details) :
Extend PC input mux with extra entry with fixed address Add EPC register recording the ID/EX stage PC
E.g., in case of overflow exception insert 3 bubbles; flush the following stages: IF/ID stage ID/EX stage EX/MEM stage
H.Corporaal EmbProcArch 5kk73 106
Scheduling, why?
Lets look at the execution time: Texecution = Ncycles x Tcycle = Ninstructions x CPI x Tcycle Scheduling may reduce Texecution
Reduce CPI (cycles per instruction) early scheduling of long latency operations avoid pipeline stalls due to structural, data and control hazards allow Nissue > 1 and therefore CPI < 1 Reduce Ninstructions compact many operations into each instruction (VLIW)
107
lw lw sw sw
lw lw sw sw
108
Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4
Code: a = b + c d = e - f
Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4
109
Modern processors tend to have large branch penalty, Pbranch, due to:
110
Early computation of new PC Early determination of condition Visible branch delay slots filled by compiler (MIPS)
Branch prediction Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro95] Remove branches: if-conversion
Conditional instructions: CMOVE, cond skip next Guarding all instructions: TriMedia
111
the next instruction after a branch is always executed rely on compiler to fill the slot with something useful
112
branch target
H.Corporaal EmbProcArch 5kk73
L: op 3 .............
113
Summary
Modern processors are (deeply) pipelined, to reduce Tcycle and aim at CPI = 1 Hazards increase CPI Several software and hardware measure to avoid or reduce hazards are taken
Not discussed, but important developments: Multi-issue further reduces CPI Branch prediction to avoid high branch penalties Dynamic scheduling In all cases: a scheduling compiler needed
H.Corporaal EmbProcArch 5kk73 114
Recap of MIPS
115
load-store architecture enables pipelining uniform (no distinction between e.g. address and data registers) know directly where the following instruction starts
Limited number of instruction formats Memory alignment restrictions ...... Based on quantitative analysis
" the famous MIPS one percent rule": don't even think about it when its not used more than one percent
116
Register space
32 integer (and 32 floating point) registers of 32-bit
Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address
H.Corporaal EmbProcArch 5kk73 117
Addressing
funct Registers Register
Register
Byte
Halfword
Word
PC
Word
PC
Word
118
Instruction format
R I J op op op rs rs rt rt rd shamt funct 16 bit address
26 bit address
Meaning
$s1 = $s2 + $s3 $s2 = $s3 + 4 $s1 = Memory[$s2+100] if $s4<>$s5 goto L goto Label
119
Pipelining
All integer instructions fit into the following pipeline
time
IF
ID IF
EX ID IF
MEM EX ID IF
120
Accumulator architecture
one operand (in register or memory), accumulator almost always implicitly used zero operand: all operands implicit (on TOS)
Stack
three operands, all in registers loads and stores are the only instructions accessing memory (i.e. with a memory (indirect) addressing mode two operands, one in memory
three operands, may be all in memory
Register-Memory
Memory-Memory
Accumulator architecture
latch Accumulator
address
Memory
122
Stack architecture
latch latch top of stack ALU latch stack pt Memory
push c
add
pop a
c b
b+c
123
Accumulator Architecture
Load A Add B
RegisterMemory
Load r1,A Add r1,B
MemoryMemory
Add C,B,A
Register (load-store)
Load r1,A Load r2,B Add r3,r1,r2
Store C
Store C,r1
Store C,r3