Mi Ps Details

Embedded Processor Architecture
RISC
Instruction Set Implementation Alternatives
== using MIPS as example ==
TU/e 5kk73
Henk Corporaal Bart Mesman
Topics

MIPS ISA: Instruction Set Architecture MIPS single cycle implementation MIPS multi-cycle implementation MIPS pipelined implementation Pipeline hazards Recap of RISC principles Other architectures Based on the book: ch2-4 (4th ed) Many slides; I'll go quick and skip some
H.Corporaal EmbProcArch 5kk73
Main Types of Instructions
Arithmetic

Integer Floating Point
Memory access instructions
Load & Store
Control flow
Jump Conditional Branch Call & Return
MIPS arithmetic

Most instructions have 3 operands Operand order is fixed (destination first) Example: C code: A = B + C
MIPS code: add $s0, $s1, $s2 ($s0, $s1 and $s2 are associated with variables by compiler)
MIPS arithmetic
C code: A = B + C + D; E = F - A;
MIPS code: add $t0, $s1, $s2 add $s0, $t0, $s3 sub $s4, $s5, $s0

Operands must be registers, only 32 registers provided Design Principle: smaller is faster. Why?
Registers vs. Memory
Arithmetic instruction operands must be registers, only 32 registers provided Compiler associates variables with registers What about programs with lots of variables ?
CPU
register file
Memory
IO
H.Corporaal EmbProcArch 5kk73 6
Register allocation
Compiler tries to keep as many variables in registers as possible Some variables can not be allocated
large arrays (too few registers) aliased variables (variables accessible through pointers in C) dynamic allocated variables heap stack
Compiler may run out of registers => spilling
Memory Organization
Viewed as a large, single-dimension array, with an address A memory address is an index into the array "Byte addressing" means that successive addresses are one byte apart
0 1
8 bits of data 8 bits of data 8 bits of data
2
3 4 5 6 ...
8 bits of data
8 bits of data 8 bits of data 8 bits of data
Memory Organization

Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes.
0 4
32 bits of data 32 bits of data 32 bits of data 32 bits of data
Registers hold 32 bits of data
8
... 12
232 bytes with byte addresses from 0 to 232-1 230 words with byte addresses 0, 4, 8, ... 232-4
Memory layout: Alignment

31 23 15 7 0
0 address
this word is aligned; the others are not!
4 8 12 16 20 24
Words are aligned What are the least 2 significant bits of a word address?
Instructions: load and store

Example:
C code:
A[8] = h + A[8];
MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 32($s3)

Store word operation has no destination (reg) operand Remember arithmetic operands are registers, not memory!
11
Let's translate some C-code
Can we figure out the code?
swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } swap: muli add lw lw sw sw jr Explanation: index k : $5 base address of v: $4 address of v[k] is $4 + 4.$5
$2 , $2 , $15, $16, $16, $15, $31
$5, 4 $4, $2 0($2) 4($2) 0($2) 4($2)
Machine Language
Instructions, like registers and words of data, are also 32 bits long

Example: add $t0, $s1, $s2 Registers have numbers: $t0=9, $s1=17, $s2=18
Instruction Format: op 000000

6 bits
rs 10001
5 bits
rt 10010
5 bits
rd 01000
5 bits
shamt 00000
5 bits
funct 100000
6 bits
Can you guess what the field names stand for?
13
Machine Language
Consider the load-word and store-word instructions,

What would the regularity principle have us do? New principle: Good design demands a compromise I-type for data transfer instructions other format was R-type for register
Introduce a new type of instruction format

Example: lw $t0, 32($s2)
35
op
18
rs
9
rt
32
16 bit number
14
Stored Program Concept

memory OS Program 1 CPU
unused
code global data stack heap
Program 2
unused
15
Control
Decision making instructions

alter the control flow, i.e., change the "next" instruction to be executed
MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label
Example:
if (i==j) h = i + j;
bne $s0, $s1, Label add $s3, $s0, $s1 Label: ....
16
Control
MIPS unconditional branch instructions: j label
Example:
if (i!=j) h=i+j; else h=i-j; beq $s4, $s5, Lab1 add $s3, $s4, $s5 j Lab2 Lab1:sub $s3, $s4, $s5 Lab2:...
Can you build a simple for loop?

17
So far:
Instruction
add $s1,$s2,$s3 sub $s1,$s2,$s3 lw $s1,100($s2) sw $s1,100($s2) bne $s4,$s5,L beq $s4,$s5,L j Label
Meaning
$s1 = $s2 + $s3 $s1 = $s2 $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 Next instr. is at Label if $s4 $s5 Next instr. is at Label if $s4 = $s5 Next instr. is at Label
Formats:
R I J op op op rs rs rt rt rd shamt funct 16 bit address
26 bit address
18
Control Flow

We have: beq, bne, what about Branch-if-less-than? New instruction:

meaning: if slt $t0, $s1, $s2
$s1 < $s2 then $t0 = 1 else $t0 = 0
Can use this instruction to build "blt $s1, $s2, Label" can now build general control structures
Note that the assembler needs a register to do this, use conventions for registers
19
MIPS compiler/assembler Conventions

Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address
Constants
Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not?

put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like one or .
MIPS Instructions: addi slti andi ori $29, $8, $29, $29, $29, $18, $29, $29, 4 10 6 4
21
How about larger constants?

We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate" instruction lui $t0, 1010101010101010 filled with zeros
1010101010101010 0000000000000000
Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010
1010101010101010 0000000000000000 1010101010101010
ori
0000000000000000
1010101010101010
1010101010101010
22
Assembly Language vs. Machine Language
Assembly provides convenient symbolic representation
much easier than writing down numbers
e.g., destination first

e.g., destination is no longer first e.g., move $t0, $t1 exists only in Assembly would be implemented using add $t0,$t1,$zero
Machine language is the underlying reality
Assembly can provide 'pseudoinstructions'

When considering performance you should count real instructions
23
Addresses in Branches and Jumps
Instructions:
bne $t4,$t5,Label beq $t4,$t5,Label j Label Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5 Next instruction is at Label
Formats:
I J op op rs rt 16 bit address
26 bit address
Addresses are not 32 bits How do we handle this with load and store instructions?
24
What's the next address?
Instructions:
bne $t4,$t5,Label beq $t4,$t5,Label Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5
Formats:
I op rs rt 16 bit address
Could specify a register (like lw and sw) and add it to address

use Instruction Address Register (PC = program counter) most branches are local (principle of locality)
Jump instructions just use high order bits of PC
address boundaries of 256 MB
25
To summarize:
Category
add
Instruction
MIPS assembly language Example Meaning add $s1, $s2, $s3 $s1 = $s2 + $s3 sub $s1, $s2, $s3 $s1 = $s2 - $s3 $s1 = $s2 + 100 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = 100 * 2
16
Comments
Three operands; data in registers
Arithmetic
subtract
Three operands; data in registers
addi $s1, $s2, 100 lw $s1, 100($s2) load word sw $s1, 100($s2) store word lb $s1, 100($s2) Data transfer load byte sb $s1, 100($s2) store byte load upper immediate lui $s1, 100
add immediate branch on equal
Used to add constants Word from memory to register Word from register to memory Byte from memory to register Byte from register to memory Loads constant in upper 16 bits
beq bne slt slti j jr jal
$s1, $s2, 25 $s1, $s2, 25 $s1, $s2, $s3
if ($s1 == $s2) go to PC + 4 + 100 if ($s1 != $s2) go to PC + 4 + 100 if ($s2 < $s3) $s1 = 1; else $s1 = 0 else $s1 = 0
Equal test; PC-relative branch
branch on not equal
Not equal test; PC-relative
Conditional branch
set on less than
Compare less than; for beq, bne
set less than immediate jump
$s1, $s2, 100 if ($s2 < 100) $s1 = 1; 2500 $ra 2500
Compare less than constant
Unconditional jump
jump register jump and link
Jump to target address go to 10000 For switch, procedure return go to $ra $ra = PC + 4; go to 10000 For procedure call
26
MIPS (3+2) addressing modes overview

1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register
3. Base addressing op rs rt Address Memory
Register
Byte
Halfword
Word
4. PC-relative addressing op rs rt Address Memory
PC
Word
5. Pseudodirect addressing op Address Memory
PC
Word
27
MIPS Datapath
Building a datapath
support a subset of the MIPS-I instruction-set
A single cycle processor datapath
all instruction actions in one (long) cycle
A multi-cycle processor datapath
each instructions takes multiple (shorter) cycles
For details see book (ch 5):
28
Datapath and Control

Registers & Memories
Multiplexors Buses ALUs Control Datapath
FSM or Microprogramming
29
The Processor: Datapath & Control
Simplified MIPS implementation to contain only:

memory-reference instructions: arithmetic-logical instructions: control flow instructions:
lw, sw add, sub, and, or, slt beq, j
Generic Implementation:

use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do
All instructions use the ALU after reading the registers Why?
memory-reference? arithmetic? control flow? H.Corporaal EmbProcArch 5kk73
30
More Implementation Details
Abstract / Simplified View:

Data Register # Registers Register #
PC
Address Instruction memory
Instruction
ALU
Address Data memory Data
Register #
Two types of functional units:

elements that operate on data values (combinational) elements that contain state (sequential)
31
State Elements

Unclocked vs. Clocked Clocks used in synchronous logic
when should an element that contains state be updated?

falling edge
cycle time rising edge
32
An unclocked state element
The set-reset (SR) latch
output depends on present inputs and also on past inputs R Q
S R 0 0 1 1 S 0 1 0 1 Q Q 1 0 ?
Truth table:
state change
33
Latches and Flip-flops
Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock

Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology)
A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written
34
D-latch
Two inputs:

the data value to be stored (D) the clock signal (C) indicating when to read & store D the value of the internal state (Q) and it's complement
Two outputs:
C Q
C
_ Q D
35
D flip-flop
Output changes only on the clock edge

D D C C D latch Q D Q D latch _ C Q Q _ Q
36
Our Implementation

An edge triggered methodology Typical execution:
read contents of some state elements, send values through some combinational logic, write results to one or more state elements
State element 1
Combinational logic
State element 2
Clock cycle
Register File
3-ported: one write, two read ports
Read reg. #1
Read data 1 Read data 2
Read reg.#2
Write reg.#
Write data Write
38
Register file: read ports

Register file built using D flip-flops
R e a d r e g i st e r nu m b er 1 R e g i s te r 0 R e g i s te r 1 M u x R e ad d at a 1
R e g i s te r n 1 R e g is t e r n R e a d r e g i st e r nu m b er 2
M u x
R e ad d at a 2
Implementation of the read ports

Register file: write port
Note: we still use the real clock to determine when to write

W r it e C R e g i s te r 0 D C R e g i s te r 1 D
0 1 R e g is t e r n u m b e r n -to - 1 d e co d e r n 1
C R e g is te r n 1 D C R e g i s te r n R e g is t e r d a t a
D
40
Building the Datapath
Use multiplexors to stitch them together

PCSrc Add M u x Add ALU result Shift left 2 Registers PC Read address Instruction Instruction memory Read register 1 Read Read data 1 register 2 Write register Write data RegWrite 16 Read data 2 ALUSrc 3 ALU operation Zero ALU ALU result MemWrite MemtoReg Address
M u x
Read data
Data Write memory data
M u x
Sign extend
32
MemRead
41
Our Simple Control Structure

All of the logic is combinational We wait for everything to settle down, and the right thing to be done

ALU might not produce right answer right away we use write signals along with clock to determine when to write
Cycle time determined by length of the longest path

S ta t e e le m e n t 1 S ta te e lem e n t 2
C o m b i n a t io n a l lo g ic
C l o c k c y c le
We are ignoring some details like setup and hold times !

Control

Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction
Example: add $8, $17, $18

000000 op
Instruction Format:
10001 rs 10010 rt 01000 rd 00000 shamt 100000 funct
ALU's operation based on instruction type and function code

43
Control: 2 level implementation

bit Opcode 31 6
Control 2
26
instruction register
2 ALUop 00: lw, sw 01: beq 10: add, sub, and, or, slt
Control 1
3
ALUcontrol 000: and 001: or 010: add 110: sub 111: set on less than
Funct.
5 0
ALU
44
Datapath with Control

0 M u x Add ALU result Add 4 Instruction [3126] RegDst Branch Mem ead R MemoReg t Control ALUOp Mem rite W ALUSrc RegWite r Read register 1 Shift left 2 1
Instruction [2521] PC Read address Instruction [310] Instruction m m ry e o Instruction [1511] Instruction [2016] 0 M u x 1
Read register 2 R gisters Read e W ite r data 2 register W ite r data
Read data 1
0 M u x 1
Zero ALU ALU result
Address
Read data D ta a memory
Wite r data
Instruction [150] 16 Sign extend 32 ALU control
1 M u x 0
Instruction [5 0]
45
ALU Control1
What should the ALU do with this instruction example: lw $1, 100($2) 35 op 2 rs 1 rt 100 16 bit offset
ALU control input
000 001 010 110 111
AND OR add subtract set-on-less-than
Why is the code for subtract 110 and not 011?
46
ALU Control1
Must describe hardware to compute 3-bit ALU control input
given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic function code for arithmetic inputs
ALU Operation class, computed from instruction type
Describe it using a truth table (can turn into gates):

outputs
Operation F0 X X 0 0 0 1 0 010 110 010 110 000 001 111 Funct field F4 F3 F2 F1 X X X X X X X X X 0 0 0 X 0 0 1 X 0 1 0 X 0 1 0 X 1 0 1
ALUOp ALUOp1 ALUOp0 0 0 X 1 1 X 1 X 1 X 1 X 1 X H.Corporaal EmbProcArch 5kk73
F5 X X X X X X X
47
ALU Control1
Simple combinational logic (truth tables)

ALUOp ALU control block ALUOp0 ALUOp1
F3 F2 F (5 0) F1
Operation2 Operation1 Operation0 Operation
F0
48
Deriving Control2 signals

Input
6-bits
9 control (output) signals
Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1
Determine these control signals directly from the opcodes: R-format: 0 lw: 35 sw: 43 beq: 4
Control 2
Inputs Op5 Op4 Op3
PLA example implementation
Op2 Op1 Op0
Outputs R-format Iw sw beq RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO
Single Cycle Implementation
Calculate cycle time assuming negligible delays except:
memory (2ns), ALU and adders (2ns), register file access (1ns)
PCSrc 1 M u x 0
Add 4 ALU Add result
RegWrite
Instruction [25 21] PC Read address Instruction [31 0] Instruction memory Instruction [20 16] 1 M u Instruction [15 11] x 0 RegDst Instruction [15 0] Read register 1 Read register 2
Shift left 2
Read data 1
MemWrite ALUSrc 1 M u x 0 Zero ALU ALU result MemtoReg
Read Write data 2 register Write Registers data 16
Address
Read data
Sign 32 extend
Write data ALU control
Data memory
1 M u x 0
MemRead
Instruction [5 0] ALUOp
Single Cycle Implementation
Memory (2ns), ALU & adders (2ns), reg. file access (1ns) Fixed length clock: longest instruction is the lw which requires 8 ns Variable clock length (not realistic, just as exercise):
R-instr: Load: Store: Branch: Jump:
6 ns 8 ns 7 ns 5 ns 2 ns
Average depends on instruction mix
52
Where we are headed
Single Cycle Problems:

what if we had a more complicated instruction like floating point? wasteful of area: NO Sharing of Hardware resources
One Solution:
use a smaller cycle time have different instructions take different numbers of cycles a multicycle datapath:
Instruction register PC Address Instruction Memory or data Memory data register
Data A Register # Registers Register # B Register # ALU ALUOut
IR
Data
MDR
53
Multicycle Approach
We will be reusing functional units

ALU used to compute address and to increment PC Memory used for instruction and data
Add registers after every major functional unit Our control signals will not be determined solely by instruction
e.g., what should the ALU do for a subtract instruction?
Well use a finite state machine (FSM) or microcode for control
54
Review: finite state machines
Finite state machines:

a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input)
Current state
Next-state function
Next state
Clock Inputs
Output function
Outputs
Well use a Moore machine (output based only on current state)

55
Multicycle Approach
Break up the instructions into steps, each step takes a cycle

balance the amount of work to be done restrict each cycle to use only one major functional unit store values for use in later cycles (easiest thing to do) introduce additional internal registers
At the end of a cycle
Notice: we distinguish
processor state: programmer visible registers internal state: programmer invisible registers (like IR, MDR, A, B, and ALUout)
56
Multicycle Approach
PC 0 M u x 1 Instruction [25 21] Instruction [20 16] Instruction [15 0] Instruction [15 11] Instruction register Instruction [15 0] 0 M u x 1 0 M u x 1 16 Read register 1 Read Read data 1 register 2 Registers W ite r Read register data 2 W ite r data A 0 M u x 1
Address
M mory e Mem ata D Wite r data
Zero ALU ALU result
ALUOut
B 4
0 1M u 2x 3
Mem ry o data register
Sign extend
32
Shift left 2
57
Multicycle Approach
Note that previous picture does not include:
branch support jump support Control lines and logic
Tclock > max (ALU delay, Memory access, Regfile access) See book for complete picture
58
Five Execution Steps
Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Step 1: Instruction Fetch

Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register-Transfer Language" IR = Memory[PC]; PC = PC + 4;
Can we figure out the values of the control signals? What is the advantage of updating the PC now?
60
Step 2: Instruction Decode and Register Fetch

Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch Previous two actions are done optimistically!! RTL:
A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC+(sign-extend(IR[15-0])<< 2);
We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
61
Step 3 (instruction dependent)
ALU is performing one of four functions, based on instruction type Memory Reference:
ALUOut = A + sign-extend(IR[15-0]);
R-type: ALUOut = A op B;
Branch: if (A==B) PC = ALUOut;
Jump: PC = PC[31-28] || (IR[25-0]<<2)
62
Step 4 (R-type or Memory-access)
Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B;
R-type instructions finish Reg[IR[15-11]] = ALUOut;
The write actually takes place at the end of the cycle on the edge
63
Write-back step
Memory read completion step
Reg[IR[20-16]]= MDR;
What about all the other instructions?
64
Summary execution steps

Steps taken to execute any instruction class
Action for R-type instructions Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) ALUOut = A + sign-extend (IR[15-0]) Load: MDR = Memory[ALUOut] or Store: Memory [ALUOut] = B Load: Reg[IR[20-16]] = MDR if (A ==B) then PC = ALUOut Action for jumps
Step name Instruction fetch Instruction decode/register fetch Execution, address computation, branch/ jump completion Memory access or R-type completion Memory read completion
ALUOut = A op B
PC = PC [31-28] II (IR[25-0]<<2)
Reg [IR[15-11]] = ALUOut
65
Simple Questions
How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, L1 add $t5, $t2, $t3 sw $t5, 8($t3) L1: ...
#assume not taken
What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place?
66
Implementing the Control
Value of control signals is dependent upon:

what instruction is being executed which step is being performed
Use the information we have accumulated to specify a finite state machine (FSM)

specify the finite state machine graphically, or use microprogramming
Implementation can be derived from specification
67
Graphical Specification of FSM

S t a rt
In s tr u c ti o n fe tc h 0 M em R e ad A L U S rc A = 0 Io rD = 0 IR W r i te A L U S rc B = 0 1 ALUOp = 00 P C W r i te P C S o u rc e = 0 0
In s t r u c ti o n d e c o d e / re g i s te r fe t ch 1 A L U S rc A = 0 A L U S rc B = 1 1 A L U O p = 00
How many state bits will we need?

2
M e m o ry a d d r e s s c o m p u t a ti o n 6 A L U S rc A = 1 A L U S rc B = 10 ALUO p = 00
E x e c u ti o n 8 A L U S rc A = 1 A L U S rc B = 00 A L U O p = 10
B ra nc h co m p l e ti o n 9 A L U S rc A = 1 A L U S rc B = 0 0 AL U Op = 0 1 P C W rit eC o nd P C S ou rc e = 0 1
(O p = 'J')
J ump c o m p l e t io n
P C W r i te P C S ou rc e = 1 0
(Op = 'L W')
M e m o ry a c ce s s 5
M em o ry ac c es s 7 M e m W r ite Io r D = 1
R - t y p e c o m p l e t io n
3 M e m R ea d Io r D = 1
R e gD s t = 1 R e g W ri te M e m to R e g = 0
W rite - b a c k s te p 4 R eg D st = 0 R e g W r i te M e m to R e g = 1
Finite State Machine for Control

PCWrite
Implementation:
Control logic
PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0
Inputs
Op5
Op4
Op3
Op2
Op1
Op0
S3
S2
S1
Instruction register opcode field
State register
S0
69
PLA Implementation
(see book)
Op5 Op4
opcode
Op3 Op2 Op1 Op0 S3
current state
S2 S1 S0
If I picked a horizontal or vertical line could you explain it ? What type of FSM is used? Mealy or Moore?
PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0
next state
70
datapath control
Pipelined implementation

Pipelining Pipelined datapath Pipelined control Hazards:

Structural Data Control Exceptions
Scheduling For details see the book (chapter 6):
71
Pipelining
Improve performance by increasing instruction throughput
P rog ra m e x e c u t io n T im e o rd er ( i n in s t r u c t i o n s ) lw $ 1 , 1 0 0 ( $ 0 ) 2 4 6 8 10 12 14 16 18
I n s t ru c t i o n R eg fe tc h
A LU
D a ta a c c e ss
R eg I n s t ru c t i o n R eg fe tc h D a ta a c c ess
lw $ 2 , 2 0 0 ( $ 0 )
8 ns
A LU
R eg I n s t ru c t i o n fe tc h
lw $ 3 , 3 0 0 ( $ 0 )
8 ns
...
8 ns P ro g ra m e x e c u t io n T im e o rd e r ( i n i n s t r u c t io n s ) lw $ 1 , 1 0 0 ( $ 0 )
10
12
14
I n s t r u c t io n fe tc h
Reg I n s t r u c t io n fe tc h
ALU
D a ta acce ss ALU
R eg D a ta a cc e s s ALU
lw $ 2 , 2 0 0 ( $ 0 )
2 ns
R eg I n s t r u c t io n fe tc h
R eg D a ta acce ss
lw $ 3 , 3 0 0 ( $ 0 )
2 ns
Reg
R eg
2 ns H.Corporaal EmbProcArch 5kk73
2 ns
2 ns
2 ns
2 ns 72
Pipelining
Ideal speedup = number of stages Do we achieve this?
73
Pipelining
What makes it easy

all instructions are the same length just a few instruction formats memory operands appear only in loads and stores
What makes it hard?

structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction
Well build a simple pipeline and look at these issues Well talk about modern processors and what really makes it hard:

exception handling trying to improve performance with out-of-order execution, etc.
74
Basic idea: start from single cycle impl.

What do we need to add to actually split the datapath into stages?
IF: Instruction fetch
0 M u x 1
ID: Instruction decode/ register file read
EX: Execute/ address calculation
MEM: Memory access
WB: Write back
Add 4 Shift left 2 Read register 1 d Add reAuld s t
PC
Address
Instruction Instruction memory
Read data 1 Read register 2 Registers Read Write data 2 register Write data
0 M u x 1
Zero ALU ALU result
Address Data memory Write data
Read data
1 M u x 0
16
Sign extend
32
75
Pipelined Datapath
0 M u x 1
Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?
IF/ID
ID/EX
EX/MEM
MEM/WB
Add 4 Shift left 2 Ins tructio n Read register 1 Add Add result
PC
Read register 2 Registers Read Write data 2 register Write data
Read data 1 0 M u x 1 Zero ALU ALU result Read data
1 M u x 0
16
Sign extend
32
76
Corrected Datapath
0 M u x 1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add 4 Shift left 2 I nst r uci o n t Read register 1 Add Add result
PC
Read register 2 Registers Read Write data 2 register Write data
Read data 1 0 M u x 1 Zero ALU ALU result Read data
1 M u x 0
16
Sign extend
32
77
Graphically Representing Pipelines

Time (in clock cycles)
Program execution order (in instructions)

lw $10, 20($1)
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
IM
Reg
ALU
DM
Reg
sub $11, $2, $3
IM
Reg
ALU
DM
Reg
Can help with answering questions like:

how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths
78
Pipeline Control
PCSrc 0 M u x 1 IF/ID Add 4
RegWrite Shift left 2 Add Add result
ID/EX
EX/MEM
MEM/WB
Branch
PC
Address
Instruction memory
Instruction
Read register 1
MemWrite Read data 1

ALUSrc
Read register 2 Registers Read Write data 2 register

Write data
0 M u x 1
Zo Zerero ALU ALU result
MemtoReg Address Data memory Write data
Read
data
1 M u x 0
Instruction 16 [15 0]
Sign extend
32
ALU control
MemRead
Instruction [20 16] Instruction [15 11]
0 M u x 1
RegDst
ALUOp
79
Pipeline control
We have 5 stages. What needs to be controlled in each stage?

Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back
How would control be handled in an automobile plant?

a fancy control center telling everyone what to do? should we use a finite state machine?
80
Pipeline Control
Instruction R-format lw sw beq Execution/Address Calculation stage control lines Reg ALU ALU ALU Dst Op1 Op0 Src 1 1 0 0 0 0 0 1 X 0 0 1 X 0 1 0 Memory access stage control lines Branc Mem Mem h Read Write 0 0 0 0 1 0 0 0 1 1 0 0 Write-back stage control lines Reg Mem write to Reg 1 0 1 1 0 X 0 X
Pass control signals along just like the data:

WB Instruction M EX Control
(compare single cycle control!)
WB M WB
IF/ID
ID/EX
EX/MEM
MEM/WB
81
Datapath with Control

PCSrc 0 M u x 1 Control ID/EX WB M EX EX/MEM WB M MEM/WB WB
IF/ID Add 4
RegWrite
d Add reAuld s t
MemWrite Shift left 2 ALUSrc
Branch
PC
Instruction
Read register 1
Read data 1 Read register 2 Registers Read Write data 2 register Write data
0 M u x 1
Zero ALU ALU result
Read data
Instruction 16 [15 0]
Sign extend
32
ALU control
MemRead
Instruction [20 16]

Instruction [15 11]
0 M u x 1 RegDst
ALUOp
MemtoReg
1 M u x 0
82
83
Hazards: problems due to pipelining

Hazard types: Structural
same resource is needed multiple times in the same cycle data dependencies limit pipelining next executed instruction may not be the next specified instruction
Data
Control
84
Structural hazards
Examples: Two accesses to a single ported memory Two operations need the same function unit at the same time Two operations need the same function unit in successive cycles, but the unit is not pipelined Solutions: stalling add more hardware
85
Structural hazards on MIPS

Q: Do we have structural hazards on our simple MIPS pipeline?
time
IF
ID
EX
MEM WB
IF
ID
IF
EX
ID IF
MEM WB
EX ID IF MEM WB EX ID MEM WB EX MEM WB
86
Data hazards
Data dependencies:

RaW WaW WaR
(read-after-write) (write-after-write) (write-after-read)
Hardware solution:

Forwarding / Bypassing Detection logic Stalling
Software solution: Scheduling
87
Data dependences
Three types: RaW, WaR and WaW
add r1, r2, 5 sub r4, r1, r3 add r1, r2, 5 sub r2, r4, 1 add r1, r2, 5 sub r1, r1, 1 st ld r1, 5(r2) r5, 0(r4) ; r1 := r2+5 ; RaW of r1 ; WaR of r2
; WaW of r1
; M[r2+5] := r1 ; RaW if 5+r2 = 0+r4
WaW and WaR do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and Pentium! use register renaming to solve this!
RaW on MIPS pipeline

T i m e ( in c lo c k c y c le s ) V a lu e o f r e g is te r $ 2 : CC 1 10 CC 2 10 CC 3 10 CC 4 10 CC 5 1 0 / 2 0 CC 6 20 CC 7 20 CC 8 20 CC 9 20
P ro g ra m e x e c u ti o n orde r ( in in s tru c t io n s )
su b $ 2 , $ 1 , $ 3 IM Reg DM Reg
and $1 2, $2 , $ 5
IM
R eg
DM
R eg
or $ 1 3 , $ 6 , $ 2
IM
R eg
DM
R eg
a dd $ 1 4 , $ 2 , $ 2
IM
Reg
DM
R eg
sw $ 1 5 , 1 0 0 ( $ 2 )
IM
R eg
DM
Reg
89
Forwarding
Use temporary results, dont wait for them to be written
register file forwarding to handle read/write to same register ALU forwarding

T im e ( i n c lo ck cy c le s) CC 1 CC 2 10 X X CC 3 10 X X CC 4 10 20 X CC 5 1 0 / 20 X 20 CC 6 20 X X CC 7 20 X X CC 8 20 X X CC 9 20 X X
V a l ue o f re giste r $ 2 : 1 0 V a lu e of E X /M E M : X V a lu e o f M E M /W B : X
P r o g ra m e xe c u ti on o rde r ( in ins tru c tio ns ) sub $ 2 , $ 1 , $ 3 IM Reg DM Reg
What if this $2 was $13?
a nd $ 1 2 , $ 2 , $ 5
IM
R eg
DM
R eg
or $ 1 3 , $ 6, $ 2
IM
R eg
DM
Reg
a dd $ 1 4 , $ 2 , $ 2
IM
Reg
DM
Reg
sw $ 1 5 , 1 0 0 ($ 2 ) H.Corporaal EmbProcArch 5kk73
IM
Reg
DM
Reg 90
Forwarding hardware
ALU forwarding circuitry principle:
from register file ALU from register file to register file
Note: there are two options buf - ALU bypass mux - buf buf - bypass mux ALU - buf H.Corporaal EmbProcArch 5kk73
91
Forwarding
Control IF/ID
ID/EX WB
EX/MEM WB
MEM/WB WB
EX
In str uc tion
M u x Registers
PC
Instruction memory
ForwardA ALU
M u x
Data memory
M u x
IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd
Rs Rt Rt Rd M u x
ForwardB
EX/MEM.RegisterRd
Forwarding unit
MEM/WB.RegisterRd
92
Forwarding check

Check for matching register-ids: For each source-id of operation in the EX-stage check if there is a matching pending dest-id
Example:
if (EX/MEM.RegWrite) (EX/MEM.RegisterRd 0) (EX/MEM.RegisterRd = ID/EX.RegisterRs) then ForwardA = 10
Q. How many comparators do we need?

Can't always forward
Load word can still cause a hazard:

an instruction tries to read register r following a load to the same r Need a hazard detection unit to stall the load instruction
T im e ( in c lo c k c y c le s ) P r o gr a m CC 1 e x e c u t io n ord er ( in in s t r u c t i o n s ) CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
lw $ 2 , 2 0 ( $ 1 )
IM
R eg
DM
R eg
an d $4 , $ 2, $5
IM
R eg
DM
Re g
or $8 , $ 2, $6
IM
R eg
DM
Reg
ad d $9 , $ 4, $2
IM
R eg
DM
Reg
slt $ 1, $6 , $ 7 H.Corporaal EmbProcArch 5kk73
IM
Reg
DM
Reg 94
Stalling
We can stall the pipeline by keeping an instruction in the same stage
Program Tim (in clock cycles) e execution CC1 CC2 order (in instructions) CC3 CC4 CC5 CC6 CC 7 CC8 CC9 CC 10
lw$2, 20($1)
IM
Reg
DM
Reg
and $4, $2, $5
IM
Reg
Reg
DM
Reg
or $8, $2, $6
IM
IM
Reg
DM
Reg
bubble
add $9, $4, $2 IM Reg DM Reg
In$1, $6, $7 the ALU is not used, CC4 slt Reg, and IM are redone
IM
Reg
DM
Reg
95
Hazard Detection Unit

Hazard detection unit IF/IDW r ite ID/EX.MemRead ID/EX WB M u x EX/MEM WB
Control 0
MEM/WB WB
IF/ID
EX
P CW r ite
In str uction
M u x Registers ALU M u x Data memory
PC
Instruction memory
M u x
IF/ID.RegisterRs IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRd ID/EX.RegisterRt
Rt Rd Rs Rt
M u x Forwarding unit
EX/MEM.RegisterRd
MEM/WB.RegisterRd
96
Software only solution?

Have compiler guarantee that no hazards occur Example: where do we insert the NOPs ?
sub nop nop and or Problem: this really slows us down! add nop sw
sub and or add sw
$2, $12, $13, $14, $13,
$1, $3 $2, $5 $6, $2 $2, $2 100($2)
$2,
$1, $3
$12, $2, $5 $13, $6, $2 $14, $2, $2 $13, 100($2)

97
Control hazards
Control operations may change the sequential flow of instructions

branch jump call (jump and link) return (exception/interrupt and rti / return from interrupt)
98
Control hazard: Branch

Branch actions: Compute new address Determine condition Perform the actual branch (if taken): PC := new address
99
Branch example
P ro g ra m e x e c u ti o n o rd e r ( in i n s t r u c t i o n s ) T i m e ( i n c l o c k c y c le s ) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
40 be q $1 , $ 3, 7
IM
Reg
DM
R eg
44 an d $1 2, $2 , $ 5
IM
Reg
DM
R eg
48 or $1 3, $6 , $ 2
IM
R eg
DM
R eg
52 ad d $1 4 , $2 , $ 2
IM
R eg
DM
Reg
7 2 lw $ 4 , 5 0 ($ 7 )
IM
Reg
DM
R eg
100
Branching
Squash pipeline: When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken
need to add hardware for flushing instructions if we are wrong
101
Branch with predict not taken
Clock cycles
Branch L
IF
ID IF
EX ID IF
MEM WB EX ID IF MEM WB EX ID IF MEM WB EX ID MEM WB EX MEM WB
Predict not taken
L:
102
Branch speedup

Earlier address computation Earlier condition calculation Put both in the ID pipeline stage

adder comparator
Clock cycles
Branch L Predict not taken L:

IF
ID IF
EX ID IF
MEM WB EX ID MEM WB EX MEM WB

103
Improved branching / flushing IF/ID

IF.Flush Hazard detection unit M u x M u x ID/EX WB EX/MEM WB
Control 0 IF/ID
MEM/WB WB
EX
Shift left 2 M u x ALU M u x Data memory
Registers PC
Instruction memory
M u x
Sign extend
M u x Forwarding unit
104
Exception support
Types of exceptions: Overflow I/O device request Operating system call Undefined instruction Hardware malfunction Page fault
Precise exception:
finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after handling the exception(s)
105
Exceptions
Changes needed for handling overflow exception of an operation in EX stage (see book for details) :

Extend PC input mux with extra entry with fixed address Add EPC register recording the ID/EX stage PC
this is the address of the next instruction !
Cause register recording exception type
E.g., in case of overflow exception insert 3 bubbles; flush the following stages: IF/ID stage ID/EX stage EX/MEM stage
Scheduling, why?
Lets look at the execution time: Texecution = Ncycles x Tcycle = Ninstructions x CPI x Tcycle Scheduling may reduce Texecution
Reduce CPI (cycles per instruction) early scheduling of long latency operations avoid pipeline stalls due to structural, data and control hazards allow Nissue > 1 and therefore CPI < 1 Reduce Ninstructions compact many operations into each instruction (VLIW)
107
Scheduling data hazards: example 1

Try and avoid RaW stalls (in this case load interlocks)! E.g., reorder these instructions:
lw lw sw sw
$t0, $t2, $t2, $t0,
0($t1) 4($t1) 0($t1) 4($t1)
lw lw sw sw
$t0, $t2, $t0, $t2,
0($t1) 4($t1) 4($t1) 0($t1)
108
Scheduling data hazards example 2

Avoiding RaW stalls:
Reordering instructions for following program
(by you or the compiler)
Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4
Code: a = b + c d = e - f
Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4
109
Scheduling control hazards

Texecution = Ninstructions x CPI x Tcycle CPI = CPIideal + fbranch x Pbranch Pbranch = Ndelayslots x miss_rate
Modern processors tend to have large branch penalty, Pbranch, due to:
many pipeline stages multi-issue
Note that penalties have larger effect when CPIideal is low
110
Scheduling control hazards

What can we do about control hazards and CPI penalty? Keep penalty Pbranch low:

Early computation of new PC Early determination of condition Visible branch delay slots filled by compiler (MIPS)
Branch prediction Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro95] Remove branches: if-conversion

Conditional instructions: CMOVE, cond skip next Guarding all instructions: TriMedia
111
Branch delay slot
Add a branch delay slot:

the next instruction after a branch is always executed rely on compiler to fill the slot with something useful
Is this a good idea?
let's look how it works
112
Branch delay slot scheduling

Q. What to put in the delay slot? op 1 beq r1,r2, L ............. 'fall-through' op 2 .............
branch target
L: op 3 .............
113
Summary
Modern processors are (deeply) pipelined, to reduce Tcycle and aim at CPI = 1 Hazards increase CPI Several software and hardware measure to avoid or reduce hazards are taken
Not discussed, but important developments: Multi-issue further reduces CPI Branch prediction to avoid high branch penalties Dynamic scheduling In all cases: a scheduling compiler needed
Recap of MIPS

RISC architecture Register space Addressing Instruction format Pipelining
115
Why RISC? Keep it simple

RISC characteristics: Reduced number of instructions Limited addressing modes

load-store architecture enables pipelining uniform (no distinction between e.g. address and data registers) know directly where the following instruction starts
Large register set
Limited number of instruction sizes (preferably one)
Limited number of instruction formats Memory alignment restrictions ...... Based on quantitative analysis
" the famous MIPS one percent rule": don't even think about it when its not used more than one percent
116
Register space
32 integer (and 32 floating point) registers of 32-bit
Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address
1. Immediate addressing op rs rt Immediate
Addressing
funct Registers Register
2. Register addressing op rs rt rd ...
3. Base addressing op rs rt Address Memory
Register
Byte
Halfword
Word
4. PC-relative addressing op rs rt Address Memory
PC
Word
5. Pseudodirect addressing op Address Memory
PC
Word
118
Instruction format
R I J op op op rs rs rt rt rd shamt funct 16 bit address
26 bit address
Example instructions Instruction

add $s1,$s2,$s3 addi $s2,$s3,4 lw $s1,100($s2) bne $s4,$s5,L j Label
Meaning
$s1 = $s2 + $s3 $s2 = $s3 + 4 $s1 = Memory[$s2+100] if $s4<>$s5 goto L goto Label
119
Pipelining
All integer instructions fit into the following pipeline
time
IF
ID IF
EX ID IF
MEM EX ID IF
WB MEM EX ID IF WB MEM EX ID WB MEM EX WB MEM WB
120
Other architecture styles
Accumulator architecture
one operand (in register or memory), accumulator almost always implicitly used zero operand: all operands implicit (on TOS)
Stack
Register (load store)
three operands, all in registers loads and stores are the only instructions accessing memory (i.e. with a memory (indirect) addressing mode two operands, one in memory
three operands, may be all in memory
Register-Memory
Memory-Memory
(there are more varieties / combinations)

Accumulator architecture
latch Accumulator
ALU registers latch
address
Memory
Example code: a = b+c;

load b; add c; store a;
// accumulator is implicit operand
122
Stack architecture
latch latch top of stack ALU latch stack pt Memory
Example code: a = b+c; push b; push b push c; b add; stack: pop a;

push c
add
pop a
c b
b+c
123
Other architecture styles

Let's look at the code for C = A + B
Stack Architecture
Push A Push B Add Pop C
Accumulator Architecture
Load A Add B
RegisterMemory
Load r1,A Add r1,B
MemoryMemory
Add C,B,A
Register (load-store)
Load r1,A Load r2,B Add r3,r1,r2
Store C
Store C,r1
Store C,r3
Q: What are the advantages / disadvantages of load-store (RISC) architecture?


Mi Ps Details

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mi Ps Details

Uploaded by

Copyright:

Available Formats

Embedded Processor Architecture

H.Corporaal EmbProcArch 5kk73

Main Types of Instructions

Integer Floating Point

Memory access instructions

Load & Store

Jump Conditional Branch Call & Return

H.Corporaal EmbProcArch 5kk73

H.Corporaal EmbProcArch 5kk73

H.Corporaal EmbProcArch 5kk73

Registers vs. Memory

Compiler may run out of registers => spilling

H.Corporaal EmbProcArch 5kk73

Registers hold 32 bits of data

H.Corporaal EmbProcArch 5kk73

Memory layout: Alignment

this word is aligned; the others are not!

Instructions: load and store

H.Corporaal EmbProcArch 5kk73

Let's translate some C-code

Can we figure out the code?

$2 , $2 , $15, $16, $16, $15, $31

$5, 4 $4, $2 0($2) 4($2) 0($2) 4($2)

Instruction Format: op 000000

Can you guess what the field names stand for?

H.Corporaal EmbProcArch 5kk73

Consider the load-word and store-word instructions,

Introduce a new type of instruction format

Example: lw $t0, 32($s2)

H.Corporaal EmbProcArch 5kk73

Stored Program Concept

code global data stack heap

H.Corporaal EmbProcArch 5kk73

Decision making instructions

H.Corporaal EmbProcArch 5kk73

MIPS unconditional branch instructions: j label

Can you build a simple for loop?

H.Corporaal EmbProcArch 5kk73

H.Corporaal EmbProcArch 5kk73

We have: beq, bne, what about Branch-if-less-than? New instruction:

$s1 < $s2 then $t0 = 1 else $t0 = 0

H.Corporaal EmbProcArch 5kk73

MIPS compiler/assembler Conventions

H.Corporaal EmbProcArch 5kk73

How about larger constants?

Assembly Language vs. Machine Language

Assembly provides convenient symbolic representation

much easier than writing down numbers

e.g., destination first

Machine language is the underlying reality

Assembly can provide 'pseudoinstructions'

When considering performance you should count real instructions

H.Corporaal EmbProcArch 5kk73

Addresses in Branches and Jumps

H.Corporaal EmbProcArch 5kk73

What's the next address?

Could specify a register (like lw and sw) and add it to address

Jump instructions just use high order bits of PC

address boundaries of 256 MB

H.Corporaal EmbProcArch 5kk73

Three operands; data in registers

beq bne slt slti j jr jal

$s1, $s2, 25 $s1, $s2, 25 $s1, $s2, $s3