Professional Documents
Culture Documents
Krste Asanovic
Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152
1/29/2009
CS152-Spring09
Styles of ISA
Accumulator Stack GPR CISC RISC VLIW Vector
1/29/2009
CS152-Spring09
Styles of Implementation
Microcoded Unpipelined single cycle Hardwired in-order pipeline Out-of-order pipeline with speculative execution and register renaming Software interpreter Binary translator Just-in-Time compiler
1/29/2009
CS152-Spring09
General-purpose register machines provide greater efficiency with better compiler technology (or assembly coding)
Compilers can explicitly manage fastest level of memory hierarchy (registers)
Microcoding was a straightforward methodical way to implement machines with low gate count
1/29/2009
CS152-Spring09
rd rt rs ALU control
RegWrt enReg
enALU
data
1/29/2009
1/29/2009
CS152-Spring09
restrict the next-state encoding Next, Dispatch on opcode, Wait for memory, ... encode control signals (vertical microcode)
CS152-Spring09 8
1/29/2009
MIPS Controller V2
Opcode
ext
PC (state)
address
Control ROM
data
Jump Logic
PCSrc = Case JumpTypes
next spin fetch dispatch feqz fnez PC+1 if (busy) then PC else PC+1 absolute op-group if (zero) then absolute else PC+1 if (zero) then PC+1 else absolute
1/29/2009
CS152-Spring09
10
11
1/29/2009
CS152-Spring09
12
Branches: MIPS-Controller-2
State BEQZ0 BEQZ1 BEQZ2 BEQZ3 BEQZ4 BNEZ0 BNEZ1 BNEZ2 BNEZ3 BNEZ4 Control points A Reg[rs] next-state next fnez
1/29/2009
CS152-Spring09
13
Jumps: MIPS-Controller-2
State J0 J1 J2 JR0 JR1 JAL0 JAL1 JAL2 JAL3 JALR0 JALR1 JALR2 JALR3 1/29/2009 Control points next-state A PC next B IR next PC JumpTarg(A,B) fetch A Reg[rs] PC A A PC Reg[31] A B IR PC JumpTarg(A,B) A PC next B Reg[rs] Reg[31] A PC B CS152-Spring09 next next next fetch next next fetch
14
next fetch
rd rt rs ALU control
RegWrt enReg
enALU
data
rd M[(rs)] op (rt) Reg-Memory-src ALU op M[(rd)] (rs) op (rt) Reg-Memory-dst ALU op M[(rd)] M[(rs)] op M[(rt)] Mem-Mem ALU op
1/29/2009 CS152-Spring09 15
Complex instructions usually do not require datapath modifications in a microprogrammed implementation -- only extra space for the control program Implementing these instructions using a hardwired controller is difficult without datapath modifications
1/29/2009 CS152-Spring09 16
Performance Issues
Microprogrammed control multiple cycles per instruction Cycle time ? tC > max(treg-reg , tALU , t ROM ) Suppose 10 * t ROM < tRAM Good performance, relative to a single-cycle hardwired implementation, can be achieved even with a CPI of 10
1/29/2009
CS152-Spring09
17
Nanocoding
1/29/2009
Nanocoding
Exploits recurring control signal patterns in code, e.g., ALU0 A Reg[rs] ... ALUi0 A Reg[rs] ...
nanoaddress
PC (state)
address
code next-state
code ROM
nanoinstruction ROM
data
MC68000 had 17-bit code containing either 10-bit jump or 9-bit nanoinstruction pointer Nanoinstructions were 68 bits wide, decoded to give 196 control signals
1/29/2009 CS152-Spring09 19
Microcode Emulation
IBM initially miscalculated the importance of software compatibility with earlier models when introducing the 360 series Honeywell stole some IBM 1401 customers by offering translation software (Liberator) for Honeywell H200 series machine IBM retaliated with optional additional microcode for 360 series that could emulate IBM 1401 ISA, later extended for IBM 7000 series
one popular program on 1401 was a 650 simulator, so some customers ran many 650 programs on emulated 1401s (650 simulated on 1401 emulated on 360)
1/29/2009
CS152-Spring09
21
User-WCS failed
Little or no programming tools support Difficult to fit software into small space Microcode control tailored to original ISA, less useful for others Large WCS part of processor state - expensive context switches Protection difficult if user can change microcode Virtual memory required restartable microcode
1/29/2009
CS152-Spring09
23
With the advent of VLSI technology assumptions about ROM & RAM speed became invalid Better compilers made complex instructions less important Use of numerous micro-architectural innovations, e.g., pipelining, caches and buffers, made multiple-cycle execution of reg-reg instructions unattractive
1/29/2009
CS152-Spring09
24
Microcode pays an assisting role in most modern micros (AMD Athlon, Intel Core 2 Duo, IBM PowerPC)
Most instructions are executed directly, i.e., with hard-wired control Infrequently-used and/or complicated instructions invoke the microcode engine
Patchable microcode common for post-fabrication bug fixes, e.g. Intel Pentiums load code patches at bootup
1/29/2009 CS152-Spring09 25
1/29/2009
CS152-Spring09
26
PC er code Exploits recurring s (state) PC U next-state control signal patterns in code, e.g., address he ac code ROM .C ALU A Reg[rs] st In de nanoaddress ... co ALUi A Reg[rs] De nanoinstruction ROM ed ... irdata dw ar H MC68000 had 17-bit code containing either 10-bit jump or 9-bit
0 0
Nanocoding
nanoinstruction pointer Nanoinstructions were 68 bits wide, decoded to give 196 control signals
1/29/2009 CS152-Spring09
27
1/29/2009
CS152-Spring09
28
1/29/2009
CS152-Spring09
29
opcode
6
i
3
j
3
k
18
Ri (Rj) op (Rk)
Only Load and Store instructions refer to memory! opcode i j disp Ri M[(Rj) + disp]
Touching address registers 1 to 5 initiates a load 6 to 7 initiates a store - very useful for vector operations
1/29/2009 CS152-Spring09 30
loop:
1/29/2009
CS152-Spring09
31
Decoupling setting of address register (Ar) from retrieving value from data register (Xr) simplifies providing multiple outstanding memory accesses
Software can schedule load of address register before use of value Can interleave independent instructions inbetween
1/29/2009
CS152-Spring09
32
Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology Microarchitecture Microcoded Single-cycle unpipelined Pipelined
CS152-Spring09
this lecture
CPI >1 1 1
1/29/2009
CS152 Administrivia
Check web site for new calendar, quiz dates should not change
Feb 17 and Mar 17 lecture in 320 Soda All other lectures in 306 Soda (here)
1/29/2009
CS152-Spring09
34
Hardware Elements
Combinational circuits
Mux, Decoder, ALU, ...
Sel A0 A1
lg(n)
OpSelect
- Add, Sub, ... - And, Or, Xor, Not, ... - GT, LT, EQ, Zero, ...
Decoder
. . .
Mux
A
lg(n)
. . .
O0 O1
ALU
B
Result Comp?
An-1
On-1
Register Files
register
D0
En Clk
D1
D2
...
Dn-1
ff
Q0
ff
Q1
ff ...
Q2
ff
Qn-1
...
rd1 rd2
ReadData1 ReadData2
1/29/2009
CS152-Spring09
36
clk
wd
32
32
rd1 rd2
32
rs1
5
rs2
5
reg 0
reg 31
we
reg 1
1/29/2009
CS152-Spring09
37
MAGIC RAM
ReadData
Reads and writes are always completed in one cycle a Read can be done any time (i.e. combinational) a Write is performed at the rising clock edge if it is enabled the write address and data must be stable at the clock edge Later in the course we will present a more realistic model of memory
1/29/2009 CS152-Spring09 38
Implementing MIPS:
Single-cycle per instruction datapath & control logic (Should be review of CS61C)
1/29/2009
CS152-Spring09
39
Data types
Instruction Execution
Execution of an instruction involves
1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation memory operation (optional) write back
1/29/2009
CS152-Spring09
41
inst<25:21> inst<20:16> PC
clk
addr
inst
inst<15:11>
Inst. Memory
inst<5:0>
ALU
GPRs
ALU
Control
OpCode
6 0
1/29/2009
31 26 25
5 rs
5 rt
21 20 16 15
5 rd
11
5 0
5
6 func
0
RegWrite Timing?
rd (rs) func (rt)
42
CS152-Spring09
inst<25:21> PC
clk
addr
inst
inst<20:16>
Inst. Memory
ALU
GPRs
Imm Ext
inst<15:0> inst<31:26>
ALU Control
OpCode
ExtSel
6 opcode
1/29/2009
31 26 25
5 rs
2120
5 rt
16 15
16 immediate
0
rt (rs) op immediate
43
CS152-Spring09
inst<25:21> PC
clk
addr
inst
Inst. Memory
Introduce muxes
ALU
GPRs
Imm Ext
ALU Control
OpCode
ExtSel
6 0 opcode
1/29/2009
5 rs rs
5 rt rt
5 rd
5 0
6 func
immediate
CS152-Spring09
PC
clk
addr
Inst. Memory
ALU
GPRs
Imm Ext
ALU Control
OpCode
6 0 opcode
1/29/2009
5 rs rs
5 rt rt
5 rd
RegDst rt / rd
ExtSel
OpSel
5 0
6 func
immediate
CS152-Spring09
base PC
clk
addr
inst
ALU
we addr z
Inst. Memory
GPRs
Imm Ext
disp
Data Memory
wdata ALU Control
rdata
OpCode RegDst
ExtSel
OpSel
BSrc
6 opcode
31 26 25
5 rs
5 rt
21 20 16 15
16 displacement
0
rs is the base register rt is the destination of a Load or the source for a Store
1/29/2009 CS152-Spring09 47
PC-relative branches add offset 4 to PC+4 to calculate the target address (offset is in words): 128 KB range Absolute jumps append target 4 to PC<31:28> to calculate the target address: 256 MB range jump-&-link stores PC+4 into the link register (R31) All Control Transfers are delayed by 1 instruction
we will worry about the branch delay slot later
CS152-Spring09 1/29/2009 48
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
49
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
50
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
51
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
52
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
53
combinational logic
1/29/2009
CS152-Spring09
54
Decode Map
ExtSel ( sExt16 , uExt16 , High16 )
1/29/2009 CS152-Spring09 55
Func Op Op + + 0? 0? * * * *
no no no no yes no no no no no no
rd rt rt rt * * * * R31 * R31
pc+4 pc+4 pc+4 pc+4 pc+4 br pc+4 jabs jabs rind rind
We will assume clock period is sufficiently long for all of the following steps to be completed:
1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation data fetch if required register write-back setup time tC > tIFetch + tRFetch + tALU + tDMm + tRW e B
At the rising edge of the following clock, the PC, the register file and the memory are updated
1/29/2009 CS152-Spring09 57
An Ideal Pipeline
stage 1 stage 2 stage 3 stage 4
All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition?
CS152-Spring09
1/29/2009
58
Pipelined MIPS
To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1
1/29/2009
CS152-Spring09
59
Pipelined Datapath
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext
PC
addr rdata
IR
ALU
we addr
Inst. Memory
Data Memory
wdata
rdata
write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM , tRF , tALU , tDM , tRW } ( = tDM probably) However, CPI will increase unless instructions are pipelined
1/29/2009 CS152-Spring09 60
Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance
1/29/2009
CS152-Spring09
61
Alternative Pipelining
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext
PC
addr rdata
IR
ALU
we addr
Inst. Memory
Data Memory
wdata
rdata
fetch phase
execute phase
tC > max {tIM , tRFF, tAAUU ,, ttDM,, ttRW} tC > max {tIM , tRF +tLLU DM +tW } RR } IM D M W
Summary
Microcoding became less attractive as gap between RAM and ROM speeds reduced Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew Iron Law explains architecture design space
Trade instruction/program, cycles/instruction, and time/cycle
MIPS ISA will be used in class and problems, SPARC in lab (two very similar ISAs)
1/29/2009
CS152-Spring09
63
Acknowledgements
These slides contain material developed and copyright by:
Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB)
MIT material derived from course 6.823 UCB material derived from course CS252
1/29/2009
CS152-Spring09
64