L03 Ciscrisc

CS 152 Computer Architecture and Engineering Lecture 3 - From CISC to RISC
Krste Asanovic
Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152
Instruction Set Architecture (ISA) versus Implementation

ISA is the hardware/software interface
Defines set of programmer visible state Defines instruction format (bit encoding) and instruction semantics Examples: MIPS, x86, IBM 360, JVM
Many possible implementations of one ISA

360 implementations: model 30 (c. 1964), z990 (c. 2004) x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486, Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon, Transmeta Crusoe, SoftPC MIPS implementations: R2000, R4000, R10000, ... JVM: HotSpot, PicoJava, ARM Jazelle, ...
1/29/2009
CS152-Spring09
Styles of ISA
Accumulator Stack GPR CISC RISC VLIW Vector
Boundaries are fuzzy, and hybrids are common

E.g., 8086/87 is hybrid accumulator-GPR-stack ISA Many ISAs have added vector extensions
1/29/2009
CS152-Spring09
Styles of Implementation
Microcoded Unpipelined single cycle Hardwired in-order pipeline Out-of-order pipeline with speculative execution and register renaming Software interpreter Binary translator Just-in-Time compiler
1/29/2009
CS152-Spring09
Last Time in Lecture 2

Stack machines popular to simplify High-Level Language (HLL) implementation
Algol-68 & Burroughs B5000, Forth machines, Occam & Transputers, Java VMs & Java Interpreters
General-purpose register machines provide greater efficiency with better compiler technology (or assembly coding)
Compilers can explicitly manage fastest level of memory hierarchy (registers)
Microcoding was a straightforward methodical way to implement machines with low gate count
1/29/2009
CS152-Spring09
A Bus-based Datapath for MIPS

Opcode ldIR OpSel 2 ldA zero? ldB 32(PC) 31(Link) rd rt rs RegSel busy ldMA
IR ExtSel 2 enImm Imm Ext
rd rt rs ALU control
3 A B addr 32 GPRs + PC ... ALU 32-bit Reg data Bus 32
MA addr Memory MemWrt enMem
RegWrt enReg
enALU
data
Microinstruction: register to register transfer (17 control signals)

MA B PC means Reg[rt] means RegSel = PC; enReg=yes; RegSel = rt; enReg=yes;
CS152-Spring09
ldMA= yes ldB = yes

6
1/29/2009
MIPS Microcontroller: first attempt

Opcode zero? Busy (memory) latching the inputs may cause a one-cycle delay addr ROM size ? = 2(opcode+status+s) words Word size ? = control+s bits Program ROM data next state
6
PC (state) s How big is s? s
Control Signals (17)
1/29/2009
CS152-Spring09
Reducing Control Store Size

Control store has to be fast expensive Reduce the ROM height (= address bits)
reduce inputs by extra external logic each input bit doubles the size of the control store reduce states by grouping opcodes find common sequences of actions condense input status bits combine all exceptions into one, i.e., exception/no-exception
Reduce the ROM width
restrict the next-state encoding Next, Dispatch on opcode, Wait for memory, ... encode control signals (vertical microcode)
CS152-Spring09 8
1/29/2009
MIPS Controller V2
Opcode
ext
absolute op-group PC PC+1 +1
input encoding reduces ROM height
PC (state)
PCSrc jump logic zero busy
address
JumpType = next | spin | fetch | dispatch | feqz | fnez
Control ROM
data
Control Signals (17)

1/29/2009 CS152-Spring09
next-state encoding reduces ROM width 9
Jump Logic
PCSrc = Case JumpTypes
next spin fetch dispatch feqz fnez PC+1 if (busy) then PC else PC+1 absolute op-group if (zero) then absolute else PC+1 if (zero) then PC+1 else absolute
1/29/2009
CS152-Spring09
10
Instruction Fetch & ALU:MIPS-Controller-2

State Control points next-state next spin next dispatch next next fetch next next fetch fetch0 MA PC fetch1 IR Memory fetch2 A PC fetch3 PC A + 4 ... ALU0 A Reg[rs] ALU1 B Reg[rt] ALU2 Reg[rd]func(A,B) ALUi0 ALUi1 ALUi2
1/29/2009
A Reg[rs] B sExt16 (Imm) Reg[rd] Op(A,B)

CS152-Spring09
11
Load & Store: MIPS-Controller-2

State LW0 LW1 LW2 LW3 LW4 SW0 SW1 SW2 SW3 SW4 Control points A Reg[rs] B sExt16 (Imm) MA A+B Reg[rt] Memory next-state next next next spin fetch next next next spin fetch
A Reg[rs] B sExt16 (Imm) MA A+B Memory Reg[rt]
1/29/2009
CS152-Spring09
12
Branches: MIPS-Controller-2
State BEQZ0 BEQZ1 BEQZ2 BEQZ3 BEQZ4 BNEZ0 BNEZ1 BNEZ2 BNEZ3 BNEZ4 Control points A Reg[rs] next-state next fnez
A PC next B sExt16 (Imm<<2) next PC A+B fetch A Reg[rs] next feqz
A PC next B sExt16 (Imm<<2) next PC A+B fetch
1/29/2009
CS152-Spring09
13
Jumps: MIPS-Controller-2
State J0 J1 J2 JR0 JR1 JAL0 JAL1 JAL2 JAL3 JALR0 JALR1 JALR2 JALR3 1/29/2009 Control points next-state A PC next B IR next PC JumpTarg(A,B) fetch A Reg[rs] PC A A PC Reg[31] A B IR PC JumpTarg(A,B) A PC next B Reg[rs] Reg[31] A PC B CS152-Spring09 next next next fetch next next fetch
14
next fetch
Implementing Complex Instructions

Opcode ldIR OpSel 2 ldA zero? ldB 32(PC) 31(Link) rd rt rs RegSel busy ldMA
IR ExtSel 2 enImm Imm Ext
rd rt rs ALU control
3 A B addr 32 GPRs + PC ... ALU 32-bit Reg data Bus 32
MA addr Memory MemWrt enMem
RegWrt enReg
enALU
data
rd M[(rs)] op (rt) Reg-Memory-src ALU op M[(rd)] (rs) op (rt) Reg-Memory-dst ALU op M[(rd)] M[(rs)] op M[(rt)] Mem-Mem ALU op
1/29/2009 CS152-Spring09 15
Mem-Mem ALU Instructions:

MIPS-Controller-2 Mem-Mem ALU op ALUMM0 ALUMM1 ALUMM2 ALUMM3 ALUMM4 ALUMM5 ALUMM6 M[(rd)] M[(rs)] op M[(rt)] next spin next spin next spin fetch
MA Reg[rs] A Memory MA Reg[rt] B Memory MA Reg[rd] Memory func(A,B)
Complex instructions usually do not require datapath modifications in a microprogrammed implementation -- only extra space for the control program Implementing these instructions using a hardwired controller is difficult without datapath modifications
1/29/2009 CS152-Spring09 16
Performance Issues
Microprogrammed control multiple cycles per instruction Cycle time ? tC > max(treg-reg , tALU , t ROM ) Suppose 10 * t ROM < tRAM Good performance, relative to a single-cycle hardwired implementation, can be achieved even with a CPI of 10
1/29/2009
CS152-Spring09
17
Horizontal vs Vertical Code

Bits per Instruction # Instructions
Horizontal code has wider instructions

Multiple parallel operations per instruction Fewer steps per macroinstruction Sparser encoding more bits
Vertical code has narrower instructions

More steps to per macroinstruction More compact less bits
separate instruction for branches
Typically a single datapath operation per instruction
Nanocoding
1/29/2009
Tries to combine best of horizontal and vertical code

CS152-Spring09 18
Nanocoding
Exploits recurring control signal patterns in code, e.g., ALU0 A Reg[rs] ... ALUi0 A Reg[rs] ...
nanoaddress
PC (state)
address
code next-state
code ROM
nanoinstruction ROM
data
MC68000 had 17-bit code containing either 10-bit jump or 9-bit nanoinstruction pointer Nanoinstructions were 68 bits wide, decoded to give 196 control signals
1/29/2009 CS152-Spring09 19
Microprogramming in IBM 360

M30 Datapath width (bits) inst width (bits) code size (K insts) store technology store cycle (ns) memory cycle (ns) Rental fee ($K/month) M40 M50 M65
8 50 4 CCROS 750 1500 4
16 52 4 TCROS 625 2500 7
32 85 2.75 BCROS 500 2000 15
64 87 2.75 BCROS 200 750 35
Only the fastest models (75 and 95) were hardwired

1/29/2009 CS152-Spring09 20
Microcode Emulation
IBM initially miscalculated the importance of software compatibility with earlier models when introducing the 360 series Honeywell stole some IBM 1401 customers by offering translation software (Liberator) for Honeywell H200 series machine IBM retaliated with optional additional microcode for 360 series that could emulate IBM 1401 ISA, later extended for IBM 7000 series
one popular program on 1401 was a 650 simulator, so some customers ran many 650 programs on emulated 1401s (650 simulated on 1401 emulated on 360)
1/29/2009
CS152-Spring09
21
Microprogramming thrived in the Seventies

Significantly faster ROMs than magnetic core memory or DRAMs were available For complex instruction sets (CISC), datapath and controller were cheaper and simpler New instructions , e.g., floating point, could be supported without datapath modifications Fixing bugs in the controller was easier ISA compatibility across various models could be achieved easily and cheaply Except for the cheapest and fastest machines, all computers were microprogrammed
1/29/2009 CS152-Spring09 22
Writable Control Store (WCS)

Implement control store in RAM not ROM
MOS SRAM memories now became almost as fast as control store (core memories/DRAMs were 2-10x slower) Bug-free microprograms difficult to write
User-WCS provided as option on several minicomputers

Allowed users to change microcode for each processor
User-WCS failed
Little or no programming tools support Difficult to fit software into small space Microcode control tailored to original ISA, less useful for others Large WCS part of processor state - expensive context switches Protection difficult if user can change microcode Virtual memory required restartable microcode
1/29/2009
CS152-Spring09
23
Microprogramming: early Eighties

Evolution bred more complex micro-machines
Ever more complex CISC ISAs led to need for subroutine and call stacks in code Need for fixing bugs in control programs was in conflict with read-only nature of ROM --> WCS (B1700, QMachine, Intel i432, )
With the advent of VLSI technology assumptions about ROM & RAM speed became invalid Better compilers made complex instructions less important Use of numerous micro-architectural innovations, e.g., pipelining, caches and buffers, made multiple-cycle execution of reg-reg instructions unattractive
1/29/2009
CS152-Spring09
24
Microprogramming in Modern Usage

Microprogramming is far from extinct Played a crucial role in micros of the Eighties
DEC uVAX, Motorola 68K series, Intel 386 and 486
Microcode pays an assisting role in most modern micros (AMD Athlon, Intel Core 2 Duo, IBM PowerPC)
Most instructions are executed directly, i.e., with hard-wired control Infrequently-used and/or complicated instructions invoke the microcode engine
Patchable microcode common for post-fabrication bug fixes, e.g. Intel Pentiums load code patches at bootup
1/29/2009 CS152-Spring09 25
From CISC to RISC

Use fast RAM to build fast instruction cache of user-visible instructions, not fixed hardware microroutines
Can change contents of fast instruction memory to fit what application needs right now
Use simple ISA to enable hardwired pipelined implementation

Most compiled code only used a few of the available CISC instructions Simpler encoding allowed pipelined implementations
Further benefit with integration

In early 80s, could finally fit 32-bit datapath + small caches on a single chip No chip crossings in common case allows faster operation
1/29/2009
CS152-Spring09
26
PC er code Exploits recurring s (state) PC U next-state control signal patterns in code, e.g., address he ac code ROM .C ALU A Reg[rs] st In de nanoaddress ... co ALUi A Reg[rs] De nanoinstruction ROM ed ... irdata dw ar H MC68000 had 17-bit code containing either 10-bit jump or 9-bit
0 0
Nanocoding
nanoinstruction pointer Nanoinstructions were 68 bits wide, decoded to give 196 control signals
1/29/2009 CS152-Spring09
27
CDC 6600 Seymour Cray, 1964

A fast pipelined machine with 60-bit words Ten functional units - Floating Point: adder, multiplier, divider - Integer: adder, multiplier ... Hardwired control (no microcoding) Dynamic scheduling of instructions using a scoreboard Ten Peripheral Processors for Input/Output - a fast time-shared 12-bit integer ALU Very fast clock, 10MHz Novel freon-based technology for cooling
1/29/2009
CS152-Spring09
28
CDC 6600: Datapath

Operand Regs 8 x 60-bit operand result Central Memory 128K words, 32 banks, 1 s cycle 10 Functional Units IR Inst. Stack 8 x 60-bit
Address Regs 8 x 18-bit operand addr result addr
Index Regs 8 x 18-bit
1/29/2009
CS152-Spring09
29
CDC 6600: A Load/Store Architecture

Separate instructions to manipulate three types of reg.
8 8 8 60-bit data registers (X) 18-bit address registers (A) 18-bit index registers (B)
All arithmetic and logic instructions are reg-to-reg

6 3 3 3
opcode
6
i
3
j
3
k
18
Ri (Rj) op (Rk)
Only Load and Store instructions refer to memory! opcode i j disp Ri M[(Rj) + disp]
Touching address registers 1 to 5 initiates a load 6 to 7 initiates a store - very useful for vector operations
1/29/2009 CS152-Spring09 30
CDC6600: Vector Addition
loop:
B0 - n JZE B0, exit A0 B0 + a0 A1 B0 + b0 X6 X0 + X1 A6 B0 + c0 B0 B0 + 1 jump loop
load X0 load X1 store X6
Ai = address register Bi = index register Xi = data register
1/29/2009
CS152-Spring09
31
CDC6600 ISA designed to simplify high-performance implementation

Use of three-address, register-register ALU instructions simplifies pipelined implementation
No implicit dependencies between inputs and outputs
Decoupling setting of address register (Ar) from retrieving value from data register (Xr) simplifies providing multiple outstanding memory accesses
Software can schedule load of address register before use of value Can interleave independent instructions inbetween
CDC6600 has multiple parallel but unpipelined functional units

E.g., 2 separate multipliers
Follow-on machine CDC7600 used pipelined functional units

Foreshadows later RISC designs
1/29/2009
CS152-Spring09
32
Iron Law of Processor Performance

Time = Instructions Cycles Time Program Program * Instruction * Cycle
Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology Microarchitecture Microcoded Single-cycle unpipelined Pipelined
CS152-Spring09
this lecture
CPI >1 1 1
cycle time short long short

33
1/29/2009
CS152 Administrivia
Check web site for new calendar, quiz dates should not change
Feb 17 and Mar 17 lecture in 320 Soda All other lectures in 306 Soda (here)
PS1 and Lab 1 available now or tomorrow

PS 1 / Lab 1 due Tuesday February 10
Section tomorrow (Friday 1/30) 12-1pm 258 Dwinelle

Covers lab 1 details
Quiz 1 on Thursday Feb 12
1/29/2009
CS152-Spring09
34
Hardware Elements
Combinational circuits
Mux, Decoder, ALU, ...
Sel A0 A1
lg(n)
OpSelect
- Add, Sub, ... - And, Or, Xor, Not, ... - GT, LT, EQ, Zero, ...
Decoder
. . .
Mux
A
lg(n)
. . .
O0 O1
ALU
B
Result Comp?
Synchronous state elements

D En Clk ff Q Clk En D Q
An-1
On-1
Flipflop, Register, Register file, SRAM, DRAM
Edge-triggered: Data is sampled at the rising edge

1/29/2009 CS152-Spring09
Register Files
register
D0
En Clk
D1
D2
...
Dn-1
ff
Q0
ff
Q1
ff ...
Q2
ff
Qn-1
...
Clock WE ReadSel1 ReadSel2 WriteSel WriteData

rs1 rs2 ws wd we
Register file 2R+1W
rd1 rd2
ReadData1 ReadData2
Reads are combinational
1/29/2009
CS152-Spring09
36
Register File Implementation

ws
5
clk
wd
32
32
rd1 rd2
32
rs1
5
rs2
5
reg 0
reg 31
Register files with a large number of ports are difficult to design

Almost all MIPS instructions have exactly 2 register source operands Intels Itanium, GPR File has 128 registers with 8 read ports and 4 write ports!!!
we
reg 1
1/29/2009
CS152-Spring09
37
A Simple Memory Model

WriteEnable Clock Address WriteData
MAGIC RAM
ReadData
Reads and writes are always completed in one cycle a Read can be done any time (i.e. combinational) a Write is performed at the rising clock edge if it is enabled the write address and data must be stable at the clock edge Later in the course we will present a more realistic model of memory
1/29/2009 CS152-Spring09 38
Implementing MIPS:
Single-cycle per instruction datapath & control logic (Should be review of CS61C)
1/29/2009
CS152-Spring09
39
The MIPS ISA

Processor State
32 32-bit GPRs, R0 always contains a 0 32 single precision FPRs, may also be viewed as 16 double precision FPRs FP status register, used for FP compares & exceptions PC, the program counter some other special registers 8-bit byte, 16-bit half word 32-bit word for integers 32-bit word for single precision floating point 64-bit word for double precision floating point data addressing modes- immediate & indexed branch addressing modes- PC relative & register indirect Byte addressable memory- big endian mode
Data types
Load/Store style instruction set
All instructions are 32 bits

1/29/2009 CS152-Spring09 40
Instruction Execution
Execution of an instruction involves
1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation memory operation (optional) write back
and the computation of the address of the next instruction
1/29/2009
CS152-Spring09
41
Datapath: Reg-Reg ALU Instructions

RegWrite 0x4
Add clk
inst<25:21> inst<20:16> PC
clk
addr
inst
inst<15:11>
Inst. Memory
inst<5:0>
we rs1 rs2 rd1 ws wd rd2
ALU
GPRs
ALU
Control
OpCode
6 0
1/29/2009
31 26 25
5 rs
5 rt
21 20 16 15
5 rd
11
5 0
5
6 func
0
RegWrite Timing?
rd (rs) func (rt)
42
CS152-Spring09
Datapath: Reg-Imm ALU Instructions

RegWrite 0x4
Add clk
inst<25:21> PC
clk
addr
inst
inst<20:16>
Inst. Memory
ALU
GPRs
Imm Ext
inst<15:0> inst<31:26>
ALU Control
OpCode
ExtSel
6 opcode
1/29/2009
31 26 25
5 rs
2120
5 rt
16 15
16 immediate
0
rt (rs) op immediate
43
CS152-Spring09
Conflicts in Merging Datapath

RegWrite 0x4
Add clk
inst<25:21> PC
clk
addr
inst
Inst. Memory
inst<20:16> inst<15:11> inst<15:0> inst<31:26> inst<5:0>
Introduce muxes
ALU
GPRs
Imm Ext
ALU Control
OpCode
ExtSel
6 0 opcode
1/29/2009
5 rs rs
5 rt rt
5 rd
5 0
6 func
rd (rs) func (rt) rt (rs) op immediate

44
immediate
CS152-Spring09
Datapath for ALU Instructions

RegWrite 0x4
Add clk
PC
clk
addr
<25:21> <20:16> inst <15:11> <15:0> <31:26>, <5:0>
Inst. Memory
ALU
GPRs
Imm Ext
ALU Control
OpCode
6 0 opcode
1/29/2009
5 rs rs
5 rt rt
5 rd
RegDst rt / rd
ExtSel
OpSel
5 0
6 func
BSrc Reg / Imm
rd (rs) func (rt) rt (rs) op immediate

45
immediate
CS152-Spring09
Datapath for Memory Instructions

Should program and data memory be separate? Harvard style: separate (Aiken and Mark 1 influence) - read-only program memory - read/write data memory - Note: Somehow there must be a way to load the program memory Princeton style: the same (von Neumanns influence) - single read/write memory for program and data - Note: A Load or Store instruction requires accessing the memory more than once during its execution
1/29/2009 CS152-Spring09 46
Load/Store Instructions:Harvard Datapath

RegWrite 0x4
Add clk
MemWrite WBSrc ALU / Mem

clk
base PC
clk
addr
inst
ALU
we addr z
Inst. Memory
GPRs
Imm Ext
disp
Data Memory
wdata ALU Control
rdata
OpCode RegDst
ExtSel
OpSel
BSrc
6 opcode
31 26 25
5 rs
5 rt
21 20 16 15
16 displacement
0
addressing mode (rs) + displacement
rs is the base register rt is the destination of a Load or the source for a Store
1/29/2009 CS152-Spring09 47
MIPS Control Instructions

Conditional (on GPR) PC-relative branch
6 opcode 6 opcode 5 rs 5 rs 5 16 offset 16 BEQZ, BNEZ
Unconditional register-indirect jumps

5 JR, JALR
Unconditional absolute jumps

6 opcode 26 target J, JAL
PC-relative branches add offset 4 to PC+4 to calculate the target address (offset is in words): 128 KB range Absolute jumps append target 4 to PC<31:28> to calculate the target address: 256 MB range jump-&-link stores PC+4 into the link register (R31) All Control Transfers are delayed by 1 instruction
we will worry about the branch delay slot later
CS152-Spring09 1/29/2009 48
Conditional Branches (BEQZ, BNEZ)

PCSrc br pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
49
Register-Indirect Jumps (JR)

PCSrc br rind pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
50
Register-Indirect Jump-&-Link (JALR)

PCSrc br rind pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
51
Absolute Jumps (J, JAL)

PCSrc br rind jabs pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
52
Harvard-Style Datapath for MIPS

PCSrc br rind jabs pc+4
0x4 Add Add clk
RegWrite
MemWrite
WBSrc
PC
clk
addr
inst
31
Inst. Memory
clk
ALU z
we addr
GPRs
Imm Ext
Data Memory
wdata
rdata
ALU Control
OpCode RegDst
ExtSel
OpSel
BSrc
zero?
1/29/2009
CS152-Spring09
53
Hardwired Control is pure Combinational Logic

ExtSel BSrc op code zero? OpSel
combinational logic
MemWrite WBSrc RegDst RegWrite PCSrc
1/29/2009
CS152-Spring09
54
ALU Control & Immediate Extension

Inst<5:0> (Func) Inst<31:26> (Opcode) + 0? ALUop
OpSel ( Func, Op, +, 0? )
Decode Map
ExtSel ( sExt16 , uExt16 , High16 )
1/29/2009 CS152-Spring09 55
Hardwired Control Table

Opcode ALU ALUi ALUiu LW SW BEQZz=0 BEQZz=1 J JAL JR JALR ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc
* sExt16 uExt16 sExt16 sExt16 sExt16 sExt16 * * * *
Reg Imm Imm Imm Imm * * * * * *
Func Op Op + + 0? 0? * * * *
no no no no yes no no no no no no
yes yes yes yes no no no no yes no yes
ALU ALU ALU Mem * * * * PC * PC
rd rt rt rt * * * * R31 * R31
pc+4 pc+4 pc+4 pc+4 pc+4 br pc+4 jabs jabs rind rind
BSrc = Reg / Imm RegDst = rt / rd / R31

1/29/2009
WBSrc = ALU / Mem / PC PCSrc = pc+4 / br / rind / jabs

CS152-Spring09 56
Single-Cycle Hardwired Control:

Harvard architecture
We will assume clock period is sufficiently long for all of the following steps to be completed:
1. 2. 3. 4. 5. instruction fetch decode and register fetch ALU operation data fetch if required register write-back setup time tC > tIFetch + tRFetch + tALU + tDMm + tRW e B
At the rising edge of the following clock, the PC, the register file and the memory are updated
1/29/2009 CS152-Spring09 57
An Ideal Pipeline
stage 1 stage 2 stage 3 stage 4
All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition?
CS152-Spring09
1/29/2009
58
Pipelined MIPS
To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1
1/29/2009
CS152-Spring09
59
Pipelined Datapath
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext
PC
addr rdata
IR
ALU
we addr
Inst. Memory
Data Memory
wdata
rdata
write fetch decode & Reg-fetch execute memory -back phase phase phase phase phase Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM , tRF , tALU , tDM , tRW } ( = tDM probably) However, CPI will increase unless instructions are pipelined
1/29/2009 CS152-Spring09 60
How to divide the datapath into stages

Suppose memory is significantly slower than other stages. In particular, suppose
tIM tDM tALU tRF tRW = 10 units = 10 units = 5 units = 1 unit = 1 unit
Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance
1/29/2009
CS152-Spring09
61
Alternative Pipelining
0x4 Add we rs1 rs2 rd1 ws wd rd2 GPRs Imm Ext
PC
addr rdata
IR
ALU
we addr
Inst. Memory
Data Memory
wdata
rdata
fetch phase
decode & Reg-fetch phase
execute phase
memory phase = ttDM+ tRW = DM
write -back phase
tC > max {tIM , tRFF, tAAUU ,, ttDM,, ttRW} tC > max {tIM , tRF +tLLU DM +tW } RR } IM D M W
increase the critical path by 10%

Write-back stage takes much less time than other stages. Suppose we combined it with the memory phase
1/29/2009 CS152-Spring09 62
Summary
Microcoding became less attractive as gap between RAM and ROM speeds reduced Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew Iron Law explains architecture design space
Trade instruction/program, cycles/instruction, and time/cycle
Load-Store RISC ISAs designed for efficient pipelined implementations

Very similar to vertical microcode Inspired by earlier Cray machines
MIPS ISA will be used in class and problems, SPARC in lab (two very similar ISAs)
1/29/2009
CS152-Spring09
63
Acknowledgements
These slides contain material developed and copyright by:
Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB)
MIT material derived from course 6.823 UCB material derived from course CS252
1/29/2009
CS152-Spring09
64

L03 Ciscrisc

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L03 Ciscrisc

Uploaded by

Copyright:

Available Formats

CS 152 Computer Architecture and Engineering Lecture 3 - From CISC to RISC

Instruction Set Architecture (ISA) versus Implementation

Many possible implementations of one ISA

Boundaries are fuzzy, and hybrids are common

Last Time in Lecture 2

A Bus-based Datapath for MIPS

IR ExtSel 2 enImm Imm Ext

3 A B addr 32 GPRs + PC ... ALU 32-bit Reg data Bus 32

MA addr Memory MemWrt enMem

Microinstruction: register to register transfer (17 control signals)

ldMA= yes ldB = yes

MIPS Microcontroller: first attempt

PC (state) s How big is s? s

Control Signals (17)

Reducing Control Store Size

Reduce the ROM width

absolute op-group PC PC+1 +1

input encoding reduces ROM height

PCSrc jump logic zero busy

JumpType = next | spin | fetch | dispatch | feqz | fnez

Control Signals (17)

next-state encoding reduces ROM width 9

Instruction Fetch & ALU:MIPS-Controller-2

A Reg[rs] B sExt16 (Imm) Reg[rd] Op(A,B)

Load & Store: MIPS-Controller-2

A Reg[rs] B sExt16 (Imm) MA A+B Memory Reg[rt]

A PC next B sExt16 (Imm<<2) next PC A+B fetch A Reg[rs] next feqz

A PC next B sExt16 (Imm<<2) next PC A+B fetch

Implementing Complex Instructions

IR ExtSel 2 enImm Imm Ext

3 A B addr 32 GPRs + PC ... ALU 32-bit Reg data Bus 32

MA addr Memory MemWrt enMem

Mem-Mem ALU Instructions:

MA Reg[rs] A Memory MA Reg[rt] B Memory MA Reg[rd] Memory func(A,B)

Horizontal vs Vertical Code

Horizontal code has wider instructions

Vertical code has narrower instructions

Typically a single datapath operation per instruction

Tries to combine best of horizontal and vertical code

Microprogramming in IBM 360

8 50 4 CCROS 750 1500 4

16 52 4 TCROS 625 2500 7

32 85 2.75 BCROS 500 2000 15

64 87 2.75 BCROS 200 750 35

Only the fastest models (75 and 95) were hardwired

Microprogramming thrived in the Seventies

Writable Control Store (WCS)

User-WCS provided as option on several minicomputers

Microprogramming: early Eighties

Microprogramming in Modern Usage

From CISC to RISC

Use simple ISA to enable hardwired pipelined implementation

Further benefit with integration

CDC 6600 Seymour Cray, 1964

CDC 6600: Datapath

Address Regs 8 x 18-bit operand addr result addr

Index Regs 8 x 18-bit

CDC 6600: A Load/Store Architecture

All arithmetic and logic instructions are reg-to-reg

CDC6600: Vector Addition

B0 - n JZE B0, exit A0 B0 + a0 A1 B0 + b0 X6 X0 + X1 A6 B0 + c0 B0 B0 + 1 jump loop

load X0 load X1 store X6