Professional Documents
Culture Documents
Architecture
CIT 3007
Semester 2 – 2008/9
Computer Architecture
REFERENCES
David A. Patterson & John L. Hennessy, Computer
Organization and Design: the hardware/software
interface, Morgan-Kaufmann.
Quizzes = 15%
On-going (done in tutorials)
Research/Presentations = 20%
Final exam = 50%
Total = 100%
UNIT 1 – Introduction to Computer
Architecture
Overview and history
The cost factor
Memory hierarchy
Hardware/Software interface
Computer abstractions and technology
Hierarchical approach to understanding & designing a complex
system
Software
Hardware
Computer organization
Processor
Control
Datapath
Memory
Input & output
Common themes:
Design / structure
Art
System
Tool for programmer and application
Interface
Computer Architecture
refers to those attributes of the system
that are visible to a programmer -- those
attributes that have a direct impact on the
execution of a program, including
Instruction sets
Data representations
Addressing
I/O
Architecture vs Organisation
Organisation
Synonymous with “architecture” in many uses and
textbooks
Transparent to the programmer
Multiplexers
Circuit Delays
Reference: Chapter 4 – Hennessey & Patterson
important aspect to circuit design in relation to
the speed at which it operates.
Example
A PC has a clock frequency of 800 MHz. That
means that each 1.25 ns (i.e. period
T=1/frequency) the PC will perform a
computation
Circuit Delays
The logic circuit
Circuits that perform the same function can
vary significantly in their speeds. A good
example is an adder circuit.
Example
This requires the signal to travel (ripple) through all the stages
of the adder - final Sum and Carry bits will be valid after a
considerable delay.
n-bit Carry Ripple Adders
The propagation delay in each full adder to
produce the carry is equal to two gate delays
= 2t
Since the generation of the sum requires the
propagation of the carry from the lowest
position to the highest position, the total
propagation delay of the adder is
approximately:
Total Propagation delay = 2nt
Additional delay may result from other delays
associated with interconnections.
n-bit Carry Ripple Adders
Propagation delay = 2 nt = 8t S3 S2 S1 S0
or 8 gate delays Sum Output
X3 Y3 X2 Y2 X1 Y1 X0 Y0
S3 S2 S1 S0
Sum output
Carry Look-Ahead Adders
designed as a faster way to add two binary
numbers thus reducing computation time
solve problem of ripple adder by calculating
the carry signals in advance, based on the
input signals
work by creating Propagate and Generate
signals (P and G) for each bit position, based
on whether a carry is propagated through
from a less significant bit position
Carry Look-Ahead Adders
a carry signal will be generated in two
cases:
when both bits Ai and Bi are 1, or
when one of the two bits is 1 and
Si
Propagation Delay
What is the maximum propagation delay of this ALU?
The propagation
delay will depend on :
1. The propagation
delay of the
individual gates.
2. The propagation
delay of the
multiplexers.
3. The propagation
delay of the One
Bit Adder.
1-bit ALU
Propagation Delay
The propagation delay of the multiplexer is 19ns:
OR Gate - 7.0ns
Propagation Delay
The propagation delay of the ALU:
Unit 2 – Processor Performance
Objective:
After the completion of this unit the student should be
able to determine processor performance:
a. Performance metrics and evaluating computer designs
b. Measuring performance.
Clock rates
Clock time
Execution time
Cycles per instruction (CPI)
c. Evaluating and comparing performance
Amdahl’s Law
MIPS
MFLOPS
Benchmarks
Bandwidth
Computer Performance
Reference: Chapter 2 – Hennessey & Patterson
Throughput
The amount of digital data per time unit that
is delivered over a physical or logical link.
Computer Performance
System resource
Any physical or virtual component of limited
availability within a computer system.
High availability
A system design protocol and associated
implementation that ensures a certain
absolute degree of operational continuity
during a given measurement period.
Computer Performance
Metrics (measurement) include:
= IC x CPI
Clock rate IC = instruction count
Evaluating Performance
Definition of time (CPU time):
Where CPI = CPU clock cycles
IC
Therefore,
CPU time
= seconds
program
For example
Op F CPI CPI x F % Time
ALU 50% 1 .5 23
Load 20% 5 1.0 45
Store 10% 3 .3 14
Branch 20% 2 .4 18
Total 100% 2.2 100
Evaluating Performance
Estimating Perfomance Improvements
speedup
= (Execution time for entire task
without using the enhancement)
(Execution time for entire task
using the enhancement when possible)
Evaluating Performance
Amdahl’s Law
Execution timenew
= Execution timeold x (1 – Fractionenhanced ) + Fractionenhanced
Speedupenhanced
Speedupoverall = executionold
executionnew
= 1
(1 – Fractionenhanced ) + Fractionenhanced
Speedupenhanced
Evaluating Performance
Example:
You have a system that contains a special
processor for doing floating-point operations.
You have determined that 50% of your
computations can use the floating-point
processor. The speedup of the floating
pointing-point processor is 15. What is the
overall speedup?
Evaluating Performance
Overall speedup achieved by using the
floating-point processor.
Overall speedup = 1
(1-0.5) + 0.5/15
= 1
0.5 + 0.033
= 1.876
Evaluating Performance
MIPS
Defined as millions of instructions per second
Problems:
» Rating of a machine is based on its
instruction set -- how do you compare
machines with very different instruction sets?
Apples and Oranges
Evaluating Performance
MIPS
rating can vary on a single computer based
on program being executed
Opera
Assume a 20 ns clock, optimizing compiler eliminates
50% of all ALU operations.
Evaluating Performance
MIPS
Solution (NOT Optimized):
Avg. CPI = 0.43 x 1 + 0.21 x 2 + 0.12 x 2 + 0.24x 2
= 1.57
MIPS = 50 MHz
1.57 x 106
= 5 x 106
1.57 x 106
= 31.8
Evaluating Performance
Optimized:
Ave CPI = .43/2x1 + .21x2 + .12x2 + .24x2
1-.43/2
= 1.73
MIPS = 50 MHz
1.73 x 106
= 5 x 106
1.73 x 106
= 28.9
Evaluating Performance
Benchmarks
– programs specially chosen to accurately
measure performance
– Synthetic Benchmarks
– Toy Benchmarks
– Kernels
– Real Programs
Evaluating Performance
Synthetic Benchmarks
» Programs that attempt to match average
frequency of operations and operands over a
large program base
» Don't do any real work
» Whetstone -- based on 1970's Algol programs
» Dhrystone -- Based on a composite of HLL
statements for 1980's -- targeted to test CPU and
compiler performance
» Designer can get around these benchmarks
Evaluating Performance
Toy Benchmarks
» 10-100 lines of easily produced code with known
computational result
Input
Receive data from peripheral
Send data to computer
Input/Output Connection
Receive control signals from computer
Send control signals to peripherals
e.g. spin disk
Receive addresses from computer
e.g. port number to identify peripheral
Send interrupt signals (control)
Facilitated by buses
Input/Output Systems
Example of I/O Devices
Magnetic disks
Networks
Buses
Magnetic Disks
orders of magnitude slower than main memory,
but are cheaper and have more capacity
Note 1:
Note 2:
disk manufacturers usually
there is usually a
denote GB as 109 whereas
trade off between
computer quantities often are
speed and capacity
powers of 2, i.e., GB is 230
Note 3:
there is a difference between internal and formatted
transfer rate. Internal is only between platter.
Formatted is after the signals interfere with the
electronics (cabling loss, interference,
retransmissions, checksums, etc.)
Disk Specifications
Barracuda 180 Cheetah 36 Cheetah X15
Disk heads
read or alter the
magnetism (bits)
passing under it. The
heads are attached to
an arm enabling it to
Sectors move across the
segments of the track circle platter surface
separated by non-magnetic gaps.
The gaps are often used to identify Cylinders
beginning of a sector corresponding tracks on the different
platters are said to form a cylinder
Input/Output Systems
Magnetic disks - Internal performance measures
Seek (access) time
- time a program or device takes to locate a
particular piece of data
- access time is often longer the seek time because
it includes a brief latency period
Spindle speed
- speed at which the shaft that rotates in the middle
of a disk drive
Input/Output Systems
Magnetic disks - Internal performance measures
Rotational latency (rotational delay)
- the amount of time it takes for the desired sector
of a disk to rotate under the read-write heads of
the disk drive
- average rotational latency for a disk is half the
amount of time it takes for the disk to make one
revolution.
- typically applied to rotating storage devices, but
not to tape drives.
Note 1:
Disk Capacity the tracks on the edge of the
platter is larger than the tracks
close to the spindle. Today, most
The size of the disk is dependent on disks are zoned, i.e., the outer
the number of platters tracks have more sectors than
the inner tracks
whether the platters use one or both sides
number of tracks per surface
(average) number of sectors per track
number of bytes per sector
+ Rotational delay
+ Transfer time
Disk arm
+ Other delays
Disk Access Time
Disk read/write latency has four components
Seek delay (tseek): head seeks to right track
Average of ~5ms - 15ms
Less in practice because of shorter seeks)
Rotational delay (trotation): right sector rotates under head
On average: time to go halfway around disk
Based on rotation speed (RPM)
10,000 to 15,000 RPMs
~3ms
Transfer time (ttransfer): data actually being transferred
Fast for small blocks
Controller delay (tcontroller): controller overhead (on either
side)
Fast (no moving parts)
Disk Access Time: Seek Time
Seek time is the time to position the head
the heads require a minimum amount of time to start and stop
moving the head
some time is used for actually moving the head –
roughly proportional to the number of cylinders traveled
“Typical” average:
10 ms → 40 ms
7.4 ms (Barracuda 180)
5.7 ms (Cheetah 36)
3.6 ms (Cheetah X15)
Disk Access Time: Rotational Delay
Time for the disk platters to rotate so the first of the
required sectors are under the disk head
head here
Average delay is 1/2 revolution
“Typical” average:
8.33 ms (3.600 RPM)
5.56 ms (5.400 RPM)
4.17 ms (7.200 RPM)
3.00 ms (10.000 RPM)
2.00 ms (15.000 RPM)
block I want
Disk Access Time: Transfer Time
Time for data to be read by the disk head, i.e., time it takes
the sectors of the requested block to rotate past the head
Transfer time =
Note:
to increase overall
throughput, one should read
as much as possible
contiguously on disk
Disk Throughput
Example:
for each operation we have
- average seek
- average rotational delay
- transfer time
- no gaps, etc.
Cheetah X15
4 KB blocks 0.71 MB/s
64 KB blocks 11.42 MB/s
Barracuda180
4 KB blocks 0.35 MB/s
64 KB blocks 5.53 MB/s
Disk Access Time: Other Delays
There are several other factors which
might introduce additional delays:
CPU time to issue and process I/O
contention for controller
contention for bus
contention for memory
verifying block correctness with checksums
(retransmissions)
waiting in scheduling queue
Input/Output Systems
Rotational Latency Example
Calculate the time taken to read a 4KB page assuming…
128 sectors/track, 512 B/sector, 6000 RPM,
10 ms tseek , 1 ms tcontroller
Solution
6000 RPM => 100 R/s => 10 ms/R
=> trotation = 10 ms / 2 = 5 ms
outer:
Sets of wires
Buses
Types of buses
Processor-memory buses
I/O buses
Backplane buses Connect I/O devices to processor and
memory
Major advantages of bus organization are versatility and
low cost.
Major disadvantage is that insufficient bandwidth may
create a communication bottle neck, limiting the maximum
I/O throughput.
Performance factors
Physical size (latency & bandwidth)
Number and type of connected devices (taps)
Bus Interconnection Scheme
Address Bus
contains the source or destination address
of the data on the data bus
e.g. CPU needs to read an instruction (data) from
a given location in memory
Transfer type
• serial or parallel
Bus Types
Dedicated
Separate data & address lines
Multiplexed
Shared lines
Address valid or data valid control line
Advantage - fewer lines
Disadvantages
More complex control
Degradation of performance
Bus Arbitration
Ensuring only one device uses the bus at
a time – avoiding collisions
Choosing a master among multiple
requests
Tryto implement priority and fairness (no
device “starves”)
Uses a master-slave mechanism
Two main schemes:
• centralised
• distributed
Bus Arbitration
Components
Bus master: component that can initiate a bus
request
Bus typically has several masters
Processor, but I/O devices can also be
masters
Daisy-chain: devices connect to bus in priority
order
High-priority devices intercept/deny requests
by low-priority ones
Simple, but slow and can’t ensure fairness
Bus Arbitration
New trend: Point-to-point busses
• Pro: No arbitration, no “master”, fast,
simple, source synchronous
• Con: need lots of wires or requires high
per-wire bandwidth
Bus Arbitration
Centralised
singlehardware device controls bus access -
(bus controller or bus arbiter)
May be part of CPU or separate
Distributed
any module (except passive devices like
memory) can become the bus master e.g. CPU
and DMA controller
Access control logic is on all modules
Modules work together to control bus
Bus Timing
Co-ordination of events on bus
Synchronous – events are controled by a
clock
Asynchronous – timing is handled by well-
defined specifications, i.e., a response is
delivered within a specified time after a
request
Synchronous Bus Timing
Events determined by clock signals
Control Bus includes clock line
A single 1-0 cycle is a bus cycle
All devices can read clock line
Usually sync on leading/rising edge
Usually a single cycle for an event
Analogy – Orchestra conductor with baton
Usually stricter in terms of its timing
requirements
Synchronous Bus Timing
Asynchronous Timing
Interrupt-driven I/O
Overcomes CPU waiting
No repeated CPU checking of device
I/O module interrupts when ready
controller rotational
average data
overhead latency
seek transfer
Example
300 MIPS CPU, 100 MB/s I/O bus
50K OS instructions/s + 100K user instructions per I/O operation
SCSI-2 controllers (20 MB/s): each accommodates up to 7 disks
5 MB/s disks with tseek + trotation = 10 ms, 64 KB reads
Determine:
What is the maximum sustainable I/O rate?
How many SCSI-2 controllers and disks does it require?
Assuming random reads
Designing I/O
Designing an I/O System for Bandwidth
First: determine I/O rates of components we can’t change
CPU: (300M instructions/s) / (150K instructions/IO) = 2000 IO/s
I/O bus: (100MB/s) / (64KB/IO) = 1562 IO/s
Peak I/O rate determined by bus: 1562 IO/s
An implementation of a CPU
PC
Instruction Add Sum
Instruction
memory
ALU control
5 Read 3 MemWrite
register 1
Read
Register 5 data 1
Read
numbers register 2 Zero Address Read
Registers Data ALU ALU data 16 32
5 Write Sign
result
register extend
Read Write Data
Write data 2 data memory
Data data
RegWrite
MemRead
a. Registers b. ALU
a. Data memory unit b. Sign-extension unit
Building the Datapath
Use multiplexors to stitch them together
PCSrc
M
Add u
x
4 Add ALU
result
Shift
left 2
Registers
Read 3 ALU operation
MemWrite
Read register 1 ALUSrc
PC Read
address Read data 1 MemtoReg
register 2 Zero
Instruction ALU ALU
Write Read Address Read
register M result data
data 2 u M
Instruction u
memory Write x Data x
data
Write memory
RegWrite data
16 32
Sign
extend MemRead
Datapath
Two implementations (demonstrated with MIPS
instruction)
Single cycle implementation
uses a single clock cycle for every instruction.
(Start execution with one clock edge, and complete on the next
edge)
Advantage: One clock cycle per instruction
Disadvantage: long cycle time
Multicycle implementation
instructionsuse multiple clock cycles.
Advantage: Shorter execution time
Disadvantage: More complex control
Single-Cycle Implementation
PCSrc
1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2
ALUOp
Single-Cycle Datapath
Cycle time = Σ(stages)
Execution Time = IC * CPI * Cycle Time
Processor design (datapath and control) will
determine:
Clock cycle time
Clock cycles per instruction
All combinational logic must stabilize within one
clock cycle.
All state elements will be written exactly once at
the end of the clock.
CPI = 1
Recall: MIPS Instruction Format
Single-Cycle Datapath: R-format
• Format: opcode r1, r2, r3
ALU op
Register 3
File
Read Reg 1 Read
Instruction
Data 1 Zero
Read Reg 2
Write ALU
Register Read
Write Data Data 2
Result
Register
Write
Single-Cycle Datapath: Load/Store
Decode: determine action to take (set up inputs for ALU, RAM, etc.)
Cycle time
= longest stage + pipeline overhead
Execution time
= cycle time * (no. of stages + IC – 1)
Pipelining
changes the relative timing of instructions by
overlapping their execution
introduces hazards
1 2 3 4 5 6 7 8 9
ADD R1, R2, R3 IF ID EX MEM WB
IF
SUB R4, R5, R1 IDSUB EX MEM WB
IF
AND R6, R1, R7 IDAND EX MEM WB
IF
OR R8, R1, R9 IDOR EX MEM WB
IF
XOR R10,R1,R11 IDXOR EX MEM WB
Pipelining
All the instructions after the ADD use the
result of the ADD instruction (in R1).
Note: Only branch hazards and RAW data hazards are possible in MIPS
pipeline
Pipeline Hazards
Data hazards
when reads and writes of data occur in a different
order in the pipeline than in the program code –
results in data dependencies
WAW
is a situation in which two writes occur out of
order - when there is no read in between
Pipeline Hazards
Data hazards
RAW
occurs when, in the code as written, one
instruction reads a location after an earlier
instruction writes new data to it, but in the pipeline
the write occurs after the read (so the instruction
doing the read gets stale data).
Pipeline Hazards
Example (Data hazard)
add $1, $2, $3 _ _ _ _ _
add $4, $5, $6 _ _ _ _ _
add $7, $8, $9 _ _ _ _ _
add $10, $11, $12 _ _ _ _ _
add $13, $14, $1 _ _ _ _ _ (data arrives early; OK)
add $15, $16, $7 _ _ _ _ _ (data arrives on time; OK)
add $17, $18, $13 _ _ _ _ _ (uh, oh)
add $19, $20, $17 _ _ _ _ _ (uh, oh again)
Pipeline Hazards
Branch/Control hazards
occurs when a decision needs to be made, but the
information needed to make the decision is not available
yet
JMP LOOP
…
LOOP: ADD R1, R2, R3
Structural hazards
occur when a single piece of hardware is used in more
than one stage of the pipeline, so it's possible for two
instructions to need it at the same time
Resolving Hazards
Four possible techniques
1. Stall
Can resolve any type of hazard
Detect the hazard
Freeze the pipeline up to the dependent stage
until the hazard is resolved - simply make the
later instruction wait until the hazard resolves
itself
Undesirable because it slows down the
machine, but may be necessary.
Resolving Hazards
2. Bypass/Forward
detects condition
If the data is available somewhere, but is just not
where we want it, create extra data paths to
``forward'' the data to where it is needed
no need to stall
eliminates stalls for single-cycle operations
2 types of locality:
Temporal locality (locality in time) - If an item is
accessed it will be accessed again soon.
SpatialLocality (locality in space) - If an item was
accessed, items close by will be accessed.
The Principle of Locality
used by implementing a
memory hierarchy Speed CPU Size Cost ($/bit)
composed of multiple
levels of memory with Fastest Memory Smallest Highest
different sizes and
speeds. Memory
expensive(100-250$ per
Mbyte) are used closer to
the CPU while DRAMs
(25-100ns, 3-8$) are
used as main memory.
Hits and Misses
In the memory hierarchy an Processor
Increasing distance
Level 1
from the CPU in
access time
Level n
X3 X3
Direct Mapped Cache: The simplest way to allocate the cache to the
system memory is to determine how many cache lines there are and just
chop the system memory into the same number of chunks. Then each
chunk gets the use of one cache line. This is called direct mapping. So if
we have 64 MB of main memory addresses, each cache line would be
shared by 4,096 memory addresses (64 M divided by 16 K).
Fully Associative Cache: Instead of hard-allocating cache lines to
particular memory locations, it is possible to design the cache so that any
line can store the contents of any memory location. This is called fully
associative mapping.
N-Way Set Associative Cache: "N" here is a number, typically 2, 4, 8 etc.
This is a compromise between the direct mapped and fully associative
designs. In this case the cache is broken into sets where each set contains
"N" cache lines, let's say 4. Then, each memory address is assigned a set,
and can be cached in any one of those 4 locations within the set that it is
assigned to. In other words, within each set the cache is associative, and
thus the name.
Tags
Address (showing bit positions)
31 30 13 12 11 210
Byte
cache location is 0
mapped to several
1
2
memory locations,
Contain the upper bits of
the address not used to 1021
index the cache. 1022
1023
Include a valid bit to 20 32
valid?
10 2
tag
the valid bit 9 1
How do we 8 0 1 1 3
know which 7 3 0 2
6 2
memory 1 2 1
5 1 1 0 0
address a slot
4 0
reflects? 3 3
slot
the tag field 2 2
address = tag*4 + slot
For the sake 1 1
slot = address AND 3
of simplicity, 0 0
tag = address >> 2 or
we consider
= floor(address/4)
word
addresses
Cache Basics
The valid bit has the value ‘0’ to say that it is empty or the value ‘1’ to say that it is full
caches are empty when the machine powers up
For the sake of simplicity the addresses on the figure are word addresses, and block
contains only one word
Caches are also usually emptied when the computer switches from running one
program to running another program (this is to prevent programs from access each
other’s virtual memory spaces)
The tag field contains rubbish when the valid bit is 0
Otherwise, the tag field tells us which of the possible memory addresses the slot
currently reflects
If memory has m bits long byte address, and cache has capacity of 2n blocks (of 2k
words, k > 0), tag contains m – n - k – 2 most significant address bits (tag = address
>> n + k + 2, address shifted n + k + 2 times to the right, 2 due to word address, m
bits is byte address)
The slot is address of the cache block, and it is (memory_address >> k + 2) mod 2n,
or (memory_address >> k + 2) AND 2n – 1.
11 80
valid?
slot #
10 A0
tag
9 00
8 FF 7
We consider word 7 10 6
addresses again
6 44 5
5 F0 4
4 C9 3
3 20 2
2 20 1
1 0A 0
0 00 8-word cache
memory DRAM
address
Example: (cont.)
Read address 1010
10102
11 80
slot # 0102
valid?
slot #
10 A0
tag
miss,so fetch 9 00
8 FF 7
Read address 810 7 10 6
10002 6 44 5
5 F0 4
slot # 0002 4 C9 3
miss,so fetch 3 20 2
2 20 1
Read address 1010 1 0A 0
10102 0 00 8-word cache
memory
slot # 0102 address
DRAM address = tag*8 + slot
slot = mod(address/8)
tag 1
tag = floor(address/8)
hit
address >> 3
Example: (cont.)
Read address 8 Slot V Tag Data
1000 000 Y 1 FF
slot index 000 001 N
hit 010 Y 1 A0
011 N
Read address 2
100 N
0010
101 N
slot index 010 110 N
tag mismatch 111 N
The 3 Cs
Cache misses can be divided into 3
categories:
Compulsory misses: Misses that are caused
because the block was never in the cache.
These are also called cold-start misses.
Increasing the block size reduces compulsory
misses. Too large a block size can cause
capacity misses and increases the miss
penalty.
The 3 Cs
Compulsory misses: Misses that are caused
because the block was never in the cache.
These are also called cold-start misses.
Increasing the block size reduces compulsory
misses. Too large a block size can cause
capacity misses and increases the miss
penalty.
The 3 Cs
Conflictmisses: Occur in direct mapped and
set associative caches. Multiple blocks
compete for the same set. Increasing
associativity reduces conflict misses. But a
too high associativity increases access time.
Memory System Support for Caches
Cache misses read data from main memory
which is constructed from DRAMs. Although it is
hard to reduce the latency to the first word it is
possible to increase the bandwidth from memory
to cache.