You are on page 1of 239

Computer

Architecture
CIT 3007
Semester 2 – 2008/9
Computer Architecture
REFERENCES
 David A. Patterson & John L. Hennessy, Computer
Organization and Design: the hardware/software
interface, Morgan-Kaufmann.

 Andrew S. Tanenbaum, Structured Computer


Organisation, 3rd Edition, Prentice-Hall, 1990.

 William Stallings. Computer Architecture and


Organization. Prentice-Hall, Inc., 5th edition, 2000.

 J L Hennessy & D A Patterson, Computer Architecture: a


quantitative approach, Morgan-Kaufmann, 1990.
Chapters 5, 6, 7, 8, 10.
Computer Architecture
ASSESSMENT PROCEDURES
 Test = 15%

 Quizzes = 15%
On-going (done in tutorials)
 Research/Presentations = 20%
 Final exam = 50%
 Total = 100%
UNIT 1 – Introduction to Computer
Architecture
 Overview and history
 The cost factor
 Memory hierarchy
 Hardware/Software interface
 Computer abstractions and technology
 Hierarchical approach to understanding & designing a complex
system
 Software
 Hardware
 Computer organization
 Processor
 Control
 Datapath
 Memory
 Input & output

Note: These topics were covered in Computer Organisation & Assembly


and as such are being revised
Computer Architecture
Definition
 Baer: “The design of the integrated system
which provides a useful tool to the
programmer”
 Hayes: “The study of the structure, behavior
and design of computers”
 Abd-Alla: “The design of the system
specification at a general or subsystem level”
Computer Architecture
Definition
 Foster: “The art of designing a machine
that will be a pleasure to work with”

 Hennessy and Patterson: “The interface


between the hardware and the lowest
level software”
Computer Architecture

Common themes:
 Design / structure
 Art
 System
 Tool for programmer and application
 Interface
Computer Architecture
 refers to those attributes of the system
that are visible to a programmer -- those
attributes that have a direct impact on the
execution of a program, including
 Instruction sets
 Data representations
 Addressing

 I/O
Architecture vs Organisation
Organisation
 Synonymous with “architecture” in many uses and
textbooks
 Transparent to the programmer

For this module, we will use it to mean the underlying implementation of


the architecture

An architecture can have a number of organizational


implementations:
a. Control signals
b. Technologies
c. Device implementations
Computer Architecture
 design of computer systems

 often divided into instruction set


architecture and microarchitecture

 the conceptual design and fundamental


operational structure of a computer
system (in computer engineering)
Computer Architecture
 a blueprint and functional description of
requirements (especially speeds and
interconnections) and design
implementations for the various parts of a
computer — focusing largely on the way
by which the central processing unit (CPU)
performs internally and accesses
addresses in memory
Computer Architecture
 may also be defined as the science and art
of selecting and interconnecting hardware
components to create computers that meet
functional, performance and cost goals

 Most common goal is to maximize the


performance of a computer system
Computer Architecture
A Brief History
 1939 - HARVARD MARK I
 1943 - COLOSSUS
 1943 – Work begins on the ENIAC
 first true general purpose computer
 1945 – Von Neumann writes “First Draft of
a report to the EDVAC “
UNIT 1 – Computer Circuits & Arithmetic
Objective: To determine performance of
computer circuits by calculating gate count
and gate delays for:
Ripple adders

Carry lookahead adders

Multiplexers
Circuit Delays
Reference: Chapter 4 – Hennessey & Patterson
 important aspect to circuit design in relation to
the speed at which it operates.

 determine the maximum frequency at which a


digital circuit can work.

Example
A PC has a clock frequency of 800 MHz. That
means that each 1.25 ns (i.e. period
T=1/frequency) the PC will perform a
computation
Circuit Delays
The logic circuit
Circuits that perform the same function can
vary significantly in their speeds. A good
example is an adder circuit.

The overall speed of a digital system can be


measured on an oscilloscope by comparing
the input to the output signals. However,
during the design phase, the circuit has not yet
been fabricated and therefore, cannot be
measured. In that case it is possible to
determine the delay of circuits by doing
a Timing Simulation. The advantage of a
simulation is that one can also determine the
delay of internal nodes of a circuit.
Circuit Delays
 What is the longest we will have to wait for
our ALU to compute a result?

 Affects the performance of a processor

 Influenced by the amount and type of


gates used in the design

 Effects are clearly seen in adders and


multiplexers
Circuit Delays
 Factors
1. propagation delay (gate delay) due to
the internal structure of the gates
including contamination delay

2. the loading of the output buffers due to


fanout and net delays

3. the logic circuit itself.


Circuit Delays
Propagation delay
 Occurs because when the input signal of a gate
changes, the output signal will not change
instantaneously as is shown below:

 the time difference between the change of the input and


output signals
 value varies from gate to gate and from logic family to
family
 caused by parasitic capacitors inside the gates and the
physical limitations of the devices used to build these
gates

In general the more you are willing to pay for a


device (or chip), the faster it will be.
Circuit Delays
Contamination Delay
 the minimum time for a change at the
input of a combinational circuit to affect
the output.
 Normally occurs in circuit with flip-flops
Circuit Delays
Fanout & Net delay
 Caused by the
capacitors associated
with the loads seen by a
gate, which are the
result of the wiring (net
delays) between gates
(e.g. a long metal line
connecting two gates on
a chip) and the input
capacitor of the gates
(a)Parasitic interconnection capacitors and fanout of a gate;
(b) hydraulic equivalent.
n-bit Carry Ripple Adders
 So called because the result of an addition of
two bits depends on the carry generated by the
addition of the previous two bits.
 sum of the most significant bit is only available
after the carry signal has rippled through the
adder from the least significant stage to the
most significant stage.
 built by connecting in series n full adders.
n-bit Carry Ripple Adders
 The output of a full adder at position j is given by:
Sj = Xj ⊕ Yj ⊕ Cj
Cj+1 = Xj . Yj + Xj . Cj + Y . Cj

Example

 This requires the signal to travel (ripple) through all the stages
of the adder - final Sum and Carry bits will be valid after a
considerable delay.
n-bit Carry Ripple Adders
 The propagation delay in each full adder to
produce the carry is equal to two gate delays
= 2t
 Since the generation of the sum requires the
propagation of the carry from the lowest
position to the highest position, the total
propagation delay of the adder is
approximately:
Total Propagation delay = 2nt
 Additional delay may result from other delays
associated with interconnections.
n-bit Carry Ripple Adders

Ripple-carry adder, illustrating the delay of the carry bit


4-bit Carry Ripple Adders
Inputs to be added
Adds two 4-bit numbers: X3X2X1X0 Y3Y2Y1Y0
X = X3 X2 X1 X0
Y = Y3 Y2 Y1 Y0
producing the sum S = S3 S2 S1 S0 ,
C-out = C4 from the most significant 4-bit
C4 C-out C-in C0 =0
position j=3 Adder

Propagation delay = 2 nt = 8t S3 S2 S1 S0
or 8 gate delays Sum Output

Data inputs to be added

X3 Y3 X2 Y2 X1 Y1 X0 Y0

Full C3 Full C2 Full C1 Full


C4 C-out C-in C-out C-in C-out C-in C-out C-in C0 =0
Adder Adder Adder Adder

S3 S2 S1 S0
Sum output
Carry Look-Ahead Adders
 designed as a faster way to add two binary
numbers thus reducing computation time
 solve problem of ripple adder by calculating
the carry signals in advance, based on the
input signals
 work by creating Propagate and Generate
signals (P and G) for each bit position, based
on whether a carry is propagated through
from a less significant bit position
Carry Look-Ahead Adders
 a carry signal will be generated in two
cases:
 when both bits Ai and Bi are 1, or
 when one of the two bits is 1 and

the carry-in (carry of the previous


stage) is 1.
Carry Look-Ahead Adders
Therefore:
COUT = Ci+1 = Ai.Bi + (Ai Bi). Ci
+
By using the carry generate function Gi and carry
propagate function Pi , then C i+1 can be written as:
COUT = Ci+1 = Gi + PiCi
To eliminate carry ripple the term Ci is recursively
expanded and by multiplying out, we obtain a 2-
level AND-OR expression for each C i+1
Carry Look-Ahead Adders
• For a 4-bit carry look-ahead adder the expanded expressions
for all carry bits are given by:
C1 = G0 + P0.C0

C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0

C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0

C4 = G3 + P3.G2 + P3.P2.G1 + P3 . P2.P1.G0 + P3.P2.P1.P0.C0


• The additional circuits needed to realize the expressions are
usually referred to as the carry look-ahead logic.
• Using carry-ahead logic all carry bits are available after three
gate delays regardless of the size of the adder.
Carry Look-Ahead Adders
Four 4-bit ALUs using carry look-ahead to from a
16-bit adder

N.B The carries


come from the
carry look-ahead
unit and not from
the ALUs.
Carry Look-Ahead Adders
One possible implementation of a full adder uses nine gates

Assuming gate delays:


AND gates = 2t
OR gate = 3t
NOT gate = 1t

The gate delay for:


C1 = ?
S1 = ?

Si
Propagation Delay
What is the maximum propagation delay of this ALU?

The propagation
delay will depend on :
1. The propagation
delay of the
individual gates.
2. The propagation
delay of the
multiplexers.
3. The propagation
delay of the One
Bit Adder.
1-bit ALU
Propagation Delay
The propagation delay of the multiplexer is 19ns:

Not Gates - 4.5ns


AND Gates - 7.5ns

OR Gate - 7.0ns
Propagation Delay
The propagation delay of the ALU:
Unit 2 – Processor Performance
Objective:
After the completion of this unit the student should be
able to determine processor performance:
a. Performance metrics and evaluating computer designs
b. Measuring performance.
 Clock rates
 Clock time
 Execution time
 Cycles per instruction (CPI)
c. Evaluating and comparing performance
 Amdahl’s Law
 MIPS
 MFLOPS
 Benchmarks
 Bandwidth
Computer Performance
Reference: Chapter 2 – Hennessey & Patterson

If the automobile industry had followed the same evolution,


as computers, it has speculated that today a Rolls Royce
would cost less than $100 with the power of an ocean liner
and an efficiency of more than 200 miles per gallon.
Source:Measuring Performance, A Practition Guide, David J. Lilja

 How well is the computer doing the work it is supposed to


do?
Computer Performance
 the amount of useful work accomplished by a
computer system compared to the time and
resources used.

 may involve one or more of the following:


 Short response time for a given piece of work
 High throughput (rate of processing work)
 Low utilization of computing resource(s)
 High availability of the computing system or
application
Computer Performance
Response time
The time a generic system or functional unit
takes to react to a given input.

Throughput
The amount of digital data per time unit that
is delivered over a physical or logical link.
Computer Performance
System resource
Any physical or virtual component of limited
availability within a computer system.

High availability
A system design protocol and associated
implementation that ensures a certain
absolute degree of operational continuity
during a given measurement period.
Computer Performance
Metrics (measurement) include:

 Availability, response time, capacity,


latency, bandwidth, throughput, relative
efficiency scalability, speed up

 used in the relative comparison of one


system to another or the same system
before/after changes
Computer Performance
 common ways of gauging a system's value
in terms of its cost and its performance

 Observation: An increase in a machine's


performance is viewed in one of two
(competing) ways:
» Reduced response time to an individual
job
» Increase in overall throughput
Comparing Performance
CPU Time = execution time
= instruction count x CPI x Clock cycle time
= instruction count x CPI
Clock rate
NB: clock cycle time = 1
clock rate

CPI = cycles per instruction


Performance ~ 1/Execution Time

CPU PerformanceA = Execution TimeB


CPU PerformanceB Execution TimeA
Comparing Performance
Example
System A takes 12 seconds to execute a
program. System B takes 20 seconds to
execute the same program. System A is
20/12 = 1.67x faster than System B for
this application. (67% performance
improvement)
Evaluating Performance
Definition of time
 Vary based on what is being measured

 Include response time, CPU execution


time (CPU time), user CPU time, system
CPU execution time
Evaluating Performance
Definition of time
Response time :-
total time to compute a task, including time
spent executing on the CPU, accessing
disk and memory, waiting for I/O and
other processes, and operating system
overhead
Evaluating Performance
Definition of time
CPU execution time (CPU time):
 total time a CPU spends computing a task,
excluding time for I/O and other processes

 equally dependent upon clock cycle (or


rate), clock cycles per instruction, and
instruction count.
 A 10% improvement in any one of them leads to a 10%
improvement in CPU time.
Evaluating Performance
Definition of time
CPU execution time (CPU time)
= CPU clock cycles x Clock cycle time

= CPU clock cycles


Clock rate

= IC x CPI x Clock cycle time

= IC x CPI
Clock rate IC = instruction count
Evaluating Performance
Definition of time (CPU time):
Where CPI = CPU clock cycles
IC
Therefore,
CPU time
= seconds
program

= instructions x clock cycles x seconds


program instruction clock cycle
Evaluating Performance
Definition of time
User CPU time:-
Total time the CPU spends in the program

System CPU execution time:-


Total time the operating system spends
executing tasks fro the program
Evaluating Performance
Computing CPI
 The CPI is the average number of cycles per
instruction.
 If for each instruction type, we know its frequency and
number of cycles need to execute it, we can compute
the overall CPI as follows:
 CPI = Σ (CPI x F)

For example
Op F CPI CPI x F % Time
ALU 50% 1 .5 23
Load 20% 5 1.0 45
Store 10% 3 .3 14
Branch 20% 2 .4 18
Total 100% 2.2 100
Evaluating Performance
Estimating Perfomance Improvements

Assume a processor currently requires 10 seconds to execute a


program and performance improves by 50 percent per year.

By what factor does processor performance improve in 5 years?


(1 + 0.5)^5 = 7.59

How long will it take a processor to execute the program after 5


years?
ExTimenew = 10/7.59 = 1.32 seconds
Evaluating Performance
Methods used to evaluate performance:
1. Memory Bandwidth
2. Amdahl’s Law
3. MIPS (millions of instructions per second)
4. MFLOPS (millions of floating point
instructions per second)
Evaluating Performance
Memory bandwidth
 the maximum rate in bits per second at
which information can be transferred to or
from main memory
 Imposes a basic limit on the processing
power of a system
 Weakness is that it is not related in any
way to actual program execution
 Not one of the current "in" figures of merit
Evaluating Performance
Amdahl’s Law
Founded on the basis that machines are
designed to run programs, therefore
improved performance is a total system
process

Amdahl’s Law states:


“The performance improvement to be gained
from using some faster mode of execution
is limited by the fraction of time the faster
mode can be used.”
Evaluating Performance
Amdahl’s Law
 Defines speedup:

speedup
= (Execution time for entire task
without using the enhancement)
(Execution time for entire task
using the enhancement when possible)
Evaluating Performance
Amdahl’s Law
Execution timenew
= Execution timeold x (1 – Fractionenhanced ) + Fractionenhanced
Speedupenhanced

The overall speedup is the ratio of the execution times:

Speedupoverall = executionold
executionnew

= 1
(1 – Fractionenhanced ) + Fractionenhanced
Speedupenhanced
Evaluating Performance
Example:
You have a system that contains a special
processor for doing floating-point operations.
You have determined that 50% of your
computations can use the floating-point
processor. The speedup of the floating
pointing-point processor is 15. What is the
overall speedup?
Evaluating Performance
Overall speedup achieved by using the
floating-point processor.

Overall speedup = 1
(1-0.5) + 0.5/15
= 1
0.5 + 0.033
= 1.876
Evaluating Performance
MIPS
 Defined as millions of instructions per second

MIPS = instruction count = clock rate


Exec Time x 106 CPI x 106

 In general, faster machines will have higher


MIPS ratings and appear to have better
performance
Evaluating Performance
MIPS
 Advantage in use: easy to "understand,"
easy to market systems with this measure of
performance

 Problems:
» Rating of a machine is based on its
instruction set -- how do you compare
machines with very different instruction sets?
Apples and Oranges
Evaluating Performance
MIPS
 rating can vary on a single computer based
on program being executed

 can vary inversely to performance! --


increase in performance with a decrease in
MIPS rating
Evaluating Performance
MIPS
Example: (Impact of optimizing compiler)
Assume the following program makeup:

Opera
Assume a 20 ns clock, optimizing compiler eliminates
50% of all ALU operations.
Evaluating Performance
MIPS
Solution (NOT Optimized):
Avg. CPI = 0.43 x 1 + 0.21 x 2 + 0.12 x 2 + 0.24x 2
= 1.57

MIPS = 50 MHz
1.57 x 106
= 5 x 106
1.57 x 106
= 31.8
Evaluating Performance

Optimized:
Ave CPI = .43/2x1 + .21x2 + .12x2 + .24x2
1-.43/2
= 1.73

MIPS = 50 MHz
1.73 x 106
= 5 x 106
1.73 x 106
= 28.9
Evaluating Performance
Benchmarks
– programs specially chosen to accurately
measure performance
– Synthetic Benchmarks
– Toy Benchmarks
– Kernels
– Real Programs
Evaluating Performance
Synthetic Benchmarks
» Programs that attempt to match average
frequency of operations and operands over a
large program base
» Don't do any real work
» Whetstone -- based on 1970's Algol programs
» Dhrystone -- Based on a composite of HLL
statements for 1980's -- targeted to test CPU and
compiler performance
» Designer can get around these benchmarks
Evaluating Performance
Toy Benchmarks
» 10-100 lines of easily produced code with known
computational result

» Small, easy to write (compiler may not be produced


yet) and evaluate» Hennessy and Patterson assert
best use is for "beginning programming assignments“
since the benchmarks do not reflect computer's
performance for normal sized applications
Evaluating Performance
Kernels
» Small, key pieces of real programs

» Livermore loops and Linpack are good


examples

» Tend to focus on a specific aspect of the


overall performance rather than the entire
system
Evaluating Performance
Real Programs
» Actual applications with real I/O
» Best measure of a machine's total capability
» Often, the hardest to judge based on limited
access to machine -- have not bought it yet!
» Typical suite would include compilers, word
processors, math applications, etc.
» SPEC (System Performance Evaluation
Cooperative) is a good example of workstation
benchmark (started by HP, Sun, DEC, and
MIPS)
Evaluating Performance
Other considerations
– COST!!!
» Design cost
» Purchase cost
» Components: component, direct, indirect
– Compatibility
– S/W availability
– Maintenance
Unit 3 – Input/Output Systems
Objective:
After completion of this unit the student should be able
 Differentiate between types of input/output devices (particularly
buses).
 System buses
 Hard disk drives
 Calculate input/output performance measures:
Bus design
 synchronous vs asynchronous buses
 Bandwidth vs Latency
Hard disk drive design
 Disk access time
 Seek time
 Rotational delay
 Transfer time
 Control overhead
Polling
 Discuss the correlation between system performance and I/O
devices
Input/Output Systems
References
 Stalling
– chpts 3 & 7
 Hennessy and Patterson – chpts 7 & 8
Input/Output Systems
Challenges:
 Characteristics of I/O systems are driven by
technology
 E.g properties of how disks effect how the disks
should be connected to the processor, as well
as how the operating system interacts with the
disks.
 Designers of I/O systems must consider issues
such as expandability and resilience in the face of
failure in addition to performance.
Input/Output Systems
Challenges:
 Performance issues pertaining to I/O systems are
more diverse,
 e.g.
some devices have access latency as performance
measure while for others is throughput is crucial.

 Performance depends on many aspects of the


system such as; the device characteristics,
connection between devices and the rest of the
system, the memory hierarchy, and operating
system.
Input/Output Systems
Challenges:
 Wide variety of peripherals
 Delivering different amounts of data at different
speeds, in different formats

 All slower than CPU and RAM

 Need I/O modules (refer to notes from semester


1)
Input/Output Systems
Throughput:
How much data can be moved
through the system in a certain
time?
How many I/O operations can be
done per unit time
Response time; total elapsed time
to accomplish a particular task.
Input/Output Systems
Response Time
 will depend heavily on bandwidth if I/O requests are
extremely large
 Accesses are small, I/O system with the lowest
latency per access will deliver the best response
time.
 is the key performance characteristic on single user
machines such as workstations and personal
computers,
 In general, is minimized by handling a request as
early as possible, while greater throughput can be
achieved if we try to handle related request together
Input/Output Systems
Latency
 in general, is the period of time that one
component in a system is spinning its
wheels waiting for another component.
 wasted time.
 For example, in accessing data on a disk,
latency is defined as the time it takes to position
the proper sector under the read/write head.
Input/Output Connection
 Similar to memory from computer’s viewpoint
 Output
 Receive data from computer
 Send data to peripheral

 Input
 Receive data from peripheral
 Send data to computer
Input/Output Connection
 Receive control signals from computer
 Send control signals to peripherals
 e.g. spin disk
 Receive addresses from computer
 e.g. port number to identify peripheral
 Send interrupt signals (control)
 Facilitated by buses
Input/Output Systems
Example of I/O Devices
 Magnetic disks

 Networks

 Buses
Magnetic Disks
 orders of magnitude slower than main memory,
but are cheaper and have more capacity

 used to have a persistent system and manage


huge amounts of information because
 ...there is a large speed mismatch compared to
main memory (this gap will increase according
to Moore’s law)
 ...disk I/O is often the main performance
bottleneck
 ...we need to minimize the number of accesses,
Magnetic Disks
 Moore's Law – the number of transistors
that can be inexpensively placed on an
integrated circuit is increasing
exponentially, doubling approximately
every two years
Magnetic disks
 Hard disks and floppy disks
 Characteristics
 consists of a collection of platters (2-20), each
of which has two recordable disk surfaces
 Stacks of platters is rotated at 3600 to 5400
RPM
 Each disk surface is divided into concentric
circles, called tracks (500-2000 per surface)
 Each track is in turn divided into sectors (32-
128). This is also the smallest unit that can be
read or written.
 A cylinder refers to all the tracks under the disk
arms at a given point on all surfaces.
Magnetic Disk Specifications
 Disk technology develops “fast”
 Some existing (Seagate) disks today:

Note 1:
Note 2:
disk manufacturers usually
there is usually a
denote GB as 109 whereas
trade off between
computer quantities often are
speed and capacity
powers of 2, i.e., GB is 230

Note 3:
there is a difference between internal and formatted
transfer rate. Internal is only between platter.
Formatted is after the signals interfere with the
electronics (cabling loss, interference,
retransmissions, checksums, etc.)
Disk Specifications
Barracuda 180 Cheetah 36 Cheetah X15

Capacity (GB) 181.6 36.4 36.7


Spindle speed (RPM) 7200 10.000 15.000
#cylinders 24.247 9.772 18.479
average seek time (ms) 7.4 5.7 3.6
min (track-to-tack) seek 0.8 0.6 0.3
(ms)
max (full stroke) seek (ms) 16 12 7
average latency 4.17 3 2
internal transfer rate (Mbps) 282 – 508 520 – 682 522 – 709

disk buffer cache 16 MB 4 MB 8 MB


Mechanics of Magnetic Disks
Spindle Tracks
Platters of which the platters
concentric circles on a
circular platters covered with rotate around
single platter
magnetic material to provide
nonvolatile storage of bits

Disk heads
read or alter the
magnetism (bits)
passing under it. The
heads are attached to
an arm enabling it to
Sectors move across the
segments of the track circle platter surface
separated by non-magnetic gaps.
The gaps are often used to identify Cylinders
beginning of a sector corresponding tracks on the different
platters are said to form a cylinder
Input/Output Systems
Magnetic disks - Internal performance measures
 Seek (access) time
- time a program or device takes to locate a
particular piece of data
- access time is often longer the seek time because
it includes a brief latency period

 Spindle speed
- speed at which the shaft that rotates in the middle
of a disk drive
Input/Output Systems
Magnetic disks - Internal performance measures
 Rotational latency (rotational delay)
- the amount of time it takes for the desired sector
of a disk to rotate under the read-write heads of
the disk drive
- average rotational latency for a disk is half the
amount of time it takes for the disk to make one
revolution.
- typically applied to rotating storage devices, but
not to tape drives.
Note 1:
Disk Capacity the tracks on the edge of the
platter is larger than the tracks
close to the spindle. Today, most
 The size of the disk is dependent on disks are zoned, i.e., the outer
 the number of platters tracks have more sectors than
the inner tracks
 whether the platters use one or both sides
 number of tracks per surface
 (average) number of sectors per track
 number of bytes per sector

 Example (Cheetah X15): Note 2:


 4 platters using both sides: 8 surfaces there is a difference between
formatted and total capacity.
 18497 tracks per surface Some of the capacity is used for
 617 sectors per track (average) storing checksums, spare
tracks, gaps, etc.
 512 bytes per sector
 Total capacity = 8 x 18497 x 617 x 512 ≈ 4.6 x 1010 = 42.8 GB
 Formatted capacity = 36.7 GB
Disk Access Time
I want block x
block X in memory
Disk platter
Disk access time =
Disk head
Seek time

+ Rotational delay

+ Transfer time
Disk arm
+ Other delays
Disk Access Time
Disk read/write latency has four components
Seek delay (tseek): head seeks to right track
 Average of ~5ms - 15ms
 Less in practice because of shorter seeks)
Rotational delay (trotation): right sector rotates under head
 On average: time to go halfway around disk
 Based on rotation speed (RPM)
 10,000 to 15,000 RPMs
 ~3ms
Transfer time (ttransfer): data actually being transferred
 Fast for small blocks
Controller delay (tcontroller): controller overhead (on either
side)
 Fast (no moving parts)
Disk Access Time: Seek Time
 Seek time is the time to position the head
 the heads require a minimum amount of time to start and stop
moving the head
 some time is used for actually moving the head –
roughly proportional to the number of cylinders traveled

“Typical” average:
10 ms → 40 ms
7.4 ms (Barracuda 180)
5.7 ms (Cheetah 36)
3.6 ms (Cheetah X15)
Disk Access Time: Rotational Delay
 Time for the disk platters to rotate so the first of the
required sectors are under the disk head

head here
Average delay is 1/2 revolution

“Typical” average:
8.33 ms (3.600 RPM)
5.56 ms (5.400 RPM)
4.17 ms (7.200 RPM)
3.00 ms (10.000 RPM)
2.00 ms (15.000 RPM)

block I want
Disk Access Time: Transfer Time
 Time for data to be read by the disk head, i.e., time it takes
the sectors of the requested block to rotate past the head

 Transfer time =

amount of data per track


time per rotation
Disk Access Time: Transfer Time
Example 1
If a disk has 250 KB per track and operates at
10.000 RPM, we can read from the disk at
40.69 MB/s
 Access time is dependent on data density and
rotation speed

 If we has to change track, time must also be


added for moving the head
Disk Throughput
 How much data can we retrieve per second?
data size
 Throughput = transfer time (including all)

Note:
to increase overall
throughput, one should read
as much as possible
contiguously on disk
Disk Throughput
Example:
for each operation we have
- average seek
- average rotational delay
- transfer time
- no gaps, etc.

 Cheetah X15
4 KB blocks  0.71 MB/s
64 KB blocks  11.42 MB/s

 Barracuda180
4 KB blocks  0.35 MB/s
64 KB blocks  5.53 MB/s
Disk Access Time: Other Delays
 There are several other factors which
might introduce additional delays:
 CPU time to issue and process I/O
 contention for controller
 contention for bus
 contention for memory
 verifying block correctness with checksums
(retransmissions)
 waiting in scheduling queue
Input/Output Systems
Rotational Latency Example
Calculate the time taken to read a 4KB page assuming…
128 sectors/track, 512 B/sector, 6000 RPM,
10 ms tseek , 1 ms tcontroller

Solution
6000 RPM => 100 R/s => 10 ms/R
=> trotation = 10 ms / 2 = 5 ms

4 KB page => 8 sectors


⇒ttransfer = 10 ms * 8/128 = 0.6 ms

tdisk = tseek + trotation + ttransfer + tcontroller


= 10 + 5 + 0.6 + 1 = 16.6 ms
Some Complicating Issues
 There are several complicating factors:
 The “other delays” described earlier like
consumed CPU time, resource contention, etc.
 zoned disks, i.e., outer tracks are longer and therefore
usually have more sectors than inner
 checksums are also stored with each the sectors
inner:

outer:

Note 1: Note 3: Note 5:


transfer rates are the checksum is read for each for older drives the
higher on outer track and used to validate the checksum is 16 bytes
tracks track
Note 2: Note 4: Note 6:
gaps between the checksum is usually calculated SCSI disks may be changed
sectors using Reed-Solomon interleaved by user to have other
with CRC sector sizes
What is a Bus?
 A communication pathway connecting two
or more devices
 shared between the devices attached to it
 information is usually broadcast to all
devices (not just intended recipient)
 consists of multiple communication lines
transmitting information (a 0 or 1) in
parallel
What is a Bus?
 width is important in determining performance
 systems can contain multiple buses
 often grouped
A number of channels in one bus
e.g. 32 bit data bus is 32 separate single bit channels
 power lines may not be shown
 Classified as data, address, control or power
lines
What do buses look like?

 Parallel lines on circuit boards (often


yellow)
 Ribbon cables
 Strip connectors on mother boards
 e.g. PCI

 Sets of wires
Buses
 Types of buses
 Processor-memory buses
 I/O buses
 Backplane buses Connect I/O devices to processor and
memory
 Major advantages of bus organization are versatility and
low cost.
 Major disadvantage is that insufficient bandwidth may
create a communication bottle neck, limiting the maximum
I/O throughput.
 Performance factors
 Physical size (latency & bandwidth)
 Number and type of connected devices (taps)
Bus Interconnection Scheme
Address Bus
 contains the source or destination address
of the data on the data bus
e.g. CPU needs to read an instruction (data) from
a given location in memory

 Bus width determines maximum memory


capacity of system
e.g. 8080 has 16 bit address bus giving
64k address space
Data Bus
 Used for moving data between modules
 Remember that there is no difference between “data”
and “instruction” at this level

 Width is a key determinant of performance


 8, 16, 32, 64 bit
Control Bus
 set of lines used to control use of the data
and address lines by the attached devices
 includes
 Memory read/write signal
 Interrupt acknowledgement
 Interrupt request
 Clock signals
 I/O Read or Write
 Transfer ACK
 Bus request
 Bus grant
 Reset
Bus Design Issues
 Type
• dedicated or multiplexed
 Arbitration
• centralised or distributed
 Timing
• synchronous or asynchronous
 Width

 Transfer type
• serial or parallel
Bus Types
 Dedicated
 Separate data & address lines
 Multiplexed
 Shared lines
 Address valid or data valid control line
 Advantage - fewer lines
 Disadvantages
 More complex control
 Degradation of performance
Bus Arbitration
 Ensuring only one device uses the bus at
a time – avoiding collisions
 Choosing a master among multiple
requests
 Tryto implement priority and fairness (no
device “starves”)
 Uses a master-slave mechanism
 Two main schemes:
• centralised
• distributed
Bus Arbitration
Components
 Bus master: component that can initiate a bus
request
 Bus typically has several masters
 Processor, but I/O devices can also be
masters
 Daisy-chain: devices connect to bus in priority
order
 High-priority devices intercept/deny requests
by low-priority ones
 Simple, but slow and can’t ensure fairness
Bus Arbitration
New trend: Point-to-point busses
• Pro: No arbitration, no “master”, fast,
simple, source synchronous
• Con: need lots of wires or requires high
per-wire bandwidth
Bus Arbitration
Centralised
 singlehardware device controls bus access -
(bus controller or bus arbiter)
 May be part of CPU or separate

Distributed
 any module (except passive devices like
memory) can become the bus master e.g. CPU
and DMA controller
 Access control logic is on all modules
 Modules work together to control bus
Bus Timing
 Co-ordination of events on bus
 Synchronous – events are controled by a
clock
 Asynchronous – timing is handled by well-
defined specifications, i.e., a response is
delivered within a specified time after a
request
Synchronous Bus Timing
 Events determined by clock signals
 Control Bus includes clock line
 A single 1-0 cycle is a bus cycle
 All devices can read clock line
 Usually sync on leading/rising edge
 Usually a single cycle for an event
 Analogy – Orchestra conductor with baton
 Usually stricter in terms of its timing
requirements
Synchronous Bus Timing
Asynchronous Timing

 Devices must have certain tolerances to


provide responses to signal stimuli
 More flexible allowing slower devices to
communicate on same bus with faster
devices.
 Performance of faster devices, however, is
limited to speed of bus
Asynchronous Timing – Read
Asynchronous Timing – Write
Bus Width
 Wider the bus the better the data transfer
rate or the wider the addressable memory
space

 Serial “width” is determined by


length/duration of frame
Single Bus Problems
 Lots of devices on one bus leads to:
 Propagation delays
 Long data paths mean that co-ordination of bus
use can adversely affect performance
 If aggregate data transfer approaches bus capacity

 Most systems use multiple buses to


overcome these problems
Bus Transfer
 Historically, parallel has been used for high
speed peripherals (e.g., SCSI, parallel port zip
drives rather than serial port). High speed serial,
however, has begun to replace this need

 Serial communication also used to be restricted


to point-to-point communications. Now there's
an increasing prevalence of multipoint
I/O Interfaces
 How does I/O actually happen?
 How does CPU give commands to I/O devices?
 How do I/O devices execute data transfers?
 How does CPU know when I/O devices are
done?
Note: To answer the above questions, please refer to previous notes or read
up on:
- programmed /memory-mapped I/O
- interrupt-driven I/O
- dynamic memory access (DMA)
I/O Interfaces
Recap:
Programmed I/O
 CPU has direct control over I/O
 CPU waits for I/O module to complete operation
 Wastes CPU time

Interrupt-driven I/O
 Overcomes CPU waiting
 No repeated CPU checking of device
 I/O module interrupts when ready

Dynamic memory access (DMA)


 CPU gives DMA controller the required access information
 CPU carries on with other work
 DMA controller deals with transfer
 DMA controller sends interrupt when finished
I/O Interface - Polling
 process of constantly testing a port to see if data
is available
(CPU waits in a short loop, testing the I/O port's
status value until the I/O is ready to accept more
data, and then the CPU can transfer more data
to the I/O)

 inherently inefficient – CPU remains idle while


waiting for I/O

 Solution - provide an interrupt (an external


hardware event) that causes the CPU to
interrupt the current instruction sequence and
call a special interrupt service routine (ISR).
I/O Interface - Polling

 Computing processor’s overhead when:


 Polling a mouse
 Polling a disk
 Doing an interrupt driven I/O with a disk
 Using a DMA controller
Polling a Mouse
 Assume
 Polling frequency of 30 Hz
 30 requests/s
 400 clocks needed to perform a poll
 30*400 = 12,000 clocks/s
 500 MHz processor
 12,000/500×106 = 0.0024% of time spent polling

 Conclusion: An acceptable overhead


Polling a Hard Disk
Assume:
 Hard disk is active (reads or write data) all the time
 Hard disk generates 4 Mbyte/s
 16 bytes long part of a block is transferred at a time
 (disk controller has a 16 bytes buffer)
 4 × 1024 × 1024 / 16 = 256 Kpolls/s needed
 otherwise some data could be lost
 400 clocks needed to perform a poll and a 16 bytes data
transfer, if data available
 256 × 1024 × 400 clocks/s = 104,857,600 clocks/s
 500 MHz processor (1Hz = 1clock/s)
 Approx 21% of time spent polling
Block transfer time

controller rotational
average data
overhead latency
seek transfer

2ms 17.8ms 4.2ms 1ms

In this example, data transfer time represents just 4% of the


total I/O operation time
Hard Disk: Interrupts With No DMA
 Assume
 To read a block from disk takes average of 25ms
 Interrupts occur during data transfer (1 ms), only
 Interrupts
occur during 4% of time (1 out of 25 ms), and
only when a disk controller has 16 bytes to transfer
 Disk transfers 4 MByte/s, 16 bytes at a time interval
 So, when a disk transfers data, the processor has to service
256Kinterrupts/s,
 But during a whole I/O operation, only 0.04x256 K interrupts/s
(~11Kinterupts/sec) on average
Hard Disk: Interrupts With No DMA
 Assume
 500 clocks needed to handle an interrupt and a transfer of 16
bytes of data
 500 MHz CPU
 Processor is busy servicing the hard disk during bare data
transfer:
 256 × 1024 × 500 clocks/s = 131,072,000 clocks/s (26%)
 But during a whole I/O operation, processor is busy servicing
the hard disk only:
 0.04 × 256 × 1024 × 500 clocks/s = 5,242,880 clocks/s
 The processor polls 1.04% of the time

(Note improvement over polling)


Interrupt Driven Hard Disk With DMA
 Assume
a 4Kbyte block is transferred in every I/O
 25ms needed to complete a disk I/O => 40 I/Os per second
 1,000 clocks needed to set up a disk I/O
 500 clocks needed to handle interrupt when transfer completes
 500 MHz CPU
 0 processor clocks needed during actual data transfer
 1,500 clocks needed to service a disk I/O operation,
 A disk operation lasts 25 ms on average, which is
0.025*500*106 clocks = 12.5*106 clocks
 1,500clocks/(0.025s× 500Mclocks/s) gives 0.012% processor
overhead during one disk I/O operation
Polling, interrupt-driven & DMA I/Os
Controller Data
overhead Average seek time Rotational delay transf

Polling CPU overhead

Interrupt driven I/O CPU overhead

DMA driven I/O CPU overhead


Designing I/O
Designing an I/O System for Bandwidth
 Find bandwidths of individual components
 Configure components you can change…To match bandwidth of
bottleneck component you can’t

Example
300 MIPS CPU, 100 MB/s I/O bus
50K OS instructions/s + 100K user instructions per I/O operation
SCSI-2 controllers (20 MB/s): each accommodates up to 7 disks
5 MB/s disks with tseek + trotation = 10 ms, 64 KB reads

Determine:
 What is the maximum sustainable I/O rate?
 How many SCSI-2 controllers and disks does it require?
 Assuming random reads
Designing I/O
Designing an I/O System for Bandwidth
First: determine I/O rates of components we can’t change
 CPU: (300M instructions/s) / (150K instructions/IO) = 2000 IO/s
 I/O bus: (100MB/s) / (64KB/IO) = 1562 IO/s
 Peak I/O rate determined by bus: 1562 IO/s

Second: configure remaining components to match rate


 Disk: 1 / [10 ms/IO + (64K B/IO) / (5M B/s)] = 43.9 IO/s
 How many disks?
(1562 IO/s) / (43.9 IO/s) = 36 disks
 How many controllers?
(43.9 IO/s) * (64K B/IO) = 2.74M B/s
(20M B/s) / (2.74M B/s) = 7.2
(36 disks) / (7 disks/SCSI-2) = 6 SCSI-2 controllers
 Caveat: real I/O systems modeled with simulation
Designing I/O
Designing an I/O System for Latency
E.g., database system that may require maximum or average
latency

 Latencies are actually harder to deal with than bandwidths


 Unloaded system: few concurrent IO transactions
- Latency is easy to calculate

 Loaded system: many concurrent IO transactions


• Contention can lead to queuing
• Latencies can rise dramatically
• Queuing theory can help if transactions obey
fixed distribution
• Otherwise simulation is needed
UNIT 4 – Instruction Set Architecture
Objective:
After the completion of this unit the student
should be able to describe qualitatively and
mathematically the following:
 Load-store architectures
 Example: Implementation of integer addition using load-
store approach (RISC architectures) vs. memory-to-
memory approach (CISC architectures)
 Survey of implementation of I/O
 Hardware I/O instructions (Intel 80x86)
 System calls (RISC architectures)
UNIT 5 – Datapath & Control
Objective:
After completion of this unit the student should be able to:
 Differentiate between different datapaths
 Single cycle vs multiple cycle datapaths
 Calculate improvement in performance due to:
 Single cycle processors
 Multiple cycle processors
 Calculate improvement in performance due to:
 Pipelined processors
 Data hazards vs control hazards
 Data Forwarding
 Memory stalls
 Explain concept of microprogramming
 Horizontal vs vertical microcodes
Datapath & Control
References: Chpts. 3 & 5 – Hennessy & Patterson; Chpts. 12 &18 - Stalling

 An implementation of a CPU

 Must be considered with the processor design


– sequencing and execution of instructions
 the individual components that are necessary:
 the use of a clock
 registers and memory.
The Performance Perspective
Performance of a machine is determined by:
 Instruction count(determine by compiler & ISA)
 Clock cycle time
 Clock cycles per instruction

Processor design (datapath and control) will


determine:
 Clock cycle time
 Clock cycles per instruction
How to Design a Processor: step-by-step
1. Analyze instruction set => datapath requirements
 the meaning of each instruction is given by the
register transfers
 datapath must include storage element for ISA
registers
 datapath must support each register transfer
2. Select set of datapath components and establish
clocking methodology
How to Design a Processor: step-by-step
3. Assemble datapath meeting the requirements

4. Analyze implementation of each instruction to


determine setting of control points that effects the
register transfer.

5. Assemble the control logic


Datapath & Control
Datapath
 The interconnection of functional units that make
up the processor, such as ALUs, decoders, and
multiplexers that perform data processing
operations
 A component of most processors
 Provides connections for moving bits between
memory, registers and the ALU
 Works in conjunction with the control unit
Datapath
 Include the functional units we need for each instruction
Instruction
address

PC
Instruction Add Sum

Instruction
memory

a. Instruction memory b. Program counter c. Adder

ALU control
5 Read 3 MemWrite
register 1
Read
Register 5 data 1
Read
numbers register 2 Zero Address Read
Registers Data ALU ALU data 16 32
5 Write Sign
result
register extend
Read Write Data
Write data 2 data memory
Data data

RegWrite
MemRead

a. Registers b. ALU
a. Data memory unit b. Sign-extension unit
Building the Datapath
 Use multiplexors to stitch them together
PCSrc

M
Add u
x
4 Add ALU
result
Shift
left 2
Registers
Read 3 ALU operation
MemWrite
Read register 1 ALUSrc
PC Read
address Read data 1 MemtoReg
register 2 Zero
Instruction ALU ALU
Write Read Address Read
register M result data
data 2 u M
Instruction u
memory Write x Data x
data
Write memory
RegWrite data
16 32
Sign
extend MemRead
Datapath
Two implementations (demonstrated with MIPS
instruction)
 Single cycle implementation
 uses a single clock cycle for every instruction.
(Start execution with one clock edge, and complete on the next
edge)
 Advantage: One clock cycle per instruction
 Disadvantage: long cycle time

 Multicycle implementation
 instructionsuse multiple clock cycles.
 Advantage: Shorter execution time
 Disadvantage: More complex control
Single-Cycle Implementation
PCSrc

1
Add M
u
x
4 ALU 0
Add result
RegWrite Shift
left 2

Instruction [25– 21] Read


Read register 1 Read MemWrite
PC data 1
address Instruction [20– 16] Read MemtoReg
ALUSrc
Instruction register 2 Zero
1 Read ALU ALU
[31– 0] Write data 2 1 Read
M result Address 1
u register M data
Instruction Instruction [15– 11] x u M
memory Write x u
0 data Registers 0
x
Write Data 0
RegDst data memory
Instruction [15– 0] 16 Sign 32
extend ALU MemRead
control
Instruction [5– 0]

ALUOp
Single-Cycle Datapath
 Cycle time = Σ(stages)
 Execution Time = IC * CPI * Cycle Time
 Processor design (datapath and control) will
determine:
 Clock cycle time
 Clock cycles per instruction
 All combinational logic must stabilize within one
clock cycle.
 All state elements will be written exactly once at
the end of the clock.
 CPI = 1
Recall: MIPS Instruction Format
Single-Cycle Datapath: R-format
• Format: opcode r1, r2, r3
ALU op

Register 3
File
Read Reg 1 Read
Instruction
Data 1 Zero
Read Reg 2
Write ALU
Register Read
Write Data Data 2
Result
Register
Write
Single-Cycle Datapath: Load/Store

Fetch Decode Execute


Single-Cycle Datapath: Branch

Fetch Decode Execute


Multicycle Approach
 Used to improve performance in real computer systems
 Divides instruction execution into multiple clock cycles.
 Datapath resources can be reused on each cycle, saving resources
(in single cycle datapath, all operations must occur in parallel).
 ALUused to compute address and to increment PC
 Memory used for instruction and data

 Control signals not determined soley by instruction


 e.g., what should the ALU do for a “subtract” instruction?
Multicycle Approach
 Shorter instructions can use fewer clock
cycles
 total execution time is reduced.
 Control signals not determined soley by
instruction
 e.g., what should the ALU do for a “subtract”
instruction?
Multicycle Approach
 Cycle time = time for longest stage
 Execution time = cycle time * IC * CPI
 Advantages
 The work required by the typical instructions can be
divided over approximately equal, smaller elementary
operations.

 The clock cycle can be set to the longest elementary


operation.
Multicycle Datapath
Multicycle Datapath: R-format
Step 1: Fetch instr. // Store in IR // Compute PC + 4
Step 2: Decode instruction: opcode, rd, rs, rt, funct fields
Data fetch: Apply rs, rt to Register File
Data Read into A,B buffer registers (ALUin)
Step 3: ALU operation (ALUsrcA, ALUsrcB, ALUop)
ALU output goes into ALUout register

Step 4: ALUout register contents written to Register File write


input
Register number in rd written (Assert:
RegWrite,RegDst)

CPI for R-format = 4 cycles


Multicycle Datapath: Store Word (sw)
Step 1: Fetch instr. // Store in IR // Compute PC + 4
Step 2: Decode instruction: opcode, rs, rt, offset fields
Data fetch: Apply rt to Register File => Base address
Data Read into A buffer register (Base)
SignExt,Shift offset field into B buffer
register
Step 3: ALU operation (ALUsrcB, ALUop) => Base + Offset
ALU output goes into ALUout register

Step 4: ALUout register contents applied as Memory Address


Assert: MemWrite [ALUout => RegFile]

CPI for Store = 4 cycles


Multicycle DP: Load Word (lw)
Step 1: Fetch instr. // Store in IR // Compute PC + 4
Step 2: Decode instruction: opcode, rd, rt, offset fields
Data fetch: Apply rt to Register File => Base address
Data Read into A buffer register (Base)
SignExt,Shift offset field into B buffer
register
Step 3: ALU operation (ALUsrcB, ALUop) => Base + Offset
ALU output goes into ALUout register
Step 4: ALUout register contents applied as Memory Address
Assert: MemRead
Step 5: Memory Data Out routed to Register File write input
Register number from rd written to (Assert:

CPI for Load = 5 cycles


Multicycle Datapath: Branch
Step 1: Fetch instr. // Store in IR // Compute PC + 4
Step 2: Decode instruction: opcode, rs, rt, offset fields
Data fetch: Apply rs, rt to Register File
BTA calc: SignExt,Shift offset field into B buffer register
ALU compose PC, offset => BTA
Step 3: ALU operation (ALUsrcA, ALUsrcB, ALUop) = compare
ALU output present at Zero register causes Control
to select BTA or PC+4

CPI for Conditional Branch = 3 cycles


Multicycle DP: Jump
Step 1: Fetch instr. // Store in IR // Compute PC + 4
Step 2: Decode instruction: opcode, address fields
JTA calc: SignExt,Shift offset field [Bits 27-0]
Concatenate with PC [Bits 31-28] => JTA
Step 3: PC replaced by the Jump Target Address (JTA)
PCsource = 10, PCWrite asserted

CPI for Jump = 3 cycles


Datapath & Control
Control
 A circuit that generates the control signals needed by
the datapath
 Dedicated to regulating the interaction between the
datapath and memory
 Checks datapath signals in order to decide what to
do.
 A collection of signals that enable/disable the
inputs/outputs of the various components.
 May be likened to the brain, and the datapath as the
body – the datapath does only what the brain tells it
to do.
Control Unit Layout
Instruction Cycle - overview
 Six phases of the complete Instruction Cycle

 Fetch: load IR with instruction from memory

 Decode: determine action to take (set up inputs for ALU, RAM, etc.)

 Evaluate address: compute memory address of operands, if any

 Fetch operands: read operands from memory or registers

 Execute: carry out instruction

 Store results: write result to destination (register or memory)


Instruction Pipelining
 An implementation technique where
multiple instructions are overlapped in
execution

 The computer pipeline is divided in stages.


 Each stage completes a part of an instruction
in parallel.
 The stages are connected one to the next to
form a pipe - instructions enter at one end,
progress through the stages, and exit at the
other end.
Instruction Cycle State Diagram
Next: Pipelining 
Instruction Pipelining
Chapter 6 – Hennessy & Patterson
 Pipelining does not decrease the time for
individual instruction execution

 Increases instruction throughput.

The throughput of the instruction pipeline is


determined by how often an instruction exits the
pipeline.
Instruction Pipelining
 A “good” design goal of any system is to
have all of its components performing
useful work all of the time – high efficiency
 Following the instruction cycle in a
sequential fashion does not permit this
level of efficiency
 Compare the instruction cycle to an
automobile assembly line
Instruction Pipelining
 For n iterations of the task, the execution
times will be:
 With no pipelining: nk time units
 With pipelining: k + (n-1) time units

 Speedup of a k-stage pipeline is thus:


S = nk / [k+(n-1)] ==> k (for large n)
Pipelining Stages
 Fetch instruction
 Decode instruction
 Calculate operands
 Fetch operands
 Execute instructions
 Write result

 Overlap these operations


Instruction Pipelining
 Perform all tasks concurrently, but on
different (sequential) instructions
 The result is temporal parallelism
 Result is the instruction pipeline

 An ideal pipeline divides a task into k


independent sequential subtasks
 Each subtask requires 1 time unit to complete
 Thetask itself then requires k time units to
complete
Two Stage Instruction Pipeline
Timing of Pipeline
Branch in a Pipeline
Pipelining

 Cycle time
= longest stage + pipeline overhead

 Execution time
= cycle time * (no. of stages + IC – 1)
Pipelining
 changes the relative timing of instructions by
overlapping their execution
 introduces hazards

1 2 3 4 5 6 7 8 9
ADD R1, R2, R3 IF ID EX MEM WB
IF
SUB R4, R5, R1 IDSUB EX MEM WB

IF
AND R6, R1, R7 IDAND EX MEM WB

IF
OR R8, R1, R9 IDOR EX MEM WB

IF
XOR R10,R1,R11 IDXOR EX MEM WB
Pipelining
 All the instructions after the ADD use the
result of the ADD instruction (in R1).

 The ADD instruction writes the value of R1


in the WB stage and the SUB instruction
reads the value during ID stage resulting
in a data hazard and unless precautions
are taken to prevent it, the SUB instruction
will read the wrong value and try to use it.
Pipelining
 For the AND instruction, the write of R1
does not complete until the end of cycle 5.
Thus, the AND instruction that reads the
registers during cycle 4 will receive the
wrong result.
Pipelining
 The OR instruction can be made to operate
without incurring a hazard by a simple
implementation technique. The technique is to
perform register file reads in the second half of
the cycle, and writes in the first half. Because
both WB for ADD and IDor for OR are
performed in one cycle 5, the write to register file
by ADD will perform in the first half of the cycle,
and the read of registers by OR will perform in
the second half of the cycle.
Pipelining

 The XOR instruction operates properly,


because its register read occur in cycle 6
after the register write by ADD.
Pipeline Hazards
 situations in which a correct program
ceases to work correctly due to
implementing the processor with a pipeline

 three fundamental types of hazard:


 data hazards
 branch hazards
 structural hazards

Note: Only branch hazards and RAW data hazards are possible in MIPS
pipeline
Pipeline Hazards
Data hazards
 when reads and writes of data occur in a different
order in the pipeline than in the program code –
results in data dependencies

 an instruction uses the result of a previous


instruction (RAW)
ADD R1, R2, R3 or SW R1, 3(R2)
ADD R4, R1, R5 LW R3, 3(R2)

 WriteAfter Read (WAR), Write After Write (WAW),


and Read After Write (RAW) hazards
Pipeline Hazards
Data hazards
WAR
 thereverse of a RAW: in the code a write occurs
after a read, but the pipeline causes write to
happen first

WAW
 is a situation in which two writes occur out of
order - when there is no read in between
Pipeline Hazards
Data hazards
RAW
 occurs when, in the code as written, one
instruction reads a location after an earlier
instruction writes new data to it, but in the pipeline
the write occurs after the read (so the instruction
doing the read gets stale data).
Pipeline Hazards
Example (Data hazard)
add $1, $2, $3 _ _ _ _ _
add $4, $5, $6 _ _ _ _ _
add $7, $8, $9 _ _ _ _ _
add $10, $11, $12 _ _ _ _ _
add $13, $14, $1 _ _ _ _ _ (data arrives early; OK)
add $15, $16, $7 _ _ _ _ _ (data arrives on time; OK)
add $17, $18, $13 _ _ _ _ _ (uh, oh)
add $19, $20, $17 _ _ _ _ _ (uh, oh again)
Pipeline Hazards
Branch/Control hazards
 occurs when a decision needs to be made, but the
information needed to make the decision is not available
yet
JMP LOOP

LOOP: ADD R1, R2, R3
Structural hazards
 occur when a single piece of hardware is used in more
than one stage of the pipeline, so it's possible for two
instructions to need it at the same time
Resolving Hazards
 Four possible techniques
1. Stall
 Can resolve any type of hazard
 Detect the hazard
 Freeze the pipeline up to the dependent stage
until the hazard is resolved - simply make the
later instruction wait until the hazard resolves
itself
 Undesirable because it slows down the
machine, but may be necessary.
Resolving Hazards
2. Bypass/Forward
 detects condition
 If the data is available somewhere, but is just not
where we want it, create extra data paths to
``forward'' the data to where it is needed
 no need to stall
 eliminates stalls for single-cycle operations

- reduces longest stall to N-1 cycles for N-cycle operations


 best solution, since it doesn't slow the machine
down and doesn't change the semantics of the
instruction set.
Resolving Hazards
3. Document/punt - define instruction
sequences that are forbidden, or change
the semantics of instructions, to account
for the hazard - worst solution, both
because it results in obscure conditions
on permissible instruction sequences,
and (more importantly) because it ties
the instruction set to a particular pipeline
implementation.
Resolving Hazards
4.Add hardware
 most appropriate to structural hazards; if
a piece of hardware has to be used twice
in an instruction, see if there is a way to
duplicate the hardware.
Microprogramming
 A.k.a writing microcode
 method that can be employed to implement
machine instructions in a CPU relatively
easily, often using less hardware than with
other methods.
 a set of very detailed and rudimentary lowest-
level routines which controls and sequences
the actions needed to execute particular
instructions, sometimes also to decode
(interpret) them.
Microprogramming
 a machine instruction implemented by a series
of microinstructions - loosely comparable to
how an interpreter implements a high-level
language statement using a series of machine
instructions

 Microprograms are carefully designed and


optimized for the fastest possible execution,
since a slow microprogram would yield a slow
machine instruction which would in turn cause
all programs using that instruction to be slow.
Microprogramming
Microcode
 the element of a microprogram
 normally written by the CPU engineer during
the design phase
 generally not meant to be visible or
changeable by a normal programmer, not
even an assembly programmer
 can be dramatically changed with a new
microarchitecture generation
Microprogramming
Microcode
 often used to let one microarchitecture emulate
another, usually more powerful, architecture.

 used as a synonym for firmware, whether or not


it actually implements the microprogramming of
a processor. Even simple firmware, such as the
one used in a hard drive, is sometimes described
as microcode.
Microprogramming
Microcode
 microinstruction in a microprogram provides
the bits which control the functional elements
that internally compose a CPU - control
therefore becomes a specialized form of a
computer program
 thus transforms a complex electronic design
challenge (the control of a CPU) into a less-
complex programming challenge
 can be characterized as horizontal or vertical.
Horizontal Microcode
 each microinstruction directly controls
CPU elements
 typically contained in a fairly wide control
store, it is not uncommon for each word to
be 56 bits or more.
 on each tick of a sequencer clock a
microcode word is read, decoded, and
used to control the functional elements
which make up the CPU.
 a horizontal microprogram word comprises
fairly tightly defined groups of bits.
Vertical Microcode
 each microinstruction requires subsequent
decoding before it can control CPU elements
 In vertical microcode, each microinstruction is
encoded -- that is, the bit fields may pass
through intermediate combinatory logic which in
turn generates the actual control signals for
internal CPU elements (ALU, registers, etc.).
 bit fields themselves directly produce the control
signals.
 requires small instruction lengths and storage
Vertical Microcode
 requires more time to decode, resulting in
a slower CPU clock.
 may be just the assembly language of a
simple conventional computer that is
emulating a more complex computer.

 As transistors became cheaper, horizontal


microcode came to dominate the design of
CPUs using microcode, with vertical
microcode no longer being used.
Unit 6 – Memory Systems
Overview of memory system technology and
architecture
 After the completion of this unit the student should
be able to describe qualitatively and mathematically
the following:
 6.1 Differentiate between different types of memory
found in microcomputer. Mainly
 DRAM
 SRAM
 Virtual
Memory
 Secondary Storage (Hard Disk Drive)

 6.2 Calculate; Memory Access Time (AMAT), Miss


Rate, and bits required to map DRAM to cache
The Memory Hierarchy
References: Chpts 4 & 5 – Stalling; Chpt 7 – H&P
shekel.jct.ac.il/~citron/ca/ca-lec10/tsld017.htm

Ideally one would desire an indefinitely large memory


capacity such that any particular word would be
immediately available …
We are forced to recognize the possibility of
constructing a hierarchy of memories, each of which
has a greater capacity than the preceding but which
is less quickly accessible.
A.W. Burks, H.H. Goldstine, and J. von Neumann,
1946.
The Memory Hierarchy
 Used to bridge the gap in which access to
main memory is too slow for modern
microprocessors. On a 100Mhz (pretty slow)
microprocessor an addition computation takes
10ns. A memory reference takes 60-110ns
(using DRAM ) or 25ns using EDO DRAM. On
a 500Mhz microprocessor an add takes 2ns, a
memory reference still takes at least 25ns.
 main component is the cache
The Principle of Locality
 states that a program accesses a small part of
its address space at any instant of time

 2 types of locality:
 Temporal locality (locality in time) - If an item is
accessed it will be accessed again soon.
 SpatialLocality (locality in space) - If an item was
accessed, items close by will be accessed.
The Principle of Locality
 used by implementing a
memory hierarchy Speed CPU Size Cost ($/bit)
composed of multiple
levels of memory with Fastest Memory Smallest Highest
different sizes and
speeds. Memory

 SRAMs which are


faster(5-20ns) and more Slowest Memory Biggest Lowest

expensive(100-250$ per
Mbyte) are used closer to
the CPU while DRAMs
(25-100ns, 3-8$) are
used as main memory.
Hits and Misses
 In the memory hierarchy an Processor

upper level (closer to the CPU)


holds a subset of any lower
level (farther away than the
Data are transferred
CPU).
 Although the memory hierarchy
can have multiple levels data is
transferred between two
adjacent levels. The minimum
unit of data that can be present
or not present is called a block
or line.
Hits and Misses
 If data requested by the CPU is in a block of
the upper level, this is called a hit, if it isn't
and the lower level has to be accessed it is
called a miss

 The hit rate or hit ratio is the fraction of


accesses found in the upper level. The miss
rate (1 - hit rate) is the fraction of accesses
not found in the upper level.
Hits & Misses
 Hits (data found in cache) result in data
transfer at maximum speed.

 Cache miss (data not found in cache)


results a miss penalty - the processor
has to load data from memory and copies
it into cache
Hit Time and Miss Penalty
 The hit time is the time needed to access
the upper level, decide if the data is there
and get it to the CPU.

 The miss penalty is the time it takes to


replace a block in the upper level with the
block we need from the lower level and
get the data to the CPU.
Hit Time and Miss Penalty
The hit time for upper levels is much
faster than the hit time for lower levels.
Thus a high hit ratio at upper levels,
gives an access time equal to that of the
highest (and fastest) level with CPU
the size of the lowest (and slowest) level.

Increasing distance
Level 1
from the CPU in
access time

Levels in the Level 2


memory hierarchy

Level n

Size of the memory at each level


The Cache
 Lets look at a simple cache in
which the processor accesses X4 X4
a word at a time and the X1 X1
block size is 1 word. Xn – 2 Xn – 2
 The processor requested the
word Xn. It isn't in the cache
which results in cache miss. Xn – 1 Xn – 1
The word Xn is brought from X2 X2
memory into the cache. Xn

X3 X3

a. Before the reference to Xn b. After the reference to Xn


Cache Mapping
Please refer to notes for COA – sem 1 – AY 2007/8

 Direct Mapped Cache: The simplest way to allocate the cache to the
system memory is to determine how many cache lines there are and just
chop the system memory into the same number of chunks. Then each
chunk gets the use of one cache line. This is called direct mapping. So if
we have 64 MB of main memory addresses, each cache line would be
shared by 4,096 memory addresses (64 M divided by 16 K).
 Fully Associative Cache: Instead of hard-allocating cache lines to
particular memory locations, it is possible to design the cache so that any
line can store the contents of any memory location. This is called fully
associative mapping.
 N-Way Set Associative Cache: "N" here is a number, typically 2, 4, 8 etc.
This is a compromise between the direct mapped and fully associative
designs. In this case the cache is broken into sets where each set contains
"N" cache lines, let's say 4. Then, each memory address is assigned a set,
and can be cached in any one of those 4 locations within the set that it is
assigned to. In other words, within each set the cache is associative, and
thus the name.
Tags
Address (showing bit positions)
31 30 13 12 11 210
Byte

Used to check if data in


offset
 20 10
Hit Data
the cache corresponds Tag

to the data requested Index

from memory, if each Index Valid Tag Data

cache location is 0

mapped to several
1
2
memory locations,
 Contain the upper bits of
the address not used to 1021
index the cache. 1022
1023
 Include a valid bit to 20 32

determine if the cache


has valid information. If
the bit isn't set there
can't be a match.
Cache Size
 How many total bits are needed for a direct
mapped cache with 64KB data in 1 word blocks?

64KB = 16K words = 214 words = 214 blocks. Each


block has 32 bits of data,
32-14-2 bits of tag, and a valid bit.
Thus the total cache size is:
214 *(32+(32-14-2)+1) = 214 *49 = ~98KB.

The size of the cache is 50% more than the size of


the data it holds (for this configuration).
Cache Size
 Assuming a 32 bit address, a direct mapped
cache of size 2n words with 1 word (4-byte)
blocks will need a tag field of size 32 - (n+2)
because 2 bits are the byte offset and n bits are
used for the index.

 The total number of bits in a direct mapped


cache is
2n*(32 + (32-n-2)+1) = 2n*(63-n).
Cache Basics
 Data is moved
(perhaps in 4-word
blocks) between DRAM
Cache
DRAM and
cache 10 2
9 1
 The cache has 8 0 3
slots instead of 7 3 2
memory 6 2 1
addresses 5 1 0
 This is a direct- 4 0
slot
mapped cache 3 3
2 2
 How does the 1 1
cache know 0 0
what memory memory address
addresses it address AND 3
currently
mirrors?
Cache Basics
 In this example the DRAM can hold 11 words but the cache can hold at
most 4 words at a time.
 If the processor wants the data at address x then it must access that data
through the cache.
 If the data is already in the cache then it can be given directly to the
processor.
 If the data is not in the cache then it must be copied in from the DRAM.
 Question: When you copy something from DRAM to cache, does it matter
where you put it?
 Answer: In this example, yes it does matter. The rule for this ‘direct mapped
cache’ is that data from memory address x is put in slot x AND 3.
 Now we have two problems:
 How do we know whether a cache slot is empty or not?
 If a cache slot is not empty, how do we know which of the several memory
locations is currently mapped to the slot?
 e.g. if slot 0 is not empty, does it contain a copy of the data from address 0,
address 4, or address 8?
Cache Basics
 How do we
know when a 4-word
DRAM
slot is Cache
occupied?

valid?
10 2

tag
 the valid bit 9 1
 How do we 8 0 1 1 3
know which 7 3 0 2
6 2
memory 1 2 1
5 1 1 0 0
address a slot
4 0
reflects? 3 3
slot
 the tag field 2 2
address = tag*4 + slot
 For the sake 1 1
slot = address AND 3
of simplicity, 0 0
tag = address >> 2 or
we consider
= floor(address/4)
word
addresses
Cache Basics
 The valid bit has the value ‘0’ to say that it is empty or the value ‘1’ to say that it is full
 caches are empty when the machine powers up
 For the sake of simplicity the addresses on the figure are word addresses, and block
contains only one word
 Caches are also usually emptied when the computer switches from running one
program to running another program (this is to prevent programs from access each
other’s virtual memory spaces)
 The tag field contains rubbish when the valid bit is 0
 Otherwise, the tag field tells us which of the possible memory addresses the slot
currently reflects
 If memory has m bits long byte address, and cache has capacity of 2n blocks (of 2k
words, k > 0), tag contains m – n - k – 2 most significant address bits (tag = address
>> n + k + 2, address shifted n + k + 2 times to the right, 2 due to word address, m
bits is byte address)
 The slot is address of the cache block, and it is (memory_address >> k + 2) mod 2n,
or (memory_address >> k + 2) AND 2n – 1.

 Cache capacity is 2n+k+2 in bytes

 Memory_address = tag << n +2 + slot


Example: 8-slot Direct Mapped Cache
Consider word address (11)10 , this is (1011) 2

11 80

valid?

slot #
10 A0

tag
9 00
8 FF 7
We consider word 7 10 6
addresses again
6 44 5
5 F0 4
4 C9 3
3 20 2
2 20 1
1 0A 0
0 00 8-word cache
memory DRAM
address
Example: (cont.)
 Read address 1010
 10102
11 80
 slot # 0102

valid?

slot #
10 A0

tag
 miss,so fetch 9 00
8 FF 7
 Read address 810 7 10 6
 10002 6 44 5
5 F0 4
 slot # 0002 4 C9 3
 miss,so fetch 3 20 2
2 20 1
 Read address 1010 1 0A 0
 10102 0 00 8-word cache
memory
 slot # 0102 address
DRAM address = tag*8 + slot
slot = mod(address/8)
 tag 1
tag = floor(address/8)
 hit
address >> 3
Example: (cont.)
 Read address 8 Slot V Tag Data
 1000 000 Y 1 FF
 slot index 000 001 N

 hit 010 Y 1 A0
011 N
 Read address 2
100 N
 0010
101 N
 slot index 010 110 N
 tag mismatch 111 N
The 3 Cs
 Cache misses can be divided into 3
categories:
 Compulsory misses: Misses that are caused
because the block was never in the cache.
These are also called cold-start misses.
Increasing the block size reduces compulsory
misses. Too large a block size can cause
capacity misses and increases the miss
penalty.
The 3 Cs
 Compulsory misses: Misses that are caused
because the block was never in the cache.
These are also called cold-start misses.
Increasing the block size reduces compulsory
misses. Too large a block size can cause
capacity misses and increases the miss
penalty.
The 3 Cs
 Conflictmisses: Occur in direct mapped and
set associative caches. Multiple blocks
compete for the same set. Increasing
associativity reduces conflict misses. But a
too high associativity increases access time.
Memory System Support for Caches
 Cache misses read data from main memory
which is constructed from DRAMs. Although it is
hard to reduce the latency to the first word it is
possible to increase the bandwidth from memory
to cache.

 Defining access time for main memory:


1 clock cycle to send the address
 15 clock cycles to read a word from DRAM
 1 clock cycle to send a word of data
Memory System Support for Caches
 For a cache block of 4 word and 1 word wide
memory bank the miss penalty is:
1 + 4*15 + 4*1 = 65.
 If we widen the memory and busses between
memory to cache we can reduce the miss penalty.
For a 2 word wide memory the miss penalty is:
1 + 2*15 + 2*1=33.
For a 4 word wide memory the miss penalty is:
1 + 1*15 + 1 = 17.
But we pay the cost of a wide memory and wide
busses.
Computing the Average Memory Access Time
 The average memory access time is:
(hit rate*hit time) + (miss rate*miss penalty)

 Assuming the hit time is 2 cycles and the miss penalty is


17 cycles what is the average access time given a 98%
hit rate:
0.98*2 + 0.02*17 = 2.3 cycles

 Even for a high hit ratio the average access time is


relatively high compared to R-type instructions. The
solution is to introduce another level of cache between
main memory and the CPU.
Computing the Average Memory Access Time
 All modern microprocessors have an on-chip cache
which is called the L1 cache and another larger off
chip cache called the L2 cache.

 Assume a L1 hit time of 1, a L2 hit time of 5 and a


L2 miss penalty of 17. Given a L1 hit rate of 98%
and a L2 hit rate of 98% the average memory
access time is:
AMAT = hit_time x hit_rate + miss_rate x miss_penalty
0.98*1 + 0.02(0.98*5 + 0.02*17)
= 1.08 cycles
Unit 7 – Parallel Processor
After the completion of this unit the student should be able
to describe qualitatively and mathematically the following:
 7.1 Differentiate between different types of parallel
processors
 Parallel processors
 SIMD computers--Single Instruction Stream, Multiple
Data Streams
 MIMD Computers--Multiple Instruction Streams,
Multiple Data Streams
 Memory Semantics
 StrictConsistency vs sequential consistency vs
processor consistency
Unit 7 – Parallel Processor
 Processor topologies
 Number of links (degree)
 Diameter
 Fault tolerance
 Bisection bandwidth
 dimensionality
Parallel Processing Architecture
• Single instruction stream, single data stream (SISD)
• single processor
• single instruction stream
• data stored in single memory

•Single instruction stream, multiple data stream (SIMD)


• distribute processing over a large amount of
hardware
• operate concurrently on many different data
elements
• perform the same computation on all data
element.
Parallel Processing Architecture
•Multiple instruction stream, single data stream (MISD)
- single instruction stream
- data stored in single memory
- multiple processors
- never implemented

•Multiple instruction stream, multiple data stream (MIMD)


- distribute processing over a number of independent
processors.
- share resources, including main memory, among
component processors.
- each processor operates independently and
concurrently.
- each processor runs its own program.
Processor Topology
 Interconnection networks
 plays a central role in determining the overall
performance of a multicomputer system. If the
network cannot provide adequate performance,
for a particular application, nodes will frequently
be forced to wait for data to arrive.

 make a major factor to differentiate modern


multiprocessor architectures.
Processor Topology
 Interconnection networks
 can be categorized according to a number of criteria
such as topology, routing strategy and switching
technique.

 normally classified as static or dynamic.


 are built up of switching elements; topology is the
pattern in which the individual switches are connected
to other elements, like processors, memories and
other switches.
 Main topologies are classified as direct and indirect
Processor Topology
 Direct topologies connect each switch directly to a
node, while in indirect topologies at least some of the
switches connect to other switches.
 Using switching technique as a criterion one can
mention some classes:
 circuitswitching, in which the entire path through the
network is reserved before a message is transferred,
 packet switching with virtual cutthrough, in which a packet
is forwarded immediately after it determines an appropriate
switch output,
 wormhole routing, which relaxes requirements of
completely buffering of blocked packets in a single switch,
typical for packet switching.
The End
CA – semester 2 – 2008/9

You might also like