You are on page 1of 62

Computer Architecture

Chapter 1: Measuring &


Understanding Performance

2º Computer Science

[Adapted from Computer Organization and Design, 4th Edition,


Patterson & Hennessy, © 2009, MK]

EC11-12 Chapter 1.1 Dept. of Comp. Arch., UMA, 2011


Review: Some Basic Definitions

 Kilobyte – 210 or 1,024 bytes


 Megabyte– 220 or 1,048,576 bytes
 sometimes “rounded” to 106 or 1,000,000 bytes

 Gigabyte – 230 or 1,073,741,824 bytes


 sometimes rounded to 109 or 1,000,000,000 bytes

 Terabyte – 240 or 1,099,511,627,776 bytes


 sometimes rounded to 1012 or 1,000,000,000,000 bytes

 Petabyte – 250 or 1024 terabytes


 sometimes rounded to 1015 or 1,000,000,000,000,000 bytes

 Exabyte – 260 or 1024 petabytes


 Sometimes rounded to 1018 or 1,000,000,000,000,000,000 bytes

EC11-12 Chapter 1.3 Dept. of Comp. Arch., UMA, 2011


Classes of Computers
1. Desktop computers (ej.: PC, laptop)
1. good performance,
2. single user
3. low cost
4. (graphics display, a keyboard, and a mouse)

2. Servers (ej.: workstations, web servers, file storage)


1. larger programs
2. simultaneous users
3. network
4. security

EC11-12 Chapter 1.4 Dept. of Comp. Arch., UMA, 2011


Classes of Computers
3. Supercomputers (weather, oil exploration, protein structure)
1. hundreds to thousands of processors,
2. terabytes of memory and petabytes of storage
3. high-end scientific and engineering applications

4. Embedded computers (processors) (ej.: cell phones, cars, TV,


digital cameras)
- A computer inside another device used for running one
predetermined application

EC11-12 Chapter 1.5 Dept. of Comp. Arch., UMA, 2011


The Computer Revolution

 Progress in computer technology


 Moore’s Law

 Makes novel applications feasible


 Computers in automobiles
 Cell phones
 Human genome project
 World Wide Web
 Search Engines

 Computers are pervasive

EC11-12 Chapter 1.6 Dept. of Comp. Arch., UMA, 2011


The Processor Market
embedded growth >> desktop growth

 Where else are embedded processors found?


EC11-12 Chapter 1.7 Dept. of Comp. Arch., UMA, 2011
Underneath the Program
Applications software
Written in high-level Systems software
language
Hardware

 System software
 Operating system – supervising program that interfaces the
user’s program with the hardware (e.g., Linux, MacOS,
Windows)
- Handles basic input and output operations
- Allocates storage and memory
- Schedules tasks & Provides for protected sharing among multiple
applications
 Compiler – translate programs written in a high-level language
(e.g., C, Java) into instructions that the hardware can execute
EC11-12 Chapter 1.11 Dept. of Comp. Arch., UMA, 2011
Below the Program, Con’t
 High-level language program (in C)
swap (int v[], int k)
(int temp;
temp = v[k];
v[k] = v[k+1]; one-to-many
v[k+1] = temp;
) C compiler

 Assembly language program (for MIPS)


swap: sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2) one-to-one
sw $16, 0($2)
sw $15, 4($2)
jr $31 assembler

 Machine (object, binary) code (for MIPS)


000000 00000 00101 0001000010000000
000000 00100 00010 0001000000100000
. . .
EC11-12 Chapter 1.12 Dept. of Comp. Arch., UMA, 2011
Advantages of Higher-Level Languages ?
 Higher-level languages

 Allow the programmer to think in a more natural language


 Improve programmer productivity
 Improve program maintainability
 Allow programs to be independent of the computer on which
they are developed
 Emergence of optimizing compilers that produce very efficient
assembly code optimized for the target machine

 As a result, very little programming is done today at


the assembler level, except for speed-up computation
in critical areas of the code

EC11-12 Chapter 1.13 Dept. of Comp. Arch., UMA, 2011


Under the Covers
 Same components for all kinds of computer
 Five classic components of a computer – input, output, memory,
datapath, and control
 datapath + control =
processor (CPU)
 Input/output includes
User-interface devices
Display, keyboard,
mouse
Storage devices
Hard disk, CD/DVD,
flash
Network adapters
For communicating
with other
computers

EC11-12 Chapter 1.14 Dept. of Comp. Arch., UMA, 2011


Opening the Box

EC11-12 Chapter 1.15 Dept. of Comp. Arch., UMA, 2011


A Safe Place for Data

 Volatile main memory


 Loses instructions and data when power off

 Non-volatile secondary memory


 Magnetic disk
 Flash memory
 Optical disk (CDROM, DVD)

EC11-12 Chapter 1.17 Dept. of Comp. Arch., UMA, 2011


Networks

 Communication and resource sharing


 Local area network (LAN): Ethernet
 Within a building

 Wide area network (WAN: the Internet


 Wireless network: WiFi, Bluetooth

EC11-12 Chapter 1.18 Dept. of Comp. Arch., UMA, 2011


Instruction Set Architecture (ISA)
 ISA, or simply architecture – the abstract interface
between the hardware and the lowest level software that
encompasses all the information necessary to write a
machine language program, including instructions,
registers, memory access, I/O, …
 Enables implementations of varying cost and performance to run
identical software
 Directly related with the assembly language

EC11-12 Chapter 1.21 Dept. of Comp. Arch., UMA, 2011


Moore’s Law
 In 1965, Intel’s Gordon Moore
predicted that the number of
transistors that can be
integrated on single chip would
double about every two years

Courtesy, Intel ®
EC11-12 Chapter 1.22 Dept. of Comp. Arch., UMA, 2011
Another Example of Moore’s Law Impact

DRAM capacity growth over 3 decades

1G
256M
64M 512M
128M

4M
16M
1M

64K 256K

16K

EC11-12 Chapter 1.24 Dept. of Comp. Arch., UMA, 2011


But What Happened to Clock Rates and Why?

 Clock rates hit a


“power wall”

Power (Watts)

EC11-12 Chapter 1.25 Dept. of Comp. Arch., UMA, 2011


POWER CONSUMTION

EC11-12 Chapter 1.27 Dept. of Comp. Arch., UMA, 2011


Power Trends

 In CMOS IC technology

Power = Capacitive load × Voltage 2 × Frequency

×30 5V → 1V ×1000

EC11-12 Chapter 1.28 Dept. of Comp. Arch., UMA, 2011


Reducing Power

 Suppose a new CPU has


 85% of capacitive load of old CPU
 20% voltage and 18% frequency reduction

Pnew Cold × 0.85 × (Vold × 0.80)2 × Fold × 0.82


= = 0.45
Cold × Vold × Fold
2
Pold
 The power wall
 We can’t reduce voltage further
 We can’t remove more heat
 How else can we improve performance?

EC11-12 Chapter 1.29 Dept. of Comp. Arch., UMA, 2011


§1.6 The Sea Change: The Switch to Multiprocessors
Uniprocessor Performance

Constrained by power, instruction-level parallelism,


memory latency

EC11-12 Chapter 1.31 Chapter 1 — Computer Abstractions and Technology — 31


Dept. of Comp. Arch., UMA, 2011
Multiprocessors
 Multicore microprocessors
 More than one processor per chip

 Requires explicitly parallel programming


 Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer
 Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization

EC11-12 Chapter 1.32 Chapter 1 — Computer Abstractions and Technology — 32


Dept. of Comp. Arch., UMA, 2011
A Sea Change is at Hand
 The power challenge has forced a change in the design
of microprocessors
 Since 2002 the rate of improvement in the response time of
programs on desktop computers has slowed from a factor of 1.5
per year to less than a factor of 1.2 per year

 As of 2006 all desktop and server companies are


shipping microprocessors with multiple processors –
cores – per chip
Product AMD Intel IBM Power 6 Sun Niagara
Barcelona Nehalem 2
Cores per chip 4 4 2 8
Clock rate 2.5 GHz ~2.5 GHz 4.7 GHz 1.4 GHz
Power 120 W ~100 W? ~100 W 94 W

 Plan of record is to double the number of cores per chip


per generation (about every two years)
EC11-12 Chapter 1.33 Dept. of Comp. Arch., UMA, 2011
Situation after 2007
 Moore’s law keeps

EC11-12 Chapter 1.34 Dept. of Comp. Arch., UMA, 2011


Situation after 2007
 Moore’s law keeps but more slowly

EC11-12 Chapter 1.35 Dept. of Comp. Arch., UMA, 2011


Situation after 2007
 The number of manufactures decrease

EC11-12 Chapter 1.36 Dept. of Comp. Arch., UMA, 2011


Situation after 2007
 The number of manufactures decrease

EC11-12 Chapter 1.37 Dept. of Comp. Arch., UMA, 2011


Situation after 2007

 The technology scaling may finish at the middle of the


next decade
 5 nm may be the final
- distance between silice atoms 0.5 nm
- Size of the atom 0.24 nm

EC11-12 Chapter 1.38 Dept. of Comp. Arch., UMA, 2011


Situation after 2007

Source: Intel Corporation


EC11-12 Chapter 1.39 Dept. of Comp. Arch., UMA, 2011
Situation after 2007

Source: Intel Corporation


EC11-12 Chapter 1.40 Dept. of Comp. Arch., UMA, 2011
Performance

EC11-12 Chapter 1.41 Dept. of Comp. Arch., UMA, 2011


What is Performance Metrics
 Purchasing perspective
 given a collection of machines, which has the
- best performance ?
- least cost ?
- best cost/performance?
 Design perspective
 faced with design options, which has the
- best performance improvement ?
- least cost ?
- best cost/performance?
 Both require
 basis for comparison
 metric for evaluation
 Our goal is to compare which factors affect performances
and their relative weight

EC11-12 Chapter 1.42 Dept. of Comp. Arch., UMA, 2011


Comparing Throughput and Response Time
 Response time (execution time) – the time between the
start and the completion of a task
 Important to individual users

 Throughput (bandwidth) – the total amount of work done


in a given unit time
 Important to data center managers

EC11-12 Chapter 1.43 Dept. of Comp. Arch., UMA, 2011


Comparing Throughput and Response Time

 Comparsion:
 We will need different performance metrics as well as a
different set of applications to benchmark embedded and
desktop computers, which are more focused on response
time, versus servers, which are more focused on
throughput
 How are response time and throughput affected by
● Replacing the processor with a faster version?
Adding more processors?
We’ll focus on response time for now…

EC11-12 Chapter 1.44 Dept. of Comp. Arch., UMA, 2011


What is Speed
 To maximize performance, need to minimize execution
time

performanceX = 1 / execution_timeX

If X is n times faster than Y, then

performanceX execution_timeY
-------------------- = --------------------- = n
performanceY execution_timeX

 Decreasing execution time frequently improves throughput

EC11-12 Chapter 1.46 Dept. of Comp. Arch., UMA, 2011


Relative Performance Example
 If computer A runs a program in 10 seconds and
computer B runs the same program in 15 seconds, how
much faster is A than B?
We know that A is n times faster than B if

performanceA execution_timeB
-------------------- = --------------------- = n
performanceB execution_timeA

The performance ratio is 15


------ = 1.5
10

So A is 1.5 times faster than B (or 50% faster)

EC11-12 Chapter 1.47 Dept. of Comp. Arch., UMA, 2011


Measuring Execution Time
 Elapsed time
 Total response time= Wall clock Time = Elapsed Time
- it includes all aspects to complete a task
- Processing, I/O operations, OS overhead, idle time
 Determines system performance

EC11-12 Chapter 1.48 Dept. of Comp. Arch., UMA, 2011


Measuring Execution Time
 CPU time
 Time spent processing a given job
- Substract I/O time, other jobs’ shares
 Comprises user CPU time and system CPU time
 Different programs are affected differently by CPU and system
performance
Elapsed
 Example: ‘time’ in Unix: time CPU utiliz.:
(90.7 + 12.9) /
(2*60 + 39) = 0.65
90.7u 12.9s 2:39 65%
I/O and other
user CPU system processes
time CPU time

 Our goal: user CPU time + system CPU time

EC11-12 Chapter 1.49 Dept. of Comp. Arch., UMA, 2011


Measuring Execution Time
 Elapsed time versus CPU time
Elapsed Time  P1.asm

IO+
OS P1.exe OS P2.exe OS P1.exe P2.exe P1.exe OS P1.exe IO sll $2, $5, 2
add $2, $4, $2
lw $15, 0($2)
Result lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
sll $2, $5, 2
User CPU Time add $2, $4, $2
lw $15, 0($2)
lw $16, 4($2)
sw $16, 0($2)
sw $15, 4($2)
jr $31
sll $2, $5, 2
add $2, $4, $2
System CPU Time
User CPU Time

CPU time = User CPU Time + System CPU time

EC11-12 Chapter 1.51 Dept. of Comp. Arch., UMA, 2011


PERFORMANCE EQUATION

EC11-12 Chapter 1.52 Dept. of Comp. Arch., UMA, 2011


Review: Machine Clock Rate
 Clock rate (clock cycles per second in MHz or GHz) is
inverse of clock cycle time (clock period)
CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate


5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
EC11-12 Chapter 1.53 Dept. of Comp. Arch., UMA, 2011
Clock Cycles per Instruction
 Not all instructions take the same amount of time to
execute
 One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction

# CPU clock cycles # Instructions Average clock cycles


= for a program x
for a program per instruction

 Clock cycles per instruction (CPI) – the average number


of clock cycles each instruction takes to execute
 A way to compare two different implementations of the same ISA

CPI for this instruction class


A B C
CPI 1 2 3
EC11-12 Chapter 1.56 Dept. of Comp. Arch., UMA, 2011
Using the Performance Equation
 Computers A and B implement the same ISA. Computer
A has a clock cycle time of 250 ps and an effective CPI of
2.0 for some program and computer B has a clock cycle
time of 500 ps and an effective CPI of 1.2 for the same
program. Which computer is faster and by how much?
Each computer executes the same number of instructions, I,
so
CPU timeA = I x 2.0 x 250 ps = 500 x I ps
CPU timeB = I x 1.2 x 500 ps = 600 x I ps

Clearly, A is faster … by the ratio of execution times

performanceA execution_timeB 600 x I ps


------------------- = --------------------- = ---------------- = 1.2
performanceB execution_timeA 500 x I ps

EC11-12 Chapter 1.57 Dept. of Comp. Arch., UMA, 2011


Effective (Average) CPI
 Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
n
Overall effective CPI = Σ (CPIi x ICi)
i=1
 Where ICi is the count (percentage) of the number of instructions
of class i executed
 CPIi is the (average) number of clock cycles per instruction for
that instruction class
 n is the number of instruction classes

 The overall effective CPI varies by instruction mix – a


measure of the dynamic frequency of instructions across
one or many programs

EC11-12 Chapter 1.58 Dept. of Comp. Arch., UMA, 2011


THE Performance Equation
 Our basic performance equation is then
CPU time = Instruction_count x CPI x clock_cycle
or
Instruction_count x CPI
CPU time = -----------------------------------------------
clock_rate

 These equations separate the three key factors that


affect performance
 Can measure the CPU execution time by running the program
 The clock rate is usually given
 Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
 CPI varies by instruction type and ISA implementation for which
we must know the implementation details
EC11-12 Chapter 1.59 Dept. of Comp. Arch., UMA, 2011
Determinates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle

Instruction_ CPI clock_cycle


count
Algorithm
X X
Programming X X
language
Compiler
X X

ISA
X X X

Core
X X
organization
Technology
X

EC11-12 Chapter 1.60 Dept. of Comp. Arch., UMA, 2011


A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1 .5 .5 .5 .25
Load 20% 5 1.0 .4 1.0 1.0
Store 10% 3 .3 .3 .3 .3
Branch 20% 2 .4 .4 .2 .4
Average CPI Σ= 2.2 1.6 2.0 1.95

 How much faster would the machine be if a better data cache


reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 =1.375 means 37.5% faster
 How does this compare with using branch prediction to save a
cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
 What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
EC11-12 Chapter 1.61 Dept. of Comp. Arch., UMA, 2011
Warning: MIPS as a Performance
Metric
 MIPS: Millions of Instructions Per Second
 Doesn’t account for
- Differences in ISAs between computers
- Differences in complexity between instructions
Instruction count
MIPS =
Execution time × 10 6
Instruction count Clock rate
= =
Instruction count × CPI CPI × 10 6
× 10 6

Clock rate
 MFLOPS: Millions of Floating Point Operations Per
Second
 CPI varies between programs on a given CPU
EC11-12 Chapter 1.63 Dept. of Comp. Arch., UMA, 2011
BENCHMARKS

EC11-12 Chapter 1.64 Dept. of Comp. Arch., UMA, 2011


Workloads and Benchmarks
 Workloads: programs normally used
 Evaluate two computers:
- Compare the execution time of our workload
 How can we test the computers without physically running our
workload?
- Using Benchmarks: programs chosen to measure performance

EC11-12 Chapter 1.65 Dept. of Comp. Arch., UMA, 2011


Workloads and Benchmarks
 Benchmarks – a set of programs that form a “workload”
specifically chosen to measure performance
 SPEC (System Performance Evaluation Corporation)
creates standard sets of benchmarks starting with
SPEC89. The latest is SPEC CPU2017 which
substitutes consists SPEC CPU2017
www.spec.org

 There are also benchmark collections for power


workloads (SPECpower_ssj2008), for mail workloads
(SPECmail2008), for multimedia workloads
(mediabench), cloud …

EC11-12 Chapter 1.66 Dept. of Comp. Arch., UMA, 2011


Workloads and Benchmarks
 SPEC CPU2017
 CPU2017 has 43 benchmarks, organized into four suites:
 The SPECspeed®2017 Integer and SPECspeed®2017 Floating
Point suites are used for comparing time for a computer to
complete single tasks.
 The SPECrate®2017 Integer and SPECrate®2017 Floating
Point suites measure the throughput or work per unit of time.

EC11-12 Chapter 1.67 Dept. of Comp. Arch., UMA, 2011


Workloads and Benchmarks SPEC CPU2017

EC11-12 Chapter 1.68 Dept. of Comp. Arch., UMA, 2011


Workloads and Benchmarks SPEC CPU2017

EC11-12 Chapter 1.69 Dept. of Comp. Arch., UMA, 2011


SPEC CPU Benchmark

 Programs used to measure performance


 Supposedly typical of actual workload
 Standard Performance Evaluation Corp (SPEC)
 Develops benchmarks for CPU, I/O, Web, …
 SPECspeed®2017 (also valid for SPEC2006)
 Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance
 Normalize relative to reference machine
 Summarize as geometric mean of performance ratios

n
n
∏ Execution time ratio
i=1
i
EC11-12 Chapter 1.71 Dept. of Comp. Arch., UMA, 2011
Comparing and Summarizing Performance
 How do we summarize the performance for benchmark
set with a single number?
 First the execution times are normalized giving the “SPEC ratio”
(bigger is faster, i.e., SPEC ratio is the inverse of execution time)
 The SPEC ratios are then “averaged” using the geometric mean
(GM)
n
GM = n Π SPEC ratioi
i=1

 Guiding principle in reporting performance measurements


is reproducibility – list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
EC11-12 Chapter 1.72 Dept. of Comp. Arch., UMA, 2011
SPEC INT2006 on Barcelona (CC = 0.4 x 10-9)
Name ICx109 CPI ExTime RefTime SPEC
(ref. machine) ratio
perl 2,118 0.75 637 9 770 15.3
bzip2 2,389 0.85 817 9 650 11.8
gcc 1,050 1.72 724 8 050 11.1
mcf 336 10.00 1 345 9 120 6.8
go 1,658 1.09 721 10 490 14.6
hmmer 2,783 0.80 890 9 330 10.5
sjeng 2,176 0.96 837 12 100 14.5
libquantum 1,623 1.61 1 047 20 720 19.8
h264avc 3,102 0.80 993 22 130 22.3
omnetpp 587 2.94 690 6 250 9.1
astar 1,082 1.79 773 7 020 9.1
xalancbmk 1,058 2.70 1 143 6 900 6.0
12
Geometric Mean 15.3 · 11.8 · 11.1 · 6.8 ··· 9.1 · 6.0 11.7
EC11-12 Chapter 1.73 Dept. of Comp. Arch., UMA, 2011
SPEC INT2006 for Opteron X4 2356

Name Description IC×10 CPI Tc (ns) Exec time Ref time SPECratio
9

perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3
bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8
gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1
mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8
go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6
hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5
sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5
libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8
h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3
omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1
astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1
xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0
Geometric mean 11.7

High cache miss rates

EC11-12 Chapter 1.74 Dept. of Comp. Arch., UMA, 2011


Other Performance Metrics
 Power consumption – especially in the embedded market
where battery life is important
 For power-limited applications, the most important metric is
energy efficiency

EC11-12 Chapter 1.77 Dept. of Comp. Arch., UMA, 2011


AMDAHL’S LAW

EC11-12 Chapter 1.78 Dept. of Comp. Arch., UMA, 2011


Amdahl’s Law
 Pitfall: Improving an aspect of a computer and expecting a
proportional improvement in overall performance

 Fm = Fraction of improvement
 Sm = Factor of improvement (local speed-up)
 S = Global Factor of improvement (global speed-up)
 The opportunity for improvement is affected by how much time the
modified event consumes
 Corollary 1: make the common case fast

 The performance enhancement possible with a given improvement


is limited by the amount the improved feature is used
 Corollary 2: the bottleneck will limit the improv.

EC11-12 Chapter 1.79 Dept. of Comp. Arch., UMA, 2011


Amdahl’s Law
 Example: multiply accounts for 80s/100s
 How much improvement in multiply performance to
get 5× overall?
80
20 = + 20  Can’t be done!
n

 Study the limit cases in the Amdahl law

 Fm=0  Fm=1

 Sm=1  Sm=inf

EC11-12 Chapter 1.80 Dept. of Comp. Arch., UMA, 2011

You might also like