Professional Documents
Culture Documents
Assembly Languages
Materials
The slides prepared by Kip Irvine for the book, Assembly Language
for Intel-Based Computers, 5th Ed.
The slides prepared by S. Dandamudi for the book, Fundamentals of
Computer Organization and Designs.
The slides prepared by S.
S Dandamudi for the book
book, Introduction to
Assembly Language Programming, 2nd Ed.
Introduction to Computer Systems, CMU
(http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/
15213-f05/www/)
Assembly Language & Computer Organization
Organization, NTU
(http://www.csie.ntu.edu.tw/~cyy/courses/assembly/
05fall/news/)
/
/)
(http://www.csie.ntu.edu.tw/~acpang/course/asm_2004)
Outline
Overview of Microcomputer
CPU, Memory, I/O
Instruction Execution Cycle
y
Central Processing Unit (CPU)
CISC vs. RISC
6 Instruction Set Design Issues
How Hardwares Execute Processors Instructions
Digital Logic Design (Combinational & Sequential Circuits)
Microprogrammed Control
Pipelining
3 Hazards
3 ttechnologies
h l i ffor performance
f
iimprovementt
Memory
Data Alignment
2 Design Issues (Cache, Virtual Memory)
I/O Devices
Overview of Microcomputer
Memory,
y, Input/Output,
p
p , Arithmetic/Logic
g Unit,, Control Unit
Stored-program Model
Both data and programs are stored in the same main memory
Sequential Execution
http://www.virtualtravelog.net/entries/2003-08-TheFirstDraft.pdf
What is Microcomputer
Microcomputer
Microprocessor (P)
Components of Microcomputer
registers
Central Processor Unit
(CPU)
ALU
CU
clock
l k
control bus
address bus
Memory Storage
Unit
I/O
Device
#1
I/O
Device
#2
CPU
Arithmetic and logic unit (ALU) performs arithmetic (add, subtract) and
logical (AND
(AND, OR
OR, NOT) operations
Registers store data and instructions used by the processor
Control unit (CU) coordinates sequence of execution steps
Clock
Datapath consists of registers and ALU(s)
Datapath
ALU output
RISC processor
ALU input
operand
operand
Clock
one cycle
1
0
System Bus
Execution Cycle
PC
memory
op1
op2
program
I-1 I-2 I-3 I-4
fetch
read
registers
g
registers
g
instruction
I-1 register
decode
write
w
Fetch
Decode
Fetch operands
Execute
Store output
p
write
w
flags
ALU
execute
(output)
See asm
asm_ch2_dl.ppt
ch2 dl ppt
CPU
CPU
CISC vs.
vs RISC
6 Instruction Set Design Issues
Number
N
b off Add
Addresses
Flow of Control
O
Operand
dT
Types
Addressing Modes
Instruction Types
Instruction Formats
Processor
Not in memory
Simplify design (e.g., fixed instruction size)
Typically
T
i ll use a microprogram
i
Example: Intel 80x86 family
Processor (cont.)
Processor (cont.)
Number of Addresses
Flow of Control
O
Operand
Types
Addressing Modes
Instruction Types
Instruction Formats
Number of Addresses
Four categories
3-address machines
2 for the source operands and one for the result
2-address machines
One address doubles as source and result
1-address machine
Accumulator machines
Accumulator is used for one source and result
0-address machines
Stack machines
Operands are taken from the stack
Result
R
lt goes onto
t the
th stack
t k
Three-address machines
Example
C statement
A=B+C*DE+F+A
Equivalent code:
mult
T
T,C,D
C D ;T = C*D
C D
add
T,T,B ;T = B+C*D
sub
T
T,T,E
T E ;T = B+C*D-E
add
T,T,F ;T = B+C*D-E+F
add
A
A,T,A
T A ;A = B+C*D-E+F+A
Two-address machines
One address doubles (for source operand & result)
Last example makes a case for it
Address T is used twice
Sample instructions
load
dest,src ; M(dest)=[src]
add
dest
dest,src
src ; M(dest)=[dest]+[src]
M(dest) [dest]+[src]
sub
dest,src ; M(dest)=[dest]-[src]
mult
lt
d
dest,src
t
; M(dest)=[dest]*[src]
M(d t) [d t]*[
]
Example
C statement
A=B+C*DE+F+A
Equivalent code:
load
T
T,C
C ;T = C
mult
T,D ;T = C*D
add
T
T,B
B ;T = B+C*D
sub
T,E ;T = B+C*D-E
add
T
T,F
F ;T = B+C*D-E+F
add
A,T ;A = B+C*D-E+F+A
One-address machines
Use special set of registers called accumulators
Specify one source operand & receive the result
Called accumulator machines
Sample instructions
load
addr ; accum = [addr]
store addr ; M[addr] = acc
accum
m
add
addr ; accum = accum + [addr]
sub
b
addr
dd ; accum = accum - [addr]
[ dd ]
mult
addr ; accum = accum * [addr]
Example
C statement
A=B+C*DE+F+A
Equivalent code:
load
C ;load C into accum
mult
D ;accum = C*D
add
B ;accum = C*D+B
sub
E ;accum = B+C*D-E
add
F ;accum = B+C*D-E+F
add
A ;accum = B+C*D-E+F+A
store A ;store accum contents in A
Zero-address machines
Stack supplies operands and receives the result
Special instructions to load and store use an address
Called stack machines (Ex: HP3000, Burroughs B5500)
Sample instructions
push
addr ; push([addr])
pop
addr ; pop([addr])
add
; push(pop + pop)
sub
b
; push(pop
h(
- pop)
)
mult
; push(pop * pop)
Example
C statement
A=B+C*DE+F+A
Equivalent code:
push
push
p
push
Mult
push
add
E
C
D
B
sub
p
push
add
push
add
pop
F
A
A
Load/Store Architecture
Sample instructions
load
store
t
add
sub
b
Rd,addr
addr,Rs
dd R
Rd,Rs1,Rs2
Rd
Rd,Rs1,Rs2
R 1 R 2
;Rd = [addr]
;(addr)
( dd ) = R
Rs
;Rd = Rs1 + Rs2
;Rd
Rd = R
Rs1
1 - Rs2
R 2
mult
Example
C statement
A = B + C * D E + F + A
Equivalent code:
load
R1,B
load
R2,C
load
R3,D
load
R4,E
load
R5,F
load
R6,A
mult
add
sub
add
add
store
R2,R2,R3
R2,R2,R1
R2,R2,R4
R2,R2,R5
R2,R2,R6
A,R2
Flow of Control
Branches
B
h
Unconditional
Conditional
C di i
l
Delayed branches
Procedure calls
Delayed procedure calls
Branches
Unconditional
Absolute address
PC-relative
Example: MIPS
Absolute address
j target
PC-relative
b target
e g , Pentium
e.g.,
e g , SPARC
e.g.,
Branches
Conditional
Jump
p is taken only
y if the condition is met
Two types
Set-Then-Jump
Test-and-Jump
Test
and Jump
Example:
p MIPS instruction
beq
Rsrc1,Rsrc2,target
Jumps to target
g if Rsrc1 = Rsrc2
Delayed branching
Procedure calls
Pentium
uses ret instruction
MIPS
uses jr instruction
Return address
In a (special) register
MIPS allows any general-purpose register
On the stack
Pentium
Delay slot
Parameter Passing
Faster
Limit the number of parameters
Recursive procedure
Stack-based ((e.g.,
g Pentium))
Stack is used
More general
Operand Types
Characters
Integers
Floating-point
I t ti overload
Instruction
l d
Operand Types
Separate instructions
Addressing Modes
Part of instruction
Memory
Instruction Types
Several types
yp of instructions
Data movement
Pentium:
mov
dest,src
Some do not provide direct data movement
instructions
Indirect
I di t d
data
t movementt
add
Rdest,Rsrc,0 ;Rdest = Rsrc+0
Arithmetic and Logical
Arithmetic
Logical
S: Sign bit (0 = +, 1= -)
Z: Zero bit (0 = nonzero
nonzero, 1 = zero)
O: Overflow bit (0 = no overflow, 1 = overflow)
C: Carry bit (0 = no carry
carry, 1 = carry)
E
Example:
l P
Pentium
ti
cmp
count,25
je
target
;compare count to 25
;subtract 25 from count
;jump if equal
Isolated I/O
Pentium supports isolated I/O
Separate I/O instructions
in
AX,io_port ;read from an I/O port
out
t
i
io_port,AX
t AX ;write
it t
to an I/O port
t
Instruction Formats
Two types
Fixed-length
Used by RISC processors
32-bit RISC processors use 32-bits wide instructions
Examples: SPARC,
SPARC MIPS,
MIPS PowerPC
Variable-length
Used by CISC processors
Memory operands need more bits to specify
Opcode
Microprogrammed Control
Virtual Machines
Abstractions for computers
Machine-independent
High-Level Language
Level 5
Assembly Language
Level 4
Operating System
Level 3
Instruction Set
Architecture
Level 2
Microarchitecture
L
Level
l1
Digital Logic
Level 0
Machine-specific
registers
Central Processor Unit
(CPU)
ALU
CU
clock
l k
control bus
address bus
Memory Storage
Unit
I/O
Device
#1
I/O
Device
#2
Consider 1
1-bus
bus Datapath
Assume all entities are
32-bit wide
1-bit
1
bit ALU
ALU Circuit in 1
1-bus
bus Datapath
Microprogrammed Control
32 32-bit
32 bit general-purpose
general purpose registers
PC register:
PCin, PCout, and PCbout
IR register:
IRout and IRbin
MAR register:
MARin, MARout, and MARbout
MDR register:
MDRin, MDRout, MDRbin and MDRbout
%G9,%G5,%G7
Implemented as
Transfer G5 contents to A register
Assert G7out
Instruct ALU to p
perform addition
Assert Cin
Load/store
Moves data between registers and memory
Register
Arithmetic and logic instructions
Branch
Jump
J
di
direct/indirect
t/i di t
Call
Procedures
P
d
iinvocation
ti mechanisms
h i
More
Software implementation
Example
add
%G9,%G5,%G7
Three steps
S1
G5out: Ain;
S2
G7out: ALU=add: Cin;
S3
Cout: G9in: end;
Simple
microcode
organization
Microprogram essentially
Mi
ti ll iimplements
l
t th
the FSM
discussed before
Control store
Store microprogram
Use PC
Similar to PC
Address generator
Generates appropriate address depending on the
Opcode, and
Opcode
Condition code inputs
Microcontroller
Microinstruction format
N d 90 bits
Needs
bit for
f each
h microinstruction
i i t ti
Horizontal
microinstruction
format
Vertical organization
64 control
t l signals
i
l ffor th
the 32 generall purpose registers
i t
Vertical organization
2-bus
2
bus Datapath
Example
add
dd
%G9
%G9,%G5,%G7
%G5 %G7
Pipelining
Pipelining
Introduction
3 Hazards
R
Resource,
D
Data
t and
dC
Control
t lH
Hazards
d
Pipelining
Pipelining
Overlapped execution
Increases throughput
Pipelining (cont.)
Pipelining (cont.)
Registers
Cache
Memory
Pipeline Stall
Hazards
Resource hazards
Occurs when two or more instructions use the same
resource
Also called structural hazards
D t hazards
Data
h
d
Caused by data dependencies between instructions
Example:
p Result produced
p
by
y I1 is read by
y I2
Control hazards
Default: sequential execution suits pipelining
Altering control flow (e.g., branching) causes problems
Resource Hazards
Example
Data Hazards
Example
I1: add
I2: sub
R2,R3,R4
R5,R6,R2
/* R2 = R3 + R4 */
/* R5 = R6 R2 */
Control Hazards
Performance Improvement
Superscalar
Replicates the pipeline hardware
Superpipelined
Increases the pipeline depth
Very long instruction word (VLIW)
Encodes multiple operations into a long instruction word
Hardware schedules these instructions on multiple
functional units (No run
run-time
time analysis)
add R1, R2, R3
; R1 = R2 + R3
sub R5, R6, R7
; R5 = R6 R7
and R4, R1, R5
; R4 = R1 AND R5
xor R9, R9, R9
; R9 = R9 XOR R9
cycle 1: add, sub, xor
cycle 2: and
Superscalar Processor
Ex: Pentium
Cyccles
2
3
4
5
6
7
8
9
10
11
S1
I-1
S2
I-2
I-3
I-1
I-2
I-3
S3
I-1
I-2
II-3
3
exe
S4
I-1
II-1
1
I-2
I-2
I-3
I-3
S5
S6
I-1
I-1
I-2
I-2
I-3
I-3
Superscalar
A superscalar processor has multiple execution pipelines.
In the following, note that Stage S4 has left and right
pipelines (u and v).
Stages
Cycless
S4
S1
S2
S3
1
2
3
I-1
I-2
I-3
I-1
I-2
I-1
4
5
I-4
I-3
I-4
I-2
I-3
I-1
I-1
I-2
I-4
I-3
I-3
3
I-2
I-4
I-1
I-2
I-1
I-4
I-3
I-4
I-2
I-3
6
7
8
9
10
S5
S6
I-4
Superpipelined Processor
Memory
Memory
Introduction
Building Memory Blocks
Alignment
l
off Data
2 Memory Design Issues
Cache
Virtual Memoryy
Memory (cont.)
Memory (cont.)
Memory (cont.)
Read cycle
1. Place address on the address bus
2. Assert memory read control signal
3. Wait for the memory to retrieve the data
Introduce wait states if using
g a slow memory
y
4. Read the data from the data bus
5. Drop the memory read signal
Memory (cont.)
Write cycle
Place address on the address bus
2.
Place data on the data bus
3.
Assert memory write signal
4.
Wait for the memoryy to retrieve the data
A 4 X 3 memory ddesign
i
using D flip-flops
Address
Data
Control signals
Read
Write
64M X 32
memory using
i
16M X 16 chips
Alignment of Data
Alignment
Slower memories
Problem: Speed gap between processor and memory
Solution: Cache memory
Size limitations
Programmer managed
Virtual memory
Automates overlay management
Some additional benefits
Memory Hierarchy
Cache Memory
Cache
C
h hit
hit: when
h d
data
t tto b
be read
d iis already
l d iin cache
h
memory
Cache miss: when data to be read is not in cache memory
memory.
When? compulsory, capacity and conflict.
Cache design: cache size
size, n-way
n-way, block size,
size replacement
policy
Example
for (i=0; i<M; i++)
for(j=0; j<N; j++)
X[i][j] = X[i][j] + K;
Repetitive use
Temporal locality
Prefetching
g data
Spatial locality
Mapping Function
Direct mapping
Specifies a single cache line for each memory block
Set-associative
Set
associative mapping
Specifies a set of cache lines for each memory block
Associative mapping
No restrictions
Direct Mapping
Set-Associate
Set
Associate Mapping
Virtual Memory
I/O Devices
Input/Output
Memory-mapped I/O
Reading and writing similar to memory read/write
Uses same memory read and write signals
Most p
processors use this I/O mapping
pp g
Isolated I/O
Separate I/O address space
Separate I/O read and write signals are needed
Pentium supports isolated I/O
Input/Output (cont.)
Input/Output (cont.)
Several ways
y of transferring
g data
Programmed I/O
Program
g
uses a busy-wait
y
loop
p
Anticipated transfer
Powerful technique
Handles unanticipated transfers
Interconnection
On-chip buses
Buses
B
to iinterconnect ALU and
d registers
i
Data
D
t and
d address
dd
b
buses tto connectt on-chip
hi caches
h
Internal buses
PCI,
PCI AGP
AGP, PCMCIA
External buses
Serial,
S i l parallel,
ll l USB
USB, IEEE 1394 (Fi
(FireWire)
Wi )
PC
y
Buses
System
ISA (Industry Standard
A hi
Architecture)
)
PCI (Peripheral Component
Interconnect)
AGP (Accelerated Graphics
Port))
Interconnection (cont.)
Bus transactions
Sequence of actions to complete a well-defined
well defined
activity
Involves a master and a slave
Bus operations
A bus
b s ttransaction
ansaction ma
may pe
perform
fo m one o
or mo
more
eb
buss
operations
Bus arbitration
Summary
Overview of Microcomputer
CPU, Memory, I/O
Instruction Execution Cycle
y
Central Processing Unit (CPU)
CISC vs. RISC
6 Instruction Set Design Issues
How Hardwares Execute Processors Instructions
Digital Logic Design (Combinational & Sequential Circuits)
Microprogrammed Control
Pipelining
3 Hazards
3 ttechnologies
h l i ffor performance
f
iimprovementt
Memory
Data Alignment
2 Design Issues (Cache, Virtual Memory)
I/O Devices