Design A Fast ALU For The MIPS ISA

1
ALU for Computers (MIPS)
design a fast ALU for the MIPS ISA

requirements ?
support the arithmetic/logic operations: add, addi addiu,
sub, subu, and, or, andi, ori, xor, xori, slt, slti, sltu, sltiu
design a multiplier
design a divider
2
Review Digital Logic
Gates:
Combinational Logic
3
PLA: AND array, OR array

Review Digital Logic 4
5
A D latch implemented with NOR gates.
A D flip-flop with a falling-edge trigger.

6
D Q
Value of D is sampled on positive clock
edge.
Q outputs sampled value for rest of
cycle.
CLK
Q
Review: Edge-Triggering in Verilog 7
module ff(D, Q, CLK);

Module code has two bugs.
input D, CLK;
output Q; Where?
always @ (CLK)
Q <= D;
endmodule
module ff(D, Q, CLK);
input D, CLK;
output Q;
reg Q;
Correct ?
always @ (posedge CLK)
Q <= D;
endmodule
8
CLK Change Rst
R If Change == 1 on
(red) positive CLK edge
traffic light changes
Y
(yellow)
If Rst == 1 on
positive CLK edge
RYG=100
G
(green)
RYG
100
9
Rst == 1 Change == 1
RYG RYG RYG

100 Change == 1 001 Change == 1 010
10
RYG RYG RYG

100 Change == 1 001 Change == 1 010
Change
RYG 100 001 010 100

11
RYG RYG RYG

100 Change == 1 001 Change == 1 010
One-Hot Encoding
D Q R D Q G D Q Y
12
RYG RYG RYG

100 Change == 1 001 Change == 1 010
Rst
Change
Next State Combinational Logic
D Q R D Q G D Q Y
13
State Elements: Traffic Light Controller
D Q R D Q G D Q Y
wire next_R, next_Y, next_G;

output R, Y, G;
???
14
D Q
Value of D is sampled on positive clock
edge.
Q outputs sampled value for rest of
cycle.
module ff(Q, D, CLK);
input D, CLK;
CLK
output Q;
reg Q;
always @ (posedge CLK)

Q <= D;
endmodule
15
State Elements: Traffic Light Controller
D Q R D Q G D Q Y

output R, Y, G;
ff ff_R(R, next_R, CLK);
ff ff_Y(Y, next_Y, CLK);
ff ff_G(G, next_G, CLK);
16
Next State Logic: Traffic Light Controller
Rst
Change
next_R R next_G G next_Y Y
assign next_R = rst ? 1b1 : (change ? G : R);

assign next_Y = rst ? 1b0 : (change ? R : Y);
assign next_G = rst ? 1b0 : (change ? Y : G);
17

output R, Y, G;
assign next_R = rst ? 1b1 : (change ? G : R);

assign next_Y = rst ? 1b0 : (change ? R : Y);
assign next_G = rst ? 1b0 : (change ? Y : G);
ff ff_R(R, next_R, CLK);

ff ff_Y(Y, next_Y, CLK);
ff ff_G(G, next_G, CLK);
18
Logic Diagram: Traffic Light Controller

RYG RYG RYG

100 Change == 1 001 Change == 1 010
D Q R D Q G D Q Y
19
ALU for MIPS ISA
design a 1-bit ALU using AND gate, OR gate, a full
adder, and a mux
20
ALU for MIPS ISA
design a 32-bit ALU
by cascading 32 1-bit ALUs
21
ALU for MIPS
a 1-bit ALU performing AND, OR, addition and
subtraction
If we set Binvert = Carryin =1

then we can perform a - b
22
23
ALU for MIPS
include a less input for set-on-less-than (slt)

24
ALU for MIPS
design the most significant bit ALU

most significant bit need to do more work (detect
overflow and MSB can be used for slt )
how to detect an overflow
overflow = carryin{MSB} xor carryout{MSB]
overflow = 1 ; means overflow
overflow = 0 ; means no overflow
set-on-less-than
slt $1, $2, $3; if $2 < $3 then $1 = 1, else $1 = 0
; if MSB of $2 - $3 is 1, then $1 = 1
; 2s comp. MSB of a negative no. is 1
25
ALU for MIPS
a 1-bit ALU for the MSB
Overflow
=Carryin XOR Carryout
26
A 32-bit ALU
constructed from
32 1-bit ALUs
27
A 32-bit ALU
with zero detector
28
29
A Verilog behavioral definition of a MIPS ALU.
30
ALU for MIPS
Critical path of 32-bit ripple carry adder is 32 x carry

propagation delay
How to solve this problem
design trick : use more hardware
design trick : look ahead, peek
carry look adder (CLA)
CLA
a b cout
0 0 0 nothing happen
0 1 cin propagate cin
1 0 cin propagate cin
1 1 1 generate
propagate = a + b; generate = ab
31
ALU for MIPS
CLA using 4-bit as an example

two 4-bit numbers: a3a2a1a0, b3b2b1b0
p0 = a0 + b0; g0 = a0b0
c1 = g0 + p0c0
c2 = g1 + p1c1
c3 = g2 + p2c2
c4 = g3 + p3c3
larger CLA adders can be constructed by cascading 4-
bit CLA adders
other adders: carry select adder, carry skip adder
32
Design Process
Divide and Conquer

using simple components
glue simple components together
work on the things you know how to do. The unknown
will become obvious as you make progress
Successive Refinement
multiplier design
divider design
33
Multiplier
paper and pencil method

multiplicand 0110
multiplier 1001
0110
0000
0000
0110
0110110
product
n bits x m bits = m+n bits

binary : 0 place 0
1 place a copy of multiplicand
34
Multiply Hardware Version 1
32 bits x 32 bits; using 64-bit multiplicand reg. 64 bit ALU, 64 bit product reg. 32 bit multiplier
multiplicand shift left

64 bits
shift right
64-bit ALU ADD multiplier
product write control

64 bits
Check the right
Control provides most bit of Mr
four control to decide to add 0
signals or multiplicand
35
Multiply Algorithm Version 1
1. test multiplier0 (i.e., bit0 of multiplier)

1.a if multiplier0 = 1, add
multiplicand to product
and place result in
product register
2. shift the multiplicand left 1 bit
3. shift the multiplier right 1 bit
4. 32nd repetition ? if yes done
if no go to 1.
36
Multiply Algorithm Version 1 Example
0010 x 0101 = 0000 1010
iter. step multiplier multiplicand product
0 initial 0101 0000 0010 0000 0000
1 1.a 0101 0000 0010 0000 0010
2 0101 0000 0100 0000 0010
3 0010 0000 0100 0000 0010
2 2 0010 0000 1000 0000 0010
3 0001 0000 1000 0000 0010
3 1.a 0001 0000 1000 0000 1010
2 0001 0001 0000 0000 1010
3 0000 0001 0000 0000 1010
4 2 0000 0010 0000 0000 1010
3 0000 0010 0000 0000 1010
37
Multiplier Algorithm Version 1
observations from version 1

1/2 bits in multiplicand always 0
use 64-bit adder is wasted (for 32 bit x 32 bit)
0s inserted into multiplicand as shifted left, least
significant bits of the product does not change once
formed
3 steps per bit
shift product to right instead of shifting multiplicand to
left ? (by adding to the left half of the product register)
38
32-bit multiplicand reg. 32-bit ALU, 64-bit product reg. 32-bit multiplier reg
multiplicand
32 bits
shift right
32-bit ALU ADD multiplier
product shift right control

32 bits 32 bits write
Check the right
Write into the most bit of Mr
left half of the to decide to add 0
product register or multiplicand
39
1. test multiplier0 (i.e., bit 0 of the multiplier)

1a. if multiplier0 = 1 add
multiplicand to the left
half of product and place
the result in the left half of
product register;
2. shift product reg. right 1 bit
3. shift multiplier reg. right 1 bit
4. 32nd repetition ? if yes done
if no, go to 1.
40
iter. step multiplier multiplicand product

0 initial 0011 0010 0000 0000
1 1.a 0011 0010 0010 0000
2 0011 0010 0001 0000
3 0001 0010 0001 0000
2 1.a 0001 0010 0011 0000
2 0001 0010 0001 1000
3 0000 0010 0001 1000
3 2 0000 0010 0000 1100
3 0000 0010 0000 1100
4 2 0000 0010 0000 0110
3 0000 0010 0000 0110
41
Multiply Version 2
Observations
product reg. wastes space that exactly matches the size
of multiplier
3 steps per bit
combine multiplier register and product register
42
32-bit multiplicand register, 32-bit ALU, 64-bit product

register, multiplier reg is part of product register
multiplicand
ADD
32 bit ALU
write into
left half
product (multiplier) control
shift right
43
1. test product0 (multiplier is in the right half of product register)

1a. if product0 = 1
add multiplicand to the left
half of product and place the
result in the left half of product
register
2. shift product register right 1 bit
3. 32nd repetition ? if yes, done
if no, go to 1.
44
1110 x 1011
iter. step multiplicand product
0 initial 1110 0000 1011
1 1.a 1110 1110 1011
2 1110 0111 0101
2 1.a 1110 10101 0101
2 1110 1010 1010
3 2 1110 0101 0101
4 1.a 1110 10011 0101
2 1110 1001 1010
need to save the carry
1110 x 1011 = 1001 1010

14 x 11 = 154
45
Observations
2 steps per bit because of multiplier and product in one
register, shift right 1 bit once (rather than twice in
version 1 and version 2)
MIPS registers Hi and Li correspond to left and right
half of product
MIPS has instruction multu
How about signed numbers in multiplication ?
method 1: keep the sign of both numbers and use the
magnitude for multiplication, after 32 repetitions, then
change the product to appropriate sign.
method 2: Booths algorithm
Booths algorithm is more elegant in signed number
multiplications
Booths algorithm uses the same hardware as version 3
46
Booths Algorithm
Motivation for Booths Algorithm is speed

example 2 x 6 = 0010 x 0110
normal approach Booths approach
0010 0010
0110 0110
Booths approach : replace a string of 1s in multiplier by two actions

action 1: beginning of a string of 1s, subtract multiplicand
action 2: end of a string of 1s, add multiplicand
47
Booths Algorithm
end of run middle of run beginning of run

011111111111111111110
current bit bit to the right explanation action

(previous bit)
1 0 beginning of a run of 1s sub. multd from
left half of product
1 1 middle of a run no arithmetic oper.
0 1 end of a run add muld to left
half of product
0 0 middle of a run of 0s no arith. operation.
48
Booths Algorithm Example
-2 x 7=-14 in signed binary 1110 x 0111 = 1111 0010
previous
iteration step multiplicand product bit
0 initial 1110 0000 0111 0
1 sub. 1110 0010 0111 0
product shift right 1110 0001 0011 1
2 shift right 1110 0000 1001 1
3 shift right 1110 0000 0100 1
4 add 1110 1110 0100 1
shift right 1110 1111 0010 0
To begin with we put multiplier at the right half of

the product register
49
Divide Algorithm
Paper and pencil

quotient
divisor 1011 1010101010 dividend
remainder (modulo )
50
Divide Hardware Version 1
64-bit divisor reg., 64-bit ALU, 32-bit quotient reg. 64-bit

remainder register
divisor
shift right
64-bit ALU
quotient
shift left
remainder write control
put the dividend in the remainder register initially

51
Divide Algorithm Version 1
start: place dividend in remainder
1. sub. divisor from the remainder and place the result in
remainder
2. test remainder
2a. if remainder >= 0, shift quotient to left setting the new

rightmost bit to 1
2b. if remainder <0, restore the original value by adding

divisor to remainder, and place the sum in remainder. shift
quotient to left and setting new least significant bit 0
3. shift divisor right 1 bit
4. n+1 repetitions ? if yes, done, if no, go to 1.

Divide Algorithm Version 1 Example
52
iter. step quotient divisor remainder
0 initial 0000 0010 0000 0000 0111
1 1 0000 0010 0000 1110 0111
2b 0000 0010 0000 0000 0111
3 0000 0001 0000 0000 0111
2 1 0000 0001 0000 1111 0111
2b 0000 0001 0000 0000 0111
3 0000 0000 1000 0000 0111
3 1 0000 0000 1000 1111 1111
2b 0000 0000 1000 0000 0111
3 0000 0000 0100 0000 0111
4 1 0000 0000 0100 0000 0011
2a 0001 0000 0100 0000 0011
3 0001 0000 0010 0000 0011
5 1 0001 0000 0010 0000 0001
2a 0011 0000 0010 0000 0001
3 0011 0000 0001 0000 0001
53
Observations
1/2 bits in divisor always 0
1/2 of divisor is wasted
1/2 of 64-bit ALU is wasted
Possible improvement
instead of shifting divisor to right, shifting remainder to
left ?
first step can not produce a 1 in quotient, so switch order
to shift first and then subtract. This can save one
iteration
54
32-bit divisor reg. 32-bit ALU, 32-bit quotient reg., 64-bit

remainder reg.
divisor
32-bit ALU quotient

shift left
shift left
remainder control
55
1. shift remainder left 1 bit
2. sub. divisor from the left half of remainder and place the
result in the left half of remainder
3. test remainder
3a. if remainder >= 0, shift quotient to left setting the new

rightmost bit to 1

divisor to the left half of remainder, and place the sum in the
left of the remainder. also shift quotient to left and setting
new least significant bit 0
4. n repetitions ? if yes, done,

if no, go to 1.
56
iter. step quotient divisor remainder
0 initial 0000 0011 0000 1111
1 1 0000 0011 0001 1110
2 0000 0011 1110 1110
3b 0000 0011 0001 1110
2 1 0000 0011 0011 1100
2 0000 0011 0000 1100
3a 0001 0011 0000 1100
3 1 0001 0011 0001 1000
2 0001 0011 1110 1000
3b 0010 0011 0001 1000
4 1 0010 0011 0011 0000
2 0010 0011 0000 0000
3a 0101 0011 0000 0000
57
Observations
3 steps (shift remainder left, subtract, shift quotient left)
Further improvement (version 3)
eliminating quotient register by combining with
remainder register as shifted left
therefore loop contains only two steps, because the shift
of remainder is shifting the remainder in the left half and
the quotient in the right half at the same time
consequence of combining the two registers together is
the remainder shifted one time unnecessary at the last
iteration
final correction step: shift back the remainder in the left
half of the remainder register (i.e., shift right 1 bit of
remainder only)
58
32-bit divisor register, 32-bit ALU, 64-bit remainder

register, 0-bit quotient register (quotient bit shifts into
remainder register, as remainder register shifts left)
divisor
32bits
32-bit ALU
shift left control

remainder, quotient
write
64-bit
59
1. shift remainder left 1 bit
2. sub. divisor from the remainder and place the result in
remainder
3. test remainder
3a. if remainder >= 0, shift remainder to left setting the new
rightmost bit to 1

divisor to the left half of remainder, and place the sum in the
left of the remainder. also shift remainder to left and setting
new least significant bit 0
4. n repetitions ? if yes, done,

if no, go to 2.
60
iter. step divisor remainder

0 initial 0101 0000 1110
1 0101 0001 1100
1 2 0101 1100 1100
3b 0101 0011 1000
2 2 0101 1110 1000
3b 0101 0111 0000
3 2 0101 0010 0000
3a 0101 0100 0001
4 2 0101 1111 0001
3b 0101 1000 0010
0100 0010
quotient
correction step: shift remainder right 1bit.
61
Observations
same hardware as multiply, need a 32-bit ALU to add and
subtract and a 64-bit register to shift left and right
divide algorithm version 3 is called restoring division
algorithm for unsigned numbers
Signed numbers divide
simplest method
remember signs of dividend and divisor, make
positive, and finally complement quotient and
remainder as necessary
dividend and remainder must have the same sign
quotient is negative if dividend sign and divisor sign
disagree
SRT (named after three persons) method
an efficient algorithm
62
Floating Point Numbers
What can be represented in N bits ?

unsigned 0 <-------------> 2N-1
2s complement. -2N- 1 <------------------> 2N-1 - 1
1s comp. -2N-1+ 1 <---------------------->2N-1 - 1
BCD 0 <-----------------------> 10N/4 - 1
How about
very small numbers, very large numbers
rationals, such as 2/3; irrationals such as 2;
transcendentals, such as , .
63
Mantissa (aka Significand), Exponent (using radix of

10)
6.12 x 10 23 S E M
IEEE standard F.P. 1.M x 2 E-127

single precision S(1bit), E(8 bits), M(23 bits)
mantissa = sign + magnitude; magnitude is normalized with
hidden integer bit: 1.M
exponent = E -127 (excess 127), 0 < E < 255
a FP number N = (-1)S 2(E-127) (1.M)
0 = 0 00000000 00000000000000000000000
-1.5 = 1 01111111 10000000000000000000000
64
Single Precision FP numbers
- 0.75 = __________________________________
- 5.0 = ___________________________________
7 = ____________________________________
-0.75 =-0.11b=-1.1 x 2-1 E=126 1 01111110 10000.......0
-5.0 = -101.0b=-1.01 x 22 E=129
7 = 111b = 1.11 x 22 E=129

65
Single precision FP number

What is the smallest number in magnitude ?
(1.0) 2 -126
What is the largest number in magnitude ?

(1.11111111111111111111111)binary 2127 = (2 - 2-23) 2127
66
single precision FP numbers

Exponent Significand Object represented
0 0 0
0 nonzero denormalized numbers
1 to 254 anything floating point numbers
255 0 infinite
255 nonzero NaN (Not A Number)
other topics in FP numbers

1. extra bits for rounding
2. guard bit, sticky bit
3. algorithms for FP numbers
67
Double precision
64 bits total
52-bit significand
11-bit exponent (excess 1023 bias)
Number is: (-1)s (1.M) x 2E-1023
68
Basic Addition Algorithm
Steps for Y + X, assuming Y >= X

1. Align binary points (denormalize smaller number)
a. compute Diff = Exp(Y) - Exp(X); Exp = Exp(Y)
b. Sig(X) = Sig(X) >> Diff
2. Add the aligned components
Sig = Sig(X) + Sig(Y)
3. Normalize the sum
1. shift Sig right/left until leading bit is 1; decrementing
or incrementing Exp.
2. Check for overflow in Exp
3. Round
4. repeat step 3 it not still normalized
69
Addition Example
4-bit significand
1.0110 x 23 + 1.1000 x 22
align binary points (denormalize smaller number)
1. 0110 x 23
0. 1100 x 23
Add the aligned components

10. 0010 x 23
Normalize the sum
1.0001 x 24
No overflow, no rounding
70
Another Addition Example
1.0001 x 23 - 1.1110 x 1
4-bit significand; extra bit needed for accuracy
1. Align binary point:
1. 0001 x 23
- 0. 01111 x 23
2. Subtract the aligned components

0. 10011 x 23
3. Normalize
1.0011 x 22 = 4.75
Without extra bit, the result would be 0.1001 x 23 =
100.1 = 4.5, which is off by 0.25. This is too much!
71
Accuracy and Rounding
Want arithmetic to be fully precise

IEEE 754 keeps two extra digits on the right during intermediate
calculations (guard digit, round digit)
Alignment step can cause data to be discarded (shifted out on
right)
2.56 x 100 + 2.34 x 102
2.3400 x 102
+ 0.0256 x 102
2.3656 x 102 (We have two digits to round 0 to 49 round down
Round 51 to 99 round up
Guard Answer = 2.37 x 102
Without using Guard and Round digits,

Answer would be 2.36 x 102

Design A Fast ALU For The MIPS ISA - Requirements ?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Design A Fast ALU For The MIPS ISA - Requirements ?

Uploaded by

Copyright:

Available Formats

1

ALU for Computers (MIPS)

PLA: AND array, OR array

A D latch implemented with NOR gates.

A D flip-flop with a falling-edge trigger.

module ff(D, Q, CLK);

CLK Change Rst

RYG RYG RYG

RYG RYG RYG

RYG 100 001 010 100

RYG RYG RYG

RYG RYG RYG

Next State Combinational Logic

State Elements: Traffic Light Controller

wire next_R, next_Y, next_G;

always @ (posedge CLK)

wire next_R, next_Y, next_G;

Next State Combinational Logic

next_R R next_G G next_Y Y

wire next_R, next_Y, next_G;

assign next_R = rst ? 1b1 : (change ? G : R);

wire next_R, next_Y, next_G;

assign next_R = rst ? 1b1 : (change ? G : R);

ff ff_R(R, next_R, CLK);

Logic Diagram: Traffic Light Controller

RYG RYG RYG

Next State Combinational Logic

If we set Binvert = Carryin =1

include a less input for set-on-less-than (slt)

design the most significant bit ALU

a 1-bit ALU for the MSB

Critical path of 32-bit ripple carry adder is 32 x carry

CLA using 4-bit as an example

Divide and Conquer

paper and pencil method

n bits x m bits = m+n bits

multiplicand shift left

product write control

1. test multiplier0 (i.e., bit0 of multiplier)

observations from version 1

product shift right control

1. test multiplier0 (i.e., bit 0 of the multiplier)

iter. step multiplier multiplicand product

32-bit multiplicand register, 32-bit ALU, 64-bit product

1. test product0 (multiplier is in the right half of product register)

need to save the carry

1110 x 1011 = 1001 1010

Motivation for Booths Algorithm is speed

Booths approach : replace a string of 1s in multiplier by two actions

end of run middle of run beginning of run

current bit bit to the right explanation action

To begin with we put multiplier at the right half of

Paper and pencil

64-bit divisor reg., 64-bit ALU, 32-bit quotient reg. 64-bit

remainder write control

put the dividend in the remainder register initially

2a. if remainder >= 0, shift quotient to left setting the new

2b. if remainder <0, restore the original value by adding

3. shift divisor right 1 bit

4. n+1 repetitions ? if yes, done, if no, go to 1.

32-bit divisor reg. 32-bit ALU, 32-bit quotient reg., 64-bit

32-bit ALU quotient