You are on page 1of 112

CHAPTER 2

SYLLABUS
Introduction,
Basic Architectural Features,
DSP Computational Building Blocks,
Bus Architecture and Memory,
Data Addressing Capabilities,
Address Generation Unit,
Programmability and Program Execution,
Features for External Interfacing.
BASIC ARCHITECTURAL FEATURES
A programmable DSP device should provide instructions
which are used to design programs for implementing
DSP algorithms.

The instruction set of a typical DSP device should
include the following:

Arithmetic operations such as ADD, SUBTRACT, MULTIPLY
etc
Logical operations such as AND, OR, NOT, XOR etc
Multiply and Accumulate (MAC) operation
Signal scaling operation for scaling the signal before &/or
after
digital signal processing.

Thus, high speed h/w should be provided.
E.g. Multipliers: dedicated multipliers, shift & add/
micro-coded multiplier

In addition to the above provisions, the
architecture should also include,
On chip registers to store intermediate results
On chip memories to store signal samples (RAM)
On chip program memories to store programs
and fixed data such as filter coefficients (ROM)

BASIC ARCHITECTURAL FEATURES
Investigate the basic features that should
be provided in the DSP architecture to be
used to implement the following Nth order
FIR filter















BASIC ARCHITECTURAL FEATURES : PROBLEM
an x[n] earlier th samples i sample, i/p i] - x[n
t, coefficien filter ith h[i]
sample, o/p y[n] sample, i/p x[n] where
,... 2 , 1 , 0 , ] [ ] [ ] [
1
0
=
=
= =
= =

=
n i n x i h n y
N
i

In order to implement the above operation in a DSP, the architecture
requires the following features
i. A RAM to store the signal samples x (n), x(n-1), x(n-2),etc
ii. A ROM to store the filter coefficients h (n)
iii. A h/w multiplier & adder (MAC unit) to perform Multiply and Accumulate
operation
iv. A register to keep track of accumulation i.e. an accumulator to store the
result immediately
v. A register (signal pointer) to point the current signal sample in the memory
being used.
vi. A register (coefficient pointer) to point the current filter coefficient in the
memory
vii. A counter to keep track of the count of MAC operations that remain to be
done.
viii. Capability to scale the signal value x[n] as it is read from the memory &,
computed signal y[n] as it is stored in memory i.e. a shifter is used to shift
the input and/or output samples appropriately
BASIC ARCHITECTURAL FEATURES: SOLUTION
DSP COMPUTATIONAL BUILDING BLOCKS
These are h/w building blocks that carry out
basic DSP computations

Important issues to be taken into consideration
during design of building blocks are:
Accuracy
Optimization for functionality & speed
Design to be sufficiently general so that it can be
easily integrated with other blocks to implement
overall DSP system.
Basic building blocks:

Multiplier
Shifter
Multiply & Accumulate (MAC) Unit
Arithmetic logic unit
DSP COMPUTATIONAL BUILDING BLOCKS
MULTIPLIER
The advent of single chip multipliers paved the
way for implementing DSP functions on a VLSI
chip.
Parallel multipliers replaced the traditional shift
and add multipliers. Parallel multipliers take a
single processor cycle to fetch and execute the
instruction and to store the result.
They are also called as Array multipliers or
Braun multipliers.
The key features to be considered for a
multiplier are:
Accuracy
Dynamic range
Speed

The number of bits used to represent the operands
and whether they are implemented in fixed point or
floating point format decide the accuracy and the
dynamic range of the multiplier. Whereas speed is
decided by the architecture employed.

If the multipliers are implemented using hardware, the
speed of execution will be very high but the circuit
complexity will also increases considerably. Thus
there should be a tradeoff between the speed of
execution and the circuit complexity. Hence the
choice of the architecture normally depends on the
application.

MULTIPLIER
Consider the multiplication of two unsigned numbers A and B.
Let A be represented using m bits as (A
m-1
A
m-2
.. A
1
A
0
)
and B be represented using n bits as (B
n-1
B
n-2
.. B
1
B
0
).
Then the product of these two numbers is given by,

PARALLEL MULTIPLIER

=
+

=
(

=
=
=
1
0
1
0
1
0
1
0
2
2
2
n
j
j i
j
m
i
i
n
j
j
j
m
i
i
i
B A P
B B
A A
The fig shows multiplication of two 4 bit
numbers, A=A
3
A
2
A
1
A
0
& B=B
3
B
2
B
1
B
0
.
PARALLEL MULTIPLIER
BRAUN MULTIPLIER
MULTIPLIER
The hardware structure is regular and 4X4
multiplier requires 12 FAs and 16 AND gates.
In general an nXn multiplier requires n(n-
1)FAs and n
2
AND gates.
Several improvements on this basic structure
are done to increase the speed, reduce the
hardware complexity & power dissipation.

MULTIPLIER FOR SIGNED NUMBERS
Braun multiplier does not take into account
signs of numbers that are being multiplied.

Additional hardware is required (before &
after multiplier) when signed numbers,
represented in 2s complement form are
used.

The modified multiplier for handling signed
numbers is called as Baugh-Wooley
multiplier.
Consider 2 numbers A & B represented in 2s complement
format. Let A have m bits and B have n bits.








The two subtractions can be represented as additions of 2s
complement numbers.
MULTIPLIER FOR SIGNED NUMBERS

=
+

=
+

=
+ +

+

+ = =
. =
+ =
+ =
2
0
1
1
2
0
1
1
2
0
2
0
2
1 1
0 1 1 - n m
2
0
1
1
2
0
1
1
2 2 2 2 AB P
as written be can P P P P product The
2 2
2 2
m
i
i n
i n
n
j
j m
j m
m
i
n
j
j i
j i
n m
n m
n
j
j
j
n
n
m
i
i
i
m
m
A B B A B A B A
B B B
A A A
Conventional Shift and Add technique of multiplication
requires n cycles to perform the multiplication of two n bit
numbers. Whereas in parallel multipliers the time required
will be the longest path delay in the combinational circuit
used.

As DSP applications generally require very high speed, it is
desirable to have multipliers operating at the highest
possible speed by having parallel implementation.

Low access time of memories holding program & data.
Multipliers operated at high speeds with fully parallel implementation.
MULTIPLIER: SPEED
MULTIPLIER: BUS WIDTH
Consider the multiplication of two n bit numbers X and Y.
The product Z can be at most 2n bits long. In order to
perform the whole operation in a single execution cycle, we
require two buses of width n bits each to fetch the operands
X and Y and a bus of width 2n bits to store the result Z to
the memory.
Use the program bus itself to fetch one of the operands after
fetching the instruction, thus requiring only one bus to fetch
the operands. And the result Z can be stored back to the
memory using the same operand bus. But the problem with
this is the result Z is 2n bits long whereas the operand bus
is just n bits long.
We have two alternatives to solve this problem,
Use the n bits operand bus and save Z at two successive memory
locations. Although it stores the exact value of Z in the memory, it
takes two cycles to store the result.
Discard the lower n bits of the result Z and store only the higher
order n bits into the memory. It is not applicable for the applications
where accurate result is required.

Another alternative can be used for the applications where
speed is not a major concern, in which latches are used for
inputs and outputs thus requiring a single bus to fetch the
operands and to store the result.
MULTIPLIER: BUS WIDTH
MULTIPLIER: BUS WIDTH
SHIFTERS
Shifters are used to either scale down or scale up
operands or the results to avoid errors from
overflows or underflows during computations.

The following scenarios give the necessity of a
shifter

While performing the addition of N numbers each of n
bits long, the sum can grow up to n+log
2
N bits long. If
the accumulator is of n bits long, then an overflow error
will occur.
This can be overcome by using a shifter to scale down the
operands by an amount of log
2
N.
Accuracy is reduced by log
2
N, but overflow is avoided.
Actual sum is obtained by scaling up the result by log
2
N.


SHIFTERS
Similarly while calculating the product of two n bit
numbers, the product can grow up to 2n bits long.
Generally the lower n bits get neglected when stored in n
bit accumulator resulting in loss of accuracy.
In case of signed numbers, the sign bit is shifted left to
save the sign of the product before saving the n higher
order bits.

Finally in case of addition of two floating-point
numbers, the numbers should be normalized to have
same exponent. Thus, one of the operands has to be
shifted appropriately to make the exponents of two
numbers equal.

From the above cases it is clear that, a shifter is
required in the architecture of a DSP.

SHIFTERS PROBLEMS
It is required to find the sum of 64, 16 bit
numbers. How many bits should the
accumulator have so that the sum can be
computed without the occurrence of overflow
error or loss of accuracy?

Length of sum= n+log
2
N bits
The sum of 64, 16 bit numbers can grow up to (16+
log
2
64 )=22 bits long. Hence the accumulator
should be 22 bits long in order to avoid overflow
error from occurring.

In the previous problem, it is decided to have
an accumulator with only 16 bits but shift the
numbers before the addition to prevent
overflow, by how many bits should each
number be shifted?


Scale down the operands by an amount of log
2
N.
As the length of the accumulator is fixed, the operands
have to be shifted by an amount of log
2
64 = 6 bits to
the right prior to addition operation, in order to avoid
the condition of overflow.

SHIFTERS PROBLEMS
If all the numbers in the previous problem are
fixed point integers, what is the actual sum of
the numbers?


Each number had been shifted to right by 6 bits.
Thus, the actual sum can be obtained by shifting
the result by 6 bits towards left side after the sum
being computed. Therefore Actual Sum=
Accumulator content X 2
6
.

SHIFTERS PROBLEMS
What is the error in computation of the sum
in the previous problem?




Since 6 lowest bits (LSB) have been lost,
error= 2
6
-1=63
SHIFTERS PROBLEMS
In conventional microprocessors, shifting
performed similar to shift register.

Requires one clock cycle for single bit shift.

Large amount of time for multi bit shift.

For high speed DSP computations, shift of
several bits accomplished in single clock
cycle.

BARREL SHIFTER
A barrel shifter connects the input lines
representing a word to a group of output
lines with the required shifts determined by
its control inputs.
For an input of length n, log
2
n control lines
are required. And an additional control line is
required to indicate the direction of the shift.
Generally, the direction of shift remains fixed.

BARREL SHIFTER
BARREL SHIFTER
Bits shifted out of the i/p word are discarded and
new bit positions are filled with zeros in case of
left shift & with most significant bit in case of
right shift (to maintain the sign of the shifted
result).

In a 4-bit shift right barrel shifter, shift to right by
0, 1, 2 or 3 bit positions can be controlled by
setting the control inputs (S
3
S
2
S
1
S
0
)
appropriately. Only one control input can be high
at any time.

BARREL SHIFTER
SHIFT RIGHT BARREL SHIFTER
INPUT SHIFT
(SWITC
H)
OUTPUT
(B
3
B
2
B
1
B
0
)
A
3
A
2
A
1
A
0

0 (S
0
) A
3
A
2
A
1
A
0

A
3
A
2
A
1
A
0

1 (S
1
) A
3
A
3
A
2
A
1

A
3
A
2
A
1
A
0

2 (S
2
) A
3
A
3
A
3
A
2

A
3
A
2
A
1
A
0

3 (S
3
) A
3
A
3
A
3
A
3

BARREL SHIFTER
As barrel shifter is a combinational circuit,
the time taken to implement the shift is the
total combinational delay involved in
decoding the control lines and setting up
paths from i/p to o/p lines.
This delay is only fraction of a clock cycle.
Shifting normally combined with data transfer
& executed in single clock cycle.
A Barrel Shifter is to be designed with 16
inputs for left shifts from 0 to 15 bits. How
many control lines are required to
implement the shifter?

log
2
n control lines are required.
As the number of bits used to represent the
input are 16, log
2
16=4 control inputs are
required.

BARREL SHIFTER PROBLEM
MULTIPLY & ACCUMULATE (MAC) UNIT
Most of the DSP applications require the
computation of the sum of the products of a series
of successive multiplications.
Requires add/subtract unit and additional register
called accumulator at the o/p of multiplier.
In order to implement such functions a special unit
called a Multiply and Accumulate (MAC) unit is
required.
A MAC consists of a multiplier and a special
register called Accumulator.

Although addition and multiplication are two
different operations, they can be performed in
parallel.
By the time the multiplier is computing the
product, accumulator can accumulate the product
of the previous multiplications.
Thus if N products are to be accumulated, N-1
multiplications can overlap with N-1 additions.
During the very first multiplication, accumulator
will be idle and during the last accumulation,
multiplier will be idle. Thus N+1 clock cycles are
required to compute the sum of N products.

MULTIPLY & ACCUMULATE (MAC) UNIT
MACs are used to
implement the
functions of the
type A+BC.

A typical MAC unit
is as shown in the
figure

MULTIPLY & ACCUMULATE (MAC) UNIT
If a sum of 256 products is to be
computed using a pipelined MAC unit,
and if the MAC execution time of the unit
is 100nsec, what will be the total time
required to complete the operation?

N=256
MAC unit requires N+1=257execution cycles.
As the single MAC execution time is
100nsec, the total time required will be,
(257*100nsec)=25.7sec


MAC UNIT: PROBLEM
While designing a MAC unit, attention has to be
paid to the word sizes encountered at the input of
the multiplier and the sizes of the add/subtract
unit and the accumulator, as there is a possibility
of overflow and underflows.

Overflow/underflow can be avoided by using any of
the following methods.
Using shifters at the input and the output of the MAC
Providing guard bits in the accumulator
Using saturation logic

MAC UNIT: OVERFLOW & UNDERFLOW
Shifters can be provided at the input of the
MAC to normalize the data and at the output to
de-normalize the same.
Shifters may also be used
to discard redundant sign bit in 2s complement
product.
or to shift the output by required number of
positions before saving, to preserve the maximum
possible accuracy.
This is done when the number to be saved is
preceded by several leading 0s or 1s.

MAC UNIT: SHIFTERS
In order to preserve the accuracy, i/ps to the
multiplier are not normalized.
Repetitive MAC operations increase the size of
the accumulated sum.
To handle this, extra bits are provided in the
accumulator called the extension bits or guard
bits. So, the size of add/sub unit also increases.
After the computation of complete result,
these bits may be saved as a separate word
or, the sum along with the guard bits may be shifted
by required amount and saved as a single word.
MAC UNIT: GUARD BITS
Consider a MAC unit whose inputs are 16 bit
numbers. If 256 products are to be summed up
in this MAC, how many guard bits should be
provided for the accumulator to prevent
overflow condition from occurring?

16X16 bits= 32 bit product. (n=32)
256 such products are added. (N= 256)
Max size of sum=n+log
2
N=32+log
2
256=32+8=40
Number of guard bits = 8

MAC UNIT: PROBLEM

MAC UNIT: GUARD BITS: PROBLEM
What should be the minimum width of the
accumulator in a DSP device that receives
10 bit A/D samples and is required to add 64
of them without causing an overflow?
As it is required to calculate the sum of 64, 10
bit numbers, the sum can be as long as (10+
log
2
64)=16 bits. Hence the accumulator should
be capable of handling these 16 bits. Thus the
guard bits required will be (16-10)= 6 bits.

MAC UNIT: PROBLEM
Overflow/ underflow will occur if the result goes
beyond the most positive number or below the
least negative number the accumulator can
handle. Thus the overflow/underflow error can
be resolved by loading the accumulator with the
most positive number which it can handle at the
time of overflow and the least negative number
that it can handle at the time of underflow.
In saturation logic (limiting the contents of
accumulator to its saturation limits), as soon as
an overflow or underflow condition is satisfied
the accumulator will be loaded with the most
positive or least negative number overriding the
result computed by the MAC unit.

MAC UNIT: SATURATION LOGIC
Monitor carry into & out of MSB. If they are
not equal, overflow/ underflow occur
(decided by sign bit)
MAC UNIT: SATURATION LOGIC
ARITHMETIC LOGIC UNIT
A DSP device should be capable of handling
arithmetic instructions like ADD, SUB, INC,
DEC, NEG etc, logical operations like AND,
OR , NOT, XOR and compare etc in addition
to shift, multiply, & MAC operations.

It consists of status flag register, register file
and multiplexers.



ARITHMETIC LOGIC UNIT
ALU includes circuitry to generate status
flags after arithmetic and logic operations.
These flags include sign, zero, carry and
overflow.

Status of accumulator after arithmetic and
logic operations is used for program
sequencing and scaling.

ALU UNIT: STATUS FLAGS
Depending on the status of overflow and sign
flags, the saturation logic can be used to limit
the accumulator content to its most positive
or most negative values.

ALU UNIT: OVFLOW MANAGEMENT
Instead of moving data in and out of the
memory during the operation, for better
speed, a large set of general purpose
registers are provided to store the
intermediate results of arithmetic
computations along with the accumulator
ALU UNIT: REGISTER FILE
BUS ARCHITECTURE
Conventional microprocessors use Von Neumann architecture
for memory management wherein the same memory is used
to store both the program and data.
Although this architecture is simple, it takes more number of
processor cycles for the execution of a single instruction as
the same bus is used for both data and program.

In order to increase the speed of operation, separate
memories were used to store program and data and a
separate set of data and address buses have been given to
both memories, the architecture called as Harvard
Architecture.

BUS ARCHITECTURE
Although the usage of separate
memories for data and the instruction
speeds up the processing, it will not
completely solve the problem.
As many of the DSP instructions
require more than one operand, use
of a single data memory leads to
fetching of the operands one after the
other, thus increasing the delay of
processing.
This problem can be overcome by
using two separate data memories
for storing operands separately, thus
in a single clock cycle both the
operands can be fetched together.

BUS ARCHITECTURE
Although the above architecture improves
the speed of operation, it requires more
hardware and interconnections, thus
increasing the cost and complexity of the
system.

Therefore there should be a trade off
between the cost and speed while selecting
memory architecture for a DSP.

BUS ARCHITECTURE
MEMORY: On Chip Memory
In order to have a faster execution of the DSP
functions, it is desirable to have some memory
along with their buses, located on chip.
As dedicated address and data buses are used
to access the memory, on-chip memories are
faster.
The buses of off-chip memories are often
multiplexed to reduce the pin count on the DSP.
Processor makes simultaneous accesses to all
memories, fewer accesses to external
memories, thus reducing interconnection
requirements to external devices.
Speed and size are the two key parameters to
be considered with respect to the on-chip
memories.
ON CHIP MEMORY: Speed, Size
On-chip memories should match the speeds of
the ALU operations in order to maintain the
single cycle instruction execution of the DSP.

In a given area of the DSP chip, it is desirable to
implement as many DSP functions as possible
to get the best possible performance.
Thus the area occupied by the on-chip memory
should be minimized without compromising on
the essential features so that there will be a
scope for implementing more number of DSP
functions on- chip.
MEMORY: Organization Of On Chip Memories
Separate stack is provided that can be directly accessed by
program counter e.g. interrupt and subroutine call and return.
Ideally whole memory (program and data spaces) required
for the implementation of any DSP algorithm has to reside
on-chip so that the whole processing can be completed in a
single execution cycle.
Although it looks as a better solution, it consumes more
space on chip, reducing the scope for implementing any
functional block on-chip, which in turn reduces the speed of
execution.
Hence some other alternatives in which the on-chip memory
can be organized are:
As many DSP algorithms require instructions to be executed
repeatedly (MAC & Loop), the instruction can be stored in the
external memory, once it is fetched can reside in the instruction
cache.
As result is saved only after repetitions are completed (i.e. less
frequently) there is no need to provide a separate memory for this
purpose. It is sufficient to provide only 2 blocks of on-chip memories to
hold operands required for execution of instructions.

The access times for memories on-chip should be
sufficiently small so that it can be accessed more than
once in every execution cycle. This way, fewer memory
blocks would be required to hold instructions, operands
and result.
E.g. dual access on-chip memories are fast enough to be
accessed twice in each instruction cycle.

On-chip memories can be configured dynamically so
that they can serve different purpose at different times
depending on the requirement.
For example, if a DSP has 2 blocks of on-chip memory, one
would be configured as program memory and the other as data
memory. But for the execution of instructions which require 2
operands simultaneously, both the blocks can hold operands
and be used as data memories; and the instruction can be
fetched from external memory or stored in instruction cache.

MEMORY: Organization Of On Chip Memories
DATA ADDRESSING CAPABILITIES
Data accessing capability of a programmable DSP
device is configured by means of its addressing
modes. The summary of the addressing modes
used in DSP is as shown in the table below.
ADDRESSING MODES
Immediate Addressing Mode
In this addressing mode, data is included in the
instruction itself. Data need to be a fixed number known
at the time of writing the instructions e.g. filter
coefficients.

Register Addressing Mode
In this mode, one of the registers will be holding the
data and the register has to be specified in the
instruction.

Direct Addressing Mode
In this addressing mode, instruction holds the
memory location of the operand. Memory address mem
should be known explicitly.

A A imm # imm # ADD +
A A reg reg ADD +
A A mem mem ADD +
Indirect Addressing Mode
In this addressing mode, the operand is accessed using
a pointer. A pointer is generally a register, which holds the
address of the location where the operands resides.



Indirect addressing mode can be enhanced with a capability
to manipulate the pointer register just before (pre-) or after
(post-) the use i.e. it can be extended to inculcate automatic
increment or decrement capabilities, which has lead to the
following addressing modes.

A A addreg * addreg * ADD +
ADDRESSING MODES
Indirect Addressing Mode
Additional hardware such as adder, register etc is required.

ADDRESSING MODES
Identify the addressing modes of the
operands in each of the following
instructions
ADD #1234h
ADD 1234h
ADD *AR+
ADD offsetreg-,*AR
ADDRESSING MODES Problem
ADDRESSING MODES Problem
What are the memory addresses of the
operands in each of the following cases of
indirect addressing modes? In each case, what
will be the content of the addreg after the
memory access? Assume that the initial
contents of the addreg and the offsetreg are
0200h and 0010h, respectively.
ADD *addreg-
ADD +*addreg
ADD offsetreg+,*addreg
ADD *addreg,offsetreg-
ADDRESSING MODES Problem
ADDRESSING MODES Problem
SPECIAL ADDRESSING MODES
For the implementation of some real time applications
in DSP, normal addressing modes will not completely
serve the purpose. Thus some special addressing
modes are required for such applications. e.g.
computing DFT using FFT.

Two special addressing modes are

Circular addressing mode

Bit-Reversed addressing mode

CIRCULAR ADDRESSING MODE
While processing the data samples coming continuously in a
sequential manner, circular buffers are used.
In a circular buffer the data samples are stored sequentially
from the initial location till the buffer gets filled up.
Once the buffer gets filled up, the next data samples will get
stored once again from the initial location.
This process can go forever as long as the data samples are
processed in a rate faster than the incoming data rate.
To access a data sample from a circular buffer, circular
addressing mode is used.
Circular Addressing mode requires three registers:
Pointer register to hold the address of current location (PNTR)
Start Address Register to hold the starting address of the buffer
(SAR)
End Address Register to hold the ending address of the buffer
(EAR)

There are four special cases in this addressing
mode. They are
SAR<EAR, and updated PNTR>EAR
SAR<EAR, and updated PNTR<SAR
SAR>EAR, and updated PNTR>SAR
SAR>EAR, and updated PNTR<EAR
The buffer size in the first two cases= EAR-
SAR+1
In the last two it is= SAR-EAR+1


CIRCULAR ADDRESSING MODE
The pointer updating algorithm is as shown below.
CIRCULAR ADDRESSING MODE
CIRCULAR ADDRESSING MODE
BL=EAR-SAR+1

NEW PNTR
LOW ADDRESS
HIGH ADDRESS
EQUAL
SAR
EAR
UPDATED
PNTR
SAR<EAR, updated PNTR>EAR,
NEW PNTR=updated PNTR-BL
0
1
2
3
4
5 6
7 SAR=0
EAR=7
BL=8
UPD-PNTR=10
N-PNTR=2
CIRCULAR ADDRESSING MODE
BL=EAR-SAR+1

EQUAL
LOW ADDRESS
HIGH ADDRESS
SAR
EAR
UPDATED
PNTR
NEW PNTR
SAR<EAR, updated PNTR<SAR,
NEW PNTR=updated PNTR+BL
0
1
2
3
4
5 6
7 SAR=0
EAR=7
BL=8
UPD-PNTR= -3
N-PNTR=5
CIRCULAR ADDRESSING MODE
BL=SAR-EAR+1

EQUAL
LOW ADDRESS
HIGH ADDRESS
SAR
EAR
UPDATED
PNTR
NEW PNTR
SAR>EAR, updated PNTR>SAR,
NEW PNTR=updated PNTR-BL
0
1
2
3
4
5 6
7 SAR=7
EAR=0
BL=8
UPD-PNTR=9
N-PNTR=1
CIRCULAR ADDRESSING MODE
BL=SAR-EAR+1

EQUAL
LOW ADDRESS
HIGH ADDRESS
SAR
EAR
UPDATED
PNTR
NEW PNTR
SAR>EAR, updated PNTR<SAR,
NEW PNTR=updated PNTR+BL
0
1
2
3
4
5 6
7 SAR=7
EAR=0
BL=8
UPD-PNTR= -2
N-PNTR=6
CIRCULAR ADDRESSING MODE: PROBLEM
A DSP has a circular buffer with the start and the
end addresses as 0200h and 020Fh respectively.
What would be the new values of the address
pointer of the buffer if, in the course of address
computation, it gets updated to
a. 0212h
b. 01FCh
Buffer Length= (EAR-SAR+1)= 020F-0200+1=10h
a. New Address Pointer= Updated Pointer-buffer
length = 0212-10=0202h
b. New Address Pointer= Updated Pointer+ buffer
length = 01FC+10=020Ch
CIRCULAR ADDRESSING MODE: PROBLEM
Repeat the previous problem for SAR=
0210h and EAR=0201h

Buffer Length= (SAR-EAR+1)= 0210-
0201+1=10h
c. New Address Pointer= Updated Pointer-
buffer length = 0212-10=0202h
d. New Address Pointer= Updated Pointer +
buffer length = 01FC+10=020Ch
BIT REVERSED ADDRESSING MODE
To implement FFT algorithms we need to
access the data in a bit reversed manner.
Hence a special addressing mode called bit
reversed addressing mode is used to calculate
the index of the next data to be fetched.
It works as follows. Start with index 0. The
present index can be calculated by adding half
the FFT length to the previous index in a bit
reversed manner, carry being propagated from
MSB to LSB.
Current index= Previous index+ B (1/2(FFT
Size))

Compute the
indices for an 16-
point FFT using Bit
reversed
Addressing Mode

Add half the FFT
length to the
previous index in a
bit reversed manner,
carry being
propagated from
MSB to LSB.

BCD Index+B(1/2 FFT
Size)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0000
0000+1000
1000+1000
0100+1000
1100+1000
0010+1000
1010+1000
0110+1000
1110+1000
0001+1000
1001+1000
0101+1000
1101+1000
0011+1000
1011+1000
0111+1000
0000
1000
0100
1100
0010
1010
0110
1110
0001
1001
0101
1101
0011
1011
0111
1111
0
8
4
12
2
10
6
14
1
9
5
13
3
11
7
15
BIT REVERSED ADDRESSING MODE: PROBLEM
Another
method
BCD Index+B(1/2 FFT
Size)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0000
0000+0001=0001
0001+0001=0010
0010+0001=0011
0011+0001=0100
0100+0001=0101
0101+0001=0110
0110+0001=0111
0111+0001=1000
1000+0001=1001
1001+0001=1010
1010+0001=1011
1011+0001=1100
1100+0001=1101
1101+0001=1110
1110+0001=1111
0000
1000
0100
1100
0010
1010
0110
1110
0001
1001
0101
1101
0011
1011
0111
1111
0
8
4
12
2
10
6
14
1
9
5
13
3
11
7
15
BIT REVERSED ADDRESSING MODE: PROBLEM
Compute the
indices for an
8-point FFT
using Bit
reversed
Addressing
Mode

BCD Index+B(1/2 FFT
Size)
0
1
2
3
4
5
6
7
000
000+100
100+100
010+100
110+100
001+100
101+100
011+100
000
100
010
110
001
101
011
111
0
4
2
6
1
5
3
7
BIT REVERSED ADDRESSING MODE: PROBLEM
BCD Index+B(1/2 FFT
Size)
0
1
2
3
4
5
6
7
000
000+001=001
001+001=010
010+001=011
011+001=100
100+001=101
101+001=110
110+001=111
000
100
010
110
001
101
011
111
0
4
2
6
1
5
3
7
Another method
ADDRESS GENERATION UNIT
The main job of the Address Generation Unit is
to generate the address of the operands
required to carry out the operation. They have
to work fast in order to satisfy the timing
constraints.

As the address generation unit has to perform
some mathematical operations in order to
calculate the operand address, it is provided
with a separate ALU.

Address generation typically involves one of the
following operations.
Getting value from immediate operand, register or a
memory location
Incrementing/ decrementing the current address
Adding/subtracting the offset from the current
address
Generating new address according to circular
addressing mode (get updated address from
current address & using pointer updating algorithm)
Generating new address using bit reversed
addressing mode

ADDRESS GENERATION UNIT
ADDRESS GENERATION UNIT
PROGRAMMABILITY AND PROGRAM CONTROL

A programmable DSP device should provide
the programming capability involving
branching, looping and subroutines i.e.
programs involving these should be easily
written.
Normal execution sequence altered conditionally
or unconditionally by using branching
A section of program is repeated the desired
number of times by using looping.
Structured software is developed by using
subroutine handling instructions

The implementation of repeat capability should be
hardware based so that it can be programmed with
minimal or zero overhead.
A dedicated register can be used as a counter to keep track
of number of times the execution of a block of instructions
remains to be repeated.

In a normal subroutine call, return address has to be
stored in a stack thus requiring memory access for
storing and retrieving the return address, which in turn
reduces the speed of operation.

PROGRAMMABILITY AND PROGRAM CONTROL

To save the return address as well as to restore it on
return, the processor requires to carry out memory read
and write operations using system data bus.

These add to the overhead and make overall program
execution slow, thus lowering the performance.

Hence a LIFO memory can be directly interfaced
with the program counter to save the return
address.

This avoids the use of system bus for accessing stack
and speeds up subroutine branching as well as its return.

PROGRAMMABILITY AND PROGRAM CONTROL

PROGRAM CONTROL
Like microprocessors, DSP also requires a control unit
to provide necessary control and timing signals for the
proper execution of the instructions.
In microprocessors, the controlling is micro-coded
based & it is implemented by means of micro-coded
sequencer.
Each instruction is broken down into microinstructions
stored in micro memory (micro store) as a micro code.
Whenever one of the instructions need to be executed,
the corresponding microcode is called from the micro
store and executed.
It is easy to design and implementation uses less
hardware. But this process is not fast as several
accesses to micro store need to be done per
instruction.
For DSP, speed of execution is the critical
issue.

Hence the controlling is hardwired base
where the Control unit is designed as a
single, comprehensive, hardware unit taking
into account the complete instruction set of
DSP.

Hardware complexity increases. The design
is not easy to change to incorporate
additional features, but is faster.

PROGRAM CONTROL
PROGRAM SEQUENCER
It is a part of the control unit used to generate
instruction addresses in sequence needed to access
instructions.
It calculates the address of the next instruction to be
fetched. The next address can be from one of the
following sources.
Program Counter which is incremented after every
instruction fetch.
Instruction register which holds the address of instruction in
case of branching, looping and subroutine calls
Interrupt Vector table in case of interrupt service routine
Stack which holds the return address in case of return from
subroutine, interrupt service routine and end of loops.
PROGRAM SEQUENCER
Program sequencer acts as a multiplexer to select
the address of next instruction from one of the
input sources.
Program sequencer should have the following
circuitry

PC has to be updated after every fetch

Counter to hold count in case of looping

Stacks push the return address for subroutines
and interrupt service routines and while
executing loops and repeat instructions.


PROGRAM SEQUENCER
A logic block to check conditions for
execution of conditional jump and loop
(when to terminate) instructions.
This logic called condition logic status flag tests various
arithmetic conditions by means of status flags to decide
if conditional jump and loop instructions are to be
executed.

This logic also monitors repeat and loop counters to
determine when these have to be terminated to return to
the normal program flow.





PROGRAM SEQUENCER
FEATURES FOR EXTERNAL INTERFACE
DSP device should be able to communicate with
outside world which provides the signals to be
processed and receives the processed signals.
Peripherals like interfaces for interrupts, direct
memory access, serial i/o and parallel i/o are
needed in the DSP system.
DSP is a digital device. Therefore, to process
analog signals, conversion from analog to digital
and digital to analog representations need to be
carried out outside the device.
It should be capable of handling commonly
available serial and parallel signal converters.
Appropriate address, data and control signals
should be available to set up interfaces with the
peripherals.
Timer to implement events at regular intervals of
time. E.g. periodically initiating A/D converter to
start conversion, interrupt processor so that
data acquisition can go in background
simultaneously with execution of signal
processing programs
FEATURES FOR EXTERNAL INTERFACE
SPEED ISSUES
For fast execution of algorithms, DSP
architecture includes features that facilitates
high speed operation and large throughputs.

This is possible because of advances in VLSI
technology and design innovations.
SPEED ISSUES HARDWARE ARCHITECTURE
Functions like multiplication, scaling, loops &
repeats and special addressing modes are
essential which need to be implemented in
quickest possible time.
This is achieved by specially designed hardware
units such as multipliers which reduce
overheads and increase the speed.
Harvard architecture, dual data memories, on-
chip memories, instruction cache with individual
buses for each etc speed up the execution.
The other techniques to speed up the operation
in DSP are parallelism and pipelining.

SPEED ISSUES PARALLELISM
Many functional units operate in parallel for
each of the most commonly used DSP
operations such as add, multiply, shift etc
and increase the throughput.
E.g use of separate ALUs for computation of
data and address.
Provision of multiple memories with individual
buses for each.
Thus, algorithm can perform more than one
operation at the same time and increase the
speed.

The instructions are structured to carry out
the required operations in parallel.
Complex hardware is required to control
these units and the controller is hardwired to
ensure high speed.
Data and instructions are fetched
simultaneously.
SPEED ISSUES PARALLELISM
Multiply and accumulate operation executed in a
single clock cycle in the following steps:
Fetch instructions & multiple data required for
computation.
Shift data as they are fetched in order to accomplish
scaling.
Carry out a multiplication operation on the fetched
data.
Add the product to the previously computed result in
the accumulator.
Save the accumulator contents in the memory
storage, if required.
Compute new addresses for the instruction and data
required for the next operation
SPEED ISSUES PARALLELISM
In a pipelined architecture, the instruction to be
executed is broken into a number of steps. A
separate unit performs each of these steps.
When the 1
st
of these units performs the 1
st
step on
the current instruction, the 2
nd
unit will be performing
the 2
nd
step on previous instruction, the 3
rd
unit will be
performing the 3
rd
step on the instruction prior to that
etc.
If p steps were required to complete the execution of
each instruction, it would take p units of time for the
complete execution of each instruction.
Since all the units will work all the time, one output
will flow out of the architecture at the end of each
time unit, and the throughput can be maintained at
one instruction per unit time.
SPEED ISSUES PIPELINING
Problems:
1. Dividing each instruction into steps taking equal amounts
of time to perform and design the architectural units
accordingly is not always possible and slowest unit decides
the throughput.
2. Extra time is required at the starting, as the pipeline has to
be filled before the result of 1
st
instruction can flow out. This
initial delay is called pipeline latency (number of units in the
pipeline).
3. When there is change in instruction sequence as in
branching or loop, the pipeline has to be cleared before the
steps of new instruction is loaded, thereby causing delay.
This can be avoided by using additional hardware to predict
branching ahead of time and not filling the pipeline beyond
the branch instruction.
SPEED ISSUES PIPELINING
Time Slot Step 1 Step 2 Step 3 Step 4 Step 5 Result
t0
t1
t2
t3
t4
t5

Inst 1
Inst 2
Inst 3
Inst 4
Inst 5
Inst 6

Inst 1
Inst 2
Inst 3
Inst 4
Inst 5


Inst 1
Inst 2
Inst 3
Inst 4



Inst 1
Inst 2
Inst3




Inst 1
Inst 2




Inst 1 complete
Inst 2 complete

SPEED ISSUES PIPELINING
The execution of an instruction can be broken into 5
steps- instruction fetch, instruction decode, operand
fetch, execute, and save the result.
The output corresponding to the 1
st
instruction is
available after 5 units of time. However, once the
result starts to come out, we get an output after
each unit of time.
SYSTEM LEVEL PARALLELISM & PIPELINING
EXAMPLE
8-tap (8 coefficients) FIR filter

The filter can be implemented in many ways
depending on the number of multipliers and
accumulators available.

Implementation using single MAC
Pipelined implementation using 8 multipliers & 8
accumulators.
Parallel implementation using 2 MAC units

=
=
7
0
] [ ] [ ] [
i
i n x i h n y
] 7 [ ] 4 [ ] 6 [ ] 3 [ ] 5 [ ] 2 [ ] 4 [ ] 1 [ ] 3 [ ] [ ] 2 [ ] 1 [ ] 1 [ ] 2 [ ] 0 [ ] 3 [ ] 3 [
] 7 [ ] 5 [ ] 6 [ ] 4 [ ] 5 [ ] 3 [ ] 4 [ ] 2 [ ] 3 [ ] 1 [ ] 2 [ ] [ ] 1 [ ] 1 [ ] 0 [ ] 2 [ ] 2 [
] 7 [ ] 6 [ ] 6 [ ] 5 [ ] 5 [ ] 4 [ ] 4 [ ] 3 [ ] 3 [ ] 2 [ ] 2 [ ] 1 [ ] 1 [ ] [ ] 0 [ ] 1 [ ] 1 [
] 7 [ ] 7 [ ] 6 [ ] 6 [ ] 5 [ ] 5 [ ] 4 [ ] 4 [ ] 3 [ ] 3 [ ] 2 [ ] 2 [ ] 1 [ ] 1 [ ] 0 [ ] [ ] [
h n x h n x h n x h n x h n x h n x h n x h n x n y
h n x h n x h n x h n x h n x h n x h n x h n x n y
h n x h n x h n x h n x h n x h n x h n x h n x n y
h n x h n x h n x h n x h n x h n x h n x h n x n y
+ + + + + + + + + + = +
+ + + + + + + + + = +
+ + + + + + + + = +
+ + + + + + + =


SYSTEM LEVEL PARALLELISM & PIPELINING
EXAMPLE: FIR FILTER

=
=
7
0
] [ ] [ ] [
i
i n x i h n y
SPEED ISSUES: Implementation using single
MAC
One multiplier &
accumulator
available.

Each i/p
delayed from
the previous by
8T where
T=time taken by
MAC to
compute one
product term
and add it to
previously
accumulated
sum in
accumulator.
I/p samples & filter coefficients fed to multiplier
through multiplexers, which are controlled such
that correct combination of a sample and
corresponding filter coefficient are fed to the
multiplier at a given time.
Each product term is generated and added to
previously accumulated sum in accumulator (MAC
unit).
After all the 8 product terms are accumulated, the
MAC contents are available as the o/p.


SPEED ISSUES: Implementation using single
MAC

O/p y[n] is available 8T units after x[n] is made
available to the filter.
At this time, a new sample x[n+1] is applied to
the filter.
The filter then uses 8 samples namely, x[n+1],
x[n], x[n-1],,x[n-6] to compute y[n+1] after
another 8T units of time.
Thus, this implementation can take in a fresh i/p
sample once every 8T units of time and
generate an o/p sample at the same rate.
Maximum sampling rate=1/8T
SPEED ISSUES: Implementation using single
MAC

SPEED ISSUES: Pipelined implementation
using 8 multipliers & 8 accumulators.

More multipliers and accumulators can speed up the
implementation of FIR filter.
8 multipliers and 8 accumulators are connected in pipelined
structure.
Each multiplier computes one product and passes it to the
corresponding accumulator, which in turn adds it to the
summation passed on from the previous accumulator.
Since all multipliers and accumulators work all the time, a
new o/p sample is generated once every T units of time.
T=time required by the multiplier and accumulator to
compute one product term and add it to the sum passed on
from the previous stage of the pipeline.
New i/p sample is taken every T units of time and generate
an o/p sample at the same rate.
This is 8 times faster tan simple one MAC implementation.
SPEED ISSUES: Pipelined implementation
using 8 multipliers & 8 accumulators.

SPEED ISSUES: Parallel implementation
using 2 MAC units
This uses 2 MAC
units and an adder at
o/p.
Each MAC computes
4 of 8 product terms.
I/p samples and filter
coefficients are fed
to MACs using
multiplexers that are
controlled such that
correct combinations
are fed to the 2
MACs at any given
time.
If T time units are required to compute one pair of
products and add to previously accumulated sum in
MAC units, it will require 4T units of time to generate
the final o/p by adding the o/ps of 2 MACs.
At this time, a new i/p sample can be applied to the
filter for computation of next o/p sample.
The speed of this implementation is 2 times that of
one MAC implementation & one-fourth of pipelined 8-
multipler, 8-accumulator implementation.
Maximum rte at which i/p samples can be applied to
this filter implementation is 2 times that of first
implementation & one-fourth that of second.
SPEED ISSUES: Parallel implementation
using 2 MAC units

8-TAP FIR FILTER IMPLEMENTATION: Performance
Summary
Type of implementation Maximum
sample rate
Maximum throughput
One MAC 1/8T One sample in 8T units of time
Pipelined (8 multipliers &
8 accumulators)
1/T One sample in T units of time
Two MACs 1/4T One sample in 4T units of time
T=MAC time

You might also like