You are on page 1of 13

Architectural structure of the ‘C5x

The ’C5x uses an advanced, modified Harvard-type architecture based on the ’C25
architecture and maximizes processing power with separate buses for program memory and data
memory. The instruction set supports data transfers between the two memory spaces. All ’C5x
DSPs have the same CPU structure; however, they have different on-chip memory configurations
and on-chip peripherals.

Bus Structure
Separate program and data buses allow simultaneous access to program instructions and data,
providing a high degree of parallelism.
The ’C5x architecture is built around four major buses:
� Program bus (PB)
� Program address bus (PAB)
� Data read bus (DB)
� Data read address bus (DAB)
The PAB provides addresses to program memory space for both reads and writes. The PB also
carries the instruction code and immediate operands from program memory space to the CPU. The
DB interconnects various elements of the CPU to data memory space. The program and data buses
can work together to transfer data from on-chip data memory and internal or external program
memory to the multiplier for single-cycle multiply/accumulate operations.
Central Processing Unit (CPU)
The ’C5x CPU consists of these elements:
� Central arithmetic logic unit (CALU)
� Parallel logic unit (PLU)
� Auxiliary register arithmetic unit (ARAU)
� Memory-mapped registers
� Program controller
 Central Arithmetic Logic Unit (CALU)
The CPU uses the CALU to perform 2s-complement arithmetic.
The CALU consists of these elements:
� 16-bit � 16-bit multiplier
� 32-bit arithmetic logic unit (ALU)
� 32-bit accumulator (ACC)
� 32-bit accumulator buffer (ACCB)
� Additional shifters at the outputs of both the accumulator and the product register (PREG)
 Parallel Logic Unit (PLU)
The CPU includes an independent PLU, which operates separately from, but in parallel with,
the ALU. The PLU performs Boolean operations or the bit manipulations required of high-
speed controllers. The PLU can set, clear, test, or toggle bits in a status register, control register,
or any data memory location. The PLU provides a direct logic operation path to data memory
values without affecting the contents of the ACC or PREG. Results of a PLU function are
written back to the original data memory location.
 Auxiliary register arithmetic unit (ARAU)
The CPU includes an unsigned 16-bit arithmetic logic unit that calculates indirect addresses
by using inputs from the auxiliary registers (ARs), index register (INDX), and auxiliary register
compare register (ARCR). The ARAU can autoindex the current AR while the data memory
location is being addressed and can index either by +-1 or by the contents of the NDX. As a
result, accessing data does not require the CALU for address manipulation; therefore, the
CALU is free for other operations in parallel.
 Memory-Mapped Registers
The ’C5x has 96 registers mapped into page 0 of the data memory space. All ’C5x DSPs have
28 CPU registers and 16 input/output (I/O) port registers but have different numbers of
peripheral and reserved registers . The memory-mapped registers are used for indirect data
address pointers, temporary storage, CPU status and control, or integer arithmetic processing
through the ARAU.
 Program Controller
The program controller contains logic circuitry that decodes the operational instructions,
manages the CPU pipeline, stores the status of CPU operations, and decodes the conditional
operations. Parallelism of architecture lets the ’C5x perform three concurrent memory
operations in any given machine cycle: fetch an instruction, read an operand, and write an
operand.
The program controller consists of these elements:
� Program counter
� Status and control registers
� Hardware stack
� Address generation logic
� Instruction register

On-Chip Memory
The ’C5x architecture contains a considerable amount of on-chip memory to aid in system
performance and integration:
� Program read-only memory (ROM)
� Data/program dual-access RAM (DARAM)
� Data/program single-access RAM (SARAM)
The ’C5x has a total address range of 224K words x 16 bits. The memory space is divided into
four individually selectable memory segments: 64K-word program memory space, 64K-word
local data memory space, 64K-word input/ output ports, and 32K-word global data memory space.
 Program ROM
All ’C5x DSPs carry a 16-bit on-chip maskable programmable ROM. This memory is used for
booting program code from slower external ROM or EPROM to fast on-chip or external RAM.
The on-chip ROM is selected at reset by driving the MP/MC pin low. If the on-chip ROM is
not selected, the ’C5x devices start execution from off-chip memory.
 Data/Program Dual-Access RAM
All ’C5x DSPs carry a 1056-word � 16-bit on-chip dual-access RAM (DARAM). The DARAM
is divided into three individually selectable memory blocks: 512-word data or program DARAM
block B0, 512-word data DARAM block B1, and 32-word data DARAM block B2. The DARAM
is primarily intended to store data values but, when needed, can be used to store programs as well.
DARAM blocks B1 and B2 are always configured as data memory; however, DARAM block B0
can be configured by software as data or program memory.
The DARAM can be configured in one of two ways:
� All 1056 words x 16 bits configured as data memory
� 544 words x 16 bits configured as data memory and 512 words × 16 bits configured as program
memory
 Data/Program Single-Access RAM
All ’C5x DSPs except the ’C52 carry a 16-bit on-chip single-access RAM
(SARAM) of various sizes.
The SARAM can be configured by software in one of three ways:
� All SARAM configured as data memory
� All SARAM configured as program memory
� SARAM configured as both data memory and program memory
The SARAM is divided into 1K- and/or 2K-word blocks contiguous in address
memory space. All ’C5x CPUs support parallel accesses to these SARAM
blocks. However, one SARAM block can be accessed only once per machine
cycle.
SARAM supports more flexible address mapping than DARAM because SARAM can be
mapped to both program and data memory space simultaneously.
However, because of simultaneous program and data mapping, an instruction fetch and data
fetch that could be performed in one machine cycle with DARAM may take two machine
cycles with SARAM.

On-Chip Memory Protection


The ’C5x DSPs have a maskable option that protects the contents of on-chip memories. When the
related bit is set, no externally originating instruction can access the on-chip memory spaces.

On-Chip Peripherals
All ’C5x DSPs have the same CPU structure; however, they have different onchip
peripherals connected to their CPUs. The ’C5x DSP on-chip peripherals available are:
� Clock generator:
The clock generator consists of an internal oscillator and a phase-locked loop(PLL) circuit. The
clock generator can be driven internally by a crystal resonator circuit or driven externally by a
clock source. The PLL circuit can generate an internal CPU clock by multiplying the clock source
by a specific factor, so you can use a clock source with a lower frequency than that of the CPU.
� Hardware timer:
A 16-bit hardware timer with a 4-bit prescaler is available. This programmable timer clocks at a
rate that is between 1/2 and 1/32 of the machine cycle rate (CLKOUT1), depending upon the
timer’s divide-down ratio. The timer can be stopped, restarted, reset, or disabled by specific status
bits.
� Software-programmable wait-state generators:
Software-programmable wait-state logic is incorporated in ’C5x DSPs allowing
wait-state generation without any external hardware for interfacing with slower off-chip memory
and I/O devices. This feature consists of multiple wait state generating circuits. Each circuit is
user-programmable to operate in different wait states for off-chip memory accesses.
� Parallel I/O ports:
A total of 64K I/O ports are available, sixteen of these ports are memory-mapped in data memory
space. Each of the I/O ports can be addressed by the IN or the OUT instruction. The memory-
mapped I/O ports can be accessed with any instruction that reads from or writes to data memory.
� Host port interface (HPI):
The HPI available on the ’C57S and ’LC57 is an 8-bit parallel I/O port that provides an interface
to a host processor. Information is exchanged between the DSP and the host processor through on-
chip memory that is accessible to both the host processor and the ’C57.
� Serial port:
Three different kinds of serial ports are available: a general-purpose serial port, a time-division
multiplexed (TDM) serial port, and a buffered serial port (BSP). Each ’C5x contains at least one
general-purpose, high-speed synchronous, full-duplexed serial port interface that provides direct
communication with serial devices such as codecs, serial analog-to-digital (A/D) converters, and
other serial systems. The serial port transmitter and receiver are double-buffered and individually
controlled by maskable external interrupt signals. Data is framed either as bytes or as words.
� Buffered serial port (BSP):
The BSP available on the ’C56 and ’C57 devices is a full-duplexed, double buffered serial port
and an auto buffering unit (ABU). The BSP provides flexibility on the data stream length. The
ABU supports high-speed data transfer and reduces interrupt latencies.
� Time-division multiplexed (TDM) serial port:
The TDM serial port available on the ’C50, ’C51, and ’C53 devices is a full duplexed serial port
that can be configured by software either for synchronous operations or for time-division
multiplexed operations. The TDM serial port is commonly used in multiprocessor applications.
� User-maskable interrupts:
Four external interrupt lines (INT1–INT4) and five internal interrupts, a timer interrupt and four
serial port interrupts, are user maskable. When an interrupt service routine (ISR) is executed, the
contents of the program counter are saved on an 8-level hardware stack, and the contents of eleven
specific CPU registers are automatically saved (shadowed) on a 1-level-deep stack. When a return
from interrupt instruction is executed, the CPU registers’ contents are restored.

Addressing modes used in a DSP Processor


The different addressing modes used in a DSP Processor are:
1. Direct addressing
2. Indirect addressing
3. Immediate addressing
4. Dedicated-register addressing
5. Memory-mapped register addressing
6. Circular addressing

 Direct Addressing:
In the direct memory addressing mode, the instruction contains the lower 7 bits of the data
memory address (dma). The 7-bit dma is concatenated with the 9 bits of the data memory page
pointer (DP) in status register 0 to form the full 16-bit data memory address.

 Indirect addressing:
Indirect addressing can be used with all instructions except those with immediate operands or
with no operands.
The location of the operand in the memory is pinpointed through a combination of the contents
of an auxiliary register, optional displacements, and through the index registers available. The
auxiliary addressing register units (AARUs) are functional units that calculate the effective address
of the operand. This technique is particularly useful when blocks of data are being processed since
provision is made for automatically incrementing or decrementing the address stored in the register
following each reference.
Immediate Addressing: In this mode the data is contained in the instruction itself. In immediate
addressing, the instruction word(s) contains the value of the immediate operand. The ’C5x has
both 1-word (8-bit, 9-bit, and 13-bit constant) short immediate instructions and 2-word (16-bit
constant) long immediate instructions.
 Short Immediate Addressing: In short immediate instructions, the operand is contained
within the instruction machine code.
 Long Immediate Addressing: In long immediate instructions, the operand is contained in
the second word of a two-word instruction.
 Dedicated-Register Addressing:
The dedicated-registered addressing mode operates like the long immediate addressing
mode, except that the address comes from one of two special-purpose memory-mapped
registers in the CPU: the block move address register (BMAR) and the dynamic bit
manipulation register (DBMR). The advantage of this addressing mode is that the address of
the block of memory to be acted upon can be changed during execution of the program.

 Memory-Mapped Register Addressing:


With memory-mapped register addressing, the memory mapped registers can be modified
without affecting the current data page pointer value. In addition, any scratch pad RAM
(DARAM B2) location or data page 0 can be modified. The memory-mapped register
addressing mode operates like the direct addressing mode, except that the 9 MSBs of the
address are forced to 0 instead of being loaded with the contents of the DP. This allows to
address the memory-mapped registers of data page 0 directly without the overhead of changing
the DP or auxiliary register.
The following instructions operate in the memory-mapped register addressing
mode. Using these instructions does not affect the contents of the DP:
� LAMM — Load accumulator with memory-mapped register
� LMMR — Load memory-mapped register
� SAMM — Store accumulator in memory-mapped register
� SMMR — Store memory-mapped register

 Circular Addressing:
Many algorithms such as convolution, correlation, and finite impulse response (FIR) filters
can use circular buffers in memory to implement a sliding window, which contains the most
recent data to be processed. The 8-bit CBCR (Circular buffer control register) enables and
disables the circular buffer operation .

Pipelining
Pipelining is an important technique used in several applications such as digital signal processing
(DSP) systems, microprocessors, etc. It originates from the idea of a water pipe with continuous
water sent in without waiting for the water in the pipe to come out. Accordingly, it results in speed
enhancement for the critical path in most DSP systems. For example, it can either increase the
clock speed or reduce the power consumption at the same speed in a DSP system.

Thus it is a technique used extensively in DSP to increase speed as it allows two or more operations
to overlap during execution. In pipelining, a task is broken down into a number of distinct sub-
tasks which are then over lapped during execution.

Concept
Pipelining allows different functional units of a system to run concurrently. Consider an informal
example in the following figure. A system includes three sub-function units (F0, F1 and F2).
Assume that there are three independent tasks (T0, T1 and T2) being performed by these three
function units. The time for each function unit to complete a task is the same and will occupy a
slot in the schedule.

If we put these three units and tasks in a sequential order, the required time to complete them is
five slots.

However, if we pipeline T0 to T2 concurrently, the aggregate time is reduced to three slots.

Therefore, it is possible for an adequate pipelined design to achieve significant enhancement on


speed.

Pipeline Structure in TMS320C5x


The four phases of the ’C5x pipeline structure and their functions are as follows:

1) Fetch (F) — This phase fetches the instruction words from memory and updates the program
counter (PC).

2) Decode (D) — This phase decodes the instruction word and performs address generation and
ARAU updates of auxiliary registers.
3) Read (R) — This phase reads operands from memory, if required. If the instruction uses indirect
addressing mode, it will read the memory location pointed at by the ARP before the update of the
previous decode phase.

4) Execute (E) — This phase performs any specify operation, and, if required, writes results of a
previous operation to memory.

Figure 7–1 illustrates the operation of the four-level pipeline for single-word single-cycle
instruction executing with no wait state. This is perfect overlapping in the pipeline, where all four
phases operate in parallel. When more than one pipeline stage requires processing on the same
resource, such as memory and CPU registers, a pipeline conflict occurs. Since there is no priority
between these four phases, you can get unexpected results when pipeline conflict occurs.
Therefore, you should avoid any conflict between these four phases in order to get the correct
results.
The speedup = average instruction time (non pipeline) (1)

average instruction time (pipeline)

Example
In a non pipeline processor, the instruction fetch, decode and execute take 35 ns, 25 ns, and 40 ns,
respectively. Determine the increase in throughput if the instruction steps were pipelined. Assume
a 5 ns pipeline overhead at each stage, and ignore other delays.

In an ideal non pipeline processor, the average instruction time is simply the sum of the times for
instruction fetch, decode and execute:
35 + 25 + 40 ns = 100 ns.
However, if we assume a fixed machine cycle then each instruction time would take three machine
cycles to complete: 40 ns x 3 = 120 ns (the execute time – maximum time – determines the cycle
time). This corresponds to a throughput of 8.3x10 6 instructions per second.

In the pipeline processor, the clock speed is determined by the speed of the slowest stage plus
overheads, i.e. 40 + 5 = 45 ns. The through put (when the pipeline is full) is 22.2 x10 6 instructions
per second.
Speed up = average instruction time (non pipeline) = 120/45 = 2.67
average instruction time (pipeline)

Pipelining has a major impact on the system memory because it leads to an increased number of
memory accesses (typically by the number of stages). The use of Harvard architecture where data
and instructions lie in separate memory spaces promotes pipelining.
Fixed-point and 32-bit floating-point.

In 1982, Texas Instruments introduced the TMS32010 — the first fixed-point DSP in the TMS320
family. Today, the TMS320 family consists of eight generations: the ’C1x, ’C2x, ’C2xx, ’C5x,
and ’C54x are fixed-point, the ’C3x and ’C4x are floating-point, and the ’C8x is a multiprocessor.

System developers, especially those who are new to digital signal processors (DSPs), are
sometimes uncertain whether they need to use fixed- or floating-point DSPs for their systems. Both
fixed- and floating-point DSPs are designed to perform the high-speed computations that underlie
real-time signal processing. Both feature system-ona-chip (SOC) integration with on-chip memory
and a variety of high-speed peripherals to ensure fast throughput and design flexibility. Tradeoffs
of cost and ease of use often heavily influenced the fixed- or floating-point decision in the past.
Today, though, selecting either type of DSP depends mainly on whether the added computational
capabilities of the floating-point format are required by the application.

Fixed point dsp hardware


Fixed-point DSP hardware performs strictly integer arithmetic functions. TI’s TMS320C62x™
fixed-point DSPs have two data paths operating in parallel, each with a 16-bit word width that
provides signed integer values within a range from –2^15 to 2^15. TMS320C64x™ DSPs, double
the overall throughput with four 16-bit (or eight 8- bit or two 32-bit) multipliers. TMS320C5x™
and TMS320C2x™ DSPs, with architectures designed for handheld and control applications,
respectively, are based on single 16-bit data paths.

Floating point dsp hardware


TMS320C67x™ floating-point DSPs divide a 32-bit data path into two parts: a 24-bit mantissa
that can be used for either for integer values or as the base of a real number, and an 8-bit exponent.
The 16M range of precision offered by 24 bits with the addition of an 8-bit exponent, thus
supporting a vastly greater dynamic range than is available with the fixed-point format. The
C67x™ DSP can also perform calculations using industry-standard double-width precision (64
bits, including a 53-bit mantissa and an 11-bit exponent). Double-width precision achieves much
greater precision and dynamic range at the expense of speed, since it requires multiple cycles for
each operation.

Accuracy
The greater accuracy of the floating-point format results from three factors. First, the 24-bit word
width in TI C67x™ floating-point DSPs yields greater precision than the C62x™ 16-bit fixed-
point word width, in integer as well as real values. Second, exponentiation vastly increases the
dynamic range available for the application. A wide dynamic range is important in dealing with
extremely large data sets and with data sets where the range cannot be easily predicted. Third, the
internal representations of data in floating-point DSPs are more exact than in fixed-point, ensuring
greater accuracy in end results.

Finally, there is the word width for holding the intermediate products of iterated multiply
accumulate (MAC) operations. For a single 16-bit by 16-bit multiplication, a 32-bit product would
be needed, or a 48-bit product for a single 24-bit by 24-bit multiplication. (Exponents have a
separate data path and are not included in this discussion.) However, iterated MACs require
additional bits for overflow headroom. In C62x fixed-point devices, this overflow headroom is 8
bits, making the total intermediate product word width 40 bits (16 signal + 16 coefficient + 8
overflow). Integrating the same proportion of overflow headroom in C67x floating-point DSPs
would require 64 intermediate product bits (24 signal + 24 coefficient + 16 overflow), which would
go beyond most application requirements in accuracy. Fortunately, through exponentiation the
floating-point format enables keeping only the most significant 48 bits for intermediate products,
so that the hardware stays manageable while still providing more bits of intermediate accuracy
than the fixed-point format offers. These word widths are summarized in table for several TI DSP
architectures.

Word widths for TI DSPs TI Format Word Width


DSP(s)

Signal I/O Coefficient Intermediate result

C25x fixed 16 16 40

C5x™/C62x™ fixed 16 16 40

C64x™ fixed 8/16/32 16 40

C3x™ floating 24 (mantissa) 24 32

C67x™(SP) floating 24 (mantissa) 24 24/53

C67x(DP) floating 53 53 53

Cost versus ease of use


Today the early differences in cost and ease of use, while not altogether erased, are considerably
less pronounced. Scores of transistors can now fit into the same space required by a single
transistor a decade ago, leading to SOC integration that reduces the impact of a single DSP core
on die size and expense. Many DSP-based products, such as TI’s broadband, camera imaging,
wireless baseband and OMAP™ wireless application platforms, leverage the advantages of
rescaling by integrating more than a single core in a product targeted at a specific market. Fixed-
point DSPs continue to benefit more from cost reductions of scale in manufacturing, since they are
more often used for high-volume applications; however, the same reductions will apply to floating
point DSPs when high-volume demand for the devices appears. Today, cost has increasingly
become an issue of SOC integration and volume, rather than a result of the size of the DSP core
itself. The early gap in ease of use has also been reduced. TI fixed-point DSPs have long been
supported by outstandingly efficient C compilers and exceptional tools that provide visibility into
code execution. The advantage of implementing real arithmetic directly in floating-point hardware
still remains; but today advanced mathematical modeling tools, comprehensive libraries of
mathematical functions, and off-the-shelf algorithms reduce the difficulty of developing complex
applications—with or without real numbers—for fixed-point devices. Overall, fixed-point DSPs
still have an edge in cost and floating-point DSPs in ease of use, but the edge has narrowed until
these factors should no longer be overriding in the design decision.

Application
The data sets of other types of applications also lend themselves better to either fixed or floating-
point computations. Today, one of the heaviest uses of DSPs is in wired and wireless
communications, where most data is transmitted serially in octets that are then other application
areas expanded internally for 16-bit processing based on integer operations. Obviously, this data
set is extremely well-suited for the fixed-point format, and the enormous demand for DSPs in
communications has driven much of fixed-point product development and manufacturing.

Floating-point applications are those that require greater computational accuracy and flexibility
than fixed-point DSPs offer. For example, image recognition used for medicine is similar to audio
in requiring a high degree of accuracy. Many levels of signal input from light, x-rays, ultrasound
and other sources must be defined and processed to create output images that provide useful
diagnostic information. The greater precision of C67x signal data, together with the device’s more
accurate internal representations of data, enable imaging systems to achieve a much higher level
of recognition and definition for the user.

Wide dynamic range also plays a part in robotic design. Normally, a robot functions within a
limited range of motion that might well fit within a fixed-point DSP’s dynamic range. However,
unpredictable events can occur on an assembly line. For instance, the robot might weld itself to an
assembly unit, or something might unexpectedly block its range of motion. In these cases, feedback
is well out of the ordinary operating range, and a system based on a fixed-point DSP might not
offer programmers an effective means of dealing with the unusual conditions. The wide dynamic
range of a floating-point DSP, however, enables the robot control circuitry to deal with
unpredictable circumstances in a predictable manner.

You might also like