You are on page 1of 38

EDN 2000

EDNS ANNUAL DSP DIRECTORY HIGHLIGHTS THE ARCHITECTURES AVAILABLE FOR YOUR HOTTEST DESIGNS. HERES HELP IN SORTING THROUGH THE MYRIAD DSP DEVICES. YOU CAN ALSO ACCESS OUR FREQUENTLY UPDATED, FEATURE-TUNED DATABASE USING OUR SEARCH ENGINE TO FIND THE RIGHT DEVICE FOR YOUR DESIGN NEEDS.

DSP-architecture directory
By Markus Levy, Technical Editor
Im beginning to sound like a broken record. (Remember those vinyl platters?) Every year I begin the introduction to EDNs DSP Directory by remarking on the tremendous growth in DSP technology, and its no different this year. You can judge this growth from the number of new DSP companies and the number of new DSPs. And youll find descriptions of all the above right here in this directory. Whats the cause of all this excitement? Cellular phones, broadband communications, medical-imaging equipment, modems, audio equipment, motor control, and tons more. Youve probably seen the commercials from Texas Instruments about DSP technologyanother testimony that DSPs are penetrating our lives. In addition, plenty of RISC processors keep popping up with DSP instruction sets. (The appearance of RISC processors with DSP instruction sets, however, raises the question Are they really still RISCs?) These processors include those from ARC, ARM, Improv Systems, Lexra, MIPS, Sandcraft, STMicro, Tensilica, and more (www.arccores.com, www.arm.com, www.improvsys.com, www.lexra.com, www.mips.com, www.sandcraft.com, www.st.com, www.tensilica.com). Although this directory does not cover these processors, youll find them in the upcoming Microprocessor Directory. There, youll also find the traditional RISC-DSP combos, such as Hitachis SH-DSP, Infineons TriCore, and Hyperstones E1 (www.hitachi.com/semiconductor, www.hyperstone.de). I may also be repeating myself when I talk about the new benchmarks from the EDN Embedded Microprocessor Benchmark Consortium (EEMBC). The consortium has been actively working on these industry-standard benchmarks for three years, and now the benchmarks are ready to go. Starting on April 11, you can go to the EEMBC Web site at www.eembc.org and get free DSP and other processor benchmark-certified scores. These benchmarks include cascaded biquad filters, the Viterbi address-compare-select function, autocorrelation, bit allocation, and FFTs. If you dont find benchmark scores for your favorite DSP on EEMBCs Web site, urge the corresponding vendor to provide its EEMBC scores. Let them tie some quantitative performance information to those multiply-accumulate units and pipeline stages. Next year, well include those scores in this directory for direct comparisons. The Motorola and Lucent Technologies joint venture has yielded the fruits of the two companies labors in the StarCore scalable SC140 DSP core and the MSC8101. However, were still anxiously awaiting a DSP architecture from the Intel/Analog Devices partnership. An introduction should happen soon. Were also waiting to see production quantities of TIs new 750-MHz C64x: the fastest RISCer, DSPon the planet. To help you sort through the myriad DSP devices, access our frequently updated database using our feature-tuned search engine to find the right device for your design needs (www. ednmag.com/ednmag/reg/micro.asp).
www.ednmag.com

60 edn | March 30, 2000

16 BITS Analog Devices ADSP-21xx . . .62 Analog Devices TigerSharc DSP . . . . . . . . . . . .63 BOPS ManArray . . . . . . . . . . . .64 DSP Group cores . . . . . . . . . . . .66 Equator Technologies MAP-CA . . . . . . . . . . . . . . . . . .68 Infineon Carmel DSP 10XX and 20XX cores . . . . . . .70 LSI Logic ZSP DSPs . . . . . . . . . .72 Lucent/Motorola StarCore SC100 . . . . . . . . . . . .74 Lucent Technologies DSP16xx . . . . . . . . . . . . . . . . . .76 Lucent Technologies DSP16000 . . . . . . . . . . . . . . . . .78 Massana FILU-200 DSP coprocessor core . . . . . . . . . . . .80 Motorola DSP56800 . . . . . . . . .81 NEC SPRX DSP . . . . . . . . . . . . .82 Philips REAL DSP . . . . . . . . . . .83 Texas Instruments TMS320C2000 . . . . . . . . . . . . .84 Texas Instruments TMS320C5000 . . . . . . . . . . . . .86 Texas Instruments TMS320C6000 . . . . . . . . . . . . .88 3DSP SP-3 and SP-5 DSP cores . . . . . . . . . . . . . . . . .90 Zilog Z893x1/Z893x3 . . . . . . . .92 24 BITS Motorola DSP563xx . . . . . . . . .93 32 BITS Analog Devices SHARC DSP . . . . . . . . . . . . . . .94 Texas Instruments TMS320C3x . . . . . . . . . . . . . . .95 Texas Instruments TMS320C4x . . . . . . . . . . . . . . .96 64 BITS Module Research Centers NeuroMatrix NM6403 DSP . . . . . . . . . . . . . .98

FOR MORE INFORMATION...


Analog Devices www.analog.com Circle No. 376 Billions of Operations Per Second www.bops.com Circle No. 377 DSP Group www.dspg.com Circle No. 378 Equator Technologies www.equator.com Circle No. 379 Infineon Technologies www.infineon.com Circle No. 380 LSI Logic www.lsil.com Circle No. 381 Lucent Technologies www.lucent.com Circle No. 382 Massana www.massana.com Circle No. 383 SUPER CIRCLE NUMBER For more information on the products available from all of the vendors listed in this box, circle one number on the reader service card. Circle No. 391 Module Research Center www.module.vympel.msk.ru/ Circle No. 384 Motorola Inc www.motorola-dsp.com Circle No. 385 NEC Electronics Inc www.nec.com Circle No. 386 Philips Semiconductor www.semiconductors.philips.com Circle No. 387 Texas Instruments Inc www.micro.ti.com Circle No. 388 3DSP Corp www.3dsp.com Circle No. 389 Zilog www.zilog.com Circle No. 390

Photo courtesy Texas Instruments

www.ednmag.com

March 30, 2000 | edn 61

dspdirectory 16

bits

Analog Devices ADSP-21xx


The ADSP-21xx familys CPU handles general processing needs and executes all instructions in a single cycle. All of Analog Devices 16-bit DSPs are codecompatible, and many are also pin-compatible. All DSPs feature an algebraic programming syntax. The processor can execute multiple operations per cycle. The multiply-accumulate (MAC) unit, ALU, and barrel shifter are separate but cannot execute in parallel. Secondary registers shadow each execution units registers, allowing fast context switching for interrupt processing. If you need extended precision, you can address the MAC units 40-bit accumulator (includes 8 guard bits) as two 16-bit and one 8-bit register and individually copy the contents to another register. The barrel shifter moves 16-bit inputs left or right into a 32-bit register. The shifter also includes hardware support to perform logical and arithmetic shifts in addition to exponent detection and normalization for block floating point and increasing the precision of a 16-bit DSP. Algorithms such as FFTs in which bits grow from stage to stage use block floating point. A programmer may use the shifter to convert Architecture features a between fixed- and floating-point num40-bit accumulator bers. The ADSP-219x DSPs expand on with 8 guard bits. the architecture by providing two 40-bit accumulators and a 40-bit shifter result. Device offers singleADSP-21xx family members have X cycle execution. and Y data-address generators (DAGs) DSP performs condiand separate program and data buses. tional execution of Two DAGs provide addresses for simulmost instructions. taneous dual-operand fetches (from data and program memory). Each DAG maintains and updates four address pointers. You may associate a length value with each pointer to implement automatic modulo addressing for circular buffers. While executing from the on-chip memory, the buses feed the X- and Y-data values for each MAC cycle. Thus, you can use program memory as data memory to hold constants for single-cycle MAC processing. The program bus is free for MAC use when the CPU executes from on-chip program memory. The dual-ported program memory allows two memory accesses in one cycle. The ADSP-219x DSP uses an instruction cache to achieve three-bus performance. For access to external memory, the ADSP-21xx has a programmable wait-state generator for zero to 15 wait states. Analog Devices designers opted for a 16-bit-wide data word and a 24-bit-wide instruction word. The wider instruction word lets the device use more complex instructions and offers more flexibility than does a 16-bit operation code. For external-memory design, the different memory widths mean that if you let three 8-bit-wide memory chips share program and data, you sacrifice every third byte of the data-memory area. Analog Devices integrates as much as 2 Mbits of SRAM around its DSP core to help increase data-transfer efficiency. Many ADSP-21xxs also integrate DMA ports that connect to external hosts or external memory. These bidirectional, byte-wide ports can directly access as much as 4 Mbytes of external memory for offchip storage of program overlays or data tables. Addressing modesThe ADSP-21xx includes immediate, register-direct, memory-direct, and registerindirect addressing modes. The ADSP-219x adds register, indirect-postmodify, immediate-modify, and direct- and indirect-offset addressing modes. The program sequencer features internal loop counters and loop stacks, enabling looped code to execute with zero overhead. Each address generator supports as many as four circular buffers, each with three registers. The registers define the end, length, and access address. One address generator provides bit-reversed addressing. The ADSP-219x supports as many as 16 circular buffers by using a DAG shadow register and a set of base registers for additional circular-buffering flexibility. Special instructionsThe ADSP-21xx can conditionally execute most instructions. A do-until command establishes a sequence of instructions that can be arbitrary in length and nested four deep for repeat operations. The ADSP-219x allows as many as eight nesting levels. In addition to the standard arithmetic and logic instructions, the ALU supports division primitives. Because the ADSP-21xx is a nonpipelined machine, it incurs no penalties for jumps and calls. SupportAnalog Devices software- and hardwaredevelopment tools include the companys VisualDSP integrated development environment, in-circuit emulators, and a development kit. VisualDSP provides the interface to an optimizing C compiler, an assembler, a linker, a simulator, and a debugger. Analog Devices emulators are available for Universal Serial Bus, PCI, and Ethernet host platforms. An EZ-Kit Lite consists of an evaluation board and limited but full-featured VisualDSP.

62 edn | March 30, 2000

www.ednmag.com

dspdirectory 16

bits

Analog Devices TigerSharc DSP


Designed for the telecommunications infrastructure, the TigerSharc devices use a very-long-word-instruction (VLIW) load/store architecture to execute as many as four instructions per cycle with its interlocking, eight-stage pipeline. Instruction-level parallelism is determined before runtime to support deterministic execution for real-time applications. However, with an interlocked pipeline, the processor automatically inserts stall cycles whenever the result of an operation is unavailable. The architecture comprises two computation blocks; each block contains a multiplier, an ALU, and a 64-bit shifter. The single-instruction, multiple-data (SIMD) features of this architecture allow you to perform arithmetic operations on multiple 32-bit floating-point values or multiple 8-, 16-, or 32-bit fixedpoint values. Using both computation units, you can perform two 32 32-bit multiply-accumulates (MACs) per cycle, and, using both computation units with 16-bit data, you can perform eight 16 16-bit MACs per cycle. However, the complex architecture of TigerSharc will challenge programmers and code-generation tools to keep the pipeline and its computation units completely busy. Another challenge will be to keep the processor fed. The architecture has two data-address generators that work with two 128-bit data buses to transfer as many as 256 bits per cycle between the computation units and memory. This approach results in 12 Gbytes/sec of on-chip bandwidth at 250 MHz. Users can configure three banks of on-chip memory, so you can tune program and data partitioning for an applications needs. Applications should avoid going off-chip for memory accesses, but Analog Devices has plugged 14 DMA channels into TigerSharc to facilitate this process. The first device in this family, the ADSPTS001, features integrated peripherals, such as 6 Mbits of SRAM; glueless multiprocessing; and link, external bus, and JTAG ports. The address generators support data addressing, pointer updates, circular buffering, and bit reversal. You can also use the address generators for generalpurpose integer computations. Each computation unit has its own register file of 32 32-bit registers. You can combine two 32-bit registers for a single 64-bit register. You can also form 128-bit destinations for the multipliers by combining four consecutive 32-bit registers. The register file is orthogonal, making it easier for programming in C. Addressing modesTigerSharc offers immediate, bit-reversed, circular-modulo, and register-direct and -indirect addressing. An SIMD-memory-transfer mechanism lets a single load or store instruction specify that two memory transfers must occur between two memory blocks to or from two computation units. Special instructionsThe instruction set directly supports transformations between data types of higher and lower numerical precision for example, going from fixed- to floating-point or 16- to 32-bit in one cycle. TigerSharc has no hardware modes; the instruction set supports VLIW architecture exearithmetic attributes, such as signed, cutes as many as four unsigned, integer, and fractional. This instructions per cycle. simplifies high-level language programming. TigerSharc provides opDSP has SIMD capational saturation for all cases. bilities. SupportAnalog Devices softwareThe first TigerSharc and hardware-development tools inintegrates 6 Mbits of clude the companys VisualDSP inteRAM. grated development environment, incircuit emulators, and a development kit. VisualDSP provides the interface to an optimizing C compiler, an assembler, a linker, a simulator, and a debugger. Analog Devices emulators are available for Universal Serial Bus, PCI, and Ethernet host platforms. An EZ-Kit Lite consists of an evaluation board and limited but full-featured VisualDSP.

www.ednmag.com

March 30, 2000 | edn 63

dspdirectory 16

bits

BOPS ManArray
The Billions of Operations Per Second (BOPS) Inc ManArray is a family of reusable and scalable DSP cores. A developer can configure each of the family members into 16- and 32-bit, fixed-point formats; 32bit, single-precision, floating-point formats; or both. The ManArray achieves a high-level of parallelism by combining an indirect-very-long-instruction-word (iVLIW) architecture with single-instruction-multiple-data (SIMD) instructions and inherent multiprocessing capability. The SIMD instructions support 8-, 16-, and 32-bit packed data and 32-bit, single-precision, floating-point formats. A single-cycle, zero-latency, interprocessor-communications fabric and direct DMA access to all processing elements enhance the ManArrays parallelism. The ManArray architecture comprises a sequence processor (SP) and a processing element (PE). The various product configurations combine one SP and multiple PEs. The SP handles program control and combines with a PE to form the smallest increment of the ManArray architecture: a single-SP, single-PE unit. Each SP/PE unit also includes the instruction and data-address-generaDevice provides intion units. struction-level paralEach PE contains a multiported, lelism with indirect 32 32-bit register file, an iVLIW-memVLIW. ory (VIM) unit, local data memory, and three bus interfaces. The bus interfaces DSP supports 8-, 16-, include a 32-bit instruction bus; a 32-bit and 32-bit fixed and data bus; and a cluster switch (CS), a sinfloating point. gle-cycle PE-interconnect bus for movArchitecture features ing data between the SP and PEs. soft-macro DSP cores. The five execution units comprise a multiply-accumulate unit (MAU), an ALU, a data-select unit (DSU), a 64-bit load unit, and a 64-bit store unit. The DSU supports data-manipulation instructions, such as shift, rotate, floating-point conversions, and SP-to-PE and PE-toPE communications through the CS. The register file logically performs as 32 32-bit registers or 16 64-bit registers supporting packed-data operations on an instruction-by-instruction basis. The BOPS 2010 core, a one one-element array, comprises an SP/PE combination. The BOPS2020 uses the one one-element array and adds a PE element through the CS to form a one two-element array. Likewise, the BOPS2040 is a two two-element array comprising one SP/PE combination and three PEs. The SP uses a 32-bit instruction set that supports both one one-element and N M-element arrays and allows you to use one tool set for all combinations of the cores. The topology of the BOPS architecture allows the devices to interconnect and organize a set of PEs into standard ring, mesh, torus, hypercube, and other organizations. The organization depends on algorithmic data-flow requirements. The importance of the topology type is that the performance of any parallel algorithm depends on the efficiency of data movement on the processor and the cost of the interconnection mechanism. The term iVLIW refers to the ManArrays ability to indirectly access an encapsulated instruction sequence into a horizontal VLIW format that can simultaneously execute operations. You create iVLIWs with an iVLIW load VLIW (LV) instruction, which identifies as many as five programmer-defined 32-bit instructions that comprise the VLIW, as well as the VIM address in which to store the instructions. After executing the LV instruction, you issue a sequence of simple instructions that form the iVLIW. Once the SP and PEs store the iVLIW in VIM, your program can dispatch, or broadcast, an execute-iVLIW (XV) instruction to the SP and all PEs. The XV instruction contains the offset of the VIM address for the VLIW to execute in each PC. This nontraditional use of VLIWs effectively creates instructions for applications using 32-bit instruction paths. Using the iVLIW architecture, BOPS requires no large VLIW buses around the chip, as is common with VLIW machines. The VIMs allow BOPS to use a single 32-bit instruction bus in the array of PEs; this approach promotes scaling in both the number of PEs and the width of the iVLIWs, reducing the amount of program memory. The iVLIW architecture allows you to overlap the communications operations with the computation operations, thereby providing zero-latency data transfers between PEs. The architecture accomplishes this task by placing the communications instructions in the DSU and using software pipelining to transfer a result that an arithmetic-execution unit calculates in the previous machine cycle to any of the directly connected ManArray PEs. The load and store units provide independent datapaths between the SP data memory and the PEs and between each PE and its local data memory. Addressing modesThe processors support arrayparallel memory-addressing modes, including direct, base plus displacement, register indirect, and modulo indexed. Special instructionsThe MAU and ALU support floating-point and packed-data operations with saturation, and the DSU provides a complement of bitmanipulation, shift, rotate, and PE-to-PE-communiwww.ednmag.com

64 edn | March 30, 2000

cations operations. ManArray fixed-point processors support a number of special DSP instructions, including packed-data multiply-accumulate, and multiply complex data. SupportThe BOPS software-development kit comprises an integrated development environment, a visual system simulator, a cycle-accurate instruction-set simulator, a Gnu-C compiler with assembler/preprocessor/linker, and a basic DSP library. BOPS also has a compiler for Matlab. The instruction-set simulator provides views of all core resources, including the disassembly of iVLIWs in VIM and pipeline stages. You can find some demonstrations of this architecture at http://bopsnet.com/training/demos.shtml.

www.ednmag.com

Month XX, 2000 | edn xx

dspdirectory 16

bits

DSP Group cores


DSP Group developed PineDSPCore, OakDSPCore, TeakDSPCore, and PalmDSPCore and has 34 licenses. TeakDSPCore is a family of two synthesizable cores: TeakLite and Teak; both are binary-compatible with OakDSPCore, allowing them to easily replace OakDSPCore. DSP Groups PalmDSPCore is a family of fully synthesizable cores in 16-, 20-, and 24-bit versions. PalmDSPCore, a parallel-instruction-word architecture, has seven computation units that operate in parallel. The core includes two multipliers; a three-input ALU; and three adder-subtracter units (ASUs), one of which is a three-input, insertion/extraction unit. Other units include a barrel shifter and an exponent unit. PalmDSPCore also contains more register resources, including an accumulator-register file. PalmDSPCore is compatible with previous generations of the DSP Group cores. PineDSPCore has two data buses and one program bus, two data-memory blocks for X and Y memory, a data-arithmetic-address-generator unit (DAAU), a multiplier, a 36-bit ALU, and two accumulators. It also includes two zero-overA variety of synthesizhead-loop mechanisms: a single-word able cores is available. instruction loop and a block repeat. OakDSPCore expands PineDSPCore Teak is a dual-MAC capabilities by adding a bit-manipulacore with parallelism tion unit (BMU), an exponent unit, four capability. nesting levels of block-repeat, and an exTeakLite is a cost-effecpanded instruction set. OakDSPCore tive version. also has an indexed addressing mode and a software stack to improve its CPalmDSPCore includes programming friendliness. Teak exseven arithmetic units. pands on OakDSPCore by adding a second multiply-accumulate (MAC) unit, support for faster task switching, and additional addressing modes. The multipliers for all the cores have two 16-bit input registers and take two 16-bit, signed or unsigned numbers and deliver a 32-bit 2s complement product in one cycle. Depending on the version, the PalmDSPCores input registers multiplies by 16, 20, or 24 bits and yields a 32-, 40-, or 48-bit product, respectively. The shifter between the multiplier and the ALU sign-extends the product to accumulator size through the 4 guard bits (8 in Teak and PalmDSPCore). The ALU performs arithmetic/logical operations on the data operands. It also performs functions, such as step division and rounding. The BMU has a 36-bit barrel shifter (40 bits for Teak and 40, 48, or 56 bits for PalmDSPCore), a bit-field-operation unit, and two additional 36-bit accumulators (40 bits for Teak and 40, 48, and 56 bits for PalmDSPCore). The bit-fieldoperation unit reads from memory, modifies, and writes back to memory, bypassing the accumulator. Bypassing the accumulator not only frees a critical hardware resource, but also avoids the use of the accumulators high-power-consumption circuitry. Cellular phones use bit-field operations to pack the bits before putting the data onto the channel. The four accumulators and a set of shadow registers enable rapid context switches; these accumulators can also evaluate 36-bit exponents (40 bits for Teak and 40, 48, and 56 bits for PalmDSPCore). PalmDSPCores BMU supports parity and its insert/extract unit supports packing and unpacking of bit fields into the accumulator. The accumulator optionally saturates out-of-range values as the CPU transfers them to registers or memory. Teak and PalmDSPCore include saturation mode to support bit-exact standards. At each cycle, the three buses move X- and Y-memory data to the MAC unit from X and Y data RAMs or ROMs while the program-control unit (PCU) fetches a new instruction from program-space ROM or RAM. The X-data bus also serves as the main CPU data bus by linking the two data RAMs, the status registers, the computation unit, the BMU, the DAAU, the PCU, and the user-definable registers. Teak and PalmDSPCore incorporate two buses for moving data from X memory, two for moving data from Y memory, and one for moving data from program memory, all in one cycle. Teak and PalmDSPCore use a switchbox unit rather than a main-CPU-data-bus method. The DAAU generates X- and Y-memory addresses for each MAC cycle and modifies the pointers after operations, including modulo addressing. It has 16-bit pointer registers for addressing: a stack pointer, a base register, and registers for modulo and step postmodification. The DAAU contains an alternative bank of registers for interrupts and subroutines. The stack pointer references the top of the software stack for interrupts, subroutine-processing calls, and temporary variable saving. You can define four (16 in PalmDSPCore) additional on-chip registers that are not part of the DSP core. These registers can be handy for application-specific hardware, such as a Viterbi accelerator. All cores support DMA operations, downloading capabilities from data-memory to program-memory space, and an automatic-boot procedure. OakDSPCore and PineDSPCore have a 64k-word X- and Ydata space and 64k-word program space. The X space and the program memory can be in internal memory, external memory, or both. TeakLite, Teak and
www.ednmag.com

66 edn | March 30, 2000

PalmDSPCore split the 64k-word data space into X, Y (ROM/RAM), and Z space for peripherals. Teak uses linear program memory as large as 256k words and paging as large as 4M words. PalmDSPCore offers as much as 1M word of linear memory and 16M words of paged memory. All cores have a 16-bit loop counter for repeating instructions or instruction blocks as many as 65,536 times. OakDSPCore allows you to nest a repeat instruction in a loop block with as many as four levels of block nesting. Teak and PalmDSPCores allow you to use an infinite number of repeats and block repeats with a special instruction. All cores can operate from 1.8 to 2.7V and have power-management features to cut power dissipation. Internal control can automatically shut off unused functional units and memory. Addressing modesAll cores support circular buffering, direct, register, indirect, relative, and short and long immediate-addressing modes. OakDSPCore, Teak, and PalmDSPCores also support index and long direct-addressing modes. Teak and Palm can have a quadruple-indirect-addressing mode to simultaneously feed the four inputs of the two multipliers and bit-reversal addressing mode. Special instructionsThe cores support conditional subroutine call/return from a subroutine and interruptible- and block-repeat instructions. (PineDSPCore has one repeat level and one block-repeat; OakDSPCore has four levels of nesting.) They also support division step, bit-field test, set, reset and change (except for the PineDSPCore), compare, square, accumulate/subtract previous product, move data/program memory, conditionally modify accumulator, double-precision calculations, bit-field operations, exponent evaluation, normalization, context switching, minimum/maximum calculation with pointer latching and modification, delayed return, and automatic boot. Teak and Palm include special instructions to support coprocessors and built-in accelerators for Viterbi acceleration, FFTs, and other functions. Palm adds support for delayed branches, many conditional instructions, and special instructions for least mean square, vector quantization, and other algorithms. SupportDSP Group provides software-development tools along with evaluation/development boards and on-chip-emulation capabilities through the JTAG interface. The software tools include an assembler/linker; a preprocessor, a loader; a debugger; a C/C++ compiler; and the Assyst simulator. The debugger works in simulation or emulation mode and includes an application profiler. It contains support that allows you to extend the simulator and add logic to customize the debuggers hardware interface and to perform multicore debugging. All tools run under Windows and Solaris. Check out DSP Groups third-party developers at www. dspg.com/prodtech/core/3rdparty.htm.

www.ednmag.com

Month XX, 2000 | edn xx

dspdirectory 16

bits

Equator Technologies MAP-CA


The media-accelerated processor (MAP-CA) targets multimedia products, such as set-top boxes, digital TVs, videoconferencing systems, and medicalimaging products. Equators MAP-CA is more than a DSP: It combines both microprocessor and DSP functions in a very-long-instruction-word (VLIW) framework. The VLIW core executes four operations in parallel and supports partitioned single-instructionmultiple-data (SIMD) operations for 8-, 16-, 32-, and 64-bit data types. MAP-CA devices come with various coprocessors that offload the main CPUs, as well as integrated memory, a PCI interface, and a synchronousDRAM (SDRAM) controller. The architecture supports dynamic address translation and virtualmemory protection. The MAP-CA comprises two clusters, each with an Integer-ALU (I-ALU) and an Integer-Graphic-ALU (IG-ALU). Each I-ALU contains a load/store unit, a 32-bit integer ALU, Equator partnered and a branch unit. The I-ALUs perwith Hitachi to provide form integer, logical, flow-control, a media processor and memory-reference operations. for consumer The IG-ALU units contain a 32/64applications. bit integer ALU and a general-purpose signal and imaging-operation The architecture incorunit. The IG-ALU units perform inporates both VLIW teger, floating-point, and multimeand SIMD. dia/graphics operations. Each cluster The company claims has its own register file containing 64 the DSP yields good 32-bit registers that a program can code with a C comuse in pairs as 64-bit registers, 16 1piler only. bit predicate registers, and four special 128-bit registers. Each MAP-CA instruction contains four operations and structures nearly all operations in three-operand register-to-register format. No operations, including multicycle SIMD operations, block instruction issue, and most operation codes are predicated. The architecture uses register-scoreboard interlocks for load operations, and the compiler statically schedules all other operations. The compiler encodes operation in a 34-bit format32 bits plus a 2-bit field in a block header. Therefore, each instruction consumes 136 bits. In a typical VLIW instruction, some of the fields can contain nonoperation instructions, leading to wasted space in the processors cache or memory. MAPCA solves this problem by storing the instruction in a compressed format using the block header for compression information. Native data types include 1-bit logical values; 8-, 16-, 32-, and 64-bit integers; and 32-bit addresses. The media-intrinsic operations include partitioned operations over these data types. The processor uses the 1bit logical values to support predicated execution, which allows partial speculation and eliminates branches. MAP-CA coprocessors include a variable-length encoder/decoder (VLx), a video filter (VF), and Equators patented DataStreamer. The 16-bit VLx RISC coprocessor has 32 16-bit registers and special hardware for bit-stream processing; hardware-accelerated MPEG-2 table look-up; and general-purpose, variable-length decoding. The eight-phase, 2-D VF takes a 4:2:0 or 4:2:2 YUV stream as input and scales it up or down as required. The VF pumps out scaled 4:4:4 YUV data to the SDRAM or the DRC through the video bus. DataStreamer is a programmable, 64-channel, multithreaded DMA engine that provides buffered data transfer between MAP-CA memory subsystems or between memories and I/O devices. It initiates transfers to and from memory. These transfers are done under software control without consuming cycles from other on-chip processing units. The MAP-CA supports several on-chip memories and access to SDRAM and other memories via the 32bit, 33- or 66-MHz PCI bus. The VLIW portion has a 32-kbyte instruction cache and a 32-kbyte data cache. The processor- and data-cache-refill mechanism can continue execution through 16 pending misses. The CPU physically addresses both caches, and separate translation look-aside buffers provide virtual addressing for both. In addition to supporting the I-ALU ports, the data cache supports a port to the data-transfer switch within the DataStreamer. MAP-CA chips support several audio/video interfaces, including ITUR BT.656 input and output, MPEG-2 Transport Channel Interface, an I2C selectable interface, and IEC958 and I2S digital-audio interfaces. On-chip memories include a 32-kbyte data cache, a 32-kbyte instruction cache, a 4-kbyte data memory and 4-kbyte program memory for Vlx, a 6-kbyte linebuffer memory for VF, and an 8-kbyte buffer memory for the DataStreamer. The processor integrates key media-access I/O interfaces on the processor along with one 32-bit, 33/66-MHz PCI bus. Special instructionsThe MAP-CA performs shift/extract/merge operations and 64-bit SIMD operations with 8-, 16-, and 32-bit partitions, including selection, comparison, selecting maximums and minimums, addition, multiply-add, complex multiplication, inner-product, and sum-of-absolute difference. It also handles 128-bit partitioned SIMD operations, including inner-product with a new partition shiftin for efficient FIR operation and sum-of-absolute difwww.ednmag.com

68 edn | March 30, 2000

ference with new partition shift-in for efficient blockmatching operations. SupportEquator offers its iMMediaTools software developers kit, which includes a trace-scheduling Clanguage compiler, the FIRtree media-intrinsic C-language extensions; an assembler; a linker; a source-level debugger; an assembly-level debugger; a profilingCPU simulator; a virtual-machine, cycle-accurate simulator; and assorted libraries. MAP-CA supports the Microsoft Windows NT and Linux host-development environments.

www.ednmag.com

Month XX, 2000 | edn xx

dspdirectory 16

bits

Infineon Carmel DSP 10XX and 20XX cores


The Carmel DSP core is a family of licensable, fully synthesizable, 16-bit, fixed-point DSP cores. The first two members of the family are the 10XX and 20XX cores. Both cores operate from 1.2 to 2.7V and have low-power features. The Carmel DSP configurable-long-instruction-word (CLIW) architecture delivers very-long-instruction-word (VLIW) performance without sacrificing power-dissipation and code-compactness requirements. You can customize instructions through CLIW and provide a high degree of parallelism with the ability to simultaneously generate four addresses and perform four arithmetic operations and two data transfers. However, the compiler provides no support for this function, and you must hand-code CLIW instructions into your program. The Carmel DSP 20XX is binary-compatible with the 10XX and allows you to further modify and extend the cores instruction set. Carmel DSPs modified Harvard architecture has separate program- and data-memory space. Carmel DSP features four 16-bit data buses and a 48-bit bus for reading four operands and two 24bit instructions in a single cycle. The Architecture features a 10XX has six computation units that opcustomizable longerate in parallel. These units include two instruction word. 16 16-bit multiply-accumulate (MAC) units, two 40-bit ALUs, a 40-bit barrel DSP has 40-bit internal shifter, and a 40-bit exponent unit suparchitecture. ported by six 40-bit accumulators. The Devices support con20XX can have as many as 10 computaditional executiontion units that operate in parallel. These with-predication units can include four 16 16-bit MACs, mechanism. two to four 40-bit ALUs, a 40-bit barrel shifter, a 40-bit exponent unit, and other accelerators for applications such as a 32-bit MAC for audio processing or quad 8-bit MACs for video processing. A programmer may split each accumulator into two 16-bit half-accumulators, which the program can use as a source or a destination. In addition to accumulators, a program can fetch operands directly from memory and route them to either of the execution units. The program can also write the results directly to the same memory location without requiring a temporary register (that is, read-modify-write). This non-load-store design avoids the overhead typical of load-store architectures. The ALUs operate on 40-bit data and provide arithmetic and logic operations back-trace support, which can accelerate Viterbi algorithms and saturation and limit support. The MAC units operate on all combinations of signed/unsigned, 16-bit operands. Also, the MAC units can perform addition and subtraction operations, which allow the Carmel to complete four additions or subtractions per clock cycle. The barrel shifter supports both arithmetic and logical shifts by 0 to 40 bits left or right and rotate right through carrying 16-, 32-, or 40-bit operands. The exponent unit handles exponent detection and block floating point. Addressing modesThe Carmel DSP family supports direct, indirect, index-by-register, index-by-immediate, and short and long addressing modes. Both cores support linear, bit-reversal, modulo, and specialmodulo modification modes. The special-modulo mode enables you to stack memory buffers without the alignment usually associated with data buffers. Both cores can generate four independent 16-bit memory addresses. Special instructionsThe Carmel DSP uses a 24-bit instruction that you can extend to 48 bits for wider operand selection, larger immediate-operand fields, and direct-operand references. Carmel DSPs CLIW architecture extends the traditional DSP instructions into VLIW capability through an additional 96 bits of the CLIW memory. The application-specific CLIW instruction specifies as many as six parallel operations that can use the two ALUs; the two MAC units perform two data moves on the 10XX. The Carmel DSP family supports conditional execution with the Carmel DSP-predication mechanism to avoid branch penalty and fast context switching with a register-bank-exchange instruction and a conditional execution-load instruction. The registerbank-exchange instruction allows you to specify which registers to shadow. The hardware-looping mechanism enables zero-overhead loops, nested to as many as four levels. The back-trace instructions accelerate the Viterbi-decoder implementation. All arithmetic/logic instructions support double precision. In addition, special instructions are available for square, division, minimum/maximum, block floating point, logical and arithmetic shifts, bit manipulations, fractional and integer arithmetic, limiting, saturation, and nearest and convergent rounding modes. SupportInfineon provides an integrated programdevelopment environment with uniform interfaces running under Windows. The software tools include an assembler/linker, a C compiler, and Taskings (www.tasking.com) simulator and debugger. The simulator is instruction-, cycle-, bit-, and pipeline-accurate. Algorithm libraries are available as both C and assembly-language routines for common DSP applications and functions. An RTOS facilitates task-level
www.ednmag.com

70 edn | March 30, 2000

debugging. Infineon provides an evaluation/development board that supports JTAG-based emulation using Carmels on-chip debugging-support capability. These tools allow you to run programs in real time and within the applications hardware system. You can also check out Infineons partner section at www.infineon. com/dsp.

www.ednmag.com

Month XX, 2000 | edn xx

dspdirectory 16

bits

LSI Logic ZSP DSPs


The ZSP architecture, which LSI Logic acquired in 1999, is a superscalar architecture comprising synthesizable cores and standard products. The ZSPs can issue as many as four instructions per cycle. Its most prominent features are the dual multiply-accumulate (MAC) units and dual ALUs. It performs two 16-bit MACs or one 32-bit MAC per cycle. Results flow into a 40-bit accumulator. This C-friendly architecture implements an almost-orthogonal instruction set, flexible addressing modes, software-stack support, and a larger register file than that of traditional DSPs. The core of the ZSP comprises a five-stage pipeline: fetch and decode, instruction group, read, execute operations, and write results into a register file. One processing unit handles instruction scheduling, making it easier for ZSP to develop custom instructions without affecting the datapath. The pipeline-control unit performs instruction grouping and resolves data and resource dependencies. It then dispatches the instructions to the individual functional units, which operate in parallel. Pipeline control also performs result bypassing and interrupt processing. Result bypassing moves results from DSPs have a superfunctional units directly back into the scalar architecture. pipeline without going through the write-back stage. The ZSP provides mulDSPs use static-branch titasking support and uses a six-level inprediction. terrupt structure with programmable Special instructions priority levels; a high-priority event can support Viterbi. interrupt all instructions. The ZSPs functional units share a register file of 24 16-bit registers. The core can simultaneously access only 16 of these registers. An instruction can swap eight of these registers. The register file serves as a source and destination for multiply-accumulate (MAC) operands. A 32-word instruction cache and a 48-word data cache are standard features. Separate 64-bit instruction and data buses feed these caches from memory. An instruction prefetcher operates in parallel with the instruction issue unit, grabs code to be executed from memory, and loads it into the cache. If the processor is processing a loop that is bigger than the 32-word cache, the ZSPs prefetcher fills the cache before the instruction execution. A data prefetcher operates in parallel with the execution unit, grabs data from memory, and loads it into cache. This feature allows the CPU to execute a load and a load-dependent instruction in the same cycle. The ZSP uses static-branch prediction; you should try to design your code so that program flow takes all backward predictions. A mispredicted branch incurs a three- to five-cycle latency; this latency is only three cycles when the target instructions are still in the cache. The DSP core contains an integrated non-cycle-stealing, eight-channel DMA unit. You can configure each channel to handle time-division-multiplexed data. The DMA unit has its own buses and operates independently of the processor. Two 16-bit timers are also standard features of the DSP core. LSI Logics DSP has hardware support for four nested looping constructs. It also supports eight 32-bit or 16 16-bit barrel shifters. This support is possible because the DSP performs a single-cycle shift as large as 16 bits on each of the cores 16 registers. You can also concatenate two 16-bit registers and perform singlecycle, 32-bit shifts on that register combination. Addressing modesThe ZSP performs bit-reversed addressing in hardware. The ZSP also has software support for bit-reversed addressing using a bit-reversing structure; an instruction flips the bits. The ZSP has hardware support for two circular buffers. The DSP has hardware support for immediate or registercontent, indexed; indirect; and register-to-register addressing modes. Special instructionsThe ZSP has add-compare-select and parallel-add and -subtract instructions for FFT and Viterbi; single-cycle bit-manipulation instructions; and specialized load instructions, which free programmers from the complexities of detailed pipeline management. Zero-overhead looping requires an again instruction to indicate the end of the loop and to perform the counter decrement. SupportLSI offers a Gnu-based compiler, a linker, and an assembler. The compiler supports new C fixedpoint data types and employs a variety of C-intrinsic functions. Development platforms include an integrated debugging environment, flash, RS-232C and JTAG-based host communication and code downloading, two voiceband codecs, and analog I/O interfaces.

72 edn | March 30, 2000

www.ednmag.com

dspdirectory 16

bits

Lucent/Motorola StarCore SC100


The StarCore architecture, which Lucent and Motorola collaborated to develop, represents the foundation of many family members of DSP cores. The first implementation of this architecture is the StarCore SC140 core, which Motorola has used in its MSC8101 DSP product. This device operates at 1.5V and 300 MHz and includes the companys 32-bit RISC extracted from the popular PowerQUICC II. A key aspect of the StarCore architecture is its variable-length-execution-set (VLES), or explicitly parallel instruction computing (EPIC), model that promotes scalable resources (such as multiple ALUs), a scalable instruction-set architecture, and increased orthogonality. Traditional very-long-instruction-word (VLIW) architectures use fixed-length instructions in a fixed-length execution set. This architecture sometimes requires the compiler to insert nonoperating instructions to fill unused instruction slots, and the instructions and execution sets have memory-alignment restrictions. On the other hand, VLES allows variablelength instructions with no alignment restrictions; specifically, the SC140 uses 2-, 4-, or 6byte instructions. To further increase DSP architecture is flexibility and scalability, StarCore uses scalable. the prefix construct to add features to instructions. You may add in one or two Variable-length instrucprefixes per instruction to accommodate tions yield code effiadditional registers and conditional exciency and parallelism. ecution. The StarCore architecture can DSP developed in also group multiple instructions into exconjunction with comecution sets, which the architecture expiler for better C proecutes simultaneously. Two 64-bit data gramming. buses and a 128-bit-wide program-data bus allow the core to fetch as many as two prefixes and six instructions per cycle. The SC140 contains 16 40-bit registers and 27 32bit address registers. Another key StarCore feature, the associated compiler, can detect parallelism and group-independent instructions. The SC140s compiler statically builds execution sets comprising one to six instructions. Relying on the compiler to encode the parallelism minimizes the silicon resources for instruction decoding and dispatching. The compiler performs a variety of optimizations, including software pipelining, function inlining, if conversion to exploit predicated execution, global scheduling, and sophisticated loop analysis and transformations to exploit zero-overhead nest loops. Traditional DSPs do not efficiently handle runtime stacks; in many cases, the DSPs provide no architectural support, or addressing modes using the stack are less efficient than those using absolute addressing. The SC140 compiler implements compiled stacks to provide the functions of stack accesses at the cost of absolute addressing, thus avoiding extra runtime overhead. The SC140 compiler also implements DSPmemory-specific optimizations, such as modulo addressing, postincrement addressing, and memory-access vectorization. The SC100 incorporates a five-stage pipeline surrounded by four parallel data ALUs (DALUs). Each DALU contains a multiply-accumulate (MAC) unit and a bit-field unit (BFU). The MAC contains a multiplier and an adder that can perform a 16 16-bit multiply and add the 40-bit accumulator to the result. The BFU contains a 40-bit, parallel, bidirectional shifter with a 40-bit input and a 40-bit output, a maskgeneration unit, and a logic unit. The BFU can perform multibit or single-bit left/right shift, bit-field insertion or extraction, count of leading bits, and several logical operations. The CPU can perform as many as four parallel arithmetic operations, logic operations, or both in each cycle. A program can use the destination of every arithmetic operation as a source operand for the operation immediately following with no time penalty. A separate bit-manipulation unit sets, clears, inverts, or tests a selected group of bits in a register or memory. The SC140 also contains two address-generation units (AGUs), which perform effective address calculations using the integer arithmetic necessary to address data operands in memory. They also contain the registers for generating the addresses. The AGUs implement linear, modulo, multiple-wrap-around-modulo, and reverse-carry arithmetic. They operate in parallel with other chip resources to minimize address-generation overhead. Two registers in each AGU support software-stack operation: the normal-mode and the exception-mode stack pointers. These registers can perform predecrement and postincrement updates. The existence of two stack pointers in the SC140 eases support of multitasking systems and optimizes stack usage for these systems. A program-control unit includes a program sequencer, which fetches and dispatches instructions, performs loop and branch control, and processes exceptions. The SC140 can handle as many as four hardware-nested do loops. The program sequencer detects the parallel execution set. Addressing modesThe SC140 supports register-direct, address-register-indirect, and program-counterrelative addressing modes, as well as special address modes that use an immediate value to determine the data or the address of interest.
www.ednmag.com

74 edn | March 30, 2000

Special instructionsThe SC140 multipliers support all combinations of signed and unsigned operands and both fractional and integer formats. The MAC units support add, subtract, negate, absolute value, and clear. The MAC units also support division iteration, comparison, maximum/minimum operations, transfers between registers, arithmetic-shift operations, and rounding. The SC140 supports a single-instruction-multiple-data version of maximum/minimum, additions, and subtractions (MAX2, ADD2, SUB2) by treating values in registers as packed pairs of 16-bit data operands. Using these instructions, the SC140 can perform eight 16-bit additions or maximum/minimum operations per cycle. The SC140 includes MAX2VIT, a special version of the maximum/minimum operation that works with the Viterbi shift-left instruction, a specialized move instruction that supports efficient implementation of Viterbi decoding algorithms. SupportTools include an assembler, an optimizer, a linker, a simulator, and an ANSI C- and C -compliant C/C compiler. The compiler intrinsically supports for International Telecommunications Union/European Telecommunications Standards Institute primitives. Green Hills Software (www.ghs.com) will also provide C/C support with its Multi development environment.

www.ednmag.com

Month XX, 2000 | edn xx

dspdirectory 16

bits

Lucent Technologies DSP16xx


Lucent Technologies sells its DSP16xx-based products to the modem, wireless, and digital-telephony markets. The main execution unit of the DSP16xx is the data-arithmetic unit, which has a 1616-bit multiplier and a 36-bit ALU/shifter with 4 guard bits and two accumulators. The dual accumulators let you halve the number of memory accesses and thus are useful for tasks such as autocorrelation lags. The multiplier and adder operate in parallel, and the multiplier has registered inputs and outputs. The multiply-accumulate (MAC) unit has a three-stage pipeline for fetching, multiplying, and accumulating. The MAC can shift the multiply result before running it through the ALU/shifter and into one of the accumulators. The instructionstream pipeline has fetch, decode, and execute stages and runs in parallel with the MAC. The shallow pipeline minimizes the impact of branches to two cycles. The DSP16xx has an exposed pipeline, letting a programmer see data at any point. The programmer controls the fetching data into the ALU and controls the Device has a 16 16multiply and add. This method minbit multiplier. imizes the number of registers to hold temporary data and, therefore, miniFeatures include a 36mizes power consumption and die bit ALU/shifter. size. All devices have interThe DSPs X memory contains nal ROM. both program and coefficients and would typically become a bottleneck Devices operate at 2.7 for MACs. However, for fast innerto 4.75V. loop processing, the program uses special instructions to load an innerloop code block into a 15-instruction cache. Alternatively, if a DSP had program, X, and Y memories, an algorithm with few instructions would be an inefficient use of memory. The other advantage of the instruction cache is in power savings. The DSP16xx uses fixed-point, 2s complement arithmetic throughout. The bit-manipulation unit has a 36bit barrel shifter, two 36-bit accumulators, and four general-purpose 16-bit registers. The DSP16xx family with its classic Harvard architecture uses three internal buses to move instructions, coefficients, and data in parallel for high-throughput processing. The DSP defines two 64k-word address spacesone for program coefficients and one for data. Both X and Y buses connect to the same dual-port RAM. If references occur simultaneously to both ports of a bank, the chip incurs a one-instruction-cycle penalty and first performs data access. Memory writes always take two cycles. A special address cycle allows both a read and a write to memory, a compound addressing mode of MAC units. The DSP has X- and Y-memory address generators, each with its own internal adders and registers to hold address values and offsets. The XAAU has a 12-bit adder, a 12-bit static-offset register, and four 16-bit pointer registers. The YAAU has eight static registers and an adder. Programmers can access XAAU and YAAU registers. The X side has half the number of registers as the Y side because signal processing typically requires fewer coefficient pointers. (The DSP stores coefficients on the X side.) Also, the Y side points to memory that requires more pointers. Addressing modesThe DSP16xx has registerand memory-direct, register-indirect, and immediate addressing modes; it has no bit-reversed addressing. Special instructionsInstructions for the DSP16xx include single/block-instruction hardware looping, conditional subroutine call, compare, compound addressing, exponent detection, bit-field extraction, bit shifting, and replacement. The DSP16xx has no rotation instructions. SupportLucent Technologies supplies a hardware-development system with an in-circuit-emulator pod. Evaluation and demo boards are also available. The company sells software-development tools, including an assembler/linker, a debugger, a simulator, and an application library. Lucent offers a Linkable Functional Simulator, a DSP simulator model that plugs into system-level simulation tools from EDA vendors. This model allows you to develop your application at the system level using building blocks and to determine whether your design has the bandwidth to perform the task. The company also provides a cycle-accurate model of Lucents DSP.

76 edn | March 30, 2000

www.ednmag.com

dspdirectory 16

bits

Lucent Technologies DSP16000


The DSP1600 has dual multiply-accumulate (MAC) units and supports 16 32- and 32 32-bit multiplies. The devices ALU supports 16-, 32-, and 40-bit operands; 32-bit datapaths come from the X and Y memories. Although this core is source-code backward-compatible with Lucents DSP16xx core, significant differences exist between the two cores. The DSP16000 core performs two 32-bit fetches, two multiplies, and two accumulates in one cycle. The dual MAC units on the DSP16000 are part of the DSPs three-stage, parallel pipeline that begins with dual 32-bit registers. These registers serve as four 16-bit register inputs to the two signed 16 16-bit multipliers. The DSP16000s dual MAC architecture enables efficient, mixed-precision, 16 32-bit and double-precision, 3232-bit multiplies. You can use the data-arithmetic unit (DAU) to direct the multipliers outputs to a 40-bit ALU with addcompare-select (ACS) capability; a 40-bit bit-manipulation unit (BMU); or a 40-bit, three-input adder/ subtracter. The ALU supports 16-, 32-, and 40-bit operands and can perform Device has dual MAC specialized compare instructions to acunits. celerate minimum and maximum operations. In addition, these compare inDevice supports 16 structions can store trace-back bits as a 32- and 32 32-bit side effect to support Viterbi processing. multiplies. The BMU performs operations such as ALU supports 16-, 32-, insert, extract, and rotate bits. The and 40-bit operands. DSP16000 lacks a barrel shifter, so the BMU must perform shifts, but the shifts X and Y memories take more than one cycle. The separate have 32-bit datapaths. three-input adder/subtracter allows a 40-bit addition or subtraction in parallel with another operation that uses the ALU. The DSP can simultaneously send the result of an arithmetic operation into an accumulator and into the multipliers input registers. This feedback loop avoids an explicit move instruction when you use the result as an input to a subsequent multiply. The register file contains eight nonorthogonal, 40bit accumulators, which minimizes a programmers need to swap values between memory and registers, minimizes code size, and allows more efficient compiler implementations. The DSP-16000 can perform overflow saturation on the multiplier output and on the outputs of the three arithmetic units. Overflow saturation can also affect an accumulator value as program control transfers it to memory. The DSP16000s code and data transfers rely on a modified Harvard architecture with separate 20-bitaddress and 32-bit-data buses for the instruction/coefficient (X) and data (Y) memory spaces. The onchip X and Y memories each have a 32-bit datapath to the X and Y registersan essential ingredient of keeping the dual MAC units fed. Although the DSP can load the 32-bit X and Y registers in parallel with execution of one or two multiply operations, you must arrange the 16-bit operands as pairs in memory. In other words, the 32-bit fetch results in data word at Address 0 goes into one multiplier, the data word at Address 1 goes into the other, and so on. An on-chip cache holds as many as 31 32-bit instructions. You must use this cache with a do-loop instruction. When you use a do instruction, the cache-control circuitry loads the subsequent instructions into the cache as the pipeline executes them. Once the circuitry loads the loop into cache, the core can execute the loop as many as 65,535 times without overhead. The cache frees the instruction bus for X memory fetches, allowing the DSP16000 to perform as a three-bus architecture. The DSP16000 does not perform nested loopingthat is, you can perform cache-based, zero-overhead looping on only the innermost loop. The core supports as much as 1M words with a 20-bit address bus. Addressing modesThe DSP16000 supports register- and memory-direct, register-indirect, immediate, and register+displacement addressing modes. Because the device offers no bit-reversed addressing, software must perform this task. The DSP16000 supports two concurrent circular buffers. Addressing modes are oriented toward pointer arithmetic. Special instructionsThe DSP16000 supports a mixture of 16- and 32-bit instructions. Conditional execution of many instructions avoids branch penalties because branches take three cycles. A redo instruction re-executes code that has already been loaded into the cache using the do instruction. The DSP16000s trace-back encoder accelerates Viterbi acceleration and performs mode-controlled side effects for Viterbi compare instructions. The side effects allow the DAU to storewithout overheadstate information necessary for trace-back decoding. Additionally, you can use the compare instructions for determining the least common paths for Viterbi processing. Other special instructions include rounding, negation, absolute value, and fixed arithmetic. SupportThe DSP16000s software tool set includes an ANSI C compiler, an assembler, a linker, a debugger, and a simulator. Hardware tools include an in-circuit emulator and a development board. The C compiler (based on Gnu C) performs numerous local and global optimizations to produce optimized code,
www.ednmag.com

78 edn | March 30, 2000

emits information to enable C source debugging, and allows mixed C and assembly code. The assembler supports the ANSI C preprocessor to allow file inclusion, macro substitution, conditional assembly, and various numeric-constant formats. The assembler also allows expressions to include multiple user-defined labels and supports the Tcl preprocessor directives to allow the assembler to share macro operations with the debugger. The debugger supports integrated debugging of one processor or multiple homogeneous or heterogeneous processors. It supports data and instruction breakpoints, software simulation with near real-time visibility, mixed assembly and C debugging, extensive code profiling, stand-alone or networked hardware emulation through the JTAG with the TargetView Communication System, hardware trace, and on-chip cycle count. Extensive on-chip debugging hardware lets you monitor many DSP16000s in real time. Another feature of the debugger is an architectural view, which provides a block-diagram view of the DSPs multiple processing units. As you step through the instructions of your application code in the debugger, the architectural-view utility graphically displays the data flow through the DSP. This feature enables you to view underused parts of the architecture and make code changes to increase code efficiency. Third-party tools, such as Synopsys COSSAP, Cadence SPW, and Mathworks Matlab support the DSP16000 simulator. The software tools cost $1500, and the hardware tools cost $5000 to $7000.

dspdirectory 16

bits

Massana FILU-200 DSP coprocessor core


Rather than create an entire DSP with all the bells and whistles of a stand-alone processor, Massana developed a 16-bit, fully synthesizable DSP coprocessor core that requires a host processor. This approach allows the company to deliver real DSP capability and use only 30,000 gates. The FILU-200 is a dual-multiply-accumulate (MAC)-instruction architecture with dual barrel shifters, a 20-bit internal datapath, and 44bit accumulators. The architecture has 10 22-bit registers, which have 2 guard bits that allow your program to add two 20-bit numbers without worrying about overflow. The data RAM is 40 bits wide, allowing you to store two 20-bit values to maintain full precision. Massana claims that the FILU-200 is compatible with all RISC processors. The host views the FILU-200 as a memory-mapped peripheral; the host-FILU interface provides the host with access to the three FILU RAMs and to the FILU control word. The device pipelines host accesses to the RAMs to separate the critical paths on the host I/O and RAM I/Os. The interface uses a bus-request/grant or master/slave protocol to share the RAM. When the FILU-200 is busy, it sets a busy bit in RAM. The host can access the RAM even when the FILU-200 is DSP subsystem uses busy, but it stalls the FILU-200. host-processor reThe FILU-200 comprises a fetch-desources. code-execute instruction pipeline that is maintained regardless of whether the All registers include CPU retrieves the function from proguard bits. gram ROM, data RAM, or the 96-bitDSP integrates a sine/ wide program RAM. If the executing cosine function to macroinstruction contains a jump, the improve FFT CPU must execute the three macroinperformance. structions in the pipeline before the jump takes effect. When executing from ROM, the CPU retrieves a very long instruction word (VLIW) from ROM every cycle. This retrieval provides the DSP with access to the parallelism of all the functional units. The FILU-200 can perform as many as 19 operations in parallel every cycle; however, each of these operations must be from different functional units within the core. When the CPU executes from program or data RAM, the compiler encodes the VLIW to fit into the width of the memory, but this encoding reduces parallelism. When the CPU executes from data RAM, data accesses conflict with program accesses, as you would expect from a von Neumann architecture. But a programmer typically stores the key DSP functions in program ROM. The data RAM is 40 bits wide and comprises two consecutive 20-bit signed words that the CPU loads into Xn and Yn registers. Similarly, for writing data into memory, the CPU simultaneously stores the Xn and Yn registers in RAM. The core contains a program-control unit (PCU), an address-generation unit (AGU), and a computational unit (CU). The CU implements two 22 16bit signed or unsigned MAC instructions, as well as two barrel shifters that provide arithmetic shift (with saturation and convergent or nonconvergent rounding) and logical shifts. The CU also includes the blockexponent unit (BEU), which detects the exponent of the largest number in an array of numbers. Before the CPU shifts the data stored in the accumulators through the barrel shifters, the BEU compares the exponent with the MX register and updates the value accordingly. The FILU-200 has a six-deep hardware stack, allowing your program to contain as many as six nested subroutine calls. The DSP also supports zero-overhead block floating-point hardware. The AGU provides dedicated hardware to support FFT-specific addressing needs. The AGU also indexes the sine/cosine ROM, which contains twiddle factors for the FFT operation. The AGU can perform as many as three parallel addressing instructions in each cycle. The AGU includes the data addressing unit that provides linear, postincrement by N, radix-4 data, and bit-reversed increment and decrement operations. The AGU also includes the twiddle-addressing unit that provides linear, postincrement by N, twiddle factor linear, and radix-4 increment operations. A loopcontrol unit provides zero, data-move, increment, and decrement operations on all of the address and counter registers. A sine/cosine unit for FFT and rotor applications looks up coarse and fine sine and cosine values in the FILU-200s sine/cosine ROM. Addressing modesThe FILU-200 supports direct addressing but only between address, counter, and data registers. It supports data-register-indirect and address-register-indirect modes. Every address register supports a linear-addressing mode, but it is postincrement or postdecrement only. A radix-4 butterflyaddressing mode provides data and twiddle access; the input data must be in order, and the output data is in bit-reversed order. The FILU-200 supports no immediate addressing. SupportThe FILU development system includes a 20-MHz FPGA implementation of the FILU-200. The platform also includes a 16-bit stereo-audio codec, a prototyping area, an LCD interface, an RS-232 interface, and a Motorola MCore as host. Each DSP core comes with a runtime library of common DSP routines, including FFT, FIR, IIR, matrix and vector functions, that the host calls as a C-language routine. An application-programming interface (API) allows the host programmer to access FILU RAM, to start execution of FILU-200 functions, and
www.ednmag.com

80 edn | March 30, 2000

to check the status of FILU-200 execution. During host-code development, the C calls within the API initiate functions that perform operations that simulate the behavior of the FILU-200 hardware. Massanas instruction-set simulator provides bit-true, cycle-accurate simulation of the FILU-200 in C. Massana provides a FILU-DMT, which is a FILU-200 core with preprogrammed G.Lite DSP functions and a synchronous serial interface for the analog front end (AFE). The AFE provides the ADC, DAC, interpolation, decimation, and front-end analog filtering in a single chip. The FILU-DMT is available on a PCI card, which enables a user to develop a soft G.Lite implementation on real-time hardware. The card includes the FILU-DMT implemented in a 20-MHz FPGA; a 33-MHz, 32-bit PCI interface; a G.Lite AFE using TIs TLFD500 codec; line drivers, and a data-access arrangement.

dspdirectory 16

bits

Motorola DSP56800
The DSP familys parallel-instruction set controls three concurrent execution units: the data ALU, the address-generation unit (AGU), and the program controller. The general-purpose C-style instruction set with its flexible addressing modes and bit-manipulation instructions enables you to write control code without worrying about DSP complexities. The data ALU provides single-cycle multiplies and multiply-accumulate (MAC) instructions with 36-bit accumulation (4 guard bits), as well as a set of logical and arithmetic operations. The ALU contains X0, Y0, and Y1 input registers; two accumulators, which can also serve as input registers; a MAC unit; a 16-bit barrel shifter; and automatic saturation logic. You can write ALU results back to either of the accumulators. Additionally, if you dont expect the ALU result to be 36 bits, then the result can go directly back to one of the three input registers without corrupting an accumulator value. The AGU can provide two data-memory addresses with address updates in one cycle. It contains five 16bit pointer registers (one functioning as a stack pointer), an offset register, a modifier register for circularbuffer support, and two address ALUs (one supporting modulo arithmetic) to fetch two data items from memory every instruction cycle. The stack pointer has several addressing modes, improving compiler performance and supporting structured programming techniques, such as parameter passing and local variables. The 56800 supports an interruptible hardware do loop on an any-sized block of instructions. In a set of nested loops, a programmer generally uses hardware looping for the innermost loop. Then, you can perform the outer loops using software looping and the 56800s data ALU register, AGU register, or a memory location to store the loop counter. To improve the performance of software looping, the 56800 supports a decrement instruction that operates directly on X memory and uses a conditional branch operation. Furthermore, Motorola added an addressing mode that requires no address calculation and allows direct access to the first 64 locations in X memory; this approach makes the access faster than a long immediate access. Addressing modesThe 56800 supports register-direct, short and long memory-direct, seven memoryindirect, and immediate addressing modes. It also supports short-branch offset and modulo arithmetic for circular buffers. Special instructionsThe 56800 performs hardware-do and -repeat looping on one instruction or a block of instructions. Single and dual parallel-move instructions perform memory accesses in parallel with ALU operation, allowing two data-memory accesses while fetching an instruction. The 58600 can perform bit-manipulation operations on any register or memory location, and it can perform single-cycle multiply and MAC with optional rounding, addition, subtraction, and squaring. Using a conditional transfer instruction with a compare instruction implements searching and sorting alDevice mixes DSP with gorithms. If the specified condition is control functions. true, then the DSP performs a transfer from one register to another (for examArchitecture features ple, to store the array index of the maxan interruptible hardimum value in an array). ware do loop. SupportThe 56800 uses Motorolas Device operates at OnCE port for on-chip emulation 2.7V and 70 MHz. through a standard JTAG interface. Metrowerks CodeWarrior, which Motorola now owns, offers an integrated development environment, a C compiler, an assembler, a linker, a code simulator, and a graphical source and assembly-level debugger for PCs. This package includes a 30-day evaluation license for CodeWarrior in the evaluation module (DSP56824EVM) and development system (DSP-56824ADS).

www.ednmag.com

March 30, 2000 | edn 81

dspdirectory 16

bits

NEC SPRX DSP


NECs PD7701x, or SPRX, family of DSP cores, also includes ASIC macros that integrate SPRX cores plus memories, serial interfaces, and host-CPU interfaces. The cores feature a pipelined multiply-accumulate (MAC) unit; a barrel shifter; and eight general-purpose, 40-bit register/accumulators. The PD7701x has dual external-memory ports one for 16-bit data and one for 32-bit programswith two distinct 16k-word address spaces for data. The 32bit instruction word helps to increase code efficiency by allowing a variety of operation parallelism. Memory-read and -write accesses can take a single cycle, although instruction pipelining may require an extra cycle for some instructions. A programmable waitstate generator allows you to divide each of the three external memory spaces into four regions and control the wait states of each region. The 7701x also supports an 8-bit-wide host interface and two serial channels. You can configure each of these for interrupt servicing or polling and to transfer 8- or 16-bit data.You can configure each of these serial channels as most- or least-significant-bit-first data format. The 7701x comprises a data unit, a Cores are for NECs program unit, and a peripheral set. The 0.25- and 0.35- m data unit contains X- and Y-memory ASIC technology. units, each of which has an address generator, eight 40-bit general-purpose regDevice features three isters, and a MAC-execution unit. The internal buses for Xprogram unit contains the instructionand Y-data and transaddress unit with loop control, interfer buses. rupt-control logic, program memory, Unit has 16-bit data and instruction-decode/control logic. A words and 32-bit intransfer bus links the two main units. struction words. Each main unit connects to an external data-memory interface: 14-bit addressMAC unit provides 8 es and 16-bit data for the data unit. guard bits with no The MAC unit has three 40-bit parautomatic saturation. allel subunits: a multiplier, a 40-bit ALU, Circular buffers are of and a 40-bit barrel shifter. Unlike many arbitrary size and arbiother DSP implementations, the MAC trary increment subunits have no dedicated input and amounts. output registers. Instead, the MAC is tightly integrated with a set of eight general-purpose, orthogonal registers. The core uses the X, Y, and transfer buses to load data into the general register set; the general register set provides the data to drive the MAC subunits, which can execute concurrently. In effect, the general register set, which is basically a multiport register file, serves as the interchange that links the data to the execution side of the processor. Two two-word RAM data-memory banks supply the X- and Y-data components for each MAC cycle. Each bank has its own address generator with a set of four address-pointer registers. Each unit also has an index-register link to the main data bus. Through this bus, code can load and modify the pointer and modification registers. Each unit also has a modulo register for circular buffering. Autoincrement and autodecrement can be by one, by the value stored in special registers, or by an immediate value. A special bit-reverse circuit handles bit-reversed addressing for each bank. You must directly load internal RAM under program control. Additionally, the DSP hardware supports automatic interruptible looping with a fourlevel loop stack that lets code nest so that it can loop under hardware control. The 7701x devotes 64 words of internal instruction RAM to interrupt vectors. Each interrupt handler comprises four instruction words, so you can code a short interrupt handler within the vector itself. You can independently enable interrupts, and the DSP services them on a fixed-priority basis. The program stack supports both call-return and interrupt nesting, which together can total as many as 15 calls deep. You can nest zero-overhead loops of as many as 255 instructions as many as four instructions deep or deeper if you use software to save and restore the stack. You can also nest single-instruction repeats within any of these loops. Addressing modesThe 7701x supports memorydirect, register-indirect, and immediate addressing. Hardware supports modulo and bit-reversed addressing for each data memory. Special instructionsThe 7701x supports conditional operations to minimize jumps, parallel register load/store, 1-bit shift-multiply-add, clip result, register-indirect subroutine call, register-indirect jump, zero-overhead single- and block-instruction hardware loop, repeat, floating-point normalization, and double precision. The 7701x lacks bit-manipulation instructions. SupportNEC supplies the WB77016 Workbench assembler/linker/loader package. NEC also supplies the SM77016 simulator, which simulates I/O through timing files that you write in a programming language. This simulator allows you to control I/O details, such as inserting data into or extracting data from a running simulation. A PC-based plug-in development board offers in-circuit emulation using the 7701xs on-chip emulation features. NEC also offers a C compiler for the PD7701x, a starter kit, and has ported the SPOX real-time OS to 7701x DSPs. A variety of middleware, or software, libraries is also available from NEC.
www.ednmag.com

82 edn | March 30, 2000

dspdirectory 16

bits

Philips REAL DSP


Philips invented the REAL (reconfigurable embedded DSP architecture low-power/low-cost) DSP architecture for use in internal company products, such as its Genie phone and a digital telephone-answering machine, but the architecture is gaining popularity among external customers, as well. REAL features a dual Harvard architecture with two 16-bit data buses connected to its data-computation unit (DCU). Each ALU in the DCU handles 32-bit data and 8 overflow bits and stores results in four 40-bit accumulators. Alternatively, you may split each ALU into two independent 16-bit ALUs. This dual multiplier/accumulator architecture can simultaneously calculate two independent output values, although algorithm-specific memory bottlenecks may require the designer to reuse the same coefficient data for both calculations. The REAL DSP has two independent address-computation units (ACUs), each of which has eight address pointers that perform automatic context switching during interrupts. The REAL DSPs VHDL-synthesis model allows designers to add application-specific execution units (AXUs) at specified points in the datapath or the ACUs. An AXU uses the standard DSP resources, and hooks in the DSP instruction decoder enable control of an AXU. Users can define AXUs or select them from library modules, such as a 40-bit barrel shifter, a normalization unit, and a division-support unit. Philips reserves a few bits in the application-specific-instruction (ASI) bit patterns to control AXU hardware. The

assembler handles mapping of the AXU commands. Special instructionsThe REAL DSPs instruction set uses 16-bit operation codes, but you can extend the instruction set with 96-bit ASIs. An on-chip RAM or ROM look-up table contain these very-long-instruction-word-like ASIs, which contribute to a high level of parallelism. A 16-bit instruction entering the core contains an index of this table, which activates the corresponding ASI operations. If the silicon implementation of the DSPs look-up table is in RAM, your application can download sets of ASIs to the chip while the application is running and dynamically customize the DSP core. The assembler, linker, and instructionset simulator account for the ASIs that you specify when writing your application program. You have to specify the keyword asi followed by all the operations that the core executes in parallel. The assembler/linker then checks for duplicate ASI instructions, translates the instructions to an ASI look-up table and, if necessary, downloads them to the DSP. You may use as many as 256 ASIs. SupportPhilips Semiconductor supports DSP-C, a proposed extension of ISO/ANSI C to better handle DSP-specific capabilities, DSP has dual Harvard such as fixed-point data types and dual architecture. Harvard architectures. The company is developing a C compiler for the REAL Data-computation unit architecture using the Associated Comcomprises two 16 piler Experts (www.ace.nl) CoSy com16-bit signed multiplipiler-development platform. ers and two 40-bit
ALUs. Device is customizable using applicationspecific units and instructions.

www.ednmag.com

March 30, 2000 | edn 83

dspdirectory 16

bits

Texas Instruments TMS320C2000


TI based the TMS320C2000 DSPs on the 320C2xLP core that the company offers as part of its custom DSP capability. The C2xLP core keeps the same four-stage pipeline of the C5x, which allows it to operate as fast as 40 MHz. It also has a JTAG-emulation block like the C5x, instead of the older, in-circuit emulation of the C2x. The C2000 family comprises the C20x and the C24x product families; the two families differ in their memory and peripheral mixes. The C20x targets lowend telecommunications, and the C24x targets digital-motor control. The C2xLP has a central arithmetic-logic unit (CALU), which feeds the 32-bit accumulator. The accumulator also acts as one of the inputs to the CALU. The other input to the accumulator comes from either the 16 16-bit multiplier through a scaling shifter or the input data-scaling shifter. Software can rotate the contents of the accumulator through the carry bit to perform bit manipulation and testing. For implementing fractional arithmetic or justifying a fractional product, the C2xLP processes the product-register output through a product shifter to eliminate the extra bit in a multiplicaHarvard architecture tion. The product-scaling shifter allows supports two separate as many as 128 product accumulates bus structures. without overflowing the accumulator. The basic multiply-accumulate Dual-access RAM al(MAC) cycle involves multiplying a lows writes and reads data-memory value by a programto and from the RAM memory value and adding the result to in the same cycle. the accumulator. When the C2000 reFamily members operpeats the MAC, the program counter auate as low as 3.3V. tomatically increments, freeing the program bus to fetch the second operand. This feature allows the MAC unit to achieve single-cycle execution. Similar to the C5x, the C2xLP can access 64,000 16bit parallel I/O ports. The peripherals on C2000 devices, such as serial ports and software wait-state generators, are mapped in either the on-chip data or the I/O spaces. Your program must use other I/O addresses to access off-chip peripherals mapped in the I/O space. You can use slower external memories using the on-chip software wait-state generator or the chips Ready pin. Most of the devices on the C2000 platform can generate zero to seven wait states. The devices in the C24x generation have an onboard event manager to support motor-control applications. The event manager features three up/down timers and nine comparators, which you can couple with waveform-generation logic to create as many as 12 PWM outputs. The event manager supports symmetrical (centered) and asymmetrical (noncentered) PWM-generation capabilities. It also supports a spacevector PWM state machine, which implements a scheme for switching power transistors to yield longer transistor life and lower power consumption. A deadband-generation unit also helps protect power transistors. In addition, the event manager integrates four capture inputs, two of which can serve as direct inputs for optical-encoder quadrature pulses. The C24x family also integrates on-chip, 10-bit A/D converters. These successive-approximation converters convert an analog signal in as little as 500 nsec and have eight or 16 multiplexed input channels. Some of the newer C24x devices also offer autosequencing capabilities, which allow as many as 16 conversions in a session, and an independent sample/hold prescaler, which gives you greater flexibility by supporting a range of input impedances. Several devices in the C24x family include flashmemory arrays ranging from 8k to 32k words. Some C24x devices also allow you to program the flash at the sector level. Addressing modesThe C2xLP supports immediate and paged-memory-direct addressing, in which 7 bits in an instruction concatenate with a 9-bit data-page pointer to access data RAM. It also supports registerindirect addressing using the 16 bits in one of eight auxiliary registers to access memory. It can automatically postincrement or decrement auxiliary registers. The C2xLP offers no circular buffering. Special instructionsA MAC-with-data-move instruction (MACD) adds a data move for on-chip RAM blocks to the MAC unit; as the CPU uses the input data values, the CPU shifts the values to the next memory location. MACD is an alternative to using a circular buffer and is useful for convolution and transversal filtering. The C2000 also offers single-instruction repeat, multiply and accumulate previous product, multiply and subtract previous product, accumulate previous product and move data, multiconditional branches and calls, store long immediate to data-memory locations, rotate accumulator left/right, and block move. SupportTI offers Code Composer 4.10, an integrated, unified development environment that supports editing, building, debugging, profiling, and project management. TIs $1995 Code Composer for the PC includes an ANSI C compiler, an assembler, a linker, an instruction-set simulator, and real-time analysis and data visualization. TI offers an emulator that supports JTAG scanbased emulation for nonintrusive product test. The
www.ednmag.com

84 edn | March 30, 2000

company also supplies a C compiler, a source-level C assembler/debugger, a linker, a simulator, a profiler, and an application library. Evaluation modules, prototype cards, emulators, and application algorithms are also available from third parties. TI also offers analog devices, such as data converters and a power-management supply, which you can combine with the C2000 DSPs. See www.ti.com/sc/4123 for more details.

dspdirectory 16

bits

Texas Instruments TMS320C5000


This 16-bit, fixed-point DSP-product family includes the older generation C5x, the mainstream C54x, and the new C55x. The C55x DSPs are sourcecode-compatible with the TMS320C54x DSPs, and the C5x DSPs are source-code-compatible with the C2x DSPs. The C54x focuses on low power consumption, but the C55x takes power to a new level: TI claims that a 300-MHz C55x delivers approximately five times higher performance than and dissipates one-sixth the core power of a 120-MHz C54x. Although the C5x is in full production, the company is shifting new designs toward the C54x and C55x. For more information on the C5x, please refer to www. ednmag.com/ednmag/reg/1999/041599/08cs_16. htm#tms320c5x or the handy online comparison sheet. The C55x and C54x DSPs use a modified Harvard architecture. The C55x has 12 independent buses compared with eight for the C54x. Both architectures include one program bus and an associated programaddress bus. The C55x bus is 32 bits wide, and the C54x bus is 16 bits wide. The C55x has three data-read buses and two data-write DSP is 16-bit, fixedbuses, whereas the C54x has two datapoint device. read buses and one data-write bus. Each data bus also has its own address bus. The The C55x has dual corresponding address buses are 24 bits MAC units; the C54x wide on the C55x and 16 bits wide on the has one MAC unit. C54x. The C55x has variable The C54x can generate one or two instruction size and no data-memory addresses per cycle using alignment restrictions. two auxiliary register-arithmetic units. The four internal buses and dual address The C55x has 12 buses generators enable multiple-operand opversus eight for the erations. The address-data-flow unit C54x. (ADFU) on the C55x contains dedicated The devices operate at hardware for managing the five data bus0.9V at 300 MHz. es. The ADFU also provides an additional general-purpose, 16-bit ALU with shifting capability for simple arithmetic operations. This ALU accepts immediate values from the instruction-buffer unit (IU) and communicates bidirectionally with memory, the ADFU registers, the data-computation-unit (DCU) registers, and the program-flow-unit (PFU) registers. Within the ADFU, the ALU can manipulate four general-purpose, 16-bit registers or any of the address-generation registers. Either the ALU or one of the three addressing-register ALUs (ARAUs) can modify the nine addressing registers used for indirect addressing. The three ARAUs provide independent address generators for each of the C55xs three data-read buses. This parallelism allows the DCU to read two 16-bit operands and a 16-bit coefficient during each CPU cycle. The DCU in the C55x contains dual multiply-accumulate (MAC) units that perform two 17 17-bit MAC operations in a single cycle. It also contains a 40bit ALU, four 40-bit accumulator registers, a barrel shifter, and dedicated Viterbi algorithm hardware (commonly used in error-control-coding schemes and also available with the C54x). Each MAC unit comprises a multiplier and a dedicated adder with 32or 40-bit saturation logic. The three data-read buses can carry two data streams and a common coefficient stream to the two MAC units. You can use the ALU to operate on 32-bit data or split it to perform dual 16bit operations. In addition to accepting inputs from the 40-bit accumulator registers of the DCU, the ALU accepts immediate values from the IU and communicates bidirectionally with memory, the ADFU registers, or the PFU registers. The C54x is a single 17 17-bit MAC machine with a dedicated 40-bit adder, two 40-bit accumulators, and a separate 40-bit ALU. Similar to the C55x, the C54xs ALU features a dual 16-bit configuration that enables dual single-cycle operations. The 40-bit adder at the output of the multiplier allows unpipelined MAC operations as well as dual addition and multiplication in parallel. Single-cycle normalization and exponential encoding support floating-point arithmetic. Both architectures support a single barrel shifter that can shift 40-bit accumulator values as much as 31 bits to the left or right. The barrel shifter can supply a shifted value to the DCUs ALU as an input for further calculation. The instruction sets support the parallelism of the architectures with many two- and three-operand instructions and some 32-bit operands. Eight individually addressable auxiliary registers and a software stack aid a C compilers efficiency. With the ability to execute variable-length instructions, the C55x takes a significant deviation from the C54x. Whereas the C54x instructions are fixed at 16 bits, the C55x instructions range from 8 to 48 bits. The C55xs IU buffers 64 bytes of code in a queue and includes the decoding logic that identifies the instruction boundaries of the variable-length instructions. The local repeat instruction uses the instruction-buffer queue to repeat or loop a block of code. The instruction-buffer queue can also perform speculative fetching of instructions while testing a condition for conditional program-flow-control instructions. The instruction decoder decodes instructions in sequential order rather than performing dynamic scheduling. This approach results in predictable execution time.
www.ednmag.com

86 edn | March 30, 2000

dspdirectory 16

bits

The C55x has a PFU that tracks a programs execution point and generates the 24-bit addresses for instruction fetches for as much as 16 Mbytes of program memory. This unit includes hardware for looping and for speculative branching, conditional execution, and pipeline protection. A separate program counter is dedicated to fast returns from subroutines or interrupt service routines. The PFU also includes the logic for managing the instruction pipeline and the four CPU status registers. The PFU provides three levels of hardware loops by nesting a block-repeat operation within another block-repeat operation and including a single repeat in either or both of the repeated blocks. It also includes hardware to support conditional repeats. The PFU handles pipeline-control hazards and provides protection against write-after-read and read-after-write data hazards. When such data hazards occur in a C55x instruction stream, the pipeline-protection logic inserts cycles to maintain the intended order of operations and correct execution of the program. An integrated software wait-state generator in both DSPs allows you to use slower external memories. All devices support on-chip dual-access RAM (DARAM) that you can configure as data or program memory. The C55x has expanded options for synchronousburst RAM, synchronous DRAM, and asynchronous SRAM and DRAM. A PLL allows you to throttle the clock, but the C55x core can also actively and automatically manage power consumption of on-chip peripherals and memory arrays. When a program is not accessing individual on-chip memory arrays, they switch into a low-power mode. The processor provides similar control for on-chip peripherals. Peripherals can enter low-power states when they are inactive and respond to processor requests and exit their low-power states without latency. The C55x also implements user-controllable, low-power Idle domains. These domains are sections of the device that you can selectively enable or disable under software control. When you disable a domain, it enters the Idle state, maintaining register or memory contents. When you enable the domain, it returns to normal operation. On initial C55x devices, the Idle domains are the CPU, the DMA, the peripherals, the

external memory interface, the instruction queue, and the clock-generation circuitry. Addressing modesThe C54x supports single datamemory-operand addressing that also supports 32bit operands. It also supports dual-data-memoryoperand addressing, which parallel instructions use. It provides immediate, memory-mapped, circular, and bit-reversed addressing. In addition to the C54x modes, the C55x supports absolute addressing, register-indirect addressing, and the direct-addressing, or displacement, mode. The C55xs ADFU includes dedicated registers to support circular addressing for instructions that use indirect addressing. Your program can simultaneously use as many as five independent circular buffer locations with as many as three independent buffer lengths. These circular buffers have no address-alignment constraints. The C54x supports two circular buffers of arbitrary lengths and locations. Special instructionsThe C54x performs dedicated-function instructions, such as FIR filters, single and block repeat, eight parallel instructions (for example, parallel store and multiply accumulate), multiply and accumulate and subtract (10 multiply instructions), and eight dual-operand memory moves. The C55x also has special instructions that take advantage of the additional functional units as well as increase parallelism capabilities. User-defined parallelism allows you to combine certain instructions to perform two operations. You can also combine a built-in parallel instruction with a user-defined parallel instruction. SupportThe eXpressDSP software-technology strategy includes DSP integrated development tools; a scalable, real-time software foundation; standards for application interoperability and reuse; and a growing base of TI DSP-based software modules from third parties (www.ti.com/sc/docs/general/dsp/expresssp/ index.htm). Code Composer Studio, an integrated suite of DSP-software-development tools, incorporates TIs C5000 C compiler with the Code Composer integrated development environment, DSP/BIOS, and Real-Time Data Exchange technologies. Thirdparty tools and application algorithms are also available. See www.ti.com/sc/4123 for more details.

www.ednmag.com

March 30, 2000 | edn 87

dspdirectory 16

bits

Texas Instruments TMS320C6000


TIS TMS320C6000 IS A GENERAL-purpose DSP based on a very-long-instruction-word (VLIW) architecture. This architecture includes the fixed-point C62x, the floating-point C67x, and the new C64x families. The C64x is object-code-compatible with the C62x but with significant architectural enhancements and an initial operating frequency of 750 MHz. TI created the C67x by adding floating-point capability to six of the C62xs eight functional units, so the C67x instruction set is a superset of the C62x instruction set. The C6000 lacks a dedicated multiply-accumulate (MAC) unit. Instead, it performs MAC operations by using separate multiply and add instructions. Although this operation requires two instruction cycles, the pipelined effect yields apparent single-cycle execution. (Unless otherwise indicated, all C62x details that follow also apply to the C67x.) This architecture comprises dual datapaths and dual matching sets of four functional units. The eight functional units on the C64x and C62x comprise two multiply (M) units and six 32-bit arithmetic units with a 40-bit ALU and a 40-bit barrel shifter. The C64x M units perform two 16 16-bit multiplies every clock cycle, compared with TI expects the first one multiply on the C62x. In addition, C64x device to hit 750 each M unit on the C64x can perform MHz. four 8 8-bit multiplies every clock cycle. Bit-count and rotate hardware on VLIW architecture the M unit extends support for bit-levhas RISClike el algorithms. The C64x also has beefedcharacteristics. up capability in other functional units. C compiler is For example, the logical (L) units can closely tied to the perform byte shifts and quad 8-bit subarchitecture. tractions with absolute value. This absolute-difference instruction benefits Eight functional units motion-estimation algorithms. TI also deliver parallelism. added bidirectional variable shifts to the M and S units. The C64x D unit can perform 32-bit logical instructions in addition to the S and L units. The L and D units can load 5-bit constants in addition to the S units ability to load 16-bit constants. In the C64x, each functional-unit set has its own bank of 32 32-bit registers; the C62x has 16 32-bit registers per bank. A program can use the general-purpose registers for data, data-address pointers, or condition codes. In all C6xxx devices, you can use registers A4 through A7 and B4 through B7 for circular addressing. A program can use any register as a loop counter, which can free the standard-condition registers for other uses. On the C64x, any member of a functional-unit set can access the other functional-

unit sets register bank; the functional-unit set performs this procedure through one data bus; on the C62x, all units except the two data units have a data cross-path to the other set of units. The C64x data-cross-path accesses allow multiple units per side to simultaneously read the same crosspath source. Thus, one, multiple, or all the functional units on a side in a VLIW-execute packet may use the cross-path operand for that side. In the C62x, only one functional unit per datapath per execute packet could access an operand from the opposite register file. The C62x register files support packed 16-bit data through 40-bit, fixed-point and 64-bit, floating-point data. You can store values larger than 32 bits in register pairs. The C64x register file supports all the C62x data types, packed 8-bit types, and 64-bit fixed-point data types. Packed data types store four 8-bit values or two 16-bit values in a single 32-bit register or four 16bit values in a 64-bit register pair. Each C64x multiplier can return a result as large as 64 bits, so an extra write port is available from the multipliers to the register file. The C6000 families support no separate X- and Ymemory spaces. Instead, they provide a single data memory with two 64- and 32-bit paths, respectively, for loading data from memory to the register banks. Two other 32-bit paths (64 bits for C64x) store register values to memory. A 32-bit address bus supports these datapaths. The C64x can also access words and double words at any byte boundary using nonaligned loads and stores; the C62x requires alignment on 32or 64-bit boundaries. A 32-bit address bus addresses the program memory, but the single datapath is 256 bits wide. This width allows the C62x to fetch, but not necessarily execute, eight 32-bit instructions per cycle. TI calls this approach a fetch packet. The C62x architecture does not allow fetch packets to cross fetchpacket boundaries, resulting in compiler-generated nonoperation (NOP) instructions to pad fetch packets. The C64x architecture resolves this code-bloat issue with instruction packing in the instruction-dispatch unit. This approach removes execute-packetboundary restrictions and eliminates all filler NOP instructions. The CPU can execute one to eight instructions per cycle, but data dependencies, instruction latencies, and resource conflicts limit optimal performance. Multiple execute packets allow fully parallel, fully serial, or parallel/serial combinations; therefore, eight serial instructions require the same code size as eight parallel instructions. The compiler and assembly optimizer play big roles in establishing the sequence of instructions for the C6000 to execute. The programming

88 edn | March 30, 2000

www.ednmag.com

dspdirectory 16

bits

tools link instructions in a fetch packet by the least significant bit of an instruction. If the bit is set, the C6000 executes the instruction in parallel with the subsequent instruction. The assembly optimizer performs dependency checking and parallelism among instructions. Therefore, the code executes as programmed on independent functional units and eliminates the need for core features, such as out-of-order execution or dependency-checking hardware. Two devices from these families, the C6211 and C6711, are the industrys first DSPs with L1 and L2 onchip cache memory. The C6211 incorporates a twolevel cache structure with 4-kbyte Level 1 program and data caches. The internal Level 2 cache memory is a unified 64-kbyte data and instruction RAM. The C6211 also includes a 16-channel DMA controller that tracks 16 independent transfers and allows you to link each channel to a subsequent transfer. The C6202, C6203, and C6204 have a 32-bit expansion bus that replaces the 16-bit host-port interface and complements the external memory interface (EMIF). The second bus for I/O devices reduces the loading on the EMIF and increases data throughput. The EMIF and the expansion bus are independent of each other, allowing the CPU to perform concurrent accesses to both ports. Addressing modesThe C6000 performs linear and circular addressing. However, unlike most other DSPs that have dedicated address-generation units, the C6000 calculates addresses using one or more of its functional units. Special instructionsAll C6000 processors conditionally execute all instructions, a method of reducing branching and, therefore, keeping the pipeline flowing. On the C64x, the MPYU4 instruction performs four 8 8-bit unsigned multiplies. The ADD4 instruction performs four 8-bit additions. All functional units can perform dual 16-bit addition/subtraction, compare, shift, minimum/maximum, and absolute-value operations. The M units, and four of the six remaining functional units, support quad 8bit addition/subtraction, compare, average, minimum/maximum, and bit-expansion operations. TI also added instructions that operate directly on packed 8- and 16-bit data. Bit-count and rotate hardware on the M unit extends support for bit-level algorithms, such as binary morphology, image-metric calculations and encryption algorithms. The C64xs the branch-and-decrement (BDEC) and branch-on-positive (BPOS) instructions com-

bine a branch instruction with the decrement and test positive of a destination register, respectively. Another instruction helps reduce the number of instructions needed to set up the return address for a function call. The dual 16-bit arithmetic combines with six of the eight functional units and a bit-reverse (BITR) instruction to improve FFT cycle counts by a factor of two. The Galois field-multiply instruction (GMPY4) provides a performance boost over the C62x for Reed Solomon decoding using the Chien search. Special average instructions improve the performance of motion compensation by a factor of seven on a per-clock cycle basis versus the C62x. The quad-absolute-difference instruction bolsters motion-estimation performance by a factor of 7.6 on a per-clock-cycle basis for an 8 8-bit minimum-absolute-difference (MAD) computation. The C64x provides data packing and unpacking operations to allow sustained high performance for the quad 8-bit and dual 16-bit hardware extensions. Unpack instructions prepare 8-bit data for parallel 16-bit operations. Pack instructions return parallel results to output precision, including saturation support. SupportThe eXpressDSP software-technology strategy includes DSP integrated development tools; a scalable, real-time software foundation; standards for application interoperability and reuse; and a growing base of TI DSP-based software modules from third parties (www.ti.com/sc/docs/general/dsp/expresssp/ index.htm). The Code Composer Studio, an integrated suite of DSP-software-development tools, incorporates TIs C6000 C compiler with the Code Composer integrated development environment, DSP/BIOS, and Real-Time Data Exchange technologies. The assembly optimizer simplifies assembly-language programming and automatically schedules and parallelizes instructions from serial, inline assembly code. The assembler reads straight-line code without regard to registers or functional units and does the resource assignment. Deterministic operation allows the debugger to lock-step through the code. The debugger performs code profiling to determine the amount of time the processor spends in various portions of the code. Free tools are available for a 30-day trial on the Web at www.ti.com/sc/docs/tools/dsp/ 6ccsfreetool.htm. Third-party tools and application algorithms are also available. See www.ti.com/sc/4123 for more details. TI offers hardware-emulation boards and starter kits.

www.ednmag.com

March 30, 2000 | edn 89

dspdirectory 16

bits

3DSP SP-3 and SP-5 DSP cores


3DSP, a venture-capital-funded company that started up in 1997, focuses on providing fixed-point, synthesizable, static DSP cores in VHDL or Verilog.You can also purchase the DSPs as hard cores. The product offerings include the scalar SP-3 and the binarycode-compatible, superscalar SP-5. Both cores have single-instruction-multiple-data (SIMD) capability. One instruction can manipulate four 8-bit operations, two 16-bit operations, or one 32-bit operation. The SP-5 can execute two instructions in a single cycle. The cores incorporate a five-stage pipeline that handles all data dependencies and hazards in hardware and issues a nonoperation instruction when the data is unavailable in time. The SP-3 and SP-5, respectively, fetch one and two 32-bit instructions from the instruction cache every clock cycle. The CPU fetches no instructions if the pipeline is full or if a cache miss occurs. This approach eliminates unnecessary memory fetches and, therefore, reduces power consumption. The decoding unit of the CPU checks all data dependencies between instructions in the pipeline. If an operand is unavailable for execution, the decoding logic stalls the instruction until it is available. If the operand is unavailable Architecture is dualin the register file but available in the issue superscalar. pipeline, the decoding logic forwards the operand accordingly. Packed operations The DSPs process-control instrucsupport SIMD. tions support zero-overhead looping, Addressing modes static-branch prediction, and dynamicsupport matrix branch resolution to minimize branch operations. penalties. If the CPU doesnt resolve the branch condition, it predicts and issues DSP operates on 8-, the branch result. The CPU resolves the 16-, 24-, and 32-bit speculative branch as soon as it deterdata. mines the branch condition. If the branch prediction is correct, the DSP incurs no penalty. If the CPU mispredicts the branch, the penalty ranges from one-half to two cycles. The SP-3 includes a multiply-accumulate (MAC) unit, a saturate/add unit, a shifter unit, and a logic unit; the SP-5 includes an extra MAC and an extra saturate/add unit. The company designed its DSPs modularly to allow reconfigurability. The baseline SP-3 and SP-5 cores have two and four 48-bit accumulators, respectively. These accumulators combine with the MAC units to handle 24 16-bit, two 16 16-bit, or four 16 8-bit MAC instructions. Configuration options for the SP-3 and SP-5 multipliers, respectively, include two and four 56-bit accumulators, and 32 24- and 32 16-bit multipliers. In both DSPs, you can link two of the accumulators to provide 84-bit accuracy for multiprecision MAC operations. The saturate/add units perform 48-, 32-, or 16-bit saturation, as well as a 32-bit, two 16-bit, or four 8bit adds. The shifter unit can handle both signed and unsigned shifts, a barrel shift, byte packing, bit extraction, and bit insertion. The logic unit performs various logic functions as well as leading-one detection and compare/select.You can add application-specific instructions to optimize performance. A key feature of both cores is the memory-to-register-file architecture, in which the operands of the instruction come from either the data memory or the register file. The result of the instruction goes only to the register file. The SP-3 and SP-5, respectively, can perform a maximum of two and four load/store operations every clock cycle. The address-generation unit has eight circular buffers and four page registers. They can perform two or four address calculations per cycle, respectively. The circular buffer supports transpose mode, which allows the programmer to perform 2-D matrix operations. You can use the bit-reverse mode in the circular buffer for FFT operations. The page registers provide a convenient way to index array-data structures. Both cores use one program memory and two data memories. A system developer can program one-fourth of the program-memory space as set-associative cache; you can also partition the memory into multiple banks to improve DMA performance. The company also offers a synthesizable on-chip system bus controller with 10 prioritized channels, a 600-Mbyte/sec, and a cycle-by-cycle multiplexed system bus. This controller includes support for 2-D data transfers, chained transfers, and infinite transfers. It also includes multiprocessor support and a variabledepth buffer to maximize both on-chip systems. Addressing modesThe SP-3 and SP-5 have four page registers to support register-indirect addressing. Eight circular buffers support postincrement or postdecrement indirect addressing. Using the circular buffers, the SP-3 and SP-5 CPUs can read as many as two and four data operands, respectively, from memory in a cycle, and both can generate two new addresses for the next cycle. The circular buffer can also access multiple variable blocks of memories with no additional setup requirements. The DSPs support prioritized interrupts and 32 levels of nesting. Special instructionsHuffman-encoding-related instructions target image-compression applications. The DSPs conditionally execute all instructions. SupportThe vendors development-tool set includes an optimized C compiler, a cycle-accurate simulator, a software library, and a graphical-user-interface-based hardware debugger. It has a
www.ednmag.com

90 edn | March 30, 2000

JTAG-compatible debugging port. The company also provides software libraries, including support for digital-still-camera applications; H.263 image-compression algorithms; standard voice codecs; FFTs; and a 2-D, discrete-cosine transform. An RTOS kernel supports multitasking applications and handles multiple priorities for tasks, semaphore queues for synchronizing events between tasks, system functions to handle interrupts, state preservation, and DMA requests.

dspdirectory 16

bits

Zilog Z893x1/Z893x3
Zilog built the Z893xx familys architecture around a single-cycle multiply-accumulate (MAC) unit, which includes a 24-bit product register and a 24-bit accumulator and arithmetic-logic unit with no guard bits. The DSP runs from a 4k- or 8k-word ROM or one-time-programmable (OTP) program ROM. Two internal bus setsa program-address/data-bus set and a data-address/data-bus setallow the processor to access program and data concurrently with a MAC operation. The architectures two RAM blocks can hold coefficients and data samples, which automatically feed directly into the MACs input registers each cycle. RAM-block addressing automatically increments or decrements the address, which eliminates the need for data-address-generation code for each MAC cycle. Results of the MAC operation land in the product register and the 24-bit accumulator during each cycle. You can treat the product register as a general-purpose register when it is not performing multiplies. Although the Z893xx lacks a barrel shifter, a shifter between the product register and ALU allows you to shift the product result right Device has an accuby 3 bits before adding it to the accumulator-based, mulator. modified Harvard You can use the external I/O bus to acarchitecture. cess external peripheral devices, such as an ADC. An external read/write takes The MAC unit includes one cycle. You can insert one wait state a 16 16- to 16 24using software control to access slow bit multiplier with auperipherals; you can use the Wait pin for tomatic truncation. additional wait states. Running code Features include an from external memory takes one addiexternal I/O bus. tional cycle for each instruction; the chip reads the data in one cycle, but the data Device does no hardis unavailable for processing until the ware-repeat looping or next instruction cycle. bit manipulation.

Z893x1 devices have a codec interface that is compatible with 8-bit PCMs, 16-bit codecs, and 64-bit stereo sigma-delta codecs. You can adapt many general-purpose peripherals, including 8- and 16-bit ADCs and DACs, to this interface. You can also use the interface as a high-speed serial port or general-purpose counter. The Z893x1 chips also have one 13-bit timer for the CODEC interface and one 13-bit timer for general-purpose uses. You can concatenate these timers for extended timing. The Z893x3 has an 8-bit half-flash ADC, a highspeed SPI, three counter/timers, and as many as three 8-bit ports. It also has a PLL-driven system clock that drives the DSP to operate as fast as 20 MHz from a 32kHz watch crystal. Addressing modesThe Z893xx supports memorydirect addressing for as many as 512 RAM-based words; it also supports register-indirect addressing to RAM or ROM with pointer registers and immediate, short-form direct addressing using 16-bit data registers in RAM. It provides one-cycle, external-peripheral addressing, treating the peripheral as a register. Modulo-addressing options include Modulo 2 to 256 for data access. Special instructionsThe Z893xx performs compare register to accumulator, conditional execution of certain instructions, and conditional branching and subroutine calls. SupportZilog offers its Zilog Developer Studio which comprises a macroassembler, a linker/loader, and a source-level debugger. Also available is the 3xx Compass/IDE, which comprises a C compiler, an assembler, a linker, a simulator, and application libraries. Zilog offers emulators and evaluation boards, OTP programming adapters, and target emulation pods supporting a design-in environment. Check out www.zilog.com/products/dspapp.html for additional information.

92 edn | March 30, 2000

www.ednmag.com

dspdirectory 24

bits

Motorola DSP563xx
The 563xx is Motorolas highest performance fixedpoint DSP architecture, achieving single-cycle instruction execution. Although a branch penalty is three cycles, the 563xx supports conditional ALU instructions, which often avoid the need to change program flow. When the processor executes a single-cycle multiply-accumulate (MAC) operation, the first execution stage does the multiply, and the second stage does the accumulate. The 563xx uses an interlocking mechanism that automatically inserts a nonoperation (NOP) instruction into the pipe to avoid stalls. This approach permits execution to catch up with data dependencies. The 563xx is binary-code-compatible with the 56000, but the 563xx also supports addressing modes that include address-register program-counter (PC) relative. This mode is useful for multitasking and position-independent code, which lets a programmer deliver and relocate object modules without relinking to the original code. Motorola expanded addressing on the 563xx to the full 24 bits, up from 16 bits on the 56000 family. Unlike the DSP56000, which has a 16location stack limit, the DSP563xx implements an overflow mechanism for the on-chip hardware stack to off-chip data memory. Although the mechanism prevents unrecoverable stack overflows, the chip takes a two-clock penalty when externally dumping stack entries. The DMA has separate address and data buses. The DMA transfers data among memories (P, X, and Y) or among memory and peripherals or the external host buses (PCI or ISA). You can program the size of the program RAM, instruction cache, X-data RAM, and Y-data RAM. The static core operates from dc to 80 MHz and uses a PLL with a built-in prescaler that allows dynamic clock throttling. For additional power savings, the core automatically powers down unused memories, peripherals, and core logic on every instruction. The newest members of the 56300 family are the 56307 and 56311. These devices include an on-chip enhanced filter coprocessor (EFCOP) that processes filter algorithms in parallel with the DSPs core operation. The EFCOP provides performance improvements for tasks such as voice coding and echo cancellation. It operates in modes to perform real- and

complex-FIR, infinite-impulse-response filtering, adaptive-FIR filtering, and multichannel-FIR filtering. The EFCOP has its own access to memory, as well as a port into DMA. Addressing modesThe 563xx supports register-direct, address-register-indirect, PC-relative, immediate absolute addressing. Special instructionsThe 563xxs barrel shifter supports multibit-shift instructions in both directions and by any number of bits. The shifter also supports instructions for bit-stream parsing and generation. The device can conditionally execute parallel ALU instructions based on all possible condition codes. If the test condition is false, the processor executes an NOP instruction. The 563xx performs 16-bit arithmetic that is useful for handling various compression algorithms, such as LD-CELP (low-delay code-excited linear prediction). Normally, when using a 24-bit architecture for 16-bit arithmetic, performance degrades, because you have to round the 24-bit numbers in software. SupportMotorola backs the 563xx family with a host of development tools.You can use an applicationdevelopment system to evaluate the chip and debug target systems. The system Device has a sevencomes with an application-development stage pipeline, commodule, a host-interface card, a comprising two fetches, mand converter, an assembler, simulator one decode, two adsoftware, and a C compiler. The 563xxs dress generations, and JTAG-based OnCE port allows you to extwo executions. amine all internal buses in real time and record the last 12 change-of-flow inDevice has conditionstructions. Motorola provides the Suite56 al-ALU instructions. hardware- and software-development Architecture is registertools for the DSP563xx family. A range of based. third-party tools complements these tools. Third-party tools include a comSix-channel DMA oppiler and debugger from Tasking (www. erates concurrently tasking.com) and a debugger from with cores execution Domain Technologies (www.domaitec. units. com). The Motorola software tools are Most devices operate available on CD-ROM, or you can downat 3.3V and have 5Vload them from www.motorola.com/ tolerant I/O; some opSPS/WIRELESS/dsptools/index.htm. erate at 1.8V and have Metrowerks, now part of Motorola, will 3.3V-tolerant I/O. unify the look and feel of development tools for new processors under the Filter coprocessor opMetrowerks Code Warrior style.
erates in parallel with the core.

www.ednmag.com

March 30, 2000 | edn 93

dspdirectory 32

bits

Analog Devices SHARC DSP


The 32-bit fixed- and floating-point SHARC DSPs, or ADSP-2106x and the second-generation ADSP2116x, integrate four internal buses, a large on-chip memory, and an I/O controller to offload I/O. Both single-instruction-multiple-data (SIMD) and singleinstruction-single-data (SISD) versions are available. Within the SISD CPU core, the ALU, multiplier, and shifter operate in parallel to perform multifunction, single-cycle instructions. The SIMD core adds a second compute block that includes an additional parallel ALU, a multiplier, a shifter, and a register file. This arrangement allows both computation blocks to process the same instruction but operating on different data. SHARC DSPs feature an enhanced Harvard architecture in which the data-memory bus transfers data and the program-memory bus transfers both instructions and data. With its separate program- and data-memory buses and on-chip instruction cache, the processor can simultaneously fetch two operands and an instruction from cache in one cycle. The 32-entry, 48-bit-wide instruction cache is selectivecaching only the DSP supports fixedinstructions whose fetches conflict with and floating-point accesses to program-memory data. operations. The SHARC DSP uses a general-purThe new Hammerpose, 10-port, 32-register data-register head operates on file to transfer data between the compuSIMD. tation units and the data buses and to store intermediate results; the ADSPSHARC architecture is 21160 duplicates this action. The 48-bit known for having instruction word accommodates a varilarge integrated ety of parallel operations for concise SRAMs. programming. For example, the ADSP2106x DSPs can conditionally execute a multiply, an add, a subtract, and a branch in one instruction. SHARC DSPs feature two data-address generators (DAGs), which implement circular data buffers. These DAGs contain sufficient registers to allow you to create as many as 16 primary and 16 secondary circular buffers. The DAGs, which may start and end at any memory location, automatically handle addresspointer wraparound. SHARC chips have two high-speed serial ports and a host/parallel port, providing a direct interface to offchip memory, peripherals, and a host processor. Link ports facilitate interprocessor communication and bus arbitration among as many as six ADSP-2106x chips. The SHARCs CPU executes using on- or off-chip memory. Some SHARC chips contain as much as 512 kbytes of on-chip memory organized into two banks of dual-port RAM. You can use this memory to store

a combination of 16-, 32-, or 40-bit data and 48-bit instructions and perform as many as four accesses per cycle: program memory for code and data, data memory for data, and an off-chip load using the chips I/O controller. SHARCs I/O controller executes I/O transfers in parallel with CPU execution. The I/O controller offloads reads and writes between on- and off-chip memory. The dual-ported, dual-banked nature of the memory, combined with the I/O processor, allows the core and the DMA to simultaneously access internal SRAM. The I/O controller manages all DMA channels, transferring data among internal and external memory and all peripherals, such as the host port, as many as eight serial ports, and six link ports. All DMA operations generally do not interrupt or delay core thread execution. The DMA controller allows you to dynamically control the external-memory-bus width. The synchronous serial ports support time-divisionmultiplexed serial streams and hardware companding and can transfer data as fast as 40 Mbps. In all but the ADSP-21065L, the six communication ports move data in 4-bit nibbles, transferring as much as 1 byte/clock cycle. With six links operating simultaneously, maximum throughput is 600 Mbytes/sec. The CPU, I/O controller, and peripherals interconnect and perform flexible, nonintrusive transfers through a multibus-crossbar-interconnection unit. To reduce bottlenecks, the interconnect crossbar permits unlimited data and instruction movement from external or internal memory or cache and permits I/O from on- or off-chip peripheralsall in one cycle. The 211660, 21060, and 21062 provide six communication ports for array multiprocessing. These ports feed through the I/O controller and let you create meshes of DSPs that can access each others memory spaces. (Point-to-point connections between DSP ports define each processor in the mesh.) The on-chip I/O controller sets up, runs, and responds to these ports. Transfers pass through the I/O ports to and from internal memory. The I/O controller separates these transfers from mainstream DSP. A parallel port serves as a direct interface to off-chip memory, peripherals, or a host processor. As many as six SHARCs can share this interface with a host processor. SHARCs offer a unified address space using a 32-bit address bus and a 32- or 48-bit data bus. For a 100-MHz clock, the chip supports a 10-nsec access time with zero-wait-state memory. The special host interface supports both 16- and 32-bit Ps, as well as system buses, such as ISA and PCI. The host treats the SHARC as a memory-mapped device with direct writes or reads to internal memory.
www.ednmag.com

94 edn | March 30, 2000

The lowest priced SHARC DSP, the ADSP-21065, also provides a synchronous DRAM (SDRAM) interface that transfers data to and from SDRAM as fast as 240 Mbytes/sec, or twice the clock frequency. The glueless SDRAM interface can access 16- or 64-Mbyte SDRAMs and enables you to connect to any one of four external memory banks. Addressing modesSHARC offers immediate, indexed, bit-reversed, circular-modulo, and register-direct and -indirect addressing. (It must use indirect addressing for off-chip memory access.) Special instructionsSHARC provides bit manipulation, division iteration, reciprocal of square-root seed, conditional subroutine call, single and block repeat with zero-overhead looping, fixed- and floating-point compare, and conditional execution of most instructions. SHARC supports IEEE-754 single-precision, floating-point (23-bit data, 8-bit exponent, and sign bit), and a 40-bit extended IEEE format for additional accuracy (32-bit data). SupportAnalog Devices software- and hardware-development tools include the companys VisualDSP integrated development environment, in-circuit emulators, and a development kit. VisualDSP provides the interface to an optimizing C compiler, an assembler, a linker, a simulator, and a debugger. Analog Devices emulators are available for Universal Serial Bus, PCI, and Ethernet host platforms. An EZ-Kit Lite consists of an evaluation board and limited but full-featured VisualDSP. Analog Devices based the SHARC assembly language on an algebraic syntax.

xx edn | Month XX, 2000

www.ednmag.com

dspdirectory 32

bits

Texas Instruments TMS320C3x


TIs TMS320C3x integrates a 32-bit, floating-point DSP multiply-accumulate (MAC) core. The C3x also performs fixed-point math based on 24-bit-wide data, or the lower 24-bit mantissa of the floating-point registers. Although most designers use the C3x for its floating-point capability, fixed-point math is occasionally useful for functions such as clipping of image data. On the P side, the C3x supports a unified, flexible, 24-bit address space (16 Mbytes32-bit words). On the DSP side, the C3x processor performs single-cycle MAC processing. The processor receives the next instruction while accessing two data values for the current instructions MAC cycle. The C3x family does not support IEEE floatingpoint formats to help reduce core and code size. The C3x format uses an implied sign bit to increase precision. In most applications, the difference in data format is relevant only if you are passing the data to another processor. However, you can convert between the C3x and IEEE formats if necessary. The TMS320C3x DSP comprises memory/access, central-core, and I/O subsystems. The memory/access subsystem comprises separate program, data, and DMA buses, which allow parallel program fetches, data reads and writes, and DMA operations. This internal busing scheme enables programs to access the next instruction and two data values simultaneously and to transfer data to or from the I/O subsystem in one cycle. The data-address buses share a data bus that can make two sequential RAM accesses in one cycle because the buses run at twice the speed of the processor core. One 64-word, lockable, on-chip program cache automatically loads as the DSP accesses instructions from external memory. The two 4-kbyte dual access RAM blocks hold parameters and constants for sum-of-products MAC processing, and a 32kbyte ROM can hold code or coefficients for MAC processing (C30 only). The new TMS320VC33 has two additional, 16k 32-bit RAM blocks. The central core has its own set of buses to move data and results. These buses move data among internal registers; an integer/floating-point multiplier; a parallel, 32-bit barrel shifter/ALU; and the memory subsystem. The core stores results in extended-precision or auxiliary registers. Two address generators in the subsystem generate the addresses to access the data

memories. The core registers, eight 40-bit extendedprecision registers, auxiliary registers, and key-control registers reside in a central multiported register file of 32 registers. The C3x uses a software stack to support context switching. The third C3x subsystem, the I/O, comprises a single-channel DMA controller (dual channel in the C32) and a collection of peripherals that interlink with the peripheral-address and data-bus set. The memory-subsystem buses pass through a multiplexer and link to the peripheral bus, which serves the DMA controller and peripherals. Addressing modesThe C3x supports register-direct, paged-memory-direct, register-indirect, and immediate addressing. A single circular buffer supports circular addressing and bit-reversed addressing for FFTs. The circular buffer requires block-size and basepointer registers plus an auxiliary register that the buffer shares with X and Y memories. Special instructionsThe C3x performs single- or block-instruction hardware looping (supports nestable block repeats but lacks automatic save and restore of status); standard branches, which empty the pipe; delayed branchDevice features 32es, which wait three program cycles bit, floating-point before changing the program counter; architecture. interlocked access instructions for multiprocessing (load/store integer or floatHarvard-like architecing-point value and signal interlocked); ture has a von Neucomputed gotos (dynamic subroutine mann-like programcalls); and conversion of floating-point ming environment. to integer and vice versa. You can also specify instructions to execute in parallel. SupportTI supplies full-speed in-circuit emulators, evaluation modules, and starter kits. The C3x, except for the C33, lacks JTAG support but has a proprietary five-pin emulation interface. TI sells a tool set that includes a C compiler, an assembler/linker, a source-level debugger, a code profiler, a simulator, and an application library. Third-party tools include C and ADA compilers, multiple OS products, filter-design packages, advanced graphical-design tools, and hardware tools. Check out www.ti.com/sc/docs/tools/dsp/ index.html for more information.

www.ednmag.com

March 30, 2000 | edn 95

dspdirectory 32

bits

Texas Instruments TMS320C4x


The C4x has seven internal buses and on-chip memories that help deliver single-cycle execution when walking through X and Y memories for a series of multiply-accumulate (MAC) operations. Rather than time-sharing a single bus system, the C4x features separate buses for program and two data fetches. Additionally, the C4x has a floating-point-unit multiplier, an ALU, and a barrel shifter for parallel operations. The C4x also performs 32-bit, fixed-point math based on either 32-bit memory values or the 32bit mantissa of its 40-bit floating-point registers. A 128-word cache enables the processor to deliver single-cycle pipelined execution and still use slower external memory. (It does not use the cache with internal memory.) Key inner routines fill the cache as they run. The CPU accesses an instruction from external memory and automatically loads the instruction into cache, which is divided into four 32-word segments or lines. The CPU uses a least recently used algorithm to select the cache segment for the new instructions. You can freeze a segment in the cache by setting cache-freeze bits in the CPU-status register. DSP has 32-bit, floatSix 8-bit independent communicaing-point architecture. tions ports support point-to-point communications with networks of C40s and Device features seven peripherals. (The C44 has only four internal buses. ports.) Each port comprises eight data Family members have pins and four handshake signals. These five-port register files. ports free the 32-bit local and global external-memory buses for program or Device features a 128data accesses to the processors 4G-word word cache. address space (C40 and C44 only). Program and data occupy a unified address space that you can configure according to your memory requirements. The local and global buses have different memory block assignments within each mem-

ory space. I/O can also use the external buses. A six-channel DMA subsystem with its own address and data buses moves data between the communications ports and memory without altering the CPUs sequential threads. Such data movements do not overload the DSP with servicing overhead, although some data contention for memory may slow CPU execution. Addressing modesThe C4x supports register-direct, paged-memory-direct, register-indirect, immediate, and circular addressing to support single-sized circular buffers. The CPU applies bit-reversed operations to register-indirect addressing only. Special instructionsThe C4x performs single or block instruction, zero-overhead hardware looping (nestable block repeats but without automatic save and restore of status), standard/delayed branches, interlocked access for multiprocessing (load/store integer or floating-point value and signal interlocked), conversion of floating point to integer and vice versa, reciprocal and reciprocal square-root seed, and conversion to and from IEEE floating-point formats.You can also specify some instructions to execute in parallel. SupportDevelopment system includes scan-based emulation via the C4xs JTAG test port. External hardware can use the JTAG port to control the processor and to set and monitor registers or memory. You can string multiple C4x chips on a JTAG circuit for parallel debugging. One processor breakpoint can halt execution in an array of C4x chips, and you can singlestep them all in lock step. TI sells a C4x evaluation board with four processors that works with a number of host platforms. Software tools include a C compiler, a source-level debugger for parallel debugging, an assembler/linker, and a simulator. TI offers an application library. Third-party support includes the Spox, Parallel C, Virtuoso, and Helios OSs, as well as a variety of hardware tools.

96 edn | March 30, 2000

www.ednmag.com

dspdirectory 64

bits

Module Research Centers NeuroMatrix NM6403 DSP


The NeuroMatrix NM6403 DSP is a dual-core, superscalar processor that includes the 64-bit Vector coprocessor. It provides on-chip saturation and supports vector-vector, vector-matrix, and matrix-matrix multiplication. Modules NM6403 is the first dual-core, application-specific DSP processor based on the NeuroMatrix architecture. It provides scalable performance, a programmable operand width of 1 to 32 bits, and operation as fast as 50 MHz. This flexibility allows designers to trade precision with performance to suit their applications. The NM6403 processor comprises a 32-bit RISC core and a patent-pending, 64-bit Vector coprocessor that supports vector operations with elements of variable bit length. Two identical programmable interfaces work with any memory types, and two communication ports are hardware-compatible with TIs TMS320C4x; this approach allows you to build multiprocessor systems. The RISC core has a five-stage pipeline that operates with 32- and 64-bit-wide instructions. (Each instruction usually executes two operations.) Two 64-bit interfaces support Processor is dual-core SRAM, DRAM, and extended-data-out and superscalar. DRAM and comprise two separate address-generation units that can access as Processor includes much as 16 Gbytes of address space. the 64-bit Vector coEach interface supports two memory processor. banks and can function in a sharedDevice provides onmemory mode. Two DMA coproceschip saturation. sors transfer data between high-speed I/O communication ports and external Device supports vecmemory. tor-to-vector, vector-toThe Vector coprocessor works on matrix, and matrix-topacked integer data comprising 64-bit matrix multiplication. blocks in the form of variable 1- to 64bit words. The hardware also supports vector-matrix or matrix-matrix multiplication. The Vector coprocessors core looks like an array multiplier, which the company calls the NeuroMatrix engine. The structure comprises cells that include a 1-bit memory (flip-flop) surrounded by several logical elements. You can combine the cells into several macrocells by using two 64-bit programmable registers. These registers define the borders between rows and columns with macrocells. Each macrocell performs the multiplication on variable input words using preloaded coefficients and accumulates the result from the macrocells in the column above it. The columns simultaneously calculate the results in one processor cycle. For 8-bit data and coefficients, the Vector coprocessor performs 24 multiplication/accumulations with 21-bit results in one 20-nsec processor cycle. The number of multiplications/accumulations depends on the length and number of words packaged into a 64bit block. The engines configuration can change dynamically during the calculations. You can start the application with maximum precision and minimum performance and then dynamically increase the performance by reducing the data-word lengths. To avoid arithmetic overflow, the NM6403 uses two types of saturation functions with user-programmable saturation boundaries. Addressing modesThe NM6403 supports immediate with 32 bits, base, indexed, and relative addressing modes. Special instructionsThe processor uses vector instructions for efficient work on packets of as many as 32 64-bit data words. This type of instruction may define such operations as matrix-matrix, matrix-vector, or vector-vector multiplication; vector-vector addition/subtraction with saturation of results; and block moving. The NM6403 has conditional branch, call, and return instructions. SupportThe development tools for PCs cost $1995 and include an ANSI X3J16/95-0029 preliminarystandard-compatible C++ compiler, an assembler, an instruction-level simulator, a cycle-accurate simulator, a linker, a source-level debugger, a load/exchange library, and a set of application-specific vector-matrix libraries. RC Module also offers a dual CPU PCI evaluation/development board for $1495. The vector-matrix library simplifies C-language programming and allows you to design DSP applications such as FFT, Sobel, and Hadamard Transform. Module also provides an NM6403 Verilog model for Sun host platforms for system-level simulation and pc-board design.

98 edn | March 30, 2000

www.ednmag.com

You might also like