You are on page 1of 4

Architectural Exploration for Viterbi Decoders

Kenny Lu
kennylu.hl@gmail.com
Rong Huang
rongster3544@gmail.com
Abstract Convolutional codes are used by digital
communications to achieve low BER and high throughput.
A widely used convolutional code decoder is the Viterbi
Decoder. This report examines how different architectures
and algorithms affect speed, power, area. The feedback
nature of the algorithm is identified as the speed
bottleneck. Inefficient management of decision bits result
in large power and area. The analysis is based on hard
decoding of a (2,1,3) convolution code.
I. INTRODUCTION
Communicating information is an essential part of many
applications such as television, satellite communication, digital
radio, and phones. One problem that must be dealt with when
transmitting bits through channels is the noise level. Simply
sending the desired sequence of bits would not work. One way
of solving this is by encoding the data. Convolutional codes
achieve low BER by adding redundancy to the source symbols.
This can be done by using encoders that take a sequence of
inputs and produces a sequence of outputs based on some sort
of pattern. This received pattern must be then decoded. The
result is the transmission of data with a low error rate. In this
report we will talk about the Viterbi Decoder. We will be
building a decoder for a (2,1,3) encoder. This means that for
every one bit that in inputted into the encoder there are two bits
that come out. The three means that each input will have an
effect on three pairs of outputs. Another way of saying this is
that this has a code of rate and constraint length 3.
II. THE VITERBI ALGORITHM
The general diagram of a Viterbi Decoder is shown in
Figure 1. It consists of a Branch Metric Unit (BMU), Add
Compare Select Recursion Unit (ACSU), and Survivor
Management Unit (SMU).
Figure 1. The building blocks of a Viterbi Decoder
The BMU determines the reliability between two groups of
bit streams. For an encoder there are always 2
K-1
different
states. That means for our encoder there will be four states. The
BMU looks at two groups of these numbers and gives us the
Hamming distance, which is simply how many differing bits
there are between the numbers. This is called a hard decision.
There is also a soft decision that is more complex. In our case
the only possible distances are 0, 1, and 2.
The ACSU recursively calculates the path with the least
cost based on incoming branch costs. This unit compares the
path metrics after adding the newly inputted branch metrics
from the BMU. It then makes a decision based on the path
metrics and the output is sent to the SMU. The operational
speed of the Viterbi decoder is limited by the ACSU unit. This
is due to its recursive nature (pipelining can be easily
introduced in the BMU and SMU since they are purely
feedforward). One of the main goals in this article is to show
how this bottleneck can be dealt with.
The SMU takes decision values from the ACSU and uses
them to find out the survivor sequences. The output from the
SMU is the decoded sequence. If we imagine a trellis diagram,
we can assign each node a pair of numbers (j,k). The j
represents the state of the node and the k represents the stage of
the node. The parameter N will denote the number of
rows/states in the trellis. Each node has a corresponding
decision value and input value. The decision value is the
previous state in the trellis. The input value is the value that the
encoder must have received to go from the previous state to the
current state. The survivor sequence of a node is the sequence
of inputs that results in the smallest path metric to reach that
node. . One property that must be kept in mind is the merging
property that says for each node in a stage k, the survivor
sequences will be the same, except for the most recent X
stages. The value of X depends on the type of encoder that is
used as well as the signal-to-noise ratio. For our (2,1,3)
encoder we are using an X of 15 [4].
III. BREAKING THE ACS BOTTLENECK
A. Carry Ripple ACS
The recursive nature of the ACS limits the speed and
throughput of Viterbi decoders. A 5-bit carry ripple ACS is
shown in Figure 2A. The critical path is the loop traced out by
the dotted line. The critical path is equal to the iteration
bound. Retiming provides no benefits and pipelining cannot
be applied. The delay through the loop is
Tsum + (N-1)Tcarry + NTmax
where N is the ACS wordlength. Since the delay is linearly
proportional to N, the decoder throughput is inversely
proportional to N. This relationship limits decoder throughput.
The maximum operating frequency is 909MHz on a 32nm
ASIC and 270MHz on the Xilinx Virtex 7 as shown in Table
1. The carry ripple ACS serves as a reference that will be
compared with alternative designs which improve throughput
at the expense of area and power.
B. Carry Save ACS
Carry save arithmetic can be used instead of carry ripple to
reduce delay. Figure 2B shows the implementation of a 5-bit
carry save ACS. Instead of rippling the carries through the
adders, the carries are passed to the code converter (CC)
preceding the maximum selection logic (MS) of the next
higher bit. Carry save arithmetic requires a more complex
comparison logic than carry ripple, which will be discussed
after analysis of the critical path.
The carry save implementation has an iteration bound of
Tadd+Tcc+2Tms
which is independent of the wordlength N. As traced out by
the red line in Figure 2B, the critical path is
Tadd+Tcc+NTms
The critical path is no longer a loop and benefits from
retiming. Figure 2C shows the ACS after retiming. The critical
path becomes
2Tadd + 2Tcc+2Tms
Although the number of adders and code converters are
doubled, the number of maximum select units becomes fixed.
The critical path delay is independent of N and speedup is
apparent for large numbers of N.
The carry save arithmetic requires a more complex
comparison logic detailed in [2] and [3], since inputs have
ternary weights of 0, 1, and 2. While the carry-ripple can
determine the larger input at the first different largest bit, this
cannot be done for the carry-save due to ternary weights. The
code converter was proposed in [3] to simplify the logic.
The 5-bit carry-save ACS operates at 714MHz as shown in
Table 1, which is slower than the reference implementation
However, the retimed carry-save ACS operates faster at
1.25GHz. The retimed implementation for both ASIC and
FPGA increases throughput by 38%. Area increases by 70%
and power by 100% for ASIC while resource utilization
increases by 70% for FPGA.
C. Radix-4 ACS
Figure 3 shows the trellis of a 4-state convolutional code.
In each stage, any state has two possible transitions from the
preceding stage. The ACS units described previously are of
radix-2 architecture. A radix-2 ACS selects the state at K by
comparing two paths linking stages K-1 to K. It moves one
stage in the trellis during each ACS recursion. Figure 3 also
shows a radix-4 trellis. Since there is a one to one mapping
between the paths in the original radix-2 trellis and collapsed
radix-4, the decoded paths are identical. In fact, a trellis can be
collapsed into any radix-2
M
trellis.
In the ideal case where the radix-2
M
ACS operates at the
same speed as the radix-2 ACS, the speedup is M times. Note
that the number of inputs to compare for a radix-2
M
is 2
M
. For
a first order analysis, that is an exponential increase power an
area for a linear speedup. Radix-4 provides reasonable
tradeoffs.


Figure 3. Radix-2 and collapsed radix-4 trellis
We implemented two versions of the radix-4 ACS. The first
implementation uses hierarchical comparison and the second
uses parallel comparison. The hierarchical comparison is a
tree with M stages of comparisons that use a total of 2`(M-1)
maximum selection units. The parallel comparison only has 1
stage but requires

maximum select units. Synthesis


results are shown in Table 1. As predicted, the parallel
implementation achieves a higher speed at further expense of
area and power. Maximum throughput increases by 57% and
47% compared to the reference design for the parallel and
hierarchical implementations respectively.
Figure 4. Frequency, area, and power of ACS architectures
The radix-4 ACS requires a BMU that outputs branch metrics
for transitions from K-2 to K. It requires more area and power
than the radix-2 BMU to compute 16 metrics rather than 4.
Comparisons of their area and power are also shown in Table
1. Frequencies of 1GHz and 300MHz are chosen to roughly
match ACS speed. They are not the maximum operating
speed.
freq(MHz) Area(m2) Power(W)
Carry-ripple ACS 909 898 322
Carry-save ACS 714 1294 337
Carry-save
ACS(retimed)
1250 1531 667
Radix-4 ACS
Hierarchical
667 1730 438
Radix-4 ACS Parallel 714 2169 582
Radix-4 BMU 1000 1665 652
Radix-2 BMU 1000 295 131
Table 1A. ASIC synthesis results for ACS and BMU (32nm)
Freq(MHz) LUTs Register Bits
Carry-ripple ACS 270 42 31
Carry-save ACS 216 52 39
Carry-save ACS(retimed) 347 72 57
Radix-4 ACS Hierarchical 190 95 50
Radix-4 ACS Parallel 239 92 49
Radix-4 BMU 300 17 52
Radix-2 BMU 300 3 10
Table 1B. FPGA synthesis results for ACS and BMU(Virtex7)
D. Design Space for ACS
Figure 4 shows the design space for the ACS unit.
Frequencies of the ACS implementations are plotted against
their corresponding area and power for 32nm ASIC. The
carry-ripple implementation is the most power and area
efficient for low throughput decoders. The radix-4 parallel
implementation achieves the highest decoder throughput. A
higher throughput can be achieved by combining the radix-4
parallel and carry-save approaches, since our radix-4
implementation uses carry-ripple arithmetic.
IV. SMU IMPLEMENTATIONS
A. Register Exchange
In the register exchange method there are registers that
keep track of the maximum likely path for each state at each
stage. The memory unit is an N by X+1 array of registers.
Earlier we said that the merge length of our encoder is 15
(X=15). So as a result there will be 15 registers for each state.
The architecture for register exchange can be seen in Figure 5.
The decision values are the dotted lines and the input values
are the solid lines. The decision values come from the ACS
and for each state the same X+1 registers receive the same
decision value in one cycle. The input value is what is actually
stored in the one bit registers. The registers in the most recent
stage, X, each receive the same input values in every cycle
(depending on the encoder). For example, for our design, state
00 always has 0, state 01 has 1, state 10 has 0, and state 11 has
1. For the rest of the stages, the input values that the registers
take on depend on the decision values from the ACS. The
decision values actually go into a 2-1 multiplexer that also
takes two input values from the previous stage. Because of the
merging property all of the registers at stage 0 will almost
always have the same values so any one of them can be
chosen.
Figure 5. Register Exchange Architecture
B. Traceback
The simple traceback method also involves using an N by
X+1 array of registers. In this case the registers are all
connected like shift registers between each stage. The
architecture can be seen in Figure 6. Each register contains an
input value as well as a decision value. The input values and
decision values are both shifted to the adjacent registers
during each cycle. The output is obtained by starting from any
one of the registers in column X and tracing back through the
columns X-1 to 0. The decisions on which registers to trace
back to are made according to the decision values in each
register. The register we start with in column X does not
matter since due to the merge property they will most likely
all end up on the same register. Unlike in the register
exchange method, the decision values are now inputted into
the stage X registers and transferred across instead of into
every single register at once. In this method, since the values
in each register are shifted directly across, the input values for
one row are always the same. In this design the decision
values are inputted into multiplexers. For stages X to 1, the
multiplexers transfer the decision values, but for stage 0 they
output the input value of the register.
Figure 6. Simple Traceback Architecture
C. Modified Register Exchange
In this method we use the pointer concept (MRE) [6].
Once again we will use an N by X+1 array of registers. In the
original register exchange method, at a point in time, the
contents of the registers in one row represent the survivor
sequence that ends at the corresponding state. Here, it is also
the case that one row represents one survivor sequence, but it
no longer corresponds to the ending state. In the modified
method one row corresponds to the beginning state. There are
four register pointers that correspond to the four beginning
states. Figure 7 shows the contents of the pointer registers and
the contents of the array of registers throughout the process of
the algorithm. This is basically a change in perspective from
which state the survivor sequence ends in to which state the
survivor sequence begins in. A pointer register decides the
next state by using the decision value from an ACS (the
decision value represents the previous state). Since each
pointer can now take values from each ACS there is a
multiplexer for each state pointer to decide which ACS
decision value to use. This choice is made based on the current
state of the pointer along with the decisions of the four ACS.
If the ACS decision value corresponds to the current state of
the pointer then the pointer takes on the state that the ACS
corresponds to. Like the original register exchange method,
each state still corresponds to the same input value. Once the
transition to a state has occurred, the corresponding input
value is inserted into the X stage of each row. One problem
with this new method is that if one pointer has two decision
values that want to claim it as the previous state, then one path
must be abandoned. In this algorithm the transition that has
the smaller branch metric is chosen. When this happens one
source path is terminated. At stage 0 there is almost always
only one survivor sequence left and that is the row that will be
outputted.
Figure 7. Pointer and Register contents for MRE.
D. Results and Comparison for SMU
area (m
2
) power (W) freq (MHz)
Register Exchange 2096.932 800.789 1000
Traceback 4752.984 975.357 1000

Pointer 2520.533 834.352 1000
Table 2A. ASIC synthesis results for SMU.
Reg Bits LUT freq (MHz)
Register Exchange 132 60 1015
Traceback 359 53 123
Pointer 148 79 558
Table 2B. FPGA synthesis results for SMU.
In Table 1 we can see the results for ASIC and FPGA. When
looking at these results we care the most about lowering the
power since in the entire decoder architecture the SMU unit is
the most power consuming. We do not care about the
frequency very much at all since the critical path for the
system is in the ACS unit. Theoretically the original RE
should be the most power consuming by far while TB should
be the lowest. The reason for this is that in RE, a decision is
made for each register during each cycle, while in the TB
method one path is traced back per cycle, so there are fewer
decisions made per cycle. The main goal of MRE is to
significantly reduce power dissipation as well. The reason
MRE should result in less power dissipation is that some
sequences are terminated during the process. In the tables we
can see that the results are not what we expected. We suspect
that the reason for this is the difficulty in estimation by the
synthesizer. When it comes to area, the TB implementation
resulted in more area than the other two implementations. This
makes sense because the TB used more memory. The TB
implementation runs at a smaller maximum frequency and this
makes sense because the process of tracing back through the
registers takes time. However, this is okay since the real
bottleneck is in the ACS unit. One thing to note about the RE
method is that it does not scale well when the number of states
increases due to the complexity of the wiring. One drawback
about MRE is that the coded sequence cant be too long due to
the termination that occurs during the process.
V. CONCLUSION
To meet the demanding requirements for data rate, power,
and area of modern digital communications, architectural
techniques need to be exploited to maximize the design space
for Viterbi decoders. As synthesis results show, different
architectures of ACS and SMU have different speed, power,
and area. Low level circuit level tuning offer limited design
choices as shown for the reference ACS design and RE
method. There is little wiggle room for any single architecture.
Based on system specifications, architectural choices can be
made by exploring the design space.
REFERENCES
[1] Fettweis, G.; Meyr, H., "High-speed parallel Viterbi decoding:
algorithm and VLSI-architecture," Communications Magazine,
IEEE , vol.29, no.5, pp.46,55, May
[2] Parhi, K.K., "An improved pipelined MSB-first add-compare
select unit structure for Viterbi decoders," Circuits and Systems
I: Regular Papers, IEEE Transactions on , vol.51, no.3,
pp.504,511, March 2004
[3] Fettweis, G.; Meyr, H., "High-rate Viterbi processor: a systolic
array solution," Selected Areas in Communications, IEEE
Journal on , vol.8, no.8, pp.1520,1534, Oct 1990
[4] R. Cypher and C. B. Shung, Generalized trace back techniques
for survivor memory management in the Viterbi algorithm,
inProc. IEEE
[5] Global Telecommun. Conf., San Diego, Dec. 1990, pp. 1318
1322. Black, P.J.; Meng, T.H., "A 140-Mb/s, 32-state, radix-4
Viterbi decoder," Solid-State Circuits, IEEE Journal of , vol.27,
no.12, pp.1877,1885, Dec 1992
[6] D.A. El-dib and M. I. Elmasry, Modified register Exchange
Viterbig decoder for low power wireless comunications, IEE
rans. Circuits Syst. I, Reg. Papers, vol. 51, no.2, pp. 371-378,
Feb, 2004.

You might also like