Professional Documents
Culture Documents
Kenny Lu
kennylu.hl@gmail.com
Rong Huang
rongster3544@gmail.com
Abstract Convolutional codes are used by digital
communications to achieve low BER and high throughput.
A widely used convolutional code decoder is the Viterbi
Decoder. This report examines how different architectures
and algorithms affect speed, power, area. The feedback
nature of the algorithm is identified as the speed
bottleneck. Inefficient management of decision bits result
in large power and area. The analysis is based on hard
decoding of a (2,1,3) convolution code.
I. INTRODUCTION
Communicating information is an essential part of many
applications such as television, satellite communication, digital
radio, and phones. One problem that must be dealt with when
transmitting bits through channels is the noise level. Simply
sending the desired sequence of bits would not work. One way
of solving this is by encoding the data. Convolutional codes
achieve low BER by adding redundancy to the source symbols.
This can be done by using encoders that take a sequence of
inputs and produces a sequence of outputs based on some sort
of pattern. This received pattern must be then decoded. The
result is the transmission of data with a low error rate. In this
report we will talk about the Viterbi Decoder. We will be
building a decoder for a (2,1,3) encoder. This means that for
every one bit that in inputted into the encoder there are two bits
that come out. The three means that each input will have an
effect on three pairs of outputs. Another way of saying this is
that this has a code of rate and constraint length 3.
II. THE VITERBI ALGORITHM
The general diagram of a Viterbi Decoder is shown in
Figure 1. It consists of a Branch Metric Unit (BMU), Add
Compare Select Recursion Unit (ACSU), and Survivor
Management Unit (SMU).
Figure 1. The building blocks of a Viterbi Decoder
The BMU determines the reliability between two groups of
bit streams. For an encoder there are always 2
K-1
different
states. That means for our encoder there will be four states. The
BMU looks at two groups of these numbers and gives us the
Hamming distance, which is simply how many differing bits
there are between the numbers. This is called a hard decision.
There is also a soft decision that is more complex. In our case
the only possible distances are 0, 1, and 2.
The ACSU recursively calculates the path with the least
cost based on incoming branch costs. This unit compares the
path metrics after adding the newly inputted branch metrics
from the BMU. It then makes a decision based on the path
metrics and the output is sent to the SMU. The operational
speed of the Viterbi decoder is limited by the ACSU unit. This
is due to its recursive nature (pipelining can be easily
introduced in the BMU and SMU since they are purely
feedforward). One of the main goals in this article is to show
how this bottleneck can be dealt with.
The SMU takes decision values from the ACSU and uses
them to find out the survivor sequences. The output from the
SMU is the decoded sequence. If we imagine a trellis diagram,
we can assign each node a pair of numbers (j,k). The j
represents the state of the node and the k represents the stage of
the node. The parameter N will denote the number of
rows/states in the trellis. Each node has a corresponding
decision value and input value. The decision value is the
previous state in the trellis. The input value is the value that the
encoder must have received to go from the previous state to the
current state. The survivor sequence of a node is the sequence
of inputs that results in the smallest path metric to reach that
node. . One property that must be kept in mind is the merging
property that says for each node in a stage k, the survivor
sequences will be the same, except for the most recent X
stages. The value of X depends on the type of encoder that is
used as well as the signal-to-noise ratio. For our (2,1,3)
encoder we are using an X of 15 [4].
III. BREAKING THE ACS BOTTLENECK
A. Carry Ripple ACS
The recursive nature of the ACS limits the speed and
throughput of Viterbi decoders. A 5-bit carry ripple ACS is
shown in Figure 2A. The critical path is the loop traced out by
the dotted line. The critical path is equal to the iteration
bound. Retiming provides no benefits and pipelining cannot
be applied. The delay through the loop is
Tsum + (N-1)Tcarry + NTmax
where N is the ACS wordlength. Since the delay is linearly
proportional to N, the decoder throughput is inversely
proportional to N. This relationship limits decoder throughput.
The maximum operating frequency is 909MHz on a 32nm
ASIC and 270MHz on the Xilinx Virtex 7 as shown in Table
1. The carry ripple ACS serves as a reference that will be
compared with alternative designs which improve throughput
at the expense of area and power.
B. Carry Save ACS
Carry save arithmetic can be used instead of carry ripple to
reduce delay. Figure 2B shows the implementation of a 5-bit
carry save ACS. Instead of rippling the carries through the
adders, the carries are passed to the code converter (CC)
preceding the maximum selection logic (MS) of the next
higher bit. Carry save arithmetic requires a more complex
comparison logic than carry ripple, which will be discussed
after analysis of the critical path.
The carry save implementation has an iteration bound of
Tadd+Tcc+2Tms
which is independent of the wordlength N. As traced out by
the red line in Figure 2B, the critical path is
Tadd+Tcc+NTms
The critical path is no longer a loop and benefits from
retiming. Figure 2C shows the ACS after retiming. The critical
path becomes
2Tadd + 2Tcc+2Tms
Although the number of adders and code converters are
doubled, the number of maximum select units becomes fixed.
The critical path delay is independent of N and speedup is
apparent for large numbers of N.
The carry save arithmetic requires a more complex
comparison logic detailed in [2] and [3], since inputs have
ternary weights of 0, 1, and 2. While the carry-ripple can
determine the larger input at the first different largest bit, this
cannot be done for the carry-save due to ternary weights. The
code converter was proposed in [3] to simplify the logic.
The 5-bit carry-save ACS operates at 714MHz as shown in
Table 1, which is slower than the reference implementation
However, the retimed carry-save ACS operates faster at
1.25GHz. The retimed implementation for both ASIC and
FPGA increases throughput by 38%. Area increases by 70%
and power by 100% for ASIC while resource utilization
increases by 70% for FPGA.
C. Radix-4 ACS
Figure 3 shows the trellis of a 4-state convolutional code.
In each stage, any state has two possible transitions from the
preceding stage. The ACS units described previously are of
radix-2 architecture. A radix-2 ACS selects the state at K by
comparing two paths linking stages K-1 to K. It moves one
stage in the trellis during each ACS recursion. Figure 3 also
shows a radix-4 trellis. Since there is a one to one mapping
between the paths in the original radix-2 trellis and collapsed
radix-4, the decoded paths are identical. In fact, a trellis can be
collapsed into any radix-2
M
trellis.
In the ideal case where the radix-2
M
ACS operates at the
same speed as the radix-2 ACS, the speedup is M times. Note
that the number of inputs to compare for a radix-2
M
is 2
M
. For
a first order analysis, that is an exponential increase power an
area for a linear speedup. Radix-4 provides reasonable
tradeoffs.
Figure 3. Radix-2 and collapsed radix-4 trellis
We implemented two versions of the radix-4 ACS. The first
implementation uses hierarchical comparison and the second
uses parallel comparison. The hierarchical comparison is a
tree with M stages of comparisons that use a total of 2`(M-1)
maximum selection units. The parallel comparison only has 1
stage but requires