006 10.1109@jsee.2013.00047

Journal of Systems Engineering and Electronics
Vol. 24, No. 3, June 2013, pp.382399
Reduced bit low power VLSI architectures for

motion estimation
Shahrukh Agha , Shahid Khan, Shahzad Malik, and Raja Riaz
Department of Electrical Engineering, COMSATS Institute of Information Technology, Islamabad 44000, Pakistan
Abstract: Low power and real time very large scale integration
(VLSI) architectures of motion estimation (ME) algorithms for mobile devices and applications are presented. The power reduction
is achieved by devising a novel correction recovery mechanism
based on algorithms which allow the use of reduced bit sum of
absolute difference (RBSAD) metric for calculating matching error
and conversion to full resolution sum of absolute difference (SAD)
metric whenever necessary. Parallel and pipelined architectures
for high throughput of full search ME corresponding to both the full
resolution SAD and the generalized RBSAD algorithm are synthesized using Xilinx Synthesis Tools (XST), where the ME designs
based on reduced bit (RB) algorithms demonstrate the reduction
in power consumption up to 45% and/or the reduction in area up
to 38%.
Keywords: motion estimation (ME), very large scale integration
(VLSI), reduced bit sum of absolute difference (RBSAD).
DOI: 10.1109/JSEE.2013.00047
1. Introduction
It is common to use prediction techniques to compress
video data. For example in video sequences, a square region, called a macroblock, of the current frame, s(i, j, k),
is sought in a relative area, called a search area, of the
reference frame (a frame either succeeding the current
frame or preceding it), s(i, j, k 1), where (i, j) are the
spatial coordinates of the pixel, and k is the temporal coordinate, i.e., the frame number, in an attempt to find a
region which is similar to it. Such a technique is called
the motion estimation (ME) based on simplified optical
flow constraints [13]. Such techniques have been widely
used by the video standards including MPEG-1, MPEG2 and MPEG-4 as well as H.261, H.263 and H.264 [1
4]. There are different kinds of ME algorithms, the most
common and most suitable algorithm for very large scale
integration (VLSI) implementations is the block matching
Manuscript received October 9, 2011.
*Corresponding author.
algorithm (BMA) known as the full search ME (FSME)

algorithm [2,3]. The FSME algorithm seeks the match of
current MB at each position of the reference MB in the
search area, using an error metric called sum of absolute
difference (SAD), and naturally becomes complex in computation.
The FSME is the most computational part of MPEG2
encoder, i.e., the dynamic instruction count (instruction
count at run time [510]) of FSME is 60% 80% of the encoders total dynamic instruction count [3, 510]. The real
time computational complexity of the FSME algorithm on
a reduced instruction set computer (RISC) processor for
a common intermediate format (CIF) video sequence involves billions of 8-bit arithmetic calculations and memory accesses per second [3]. External memory accesses are
also slower than the arithmetic unit on speed, and consume
more power than the onchip memory [3]. Hence it may not
be possible for an ordinary sequential processor to achieve
this throughput while meeting the power constraints for
mobile device applications [3,1113].
This has naturally led to the development of fast subsampling BMAs [13,1416], such as the three-step search
[14] and diamond search (DS) [15]. These fast subsampling algorithms have less computational complexity although their quality can deteriorate as well. When these
algorithms run on the multiprocessor based system [510],
their computational complexity reduces further. VLSI implementations of these fast ME algorithms can become
complex due to irregular data flow. Lack of data reuse in
these algorithms leads to the multiple reads of the same
data from the external memory and hence more power consumption [3,1113].
We aim here to assess the effects of an orthogonal form
of acceleration for FSME which will reduce the power consumption and either the area or time, i.e., the computational burden on the encoder will be reduced by reducing
the resolution of the pixel data, i.e., normally truncating
the pixel values used in SAD calculations from 8-bit lu-
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
minance to 4-bit values by excluding the lower 4-bit and

employing spatiotemporal correlation based algorithms to
compute lower 4-bit SAD whenever necessary to compensate for the quality loss due to the reduced resolution. Operating on 4-bit numbers, adders may be made smaller
and/or faster [1720] (e.g. a 4-bit ripple adder is as fast
as a carry select or carry save adder but is smaller), and the
reduced bit SAD (RBSAD) calculations are faster and consume less power. The memory will be smaller as well as the
input/output (I/O) memory bandwidth [3,1113,2125]. A
drawback [2123] of this method which was pointed out
previously is that the reduced dynamic range, from 8-bit
numbers to 4-bit numbers, will generate poorer peak signal to noise ratio (PSNR) values especially for sequences
with high spatial and temporal correlations (e.g. newscaster kind of sequences) through an increased error metric and consequently will require a greater bit rate. In
[24,25] low bit depth ME algorithms for low power and
their hardware architectures were presented. Work in [26
28] showed the use of adaptive quantization in the MPEG2 encoder, for low power and compensating for the quality
loss due to 4-bit/pixel on the average. In [2123], we presented FSME algorithms based on low resolution pixels
along with a novel correction recovery mechanism based
on spatial and temporal correlations among the pixels that
have the benefits of high speed, low power and low area
for VLSI architectures while the quality is kept at the same
level. In [710], we considered benefits of data level parallelism and multiprocessing in reducing the computational
complexity of ME process and proposed optimized architectures with respect to area, speed and power.
In the current work, we present power efficient parallel
and pipelined VLSI architectures of these reduced resolution ME algorithms [2123] which have low power, low
area, high throughput and efficient implementation due
to the regular data flow of these algorithms. These architectures act as loosely coupled coprocessors for the main
RISC processor (serial) [3]. Different ways for achieving
required throughput at low power are also considered by
utilizing reduced resolution algorithms.
A brief summary of the paper is as follows. Section 2
describes the mathematical forms of SAD, RBSAD and
correction metrics. Section 3 describes the ME algorithms
based on RBSAD metrics along with the novel correction
recovery mechanism. Section 4 describes results of these
ME algorithms in terms of quality. Section 5 describes the
parallel and pipelined VLSI architectures of these ME algorithms for high throughput and reduced power. Section 6
describes the implementation and comparison of full resolution and RB ME VLSI architectures in terms of area,
power and speed. Section 7 presents the conclusions.
383
2. RBSAD
As described above, the matching of the current frame
square region with the reference frame square region in the
search area is performed in terms of SAD [3] error. Mean
square error metric [13] can also be used but has more
computational complexity although it represents quality
more accurately [3]. SAD for a 1616 macroblook is given
by
SAD(m, n) =
16
|FC (i, j)<7:0> FR (i + m, j + n)<7:0> |. (1)
i=0,j=0
where FC (i, j) is the (8-bit luminance) pixel value in the

current frame at position (i, j), FR (i, j) is the reference
frame.
Best match will be minimum of the error (i.e., SAD)
with coordinates (u, v) known as the motion vector (MV),
which is determined as
[u
v] = arg min SAD(m, n).
(2)
We define the RBSAD (i.e., RBSAD<7:4> ) as

RBSAD<7:4> (m, n) =
16
|FC (i, j)<7:4> FR (i + m, j + n)<7:4> |. (3)
i=0,j=0
Since most of the hardware realizations, for FSME encoding, are designed in a bit slice fashion, (3) may be evaluated by exactly using the same hardware with a 4-bit version of the datapath and controller.
In addition, the value of RBSAD<7:4> in (3) may be
corrected to (1) by adding the term (m, n) [2123] given
by
(m, n) =
16
i,j (m, n)(FC (i, j)<3:0> FR (i+m, j + n)<3:0> )
i=0,j=0
(4)
where < 3 : 0 > refers to lower four bits of the pixel
and i,j (m, n) is the sign of (FC (i, j)<7:4> FR (i +
m, j + n)<7:4> ) except when (FC (i, j)<3:0> FR (i +
m, j + n)<3:0> ) is zero and i,j (m, n) equals the sign of
(FC (i, j)<3:0> FR (i + m, j + n)<3:0> ).
3. Novel correction recovery mechanism

3.1 Correction algorithm
All natural sequences usually possess center biased (CB)
motion due to the spatial and temporal correlation among
the pixels and their motion fields between the successive
384
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
frames [3,29,30]. Because of this, there is a high probability of finding the best match in the small region surrounding the center of the search area unless the motion
is fast. We assume the distribution of the matches to be
Gaussian [31], i.e.,
p(x, y) =
1 x2 +y2 2
e 2
2 2
(5)
where is the standard deviation and p() is the probability density. The equivalent points, the points of equal error,
are on the circles,
x2 + y 2 = c2 ,
3.2 CB algorithm
while for the Laplacian error surface [32], i.e.,

p(x, y) =
1 2 |x|+|y|
e
,
2
(6)
the equivalent points lie on the surface

|y| = c |x|.
The Gaussian error surface may give a better representation for the random sequence whereas the Laplacian error surface represents the rectilinear motion, i.e., panning
camera more efficiently.
Keeping this in view, a criterion [2123] for conditionally using the correction term (m, n) for converting the
RBSAD<7:4> to full resolution SAD is given below:
if (RBSAD<7:4> <= T ) then
SAD = RBSAD<7:4> + (m, n)
else
SAD = RBSAD<7:4>
end if
where T is some threshold value. The maximum value of
error in using RBSAD<7:4> alone as compared with SAD
is 2 15 256. For sequences with high spatiotemporal correlations, there are many candidate MVs (CMVs)
with RBSAD<7:4> zero in the search area, whereas in case
of large or complex motion sequences, RBSAD<7:4> of
CMVs differs from each other significantly. The more the
RBSAD<7:4> differs from each other, the less the error
probability of finding best match using RBSAD<7:4> is.
Substituting T = 0, in the above criterion, we obtain
another criterion [2123], which we call the correction algorithm, mentioned below.
if (RBSAD<7:4> == 0) then
SAD = (m, n)
else
SAD = RBSAD<7:4>
end if
i.e., when RBSAD<7:4> is zero, simply calculate the correction term (m, n) and use it as SAD in the minimum
SAD evaluation. The notion is that for sequences with high
spatiotemporal correlations, there are many CMVs with
RBSAD<7:4> zero in the search area, hence the application of corrections under this condition can lead to correct
MVs.
In this typical case when RBSAD<7:4> is zero,
i,j (m, n) will have the sign of (FC (i, j)<3:0> FR (i +
m, j + n)<3:0> ) in (4) and the register elements will not
be required for storing the sign bits.
Another way of reducing power and area of ME architecture while keeping the accuracy is to implement the CB
ME algorithm with the RB architecture. In the ME algorithm, search for the best CMV is normally executed in a
raster manner, i.e., from left to right and top to bottom. Due
to the CB motion in natural sequences, a large number of
blocks in the current frame could be regarded as stationary
or qausi-stationary with respect to the reference frame. As
RBSAD<7:4> (m, n) only has a resolution of 16 (i.e., it
takes on values 16N , for integer N ), there can be a number of CMVs which generate the same metric due to high
spatiotemporal correlations especially in the case of slow
motion sequences, e.g., newscaster kind of sequences like
Claire, Akiyo, Missa and Grandmom sequences. We resolve this conflict by choosing the MV according to the CB
assumption, according to which if two MVs evaluate to the
same error metric then the point closest to origin should
be chosen, provided RBSAD is less than a threshold (T2 ),
i.e., when the motion is small. With a Gaussian error surface, it would be the point on the circle with the smallest
radius, while with a Laplacian error surface, it would mean
the point on the square with the smallest side length. Hence
choose the MV with the lowest value of |x| + |y|, which
corresponds to the assumption of a Laplacian error surface
(6). A pseudocode for the search of best CMV in a CB
manner using RBSAD is shown below, which we call the
CB algorithm [2123]:
if (((|vX | + |vY |) < (|mvX | + |mvY |)) and
(RBSAD<7:4> <= T2 )) then
mvX = vX
mvY = vY
end if
where (mvX , mvY ) and (vX , vY ) are the current best MV
and the MV of the candidate block under test respectively,
RBSAD<7:4> is the minimum RBSAD<7:4> value found
so far and T2 is some threshold value as the measure of
small and large displacements (motion).
In hardware, the RB CB ME algorithm will be realized

with only one RBSAD<7:4> unit and some extra logic for
CB algorithm implementation.
Fig. 1 and Fig. 2 show the MV distributions of full resolution ME (using SAD) and RB ME (using RBSAD<7:4>
only) in a raster fashion respectively where the horizontal
plane in the figures represents the search area, P is the half
length of the search window and height of the surface is
the percentage of MVs.
385
3.3 Spatiotemporal algorithm

Exploring spatiotemporal (spatial and temporal) correlations in a different way leads to another algorithm for reducing the computational complexity by conditionally using SAD metric with the temporal and spatial correlations
that exist between the motion fields of the same block in
successive frames and between neighbouring blocks in the
same frame. For example, there is a constant motion without much deformation of the foreground object(s). The algorithm is mentioned below as a spatiotemporal algorithm
[2123]:
if ((mvT == 0, mvS == 0, SADmin s <= T3 and
SADmin t <= T4 )) then
Apply corrections (the (m, n)-term) in range
[m n] = { 1, 0, 1}{ 1, 0, 1}
Fig. 1 MV distribution of full resolution search algorithm in Claire

sequence
Fig. 2 MV distribution of RB full search left-to-right raster algorithm in Claire sequence
Fig. 3 shows the MV distribution of CB searching pattern. CB RB ME distribution has 81% correct MVs (as
compared with the MVs obtained from full resolution motion estimation) and the left-to-right raster RB ME has 51%
correct MVs.
Fig. 3 MV distribution of RB full search CB raster algorithm in

Claire sequence
else
Search with CB RBSAD<7:4> in the whole search area
end if
where mvT is the MV of the corresponding macroblock
in the previous frame (i.e., from the same location) and
SADmin t (or RBSADmin t ) is the corresponding minimum SAD or minimum RBSAD<7:4> . Similarly mvS is
the MV of the macroblock in the left of the current frame
(or above macroblock if no macroblock in the left position
exists) with minimum SAD or minimum RBSAD<7:4> as
SADmin s or RBSADmin s and T3 and T4 are some threshold values. For more accuracy we only use full resolution
SAD value in the spatiotemporal algorithm whether spatiotemporal condition is satisfied or not, i.e.,
if ((mvT == 0, mvS == 0, SADmin s <= T3 and
SADmin t <= T4 )) then
Apply corrections or search with full resolution SAD in
range
[m n] = { 1, 0, 1}{ 1, 0, 1}
else
Apply corrections or search with full resolution SAD in
the whole search area
end if
The notion is that if a particular block (current macroblock) does not move in the previous frame and has a
minimum SAD value less than the threshold, and adjacent
macroblocks of the current macroblock also do not move
far in the current frame and have a minimum SAD value
less than the threshold, then it is less likely for the current
macroblock to move far in the current frame. As shown in
the pseudocode of the spatiotemporal algorithm, the spatiotemporal MVs can reduce complexity by reducing the
search area. According to the former spatiotemporal algorithm, CB ME with RBSAD<7:4> (i.e., RB CB algorithm)
386
is used whenever the spatiotemporal condition is not satisfied. The MVs obtained from RB CB algorithm are slightly
less accurate than the MVs obtained from full resolution
SAD ME (as described above) and may not accurately represent the spatiotemporal correlation among the MVs, thus
resulting in an increased error. But this condition will not
persist for long as a large motion gives a large SAD or
RBSAD<7:4> (spatiotemporal algorithm). According to
the spatiotemporal algorithm along with the RBSAD<7:4>
metric, 67% accurate MVs are obtained when compared
with the exhaustive full resolution ME algorithm. Whereas
97% accurate MVs are obtained when this algorithm is
used with full resolution SAD. In addition the accuracy of
ME can be increased by employing object shape description [3] which separates the foreground portion of images
from the static background.
version (upper 4-bit) for 60 common intermediate format

(352288) frames of the Claire sequence. Line 2 in Fig. 4
shows the difference in PSNR between the full search (full
resolution) and the RB CB search algorithm, where the
negative values of PSNR are due to the use of mean square
error metric [3] in PSNR expression instead of sum of absolute difference. Ling 3 in Fig. 4 shows the difference in
PSNR between the full search (full resolution) ME and the
spatiotemporal algorithm. Line 4 in Fig. 4 shows the difference in PSNR between the full search (full resolution)
and the RB version (upper 4-bit), with corrections applied
according to the correction algorithm [21]. Line 5 in Fig. 4
shows the difference in PSNR between the full search (full
resolution) and the three-step search algorithm [3].
3.4 Small or zero motion algorithm

Another way of reducing the computational complexity
and power consumption of full search ME is to first
evaluate SAD value at the center of the search area (or
RBSAD<7:4> in case of RB architecture only), i.e., the
current macroblock position. If the SAD value is evaluated
as zero, the search for the current macroblock can be terminated as zero which is the smallest value of SAD metric. The benefit comes when the large portions in the images of the sequences belong to background or the motion
is slow. In other words, zero SAD value can be used as a
condition for terminating the ME search.
4. Results
The quality of any ME algorithm is usually assessed by
the PSNR [3] defined, in this case, as the fractional root
mean square (RMS) error between the predicted and true
frames, expressed on a dB scale. Here, the prediction of the
current frame is obtained from the previous frame using
MV in the previous frame although in the standard MPEG
compression format, motion compensation is used in the
reconstructed frame. The reconstructed frame is obtained
by adding the quantized block or frame error in the motion compensated block or frame. The quality of compression is evaluated by computing PSNR between the original and reconstructed frames. The construction of reconstructed frame also depends upon the bit rate allocated and
the nature of motion. Simple and smaller motion sequences
have higher PSNR for a given bit rate. In this work, the
aim of evaluating PSNR is to evaluate the performance of
ME algorithms which can represent the performance of the
compression encoder.
Line 1 in Fig. 4 shows the difference in PSNR between the full search (full resolution) and the simple RB
Fig. 4
Comparation of different algorithms on PSNR
The maximum PSNR difference has fallen from 2.86 dB

to 0.35 dB and in general is very much smaller than
that. Indeed the average PSNR difference falls to around
0.03 dB. For other sequences it is generally the case that
the 4-bit SAD (RBSAD<7:4> ) is already perfectly adequate [2123]. However, the application of corrections allows for the possibility of improvement in quality when it
is necessary. When using these algorithms, it is important
to stress that the correction, although it can be applied to
the search for the MV of every current frame macroblock,
is only actually applied to 3% to 20% of the total number
of SAD comparisons made, leading to significant power
savings.
The application of the correction algorithm on Claire sequence results in average corrections of 23% whereas the
application on large (complex) motion sequence, e.g., ro-
tating city (RC) sequence, results in zero number of corrections. Similarly the application of the spatiotemporal algorithm on Claire sequence results in approximately 70%
of macroblocks which are searched in a search window of
size 3 3 and zero percentage of macroblocks in case of
RC sequence. The small motion algorithm with full resolution SAD yields 8% current frame macroblocks where
search area computation is not carried out, i.e., ME for
these current frame macroblocks is not done. Whereas with
the small motion algorithm along with RBSAD<7:4> metric, there are the average 34% macroblocks where search
area computations are not carried out. Whereas the application of small motion algorithm on RC sequence results
in zero percentage of macroblocks where ME search is not
done.
387
ates row addresses for the parallel memory and the whole
row of search area is output. Starting from the first row
of CMVs, a row of search area is output for 16 clock cycles in which 16 partial SADs corresponding to 16 CMVs
are computed in parallel by the 16 PEs. Each partial SAD
(composed of sum of 16 absolute differences) is computed
sequentially by each PE.
5. ME architecture (SAD and MV

computation)
5.1 Full resolution parallel architecture
For a comparison, we have based our architecture on that
of the reference hardware description issued by ISO/IEC in
2002, the details of which may be found in [4,33]. Briefly,
it has a single instruction multiple data (SIMD) stream architecture that processes the video data in parallel to estimate the MV for each block. The data regularity present
in the FSME [3,1113,34] makes it suitable for parallel
VLSI architectures. Fig. 5 shows a block diagram of the
architecture of the full resolution SAD ME. It consists of
a search window memory (parallel memory) with 31 random access memory (RAM) blocks, each block with 31
locations 8 bits, current block RAM or circulatory shift
registers of 256 locations (with 1 byte per location) to
hold the current block data and make efficient data reuse
of current macroblock, an address generator (finite state
machine) to generate the addresses for reading the reference and current block data from the external memory and
writing to the onchip memories then reading the data back
from the onchip memories, 16 processing elements (PEs)
to compute the SAD values, bus multiplexers for distributing data to the PEs, and a compare unit and corresponding coordinate counter to find the block with the minimum
SAD value and its coordinates (MVs) among all the candidates in a pipeline manner. The search window contains
(31 (16 1)) (31 (16 1)) = 256 candidate blocks
(macroblocks) or CMVs, i.e., 16 rows of CMVs with 16
CMVs per row. One PE is used per column of CMVs. Each
PE consists of an 8-bit subtractor and a 16-bit accumulator in case of full resolution ME. Due to the high data
regularity, use of the wide port memory reduces memory
accesses and the corresponding address generator is simple and consumes less area. The address generator gener-
Fig. 5
Full resolution SAD full search ME architecture
Fig. 6 shows the bus structure (multiplexers) for distributing full resolution data to corresponding PEs. In a
similar manner the next 15 rows are accessed to compute
the remaining partial SADs for the computation of full
SADs of the first row of CMVs in a total of 16 16 clock
cycles. Similarly starting from the second row of CMVs,
16 rows of search area are accessed to compute SADs of
the second row of CMVs. The comparison of SADs is done
in pipeline with the computation of the SADs of the next
row of CMVs, i.e., while the second row of CMVs is being
processed, the completed SADs of the first row of CMVs
will be compared together by an SAD compare unit as
shown in Fig. 5. The above process is repeated for the remaining rows of CMVs, i.e., 256 SADs are computed and
the best match (i.e., location of minimum SAD) of current frame macroblock is searched in 16 256 clock cycles. Thus a quarter common intermediate format (QCIF)
sized video frame (176 pixels 144 pixels) will take approximately (176/16 144/16) 16 256 = 405 504
clock cycles (excluding write cycles for block memories)
for the ME between two frames hence for the real time
operation, i.e., 25 frames per second, it would require
405 504 25 = 10 137 600 clock cycles.
388
2 s complement
Subtractor
Sign bit
MUX
Accumulator
SAD register
Fig. 7
Fig. 6 Bus architecture
Fig. 7 shows the PE architecture consisting of an 8-bit

subtractor, a 16-bit accumulator and 2s complement logic
for implementing the absolute operation, i.e., whenever
the result of the subtractor is negative, the borrow bit of
the subtractor converts the negative result of the subtractor
into a positive number by applying 2s complement. All
the above mentioned four algorithms (Section 3) are implemented using this parallel architecture.
Fig. 8
PE for serial accumulation
5.1.1 Power efficient RB parallel architecture with

CB unit
The RB version of the architecture in Fig. 5 contains search
window memory RAM with 31 blocks (each block of size
314bits), current block memory RAM with 16 blocks
(each block of size 164bits), 4-bit subtractors, 12-bit
accumulators and 4-bit and 12-bit registers for pipelining. Similarly the reduced bus architecture and reduced
comparator unit are implemented.
The RB CB algorithm is implemented using only the
RB parallel architecture with additional logic for CB algorithm implementation. The small motion algorithm is also
implemented using this architecture with the addition of
logic for zero detection at the center of the search area.
Architecture for correction algorithm
5.1.2 Power efficient RB parallel architectures with

correction unit
Fig. 8 shows two similar RB parallel architectures for the
implementation of the correction algorithm. The lower four
RB unit at the right is activated only when the algorithmic condition is satisfied leading to saving in power while
having the same quality. Whereas Fig. 9 shows the imple-
Fig. 9
389
mentation of the correction algorithm with comparatively

higher throughput, i.e., utilizing the lower four RB unit
with the addition of dual port memories, the upper four bit
unit also with dual port memories, two more address generators so that both units could access the upper and lower
4-bit data independent of each other, bus multiplexers for
distributing data to both units.
Architecture for correction algorithm with higher throughput
In a similar manner, the correction unit in Fig. 8 can be

used to increase the throughput by simultaneously computing RBSAD<7:4> of the two consecutive rows of CMVs
by reusing the already fetched data from the upper four bit
memory, i.e., rows 1 to 16 of the search area are used to
compute RBSAD<7:4> of the first row of CMVs by the
first unit whereas rows 2 to 17 are required for the computation of RBSAD<7:4> of the second row of CMVs which
can be utilized by the other RB unit, i.e., the correction unit
to compute the second row of CMVs. The additional cost
for achieving this is an additional comparator and logic for
the RB CB algorithm. Whereas the lower four bit memory
is not required as corrections are not computed.
Another advantage of using two similar RB architectures as shown in Fig. 8 is that instead of loading the lower
four bit data in the second RB architecture we can load

the upper four bit data of the search area of the next macroblock in the current frame so as to increase the throughput by searching for two current frame macroblocks simultaneously with the RB CB algorithm.
The spatiotemporal algorithm is implemented by using
the architecture in Fig. 8 with the addition of 2s complement logic to invert the sign of the difference of the lower
four bit pixels whenever the sign of the corresponding difference of the upper four bit pixels is negative according to
(4). The implementation of the spatiotemporal algorithm
requires the addition of some logic for the implementation of the spatiotemporal logical condition and storing elements for the MVs and SADs in the architecture given
in Fig. 10. These elements vary in number from couple of
390
elements for storing spatial MVs to the number equal to

the number of macroblocks per image for storing temporal
MVs. Similarly storage elements are required for storing
signs of the difference between the pixels corresponding
to the upper four bits, where each sign bit requires 1-bit
storage according to (4).
the SAD calculation is terminated whenever the intermediate SAD value exceeds the minimum SAD value computed so far [34] which results in saving of computations
and power), only one comparator would be required as
one SAD corresponding to a CMV is being computed at
a time. The drawback is that the local memory accesses
have increased which can be reduced by using 16 1 bit
circulatory shift register for enabling and disabling memory blocks. Similarly it is applied to the RB version of this
architecture. Fig. 10 shows a parallel architecture with parallel PE. The address generation sequence for this architecture is as follows. Starting from the first row of the search
area, 16 rows are accessed to compute the SAD of the first
CMV of the first row of CMVs in 16 clock cycles from
columns 1 to 16 of the search area memory.
Fig. 10 Paraller ME architecture with paraller PE
5.2 Parallel architecture with parallel PE (full

resolution and RB)
In order to increase the area and power efficiency of the
correction algorithm, the above parallel architectures in
Fig. 5, Fig. 8 and Fig. 9 are modified by replacing sequential PEs with a parallel PE which consists of a parallel
adder tree. Fig. 11 shows an architecture of an adder tree
for RB parallel PE [3, 510] whereas a full resolution parallel adder tree would consist of eight 8-bit adders in the
first level, four 9-bit adders in the second level, two 10-bit
adders in the third level, a 11-bit adder in the fourth level
and a 16-bit accumulator.
The throughput remains the same as one SAD is now
being computed in 16 clock cycles whereas the CMVs are
computed sequentially. The address generation unit also
remains the same with minor modification in the address
generation pattern. The main benefit of this architecture is
that converting non zero RBSAD<7:4> value of a CMV
corresponding to the upper four bit to full resolution SAD
value requires 256 bits of memory for saving signs of
the RBSAD<7:4> according to (4). Whereas the parallel architecture (Fig. 8 where CMVs are parallelized) requires 16 256 bits of memory for saving signs of the
RBSADs in all the CMVs in a row. Another advantage
is that for implementing early jump out mechanism (i.e.,
Fig. 11 An adder tree [3] for parallel PE for simultaneously calculating sum of 16 RB absolute differences (pipeline registers are not
shown)
The bus architecture remains the same frame except that

now the data are being distributed to the parallel PE. Then
starting again from the first row of the search area, 16 rows
are accessed to compute the SAD of the next CMV in 16
clock cycles from columns 2 to 17. This process will be
repeated until all 256 CMVs are computed and compared
in 16 256 clock cycles. As before, the RB version of
the architecture in Fig. 10 contains search window memory RAM with 31 blocks, each block with 31 locations
4 bits, current block memory RAM with 16 blocks, each
block with 16 locations 4bits, 4-bit subtractors, 12-bit
accumulator, RB parallel PE and 4-bit and 12-bit registers
for pipelining. Similarly the reduced bus architecture and
reduced comparator (compare unit) are utilized.
Fig. 12 shows a comparatively more power efficient way
of computing the SAD using two similar RB units, one
for computing the RBSAD<7:4> and the other for computing (m, n). Initially, the RBSAD<7:4> corresponding to the upper four bit is computed using one unit corresponding to the upper four bit with signs of difference
terms being stored in the corresponding memory. When
RBSAD<7:4> exceeds the current minimum SAD value
by 15256 (a factor for the compensation of the minimum
value of correction), the computation for the current CMV
terminates, otherwise the lower bit unit will be enabled to
compute (m, n) for full value of SAD while the upper bit
unit will compute RBSAD<7:4> of the next CMV in the
pipeline. The same architecture can be used for the correc-
391
tion algorithm. Whenever the threshold condition is true,

the lower four bit unit will be enabled and the calculation
for (m, n) will begin in the pipeline with the upper four
bit unit, whereas the < 7 : 4 > unit will be calculating
RBSAD<7:4> for the next CMV. In a same way as described in Sections 5.1.1 and 5.1.2, the correction unit here
can also be utilized for high throughput and/or for power
saving.
The implementations of the RB CB algorithm and spatiotemporal algorithms are same as mentioned in Section
5.1.2.
Fig. 12 Power efficient parallel ME architecture with paraller PE and SAD early termination
5.3 Parallel architecture (RB) for search area

correlation
Comparing the RB parallel architecture with parallel PE
and the full resolution parallel architecture with parallel PE, we can assume at the moment that the area is
saved due to the reduced memory, logic and interconnection. This saved area can be utilized such that two
current frame macroblocks could be searched simultaneously as shown in Fig. 13. The search window memory
now includes 47 RAM blocks, each RAM block with 31
locations 4bits. The search area width of 47 includes
all the locations in the horizontally adjacent two search areas. Due to common areas between the adjacent search areas, data reuse can be effectively utilized resulting in improvement of the throughput and power saving. The drawback is that this architecture may not efficiently utilize the
spatiotemporal algorithm for power reduction and speed
improvement as there are two current frame macroblocks
being searched simultaneously. Similar is the case with the
implementation of the small motion algorithm with this architecture. Only the RB CB algorithm is efficiently implemented with this architecture.
392
Fig. 13
RB parallel ME architecture for two current macroblocks
5.4 Parallel architecture with multiple parallel PEs

(full resolution and RB) for the parallelisation
of search area and high throughput and/or
less power consumption
Combining the parallel architecture and the parallel architecture with parallel PEs for the increased throughput,
i.e., computing 256 SADs of 256 CMVs in 256 clock cycles. The RB version of this architecture results in reduced
area and power consumption. The arithmetic and logical
units, i.e., the PEs, have increased but the local memory accesses have reduced. All four algorithms are implemented
using this architecture.
In [3], the throughput of the ME design was mentioned
to be increased by utilizing two dimensional systolic arrays [3,35] at a cost of more area but with less memory
accesses. Other hardware approaches mentioned in [3] for
pipelining and parallelisation involved the use of dual port
memory for storing search area in such a way that one port
broadcasts the left half of the search area (16 31), 1 byte
per cycle in a pipelined way to the 16 PEs, the other port
broadcasts the right half of the search area (1531), 1 byte
per cycle [3] while maintaining the parallelization. The
current block data are passed to the PEs via delay elements. The SADs of CMVs will be calculated after 256
clock cycles plus delay due to the pipeline with difference
of one clock cycle between them. However, the address
generation and control unit in this case is more complex.
5.5 Pipelined RISC processor
In order to see the effect of full resolution and RB paral-
lel systolic VLSI architectures on the computational complexity of serial ME process and its power consumption,
a 4-stage pipelined RISC processor [36] based on a simple instruction set architecture consisting of 31 instructions
(arithmetic, logical and load, store) is implemented along
with the corresponding compiler and simulator. All the
above mentioned architectures are then connected to one at
a time with the serial RISC processor (through one address
and data bus, i.e., loosely coupled configurations) to investigate the reduction in the computational complexity and
power consumption [3,510]. The architectures with the
serial RISC processor form a processor-coprocessor system. In addition, a writing address generator for fetching
data from the external memory and writing it to coprocessor memories and a direct memory access (DMA) unit for
storing the incoming data to the external memory are also
attached with the serial processor. The purpose of the serial
processor is to write macroblock coordinates of the reference and current frames to the writing address generator,
writing generated results to memories, running the sequential part of the encoder, communicating with DMA through
interrupt etc. In addition, the use of the serial and vector data caches in serial processor architectures can have
a significant effect in reducing the data fetching from the
external memory [3].
Fig. 14 shows the interconnection of the architectures
with the RISC processor. The DMA [36] controller is
utilized to increase the RISC processor efficiency. The
clock frequency of the system should be fast enough to
achieve the real time processing. Two address and data
buses (not shown in Fig. 14) can further improve the performance. One address and data bus will be dedicated to
the DMA or the other processor for storing the incoming
data into the external memories, whereas the main processor or the coprocessor will read data from the other external memories through the other bus. In this case, for
the real time ME, four external memories are required,
Fig. 14
393
three for storing the upcoming frames for ME and one for
storing the reconstructed frame for motion compensation
(shown in Fig. 14 as the video object plane (VOP) buffer
RAM). Here by VOP we mean a frame. After two memories are written with two consecutive frames, the ME process will start while the third memory will be written with
the third frame.
Interconnection of ME architecture with RISC processor
5.6 Parallel architecture with multiple parallel PEs

(full resolution and RB) for the parallelisation of
current frame macroblocks for high throughput
and/or power saving
Similar to parallel architecture with multiple parallel PEs
(Section 5.4), where the search area and SAD parallelisation is increased, the parallelisation of the current frame
macroblocks can also be increased for the high throughput, e.g. simultaneouly executing the search of all the current frame macroblocks in a row. For this purpose the
search area parallel memory utilized consists of 32 rows
and 176 columns (for QCIF sequence), i.e., 176 memory
blocks (RAM), each memory block consisting of 32 locations 8bits. Similarly the memory for a row of current frame macroblocks consists of 176 RAM blocks with
each block consisting of 16 locations 8bits. In this way
the data common between the horizontally and vertically
adjacent search areas can be reused efficiently which reduces the external and internal memory accesses and increases the throughput efficiently. The address generation
sequence for accessing the data from the onchip search
area memory and current block memory is as the same as
that of parallel architecture with parallel PE.
In this case, a row of current frame macroblocks are
searched in 16 256 clock cycles. Hence 25 QCIF frames
are searched in 25 (144/16) 16 256 = 921 600 clock
cycles, excluding the writing clock cycles for writing data
to the search area and current macroblock memory.
When the SADs of the first row of CMVs of all the

search areas are computed, the first line of the search area
memory is never required again and can be replaced with
the 17th line of the search areas of the second row of current frame macroblocks (as the first 15 lines of the search
areas of the second row of current frame macroblocks correspond to the last 15 lines of the search areas of the first
row of current frame macroblocks which are already in the
search area memory) while the SAD computation of other
rows of CMVs corresponding to the first row of current
frame macroblocks are in progress. The same thing is repeated with the other rows of CMVs of the search areas of
the first row of current frame macroblocks. In this case, the
number of parallel PEs utilized is 11 as there are 11 macroblocks per row of the image. Similarly 11 comparators
are utilized.
As before, this architecture is also attached with the serial RISC processor. Only correction algorithms and the
RB CB algorithm can be efficiently implemented using this
architecture. The implementation of the spatiotemporal algorithm and small motion algorithms can increase the resource usage due to the dependency of the algorithms on
different search areas.
6. VLSI implementation
Register transfer level (RTL) implementations of the
above mentioned serial and parallel ME architectures,
corresponding to the full resolution and RB, were synthesized and place-and-routed using Xilinx Synthesis
Tools (XST version 9.1i) [37], targeting Virtex 4 FPGA,
394
XC4VLX60. In order to compare the architectures, the

XST synthesis options are set such that only FPGA look
up tables (LUTs) and registers are utilized. A simple compiler/assembler for compiling the program for the serial
RISC processor is implemented. The size of the processors register file is set such that the compiler could map
all the program variables to the processors register file for
reduced complexity. Whereas when the compiler maps all
the program variables to onchip memory and use register
file for storage of temporary variables, the computational
complexity increases.
Table 1
Serial RISC processor

Parallel architecture
Parallel architecture with parallel PEs
Parallel architecture with search
area correlation
Parallel architecture with multiple
parallel PEs for search area and SAD
parallel PEs for current frame
Table 2
VLSI
architectures
Computational complexity at the required frequency
Full resolution
/MIPS
VLSI architectures
Tables 14 show the comparison of full resolution and

RB architectures in terms of area usage, speed and power
consumption. Since FPGAs are being utilized, the packing
and routing of the architecture may utilize LUTs and registers differently depending upon the timing constraints,
the complexity of the architecture and options set in the
XST (e.g., speed or area optimisations). Hence the area and
power comparison of the architectures (full resolution and
RB) may result differently as compared with application
specific integrated circuits (ASIC) implementations [38].
Correction
algorithm
/MIPS
RB CB
algorithm
/MIPS
3 355
(3 355 MHz)
19.17
(19.17 MHz)
19.17
(19.17 MHz)
19.17
(19.17 MHz)
19.17
(19.17 MHz)
9.66
(9.66 MHz)
4.72
(4.72 MHz)
9.66
(9.66 MHz)
4.72
(4.72 MHz)
19.17
(19.17 MHz)
19.17
(19.17 MHz)
11.25
(11.25 MHz)
9.66
(9.66 MHz)
4.72
(4.72 MHz)
Spatiotemporal
algorithm
(full resolution)
/MIPS
1 320.9
(1 320.9 MHz)
10.8
(10.8 MHz)
10.8
(10.8 MHz)
Full resolution small

motion algorithm
/MIPS
RB small
motion algorithm
/MIPS
3 085
(3 085 MHz)
18.705
(18.705 MHz)
18.705
(18.705 MHz)
17.18
(17.18 MHz)
17.18
(17.18 MHz)
6.27
(6.27 MHz)
9.96
(9.96 MHz)
10.94
(10.94 MHz)
Area consumption in terms of total LUTs and registers (total number of LUTs + registers)
Full resolution
Serial RISC
2 190+321=2 511
processor
Parallel
5 855+1 481=7 336
architecture
Parallel archi5 476+1 295=6 771
tecture with
parallel PEs
Parallel architecture with
search area
correlation
multiple pa11 971+3 934=15 905
rallel PEs for
search area
and SAD
multiple par- 13 130+2 418=15 548
allel PEs for
current frame
Correction
algorithm
RB CB algorithm
Spatiotemporal
algorithm
(full resolution)

motion algorithm
RB small
motion algorithm
2 190+321=2 511
2 190+321=2 511
5 855+1 481=7 336
4 298+1 116=5 414
5 855+1 481=7 336
5 855+1 481=7 336
4 298+1 116=5 414
5 476+1 295=6 771
3 980+893=4 873
5 476+1 295=6 771
5 476+1 295=6 771
3 980+893=4 873
5 382+1 298=6 680
11 971+3 934=15 905 7 589+2 510=10 099 11 971+3 934=15 905 11 971+3 934=15 905 7 589+2 510=10 099
13 130+2 418=15 548 8 039+1 535=9 574 13 130+2 418=15 548 13 130+2 418=15 548 8 039+1 535=9 574
395
Table 3 Dynamic power consumption at the required frequency

Full
resolution
Serial RISC processor
5 864
Parallel architecture
40
Parallel architecture with parallel PEs
36
Parallel architecture with search area
correlation
22
parallel PEs for search area and SAD
12.89
parallel PEs for current frame
VLSI architectures
mW
Correction
algorithm
34
32
RB CB
algorithm
32
30
Spatiotemporal algorithm
(full resolution)
2 309
22
20

motion algorithm
5 392
39
35
RB small
motion algorithm
29
27
20
17
16
14
22
18
10.64
9.89
Table 4 Dynamic power consumption of individual design components in FPGA at 63 MHz obtained from Xilinx Power Analyser
VLSI blocks
Serial
processor
Clocks
Inputs
Logic
Outputs
Signals
Total dynamic power
11.99
11.16
29.05
30.31
27.59
110.12
Full resolution
parallel
architecture
44.65
8.82
32.86
10.98
32.98
130.30
Full resolution parallel

architecture with multiple
VLSI blocks
parallel PEs for search
area and SAD/mV
Clocks
67.91
Inputs
9.55
Logic
29.55
Outputs
9.24
Signals
25.85
Total dynamic power
142.12
RB parallel
architecture
31.82
5.54
27.16
12.70
27.66
104.90
RB parallel
architecture with
parallel PE
26.98
6.02
28.52
12.47
25.95
99.96
RB parallel architecture
with search area
correction
36.86
5.77
32.20
10.88
29.43
115.16
RB parallel architecture with

multiple parallel architecture
for search area/mV

architecture with multiple
parallel PEs for current frame/mV
RB parallel architecture with

multiple parallel architecture
for current frame/mV
47.47
5.59
21.09
7.91
19.50
101.58
67.53
7.93
53.75
11.51
31.34
172.09
47.90
5.56
38.50
11.86
28.22
132.07
Dynamic power [3,1113] of the designs is estimated

using Xilinx Power Analyser (XPower tools). The dynamic
power is given by
Pdynamic = CV 2 f E

architecture with
parallel PE
35.02
8.34
35.28
9.95
29.68
118.28
mW
(7)
where C represents the load capacitance, V the applied

voltage, f the operating frequency, E the average number
of data transitions per clock cycle of the design component,
and f E the total number of transitions per second [37].
Area and power consumption values in Tables 24 show
that parallel RB ME architectures have reduced the area
and power consumption as compared with the parallel full
resolution ME design.
Table 1 shows the dynamic instruction count in million
instructions per second (MIPS) associated with the ME
process for 25 QCIF frames from Claire sequence corresponding to different algorithms and architectures (where
indicates that the VLSI architecture implementation
of the corresponding algorithm is not very significant). The
instruction count is estimated from a simulator (instruction profiler) based on the above mentioned RISC processor. Whereas the clock cycles consumed by the writing address generator (i.e., four clock cycles for a single memory
read from the external asynchronous memory and writing
data to the coprocessor memories) are estimated separately
and added to the instruction count. Similarly the clock cycles consumed by the ME architectures are added to estimate the frequency required. The frequency required to execute ME process in real time is also shown, i.e., at a frame
rate of 25 frames per second. The dynamic instruction
count of ME process corresponding to the spatiotemporal algorithm with parallel architecture (Section 5.1) is less
than the dynamic instruction count corresponding to the
correction algorithm with parallel architecture. This reduction is due to the nature of the sequence [710]. For example, in case of the small motion sequence (e.g., Claire sequence), the probability that the spatiotemporal condition
of the spatiotemporal algorithm is true increases, which in-
396
creases the search area terminations hence decreases the

instruction count. Similar is the case for the small motion algorithm with Claire sequence. Whereas in case of
the RC sequence (large rotory motion), the spatiotemporal
condition of the spatiotemporal algorithm is never satisfied
(from the results), hence the instruction count with the RC
sequence would be equal to that of the full search ME with
parallel architecture or the correction algorithm with parallel architecture. Similar is the case for the small motion
algorithm with the RC sequence. Hence the frequencies
mentioned in table corresponding to these algorithms except for the RB CB algorithm cannot be considered for the
processing of all sequences. In other words, for the correction algorithm, spatiotemporal and small motion algorithms, the frequency of corresponding VLSI architectures
will be kept equal to the frequency of VLSI architectures
for full search ME or ME with the correction algorithm,
and the architecture will be deactivated if the computations
are completed earlier in order to save power. Table 2 shows
the area consumed by the VLSI architectures implemented
on FPGA (Virtex 4, XC4VLX60), in terms of the total
number of LUTs and total slice registers. Table 3 shows the
dynamic power consumption of the algorithms and designs
at the corresponding frequencies mentioned in Table 1. Initially, the dynamic power consumption of all the architectures, full resolution and their RB versions corresponding
to full search ME with a small motion (Claire) sequence
and a large motion (rotating city) sequence is estimated
at frequency of 63 MHz (the maximum design operating
frequency) as shown in Table 4. Table 4 shows the dynamic power consumption with the small motion (Claire)
sequence. Whereas the dynamic power consumption corresponding to the large motion sequence slightly increases in
these processor-coprocessor configurations. The dynamic
power estimation in FPGA is composed of the power consumption due to clock, logic, interconnects (signals), and
I/O. Thus the obtained power values are then scaled by the
corresponding required frequencies mentioned in Table 1
according to (7), as shown in Table 3. When synthesis is
done, the approximate maximum operating frequency of
all the designs in Virtex 4 FPGA is 63 MHz. If the implementation is carried out in ASIC, the timings will be
improved due to the reduced routing and logic implementation and depending upon technology utilized [38]. The
RB parallel datapath has normally improved the frequency
as compared with the full resolution parallel datapath, but
in this case the parallel datapaths are attached with the
serial RISC processor which has the maximum delay in
the architecture due to the arithmetic logic unit (ALU),
hence all the VLSI architectures (processor-coprocessor)
have approximately the same maximum frequency. The

frequency of operation can be increased by introducing
the pipelining. For example, in the serial pipelined RISC
processor implementation, implementing pipelining in the
path with the maximum delay (e.g. ALU with multiplier
etc.) can increase the frequency of operation. Only when
there is dependency among the instructions, the pipeline
will stall. Similarly when the synthesis is done in FPGAs with latest technology, the frequency of operation will
be improved further. Higher frequency may be beneficial
in achieving the real time processing of larger dimension
frames.
As can be seen from Table 2, when the parallelisation
(area) is increased from full resolution parallel architecture
(Section 5.1) and full resolution parallel architecture with
parallel PE (Section 5.2) to full resolution parallel architecture with multiple parallel PEs for the increased throughput, the area increases. Depending upon the throughput,
frequency can be reduced to save the power consumption
according to (7) and [37]. On the other hand, the corresponding RB architectures have shown reduction in area
in FPGA. For example the RB version (i.e., RB CB algorithm) of parallel architecture with parallel PE is 30%
reduced in size (resource utilization) as compared with the
full resolution parallel architecture with parallel PE. The
saved area can be used for achieving further throughput
and/ or reduction in frequency requirement and power consumption. For example from Tables 13, parallel architecture (RB) for search area correlation consumes approximately the same area as full resolution parallel architecture
with parallel PE, but with approximately double throughput or at the same throughput with 45% less power consumption.
The RB version of parallel architecture with the RB
CB algorithm (Table 3) has shown 20% reduction in
power consumption and 28% area reduction as compared
with full resolution parallel architecture at the same frequency. 45% reduction in power consumption is achieved
using full resolution parallel architecture corresponding to
the spatiotemporal algorithm as compared with full resolution full search parallel architecture. Whereas the RB version of parallel architecture with parallel PE corresponding
to the RB CB algorithm has shown 17% reduction in power
consumption and 28% area reduction as compared with the
corresponding full resolution architecture. 45% reduction
in power consumption is achieved using full resolution parallel architecture with parallel PE corresponding to the spatiotemporal algorithm as compared with the corresponding
full resolution full search architecture. Similarly 28% re-
duction in power and 38% reduction in area are achieved

due to the RB version of parallel architecture with multiple parallel PEs for search area and SAD (Section 5.4) as
compared with the corresponding full resolution version.
Parallel architecture with search area correlation, parallel architecture with multiple parallel PEs for search area
and parallel architecture with multiple parallel PEs for
current frame can achieve real time processing rates for
CIF and QCIF video sequences at a design frequency of
63 MHz with a frame rate of 25 frames per second,
whereas the parallel architecture and parallel architecture
with parallel PE achieve real time processing rates for
QCIF sequences.
In order to compare the above architectures together in
terms of speed, area and power consumption, a metric, instructional complexity area product (ICAP) is defined, i.e.,
the product of instructional complexity and area, although
AreaxTime2 metrics have been used in [3]. The architecture with the minimum value of ICAP represents the optimum architecture in terms of speed, power and area usage provided, it already meets speed or time constraint. For
this purpose, only full resolution full search architectures
and RB architectures corresponding to the RB CB algorithm are considered. Using the corresponding complexity
(or required frequency) values from Table 1 and area values from Table 2, the RB parallel architecture with multiple parallel PEs for current frame has the minimum value
of ICAP. Hence, the RB parallel architecture with multiple
parallel PEs for current frame is the optimum in terms of
speed, area usage and power consumption. In those cases
where the area is more important than the power, architectures with reduced area can be utilized at a cost of more
power while the throughput is same.
Pipelining in FPGA also reduces the power consumption by reducing glitches [39]. Since the pipelining is utilized in the architectures, a RB architecture has the benefit
of having a less probability of glitches and reduced consumption of resources for pipelining.
The performance of the RB design also depends upon
the nature of video sequence. For small motion sequences
the RBSAD<7:4> calculation has a less toggling rate
[3,1113,37] but the rate of error corrections increases
whereas for large motion sequences, the RBSAD<7:4> has
a greater toggling rate and less error corrections.
7. Conclusions
RBSAD can be used to assess the potential match between
the CMVs and the current MB. However, the reduced dynamic range of the metric leads to somewhat reduced quality typically for newscaster kind of sequences.
397
The present work discusses the RB ME algorithms

and their VLSI implementations for low power and high
throughput while maintaining the accuracy (quality). The
RB algorithm is based on spatiotemporal correlations
which leads to power savings of up to 45% for parallel/
pipelined architectures. Depending upon the algorithms,
the VLSI implementations of the algorithms either consist of one RB architecture or two similar RB architectures
with one RB architecture activated conditionally leading to
power savings or both RB architectures activated leading to
higher throughput.
There is a tradeoff between the area, throughput and
power consumption. Use of RB algorithms along with the
data reuse results in efficient parallel/pipelined VLSI architectures with low power, less area and high throughput.
Although the main aim of the current work is to show
the significance of using RB algorithms in the VLSI implementations in terms of power, area saving and throughput,
the delay introduced in the design due to the design I/O
of FPGA creates an upper bound on the frequency of operation. It is also important to mention that reducing logic
may not always lead to saving in power or area in FPGA,
e.g., reducing logic in FPGA by reusing it can reduce logic
but also increase level of design logic which in turn may
require more LUTs for routing. Since the RB architectures mentioned here are derived from their full resolution
versions in a simple manner, hence RB architectures have
benefits of area and power savings in FPGA as mentioned
above.
Further data reuse, complexity reduction and power saving are possible by employing hierarchical or multiresolution techniques such as the successive elimination algorithm [3] and the binary sum pyramid algorithm [3] along
with RB algorithms although they are more beneficial for
the serial design due to the dependence on intermediate results. Similarly, feature matching ME algorithms [3] such
as integral projection methods [3] also reduce the computational complexity and memory accesses by using a sliding
window approach to reuse the already calculated results.
Early termination of SAD calculation also significantly
reduces the computational complexity and memory accesses [3]. In addition to SAD metric, minimized maximum error [3] and different pixel count metrics [3] are also
low complexity metrics, and their reduced resolution form
can be considered in further reducing the VLSI complexity
of the design.
Increased throughput is also possible by utilizing the
common area between the search areas as well as CMVs
by reading the whole image into the RAM at a cost of more
area.
398
References
[1] A. H. Sadka. Compressed video communications. London:
John Wiley & Sons, 2002.
[2] Z. N. Li, M. S. Drew. Fundamentals of multimedia, school of
computing science. New Delhi: Prentice-Hall of India, Private
Limited, 2005
[3] P. Kuhn. Algorithms, complexity analysis and VLSI architectures for MPEG4 motion estimation. Germany: Kluwer Academic Publishers, 2003.
[4] MPEG Compression Standard. www.mpeg.chiariglione.org
[5] V. A. Chouliaras J. L. Nunez, D. J. Mulvaney, et al. A multistandard video coding accelerator based on a vector architecture. IEEE Trans. on Consumer Electronics, 2005, 51(1): 160
167.
[6] V. A. Chouliaras, T. R. Jacobs, J. L. Nunez-Yanez, et al. Thread
parallel MPEG-2 and MPEG-4 encoders for shared memory
multiprocessors. International Journal of Computers and Applications, 2007, 29(4): 353361.
[7] V. A. Chouliaras, V. M. Dwyer, S. Agha. On the performance
improvement of sub-sampling MPEG-2 motion estimation algorithms with vector/SIMD architectures. Proc. of the Advanced Concepts for Interligent Vision Systems, 2005: 595
602.
[8] V. A. Chouliaras, S. Agha, T. R. Jacobs, et al. Quantifying the
benefit of thread and data parallelism for fast motion estimation in MPEG-2. IEE Electronic Letters, 2006, 42(13): 747
748.
[9] V. A. Chouliaras, J. L. Nunez-Yanez, S. Agha. Silicon Implementation of a parametric vector data path for real-time
MPEG2 encoding. Proc. the International Association of Science and Technology for Development, 2004.
[10] V. A. Chouliaras, V. M. Dwyer, S. Agha, et al. Manolopoulos. Customization of an embedded RISC CPU with SIMD extensions for video encoding. A case study. The VLSI Journal,
2008, 41: 135152.
[11] V. Ila, R. Garcia, F. Charot. VLSI architecture for an underwater robot vision system. Oceans Europe, 2005: 674679.
[12] C. Hisham, K. Komal, A. K. Mishra. Low power and less area
architecture for integer motion estimation. International Journal of Electronics, Circuits and Systems, 2009, 3(1): 1117.
[13] J. Miyakoshi, Y. Murachi, K. Hamano, et al. A low power
systolic array architecture for block-matching motion estimation. IEICE Trans. Electronics, 2005, 88(4): 559569.
[14] H. Jong, L. Chen, T. Chieuh. Accuracy improvement and cost
reduction of three step search block matching algorithm for
video coding. IEEE Trans. Circuits and Systems for Video
Technology, 1994, 4: 8891.
[15] J. Y. Tham, S. Ranganath, A. A. Kassim. A novel unrestricted
centre-biased diamond search algorithm. IEEE Trans. Circuits
and Systems for Video Technology, 1998, 8: 369377.
[16] B. Chanda, D. Dutta. Digital image processing and analysis,
New Delhi: Prentice Hall of India, 2003.
[17] M. A. Awan, A. Umar, S. A. Khan. Fast adder and subtractor design (-1+j)-based complex binary numbers. Proc. of the
International Conference on World Scientific and Engineering
Academy and Society, 2005: Article no. 13.
[18] B. Ramkumar, H. M. Kittur, P. M. Kannan. ASIC implementation of modified faster carry save adder. European Journal of
Scientific Research, 2010, 42(1): 5358.
[19] A. Th. Schwarzbacher, J. P. Silvennoinen, J. T. Timoney. Benchmarking CMOS adder structures. Proc. of the Irish
Systems and Signals Conference, 2002: 231234.
[20] M. D. Ciletti. Advanced digital design with the Verilog
HDL. Colorado: Springs, 2005.
[21] V. M. Dwyer, S. Agha, V. Chouliaras. Low power full search
block matching using reduced bit sad values for early termina-
tion. Mirage 2005, Versailles, France, 2005: 191196.

[22] V. M. Dwyer, S. Agha, V. A. Chouliaras. Reduced-bit, full
search block-matching algorithms and their hardware realizations. Proc. of the Conference on Advanced Concepts for Intelligent Vision Systems, 2005: 372380.
[23] S. Agha, V. M. Dwyer, V. Chouliaras. Motion estimation with
low resolution distortion metric. Electronics Letters, 2005,
41(12): 693694.
[24] A. Vlachos, V. Fotopoulos, A. N. Skodras. Low bit depth
representation motion estimation algorithms: a comparative
study. Journal of Real Time Image Processing, 2010, 5(3):
141148.
[25] A. Celebi, O. Urhan, I. Hamzaoglu. Efficient hardware
implementations of low bit depth motion estimation algorithms. IEEE Signal Processing Letters, 2009, 16(6): 513516.
[26] Z. L. He, K. K. Chan, C. Y. Tsui. Low power motion estimation
design using adaptive pixel truncation. IEEE Trans. on Circuits
and Systems for Video Technology, 2000, 10(5): 669678.
[27] S. Lee, S. I. Chae. Motion estimation algorithm using low resolution quantization. Electronics Letters, 1996, 32(6): 647648.
[28] Y. J. Baek, H. S. Oh, H. K. Lee. Block-matching criterion
for efficient VLSI implementation for motion estimation. Electronics Letters, 1996, 32(13): 11841185.
[29] H. Nisar, T. S. Choi. Fast motion estimation algorithm based
on spatio-temporal correlation and direction of motion vectors. IET Electronic Letters, 2006, 42(24): 13841385.
[30] M. Y. Kim, M. G. Jung. Motion estimation using cross centerBiased distribution and spatio-temporal correlation of motion
vector. Proc. of the 8th International Conference on KES,
2004: 244252.
[31] R. Weeks. Fundamentals of electronic image processing. New
Delhi: Prentice Hall of India, 2003.
[32] R. C. Gonzalez, R. E. Woods, S. L. Eddins. Digital image processing using MATLAB. India: Dorling Kindersley, 2006.
[33] MPEG Compression Standard. www.mpeg.chiariglione.org/
working documents.htm#MPEG-4
[34] V. S. K. Reddy, S. Sengupta, Y. M. Latha. New VLSI architecture for motion estimation algorithm. World Academy of Science, Engineering and Technology, 2007, 36: 6871.
[35] K. K. Parhi. VLSI digital signal processing system, design and
implementation. New Jersey: John Wiley & Sons, 2003.
[36] M. M. Mano, C. R. Kime. Logic and computer design fundamentals. Pearson Prentice Hall, 2004.
[37] Xilinx. Xilinx Synthesis Tools (XST version 9.1i)
www.Xilinx.com
[38] I. Kuon, J. Rose. Measuring the gap between FPGAs and
ASICs. IEEE Trans. on Computer-Aided Design of Integrated
Circuits and Systems, 2007, 26(2): 203215.
[39] J. Becker, M Platzner, S. Vernalde. The impact of pipelining on energy per operation in field-programmable gate arrays. Proc. of the International Conference on Field Programmable Logic and Applications, 2004: 719728.
Biographies
Shahrukh Agha born in 1976. He did Ph.D. in
software and hardware techniques for accelerating
MPEG motion estimation in 2006 from Loughborough University, UK. His research interests are low
power real time VLSI implementations of digital
video and signal processing algorithms. He is currently an assistant professor in the Department of
Electrical Engineering, COMSATS Institute of Information Technology, Islamabad, Pakistan.
E-mail: shahrukh agha@comsats.edu.pk
Shahid Khan did Ph.D. in telecommunications engineering from UK. His research areas are electronics and radio frequency communications. He is currently a professor and dean of the Department of
Electrical Engineering in COMSATS Institute of Information Technology, Islamabad, Pakistan.
E-mail: shahidk@comsats.edu.pk
Shahzad Malik did Ph.D. in wireless networking. His research areas are electronics and wireless networks. He is currently a professor and chairman of the Department of Electrical Engineering in
COMSATS Institute of Information Technology, Islamabad, Pakistan.
E-mail: smalik@comsats.edu.pk
399
Raja Riaz received his bachelor of engineering degree from National University of Sciences and Technology (NUST) Pakistan with Silver Medal in 1998.
He is M.S. degree holder from Center for Advanced
Studies in Engineering (CASE), University of Engineering and Technology (UET), Taxila, Pakistan,
specialising in telecommunications in 2003. He has
done another M.S. degree from NUST Pakistan with
controls of dynamic systems as area of specialization. He obtained his
Ph.D. degree in the field of ultra-wideband communications, from School
of Electrical and Computer Sciences, University of Southampton, UK in
2010. His research interests include channel coding, communication systems, and stochastic processes.
E-mail: rajaali@comsats.edu.pk

006 10.1109@jsee.2013.00047

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

006 10.1109@jsee.2013.00047

Uploaded by

Copyright:

Available Formats

Journal of Systems Engineering and Electronics

Vol. 24, No. 3, June 2013, pp.382399

Reduced bit low power VLSI architectures for

algorithm (BMA) known as the full search ME (FSME)

minance to 4-bit values by excluding the lower 4-bit and

|FC (i, j)<7:0> FR (i + m, j + n)<7:0> |. (1)

where FC (i, j) is the (8-bit luminance) pixel value in the

v] = arg min SAD(m, n).

We define the RBSAD (i.e., RBSAD<7:4> ) as

|FC (i, j)<7:4> FR (i + m, j + n)<7:4> |. (3)

i,j (m, n)(FC (i, j)<3:0> FR (i+m, j + n)<3:0> )

3. Novel correction recovery mechanism

while for the Laplacian error surface [32], i.e.,

the equivalent points lie on the surface

In hardware, the RB CB ME algorithm will be realized

3.3 Spatiotemporal algorithm

Fig. 1 MV distribution of full resolution search algorithm in Claire

Fig. 2 MV distribution of RB full search left-to-right raster algorithm in Claire sequence

Fig. 3 MV distribution of RB full search CB raster algorithm in

version (upper 4-bit) for 60 common intermediate format

3.4 Small or zero motion algorithm

Comparation of different algorithms on PSNR

The maximum PSNR difference has fallen from 2.86 dB

5. ME architecture (SAD and MV

Full resolution SAD full search ME architecture

Fig. 6 Bus architecture

Fig. 7 shows the PE architecture consisting of an 8-bit

PE for serial accumulation

5.1.1 Power efficient RB parallel architecture with

Architecture for correction algorithm

5.1.2 Power efficient RB parallel architectures with

mentation of the correction algorithm with comparatively

Architecture for correction algorithm with higher throughput

In a similar manner, the correction unit in Fig. 8 can be

four bit data in the second RB architecture we can load

elements for storing spatial MVs to the number equal to

Fig. 10 Paraller ME architecture with paraller PE

5.2 Parallel architecture with parallel PE (full

The bus architecture remains the same frame except that

tion algorithm. Whenever the threshold condition is true,

5.3 Parallel architecture (RB) for search area

RB parallel ME architecture for two current macroblocks

5.4 Parallel architecture with multiple parallel PEs

Interconnection of ME architecture with RISC processor

5.6 Parallel architecture with multiple parallel PEs

When the SADs of the first row of CMVs of all the

XC4VLX60. In order to compare the architectures, the

Serial RISC processor

Computational complexity at the required frequency

Tables 14 show the comparison of full resolution and

Full resolution small

Full resolution small

5 855+1 481=7 336

4 298+1 116=5 414

5 855+1 481=7 336

5 855+1 481=7 336

4 298+1 116=5 414

5 476+1 295=6 771

5 476+1 295=6 771

5 476+1 295=6 771

5 382+1 298=6 680

Table 3 Dynamic power consumption at the required frequency

Full resolution small