Professional Documents
Culture Documents
Abstract: Low power and real time very large scale integration
(VLSI) architectures of motion estimation (ME) algorithms for mobile devices and applications are presented. The power reduction
is achieved by devising a novel correction recovery mechanism
based on algorithms which allow the use of reduced bit sum of
absolute difference (RBSAD) metric for calculating matching error
and conversion to full resolution sum of absolute difference (SAD)
metric whenever necessary. Parallel and pipelined architectures
for high throughput of full search ME corresponding to both the full
resolution SAD and the generalized RBSAD algorithm are synthesized using Xilinx Synthesis Tools (XST), where the ME designs
based on reduced bit (RB) algorithms demonstrate the reduction
in power consumption up to 45% and/or the reduction in area up
to 38%.
Keywords: motion estimation (ME), very large scale integration
(VLSI), reduced bit sum of absolute difference (RBSAD).
DOI: 10.1109/JSEE.2013.00047
1. Introduction
It is common to use prediction techniques to compress
video data. For example in video sequences, a square region, called a macroblock, of the current frame, s(i, j, k),
is sought in a relative area, called a search area, of the
reference frame (a frame either succeeding the current
frame or preceding it), s(i, j, k 1), where (i, j) are the
spatial coordinates of the pixel, and k is the temporal coordinate, i.e., the frame number, in an attempt to find a
region which is similar to it. Such a technique is called
the motion estimation (ME) based on simplified optical
flow constraints [13]. Such techniques have been widely
used by the video standards including MPEG-1, MPEG2 and MPEG-4 as well as H.261, H.263 and H.264 [1
4]. There are different kinds of ME algorithms, the most
common and most suitable algorithm for very large scale
integration (VLSI) implementations is the block matching
Manuscript received October 9, 2011.
*Corresponding author.
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
383
2. RBSAD
As described above, the matching of the current frame
square region with the reference frame square region in the
search area is performed in terms of SAD [3] error. Mean
square error metric [13] can also be used but has more
computational complexity although it represents quality
more accurately [3]. SAD for a 1616 macroblook is given
by
SAD(m, n) =
16
i=0,j=0
(2)
i=0,j=0
Since most of the hardware realizations, for FSME encoding, are designed in a bit slice fashion, (3) may be evaluated by exactly using the same hardware with a 4-bit version of the datapath and controller.
In addition, the value of RBSAD<7:4> in (3) may be
corrected to (1) by adding the term (m, n) [2123] given
by
(m, n) =
16
i=0,j=0
(4)
where < 3 : 0 > refers to lower four bits of the pixel
and i,j (m, n) is the sign of (FC (i, j)<7:4> FR (i +
m, j + n)<7:4> ) except when (FC (i, j)<3:0> FR (i +
m, j + n)<3:0> ) is zero and i,j (m, n) equals the sign of
(FC (i, j)<3:0> FR (i + m, j + n)<3:0> ).
384
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
frames [3,29,30]. Because of this, there is a high probability of finding the best match in the small region surrounding the center of the search area unless the motion
is fast. We assume the distribution of the matches to be
Gaussian [31], i.e.,
p(x, y) =
1 x2 +y2 2
e 2
2 2
(5)
where is the standard deviation and p() is the probability density. The equivalent points, the points of equal error,
are on the circles,
x2 + y 2 = c2 ,
3.2 CB algorithm
1 2 |x|+|y|
e
,
2
(6)
SAD = RBSAD<7:4>
end if
where T is some threshold value. The maximum value of
error in using RBSAD<7:4> alone as compared with SAD
is 2 15 256. For sequences with high spatiotemporal correlations, there are many candidate MVs (CMVs)
with RBSAD<7:4> zero in the search area, whereas in case
of large or complex motion sequences, RBSAD<7:4> of
CMVs differs from each other significantly. The more the
RBSAD<7:4> differs from each other, the less the error
probability of finding best match using RBSAD<7:4> is.
Substituting T = 0, in the above criterion, we obtain
another criterion [2123], which we call the correction algorithm, mentioned below.
if (RBSAD<7:4> == 0) then
SAD = (m, n)
else
SAD = RBSAD<7:4>
end if
i.e., when RBSAD<7:4> is zero, simply calculate the correction term (m, n) and use it as SAD in the minimum
SAD evaluation. The notion is that for sequences with high
spatiotemporal correlations, there are many CMVs with
RBSAD<7:4> zero in the search area, hence the application of corrections under this condition can lead to correct
MVs.
In this typical case when RBSAD<7:4> is zero,
i,j (m, n) will have the sign of (FC (i, j)<3:0> FR (i +
m, j + n)<3:0> ) in (4) and the register elements will not
be required for storing the sign bits.
Another way of reducing power and area of ME architecture while keeping the accuracy is to implement the CB
ME algorithm with the RB architecture. In the ME algorithm, search for the best CMV is normally executed in a
raster manner, i.e., from left to right and top to bottom. Due
to the CB motion in natural sequences, a large number of
blocks in the current frame could be regarded as stationary
or qausi-stationary with respect to the reference frame. As
RBSAD<7:4> (m, n) only has a resolution of 16 (i.e., it
takes on values 16N , for integer N ), there can be a number of CMVs which generate the same metric due to high
spatiotemporal correlations especially in the case of slow
motion sequences, e.g., newscaster kind of sequences like
Claire, Akiyo, Missa and Grandmom sequences. We resolve this conflict by choosing the MV according to the CB
assumption, according to which if two MVs evaluate to the
same error metric then the point closest to origin should
be chosen, provided RBSAD is less than a threshold (T2 ),
i.e., when the motion is small. With a Gaussian error surface, it would be the point on the circle with the smallest
radius, while with a Laplacian error surface, it would mean
the point on the square with the smallest side length. Hence
choose the MV with the lowest value of |x| + |y|, which
corresponds to the assumption of a Laplacian error surface
(6). A pseudocode for the search of best CMV in a CB
manner using RBSAD is shown below, which we call the
CB algorithm [2123]:
if (((|vX | + |vY |) < (|mvX | + |mvY |)) and
(RBSAD<7:4> <= T2 )) then
mvX = vX
mvY = vY
end if
where (mvX , mvY ) and (vX , vY ) are the current best MV
and the MV of the candidate block under test respectively,
RBSAD<7:4> is the minimum RBSAD<7:4> value found
so far and T2 is some threshold value as the measure of
small and large displacements (motion).
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
385
Fig. 3 shows the MV distribution of CB searching pattern. CB RB ME distribution has 81% correct MVs (as
compared with the MVs obtained from full resolution motion estimation) and the left-to-right raster RB ME has 51%
correct MVs.
else
Search with CB RBSAD<7:4> in the whole search area
end if
where mvT is the MV of the corresponding macroblock
in the previous frame (i.e., from the same location) and
SADmin t (or RBSADmin t ) is the corresponding minimum SAD or minimum RBSAD<7:4> . Similarly mvS is
the MV of the macroblock in the left of the current frame
(or above macroblock if no macroblock in the left position
exists) with minimum SAD or minimum RBSAD<7:4> as
SADmin s or RBSADmin s and T3 and T4 are some threshold values. For more accuracy we only use full resolution
SAD value in the spatiotemporal algorithm whether spatiotemporal condition is satisfied or not, i.e.,
if ((mvT == 0, mvS == 0, SADmin s <= T3 and
SADmin t <= T4 )) then
Apply corrections or search with full resolution SAD in
range
[m n] = { 1, 0, 1}{ 1, 0, 1}
else
Apply corrections or search with full resolution SAD in
the whole search area
end if
The notion is that if a particular block (current macroblock) does not move in the previous frame and has a
minimum SAD value less than the threshold, and adjacent
macroblocks of the current macroblock also do not move
far in the current frame and have a minimum SAD value
less than the threshold, then it is less likely for the current
macroblock to move far in the current frame. As shown in
the pseudocode of the spatiotemporal algorithm, the spatiotemporal MVs can reduce complexity by reducing the
search area. According to the former spatiotemporal algorithm, CB ME with RBSAD<7:4> (i.e., RB CB algorithm)
386
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
is used whenever the spatiotemporal condition is not satisfied. The MVs obtained from RB CB algorithm are slightly
less accurate than the MVs obtained from full resolution
SAD ME (as described above) and may not accurately represent the spatiotemporal correlation among the MVs, thus
resulting in an increased error. But this condition will not
persist for long as a large motion gives a large SAD or
RBSAD<7:4> (spatiotemporal algorithm). According to
the spatiotemporal algorithm along with the RBSAD<7:4>
metric, 67% accurate MVs are obtained when compared
with the exhaustive full resolution ME algorithm. Whereas
97% accurate MVs are obtained when this algorithm is
used with full resolution SAD. In addition the accuracy of
ME can be increased by employing object shape description [3] which separates the foreground portion of images
from the static background.
4. Results
The quality of any ME algorithm is usually assessed by
the PSNR [3] defined, in this case, as the fractional root
mean square (RMS) error between the predicted and true
frames, expressed on a dB scale. Here, the prediction of the
current frame is obtained from the previous frame using
MV in the previous frame although in the standard MPEG
compression format, motion compensation is used in the
reconstructed frame. The reconstructed frame is obtained
by adding the quantized block or frame error in the motion compensated block or frame. The quality of compression is evaluated by computing PSNR between the original and reconstructed frames. The construction of reconstructed frame also depends upon the bit rate allocated and
the nature of motion. Simple and smaller motion sequences
have higher PSNR for a given bit rate. In this work, the
aim of evaluating PSNR is to evaluate the performance of
ME algorithms which can represent the performance of the
compression encoder.
Line 1 in Fig. 4 shows the difference in PSNR between the full search (full resolution) and the simple RB
Fig. 4
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
tating city (RC) sequence, results in zero number of corrections. Similarly the application of the spatiotemporal algorithm on Claire sequence results in approximately 70%
of macroblocks which are searched in a search window of
size 3 3 and zero percentage of macroblocks in case of
RC sequence. The small motion algorithm with full resolution SAD yields 8% current frame macroblocks where
search area computation is not carried out, i.e., ME for
these current frame macroblocks is not done. Whereas with
the small motion algorithm along with RBSAD<7:4> metric, there are the average 34% macroblocks where search
area computations are not carried out. Whereas the application of small motion algorithm on RC sequence results
in zero percentage of macroblocks where ME search is not
done.
387
ates row addresses for the parallel memory and the whole
row of search area is output. Starting from the first row
of CMVs, a row of search area is output for 16 clock cycles in which 16 partial SADs corresponding to 16 CMVs
are computed in parallel by the 16 PEs. Each partial SAD
(composed of sum of 16 absolute differences) is computed
sequentially by each PE.
Fig. 5
Fig. 6 shows the bus structure (multiplexers) for distributing full resolution data to corresponding PEs. In a
similar manner the next 15 rows are accessed to compute
the remaining partial SADs for the computation of full
SADs of the first row of CMVs in a total of 16 16 clock
cycles. Similarly starting from the second row of CMVs,
16 rows of search area are accessed to compute SADs of
the second row of CMVs. The comparison of SADs is done
in pipeline with the computation of the SADs of the next
row of CMVs, i.e., while the second row of CMVs is being
processed, the completed SADs of the first row of CMVs
will be compared together by an SAD compare unit as
shown in Fig. 5. The above process is repeated for the remaining rows of CMVs, i.e., 256 SADs are computed and
the best match (i.e., location of minimum SAD) of current frame macroblock is searched in 16 256 clock cycles. Thus a quarter common intermediate format (QCIF)
sized video frame (176 pixels 144 pixels) will take approximately (176/16 144/16) 16 256 = 405 504
clock cycles (excluding write cycles for block memories)
for the ME between two frames hence for the real time
operation, i.e., 25 frames per second, it would require
405 504 25 = 10 137 600 clock cycles.
388
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
2 s complement
Subtractor
Sign bit
MUX
Accumulator
SAD register
Fig. 7
Fig. 8
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
Fig. 9
389
390
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
the SAD calculation is terminated whenever the intermediate SAD value exceeds the minimum SAD value computed so far [34] which results in saving of computations
and power), only one comparator would be required as
one SAD corresponding to a CMV is being computed at
a time. The drawback is that the local memory accesses
have increased which can be reduced by using 16 1 bit
circulatory shift register for enabling and disabling memory blocks. Similarly it is applied to the RB version of this
architecture. Fig. 10 shows a parallel architecture with parallel PE. The address generation sequence for this architecture is as follows. Starting from the first row of the search
area, 16 rows are accessed to compute the SAD of the first
CMV of the first row of CMVs in 16 clock cycles from
columns 1 to 16 of the search area memory.
Fig. 11 An adder tree [3] for parallel PE for simultaneously calculating sum of 16 RB absolute differences (pipeline registers are not
shown)
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
for computing the RBSAD<7:4> and the other for computing (m, n). Initially, the RBSAD<7:4> corresponding to the upper four bit is computed using one unit corresponding to the upper four bit with signs of difference
terms being stored in the corresponding memory. When
RBSAD<7:4> exceeds the current minimum SAD value
by 15256 (a factor for the compensation of the minimum
value of correction), the computation for the current CMV
terminates, otherwise the lower bit unit will be enabled to
compute (m, n) for full value of SAD while the upper bit
unit will compute RBSAD<7:4> of the next CMV in the
pipeline. The same architecture can be used for the correc-
391
Fig. 12 Power efficient parallel ME architecture with paraller PE and SAD early termination
all the locations in the horizontally adjacent two search areas. Due to common areas between the adjacent search areas, data reuse can be effectively utilized resulting in improvement of the throughput and power saving. The drawback is that this architecture may not efficiently utilize the
spatiotemporal algorithm for power reduction and speed
improvement as there are two current frame macroblocks
being searched simultaneously. Similar is the case with the
implementation of the small motion algorithm with this architecture. Only the RB CB algorithm is efficiently implemented with this architecture.
392
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
Fig. 13
lel systolic VLSI architectures on the computational complexity of serial ME process and its power consumption,
a 4-stage pipelined RISC processor [36] based on a simple instruction set architecture consisting of 31 instructions
(arithmetic, logical and load, store) is implemented along
with the corresponding compiler and simulator. All the
above mentioned architectures are then connected to one at
a time with the serial RISC processor (through one address
and data bus, i.e., loosely coupled configurations) to investigate the reduction in the computational complexity and
power consumption [3,510]. The architectures with the
serial RISC processor form a processor-coprocessor system. In addition, a writing address generator for fetching
data from the external memory and writing it to coprocessor memories and a direct memory access (DMA) unit for
storing the incoming data to the external memory are also
attached with the serial processor. The purpose of the serial
processor is to write macroblock coordinates of the reference and current frames to the writing address generator,
writing generated results to memories, running the sequential part of the encoder, communicating with DMA through
interrupt etc. In addition, the use of the serial and vector data caches in serial processor architectures can have
a significant effect in reducing the data fetching from the
external memory [3].
Fig. 14 shows the interconnection of the architectures
with the RISC processor. The DMA [36] controller is
utilized to increase the RISC processor efficiency. The
clock frequency of the system should be fast enough to
achieve the real time processing. Two address and data
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
buses (not shown in Fig. 14) can further improve the performance. One address and data bus will be dedicated to
the DMA or the other processor for storing the incoming
data into the external memories, whereas the main processor or the coprocessor will read data from the other external memories through the other bus. In this case, for
the real time ME, four external memories are required,
Fig. 14
393
three for storing the upcoming frames for ME and one for
storing the reconstructed frame for motion compensation
(shown in Fig. 14 as the video object plane (VOP) buffer
RAM). Here by VOP we mean a frame. After two memories are written with two consecutive frames, the ME process will start while the third memory will be written with
the third frame.
6. VLSI implementation
Register transfer level (RTL) implementations of the
above mentioned serial and parallel ME architectures,
corresponding to the full resolution and RB, were synthesized and place-and-routed using Xilinx Synthesis
Tools (XST version 9.1i) [37], targeting Virtex 4 FPGA,
394
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
Full resolution
/MIPS
VLSI architectures
Correction
algorithm
/MIPS
RB CB
algorithm
/MIPS
3 355
(3 355 MHz)
19.17
(19.17 MHz)
19.17
(19.17 MHz)
19.17
(19.17 MHz)
19.17
(19.17 MHz)
9.66
(9.66 MHz)
4.72
(4.72 MHz)
9.66
(9.66 MHz)
4.72
(4.72 MHz)
19.17
(19.17 MHz)
19.17
(19.17 MHz)
11.25
(11.25 MHz)
9.66
(9.66 MHz)
4.72
(4.72 MHz)
Spatiotemporal
algorithm
(full resolution)
/MIPS
1 320.9
(1 320.9 MHz)
10.8
(10.8 MHz)
10.8
(10.8 MHz)
RB small
motion algorithm
/MIPS
3 085
(3 085 MHz)
18.705
(18.705 MHz)
18.705
(18.705 MHz)
17.18
(17.18 MHz)
17.18
(17.18 MHz)
6.27
(6.27 MHz)
9.96
(9.96 MHz)
10.94
(10.94 MHz)
Area consumption in terms of total LUTs and registers (total number of LUTs + registers)
Full resolution
Serial RISC
2 190+321=2 511
processor
Parallel
5 855+1 481=7 336
architecture
Parallel archi5 476+1 295=6 771
tecture with
parallel PEs
Parallel architecture with
search area
correlation
Parallel architecture with
multiple pa11 971+3 934=15 905
rallel PEs for
search area
and SAD
Parallel architecture with
multiple par- 13 130+2 418=15 548
allel PEs for
current frame
Correction
algorithm
RB CB algorithm
Spatiotemporal
algorithm
(full resolution)
RB small
motion algorithm
2 190+321=2 511
2 190+321=2 511
3 980+893=4 873
3 980+893=4 873
11 971+3 934=15 905 7 589+2 510=10 099 11 971+3 934=15 905 11 971+3 934=15 905 7 589+2 510=10 099
13 130+2 418=15 548 8 039+1 535=9 574 13 130+2 418=15 548 13 130+2 418=15 548 8 039+1 535=9 574
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
395
correlation
Parallel architecture with multiple
22
parallel PEs for search area and SAD
Parallel architecture with multiple
12.89
parallel PEs for current frame
VLSI architectures
mW
Correction
algorithm
34
32
RB CB
algorithm
32
30
Spatiotemporal algorithm
(full resolution)
2 309
22
20
RB small
motion algorithm
29
27
20
17
16
14
22
18
10.64
9.89
Table 4 Dynamic power consumption of individual design components in FPGA at 63 MHz obtained from Xilinx Power Analyser
VLSI blocks
Serial
processor
Clocks
Inputs
Logic
Outputs
Signals
Total dynamic power
11.99
11.16
29.05
30.31
27.59
110.12
Full resolution
parallel
architecture
44.65
8.82
32.86
10.98
32.98
130.30
RB parallel
architecture
31.82
5.54
27.16
12.70
27.66
104.90
RB parallel
architecture with
parallel PE
26.98
6.02
28.52
12.47
25.95
99.96
RB parallel architecture
with search area
correction
36.86
5.77
32.20
10.88
29.43
115.16
47.47
5.59
21.09
7.91
19.50
101.58
67.53
7.93
53.75
11.51
31.34
172.09
47.90
5.56
38.50
11.86
28.22
132.07
mW
(7)
instruction count is estimated from a simulator (instruction profiler) based on the above mentioned RISC processor. Whereas the clock cycles consumed by the writing address generator (i.e., four clock cycles for a single memory
read from the external asynchronous memory and writing
data to the coprocessor memories) are estimated separately
and added to the instruction count. Similarly the clock cycles consumed by the ME architectures are added to estimate the frequency required. The frequency required to execute ME process in real time is also shown, i.e., at a frame
rate of 25 frames per second. The dynamic instruction
count of ME process corresponding to the spatiotemporal algorithm with parallel architecture (Section 5.1) is less
than the dynamic instruction count corresponding to the
correction algorithm with parallel architecture. This reduction is due to the nature of the sequence [710]. For example, in case of the small motion sequence (e.g., Claire sequence), the probability that the spatiotemporal condition
of the spatiotemporal algorithm is true increases, which in-
396
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
7. Conclusions
RBSAD can be used to assess the potential match between
the CMVs and the current MB. However, the reduced dynamic range of the metric leads to somewhat reduced quality typically for newscaster kind of sequences.
397
398
Journal of Systems Engineering and Electronics Vol. 24, No. 3, June 2013
References
[1] A. H. Sadka. Compressed video communications. London:
John Wiley & Sons, 2002.
[2] Z. N. Li, M. S. Drew. Fundamentals of multimedia, school of
computing science. New Delhi: Prentice-Hall of India, Private
Limited, 2005
[3] P. Kuhn. Algorithms, complexity analysis and VLSI architectures for MPEG4 motion estimation. Germany: Kluwer Academic Publishers, 2003.
[4] MPEG Compression Standard. www.mpeg.chiariglione.org
[5] V. A. Chouliaras J. L. Nunez, D. J. Mulvaney, et al. A multistandard video coding accelerator based on a vector architecture. IEEE Trans. on Consumer Electronics, 2005, 51(1): 160
167.
[6] V. A. Chouliaras, T. R. Jacobs, J. L. Nunez-Yanez, et al. Thread
parallel MPEG-2 and MPEG-4 encoders for shared memory
multiprocessors. International Journal of Computers and Applications, 2007, 29(4): 353361.
[7] V. A. Chouliaras, V. M. Dwyer, S. Agha. On the performance
improvement of sub-sampling MPEG-2 motion estimation algorithms with vector/SIMD architectures. Proc. of the Advanced Concepts for Interligent Vision Systems, 2005: 595
602.
[8] V. A. Chouliaras, S. Agha, T. R. Jacobs, et al. Quantifying the
benefit of thread and data parallelism for fast motion estimation in MPEG-2. IEE Electronic Letters, 2006, 42(13): 747
748.
[9] V. A. Chouliaras, J. L. Nunez-Yanez, S. Agha. Silicon Implementation of a parametric vector data path for real-time
MPEG2 encoding. Proc. the International Association of Science and Technology for Development, 2004.
[10] V. A. Chouliaras, V. M. Dwyer, S. Agha, et al. Manolopoulos. Customization of an embedded RISC CPU with SIMD extensions for video encoding. A case study. The VLSI Journal,
2008, 41: 135152.
[11] V. Ila, R. Garcia, F. Charot. VLSI architecture for an underwater robot vision system. Oceans Europe, 2005: 674679.
[12] C. Hisham, K. Komal, A. K. Mishra. Low power and less area
architecture for integer motion estimation. International Journal of Electronics, Circuits and Systems, 2009, 3(1): 1117.
[13] J. Miyakoshi, Y. Murachi, K. Hamano, et al. A low power
systolic array architecture for block-matching motion estimation. IEICE Trans. Electronics, 2005, 88(4): 559569.
[14] H. Jong, L. Chen, T. Chieuh. Accuracy improvement and cost
reduction of three step search block matching algorithm for
video coding. IEEE Trans. Circuits and Systems for Video
Technology, 1994, 4: 8891.
[15] J. Y. Tham, S. Ranganath, A. A. Kassim. A novel unrestricted
centre-biased diamond search algorithm. IEEE Trans. Circuits
and Systems for Video Technology, 1998, 8: 369377.
[16] B. Chanda, D. Dutta. Digital image processing and analysis,
New Delhi: Prentice Hall of India, 2003.
[17] M. A. Awan, A. Umar, S. A. Khan. Fast adder and subtractor design (-1+j)-based complex binary numbers. Proc. of the
International Conference on World Scientific and Engineering
Academy and Society, 2005: Article no. 13.
[18] B. Ramkumar, H. M. Kittur, P. M. Kannan. ASIC implementation of modified faster carry save adder. European Journal of
Scientific Research, 2010, 42(1): 5358.
[19] A. Th. Schwarzbacher, J. P. Silvennoinen, J. T. Timoney. Benchmarking CMOS adder structures. Proc. of the Irish
Systems and Signals Conference, 2002: 231234.
[20] M. D. Ciletti. Advanced digital design with the Verilog
HDL. Colorado: Springs, 2005.
[21] V. M. Dwyer, S. Agha, V. Chouliaras. Low power full search
block matching using reduced bit sad values for early termina-
Biographies
Shahrukh Agha born in 1976. He did Ph.D. in
software and hardware techniques for accelerating
MPEG motion estimation in 2006 from Loughborough University, UK. His research interests are low
power real time VLSI implementations of digital
video and signal processing algorithms. He is currently an assistant professor in the Department of
Electrical Engineering, COMSATS Institute of Information Technology, Islamabad, Pakistan.
E-mail: shahrukh agha@comsats.edu.pk
Shahrukh Agha et al.: Reduced bit low power VLSI architectures for motion estimation
Shahid Khan did Ph.D. in telecommunications engineering from UK. His research areas are electronics and radio frequency communications. He is currently a professor and dean of the Department of
Electrical Engineering in COMSATS Institute of Information Technology, Islamabad, Pakistan.
E-mail: shahidk@comsats.edu.pk
Shahzad Malik did Ph.D. in wireless networking. His research areas are electronics and wireless networks. He is currently a professor and chairman of the Department of Electrical Engineering in
COMSATS Institute of Information Technology, Islamabad, Pakistan.
E-mail: smalik@comsats.edu.pk
399
Raja Riaz received his bachelor of engineering degree from National University of Sciences and Technology (NUST) Pakistan with Silver Medal in 1998.
He is M.S. degree holder from Center for Advanced
Studies in Engineering (CASE), University of Engineering and Technology (UET), Taxila, Pakistan,
specialising in telecommunications in 2003. He has
done another M.S. degree from NUST Pakistan with
controls of dynamic systems as area of specialization. He obtained his
Ph.D. degree in the field of ultra-wideband communications, from School
of Electrical and Computer Sciences, University of Southampton, UK in
2010. His research interests include channel coding, communication systems, and stochastic processes.
E-mail: rajaali@comsats.edu.pk