2011 Lolunsiu Etc Ieee Transcsvt

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication.
Improved SIMD Architecture for High

Performance Video Processors
Wing-Yee Lo, Daniel Pak-Kong Lun, Member, IEEE, Wan-Chi Siu, Senior Member, IEEE, Wendong
Wang, and Jiqiang Song, Senior Member, IEEE
AbstractSIMD execution is in no doubt an efficient way to

exploit the data level parallelism in image and video applications.
However, SIMD execution bottlenecks must be tackled in order to
achieve high execution efficiency. We first analyze in this paper
the implementation of two major kernel functions of H.264/AVC
namely, SATD and subpel interpolation, in conventional SIMD
architectures to identify the bottlenecks in traditional approaches.
Based on the analysis results, we propose a new SIMD
architecture with two novel features: (1) parallel memory
structure with variable block size and word length support; and (2)
configurable SIMD structure. The proposed parallel memory
structure allows great flexibility for programmers to perform
data access of different block sizes and different word lengths. The
configurable SIMD structure allows almost random register file
access and slightly different operations in ALUs inside SIMD. The
new features greatly benefit the realization of H.264/AVC kernel
functions. For instance, the fractional motion estimation,
particularly the half to quarter pixel interpolation, can now be
executed with minimal or no additional memory access. When
comparing with the conventional SIMD systems, the proposed
SIMD architecture can have a further speedup of 2.1X to 4.6X
when implementing H.264/AVC kernel functions. Based on
Amdahls law, the overall speedup of H.264/AVC encoding
application can be projected to be 2.46X. We expect significant
improvement can also be achieved when applying the proposed
architecture to other image and video processing applications.
Index TermsConfigurable SIMD, Parallel memory structure,
SIMD bottlenecks, video codec processor
I. INTRODUCTION
ith the extensive use of image and video information in

modern computer applications, the development of high
performance image and video processing units has attracted
Manuscript received September 23, 2009; revised April 27, 2010 and
December 13, 2010. This work was supported in part by the Hong Kong
Polytechnic University under grant no 1-BB9B. Most of the research work and
implementation development were done in the Hong Kong Applied Science and
Technology Research Institute (ASTRI) and Beijing SimpLight
Nanoelectroinics Ltd.
Wing-Yee Lo, Daniel Pak-Kong Lun and Wan-Chi Siu are with the Centre
for Signal Processing of the Department of Electronic and Information
Engineering of the Hong Kong Polytechnic University, Hung Hom, Kowloon
Hong Kong. (e-mail: winnielowingyee@gmail.com; enpklun@ polyu.edu.hk;
enwcsiu@polyu.edu.hk).
Wendong Wang is with the SimpLight Nanoelectronics Ltd., Beijing, China.
(e-mail: wending.wang@simplnano.com).
Jiqiang Song is with the Intel Lab, Beijing, China (e-mail:
jiqiangsong@gmail.com).
much interest from both academic researchers and VLSI

system designers. Among the image and video processing
operations that are performed in general computer applications,
video coding is the most computation intensive operation that is
often used as the benchmark to measure the performance of a
video processor. For the rest of this paper, we shall focus on the
realization of the state-of-the-art video coding standard
H.264/AVC [1] and use it as an example to illustrate the merit
of the proposed video processor design.
To deal with the extremely high computational complexity
of video coding, one common approach is to exploit the data
level parallelism (DLP) in the execution. As different from
application specific ASIC designs, a general purpose video
processor should provide great flexibility for programmers
while exploiting the parallelism in the execution. For this
reason, the Single Instruction Multiple Data (SIMD)
architecture is most suitable and is widely adopted. Two
popular examples are Intels MMX/SSE1/SSE2/SSE3 [2] and
Motorolas AltiVec [3], where multimedia SIMD instruction
set extensions have been added for efficient realization of video
processing applications.
In recent years, many researchers studied how much
performance can be gained after using SIMD instructions in
modern video codec [4]-[9]. Simulation results using reference
model demonstrate that there is at least 2-12X speedup. A basic
requirement to employ SIMD instructions is to possibly feed
multiple data elements perfectly into vector registers so that the
same computation operation can be applied. Although much
research effort [6] [10]-[12] has been made to address the
problem, there are often overheads and performance
bottlenecks when aligning the multiple data to feed into vector
registers. Extra memory loads and stores, unpacking, packing
and shuffling are often required that prevent SIMD execution
from achieving the peak performance. Besides, the memory
mis-alignment, stride memory access, memory latency, random
register file access and branch mis-prediction also prevent the
processor from fetching data in a timely fashion to achieve peak
throughput [13]-[15].
To address the aforementioned problems, our team has
designed and implemented a new SIMD based video processor
with architecture as shown in Fig. 1. Our video processor is a
5-stage pipeline multi-threaded multi-issue semi out-of-order
superscalar processor. It supports a maximum of 4 threads of
execution simultaneously. A maximum of 4 instructions can be
Copyright (c) 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other
purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Fig. 1. Proposed SIMD architecture.
issued in every clock cycle. The multi-thread and multi-issue

features hide away the memory latency and branch
mis-prediction penalty bottlenecks. The processor is
implemented with TSMC 0.13m technology. Based on our
simulation and performance data, it is capable of encoding and
decoding video sequences with CIF resolution at 90MHz and
30MHz respectively. The die area is about 8mm2 including
32KB instruction cache, 48KB internal SRAM memory and
8KB LUT memory. A breakdown of the area used for various
functional units is shown in TABLE I.
TABLE I
PROPOSED VIDEO PROCESSOR AREA BREAKDOWN
Logic
Area
mm2
%
2.7
33.6
Local Storage
Synthesized
Memory
Register
File
3.9
48.6
1.4
17.8
additional memory accesses.

With these features, the performance of the processor is
significantly improved. First of all, no packing and unpacking
instructions are required in our video instruction set extension.
Data shuffling within the SIMD registers can be accomplished
in one cycle. They greatly benefit the realization of many major
H.264/AVC kernel functions. For instance, the fractional
motion estimation, particularly the half to quarter pixel
interpolation, can now be executed by the proposed SIMD
structure with minimal or no additional memory accesses.
When comparing with the conventional SIMD structures, the
proposed SIMD architecture can have a further speedup of
2.1X to 4.6X when implementing H.264/AVC encoder kernel
functions. As these kernel functions are often used in other
image and video processing operations, the proposed SIMD
architecture can be generally applied to different image and
video applications with significantly improved performance.
The paper is organized as follow. In Section II, we briefly
discuss the previous works on SIMD architectural bottleneck
analysis. In Section III, we analyze where the SIMD structure
bottlenecks are and how the SIMD structure can be enhanced.
Then the two proposed features are described in Section IV.
The experimental results are shown in Section V. Finally the
conclusion is drawn in Section VI.
II. RELATED WORKS
Total
8.0
100
In our design, two novel features are introduced to reduce the

overheads and performance bottlenecks when executing SIMD
instructions for video coding operations. Firstly, a parallel
memory structure with variable block size and word length
support is proposed. The structure resolves the unaligned and
stride memory access SIMD bottleneck problems. Our design
accounts for the fact that video coding tasks operate on byte or
word data and generate word and double-word results. Using a
memory interleave scheme, the proposed parallel memory
structure can load a block of data with size up to double-word
in no more than 4 cycles. Comparing with the previous parallel
memory schemes, the proposed approach provides higher
flexibility in block size and data length selection with low
hardware complexity. The second proposed feature is the
introduction of a configurable SIMD (CSIMD) structure using
a look up table (LUT). It allows almost random register access
and slightly different operations in ALUs inside the SIMD unit.
It eliminates the problem in some video processors that
sub-word data elements resulting from previous SIMD
operations cannot be retrieved and used directly without
packing, unpacking, and shuffling. It is common for these
video processors to perform additional memory stores and
re-loads to solve the problem. The proposed feature mitigates
the performance impact by successfully reducing such
To allow SIMD architectures to achieve the peak throughput

performance, many researchers find ways to increase the
efficiency of loading data from memory and aligning them
within a vector register file for SIMD execution before the data
are used. A SIMD processor, MediaBreeze, was proposed in
[13][15] to alleviate the SIMD bottleneck problem. They
proposed a multi-dimensional vector instruction named as
Breeze instruction to speed up the nested loop operations,
which are often found in video applications. However, the
Breeze instruction structure is very complicated. It needs
dedicated instruction memory and decoder to store and decode
the Breeze instruction before execution. It can only fully
exploit 5-level looping in very regular execution functions such
as full search algorithm in motion estimation. For most fast
search algorithms, the block can be in any arbitrary location. In
this case, only 2-level looping in Breeze instruction can be
applied.
Although some SIMD architectures claim to support
unaligned memory access, their approaches often have
different limitations [14]. They include the need of multiple
aligned loads followed by data shift and OR operation, not
being thread safe, extra latency for crossing cache boundaries,
etc. To deal with these problems, it is shown in [14] that an
alignment network can be added after two-bank interleaved
L1-1 cache to reduce the cache line boundary penalty. However,
such approach only handles the unaligned memory access
problem among many SIMD bottleneck problems.
Multi-bank vector memory is also used to reduce the SIMD
architecture overheads [16]-[20]. The image data are
interleaved and loaded into multiple memory modules
sequentially. In [16], a modulo addressing mode was
introduced to allow part of the bytes in a word to be accessed
from both ends of a circular buffer to reduce external memory
bandwidth. Chang et al. [17] proposed adding one extra
memory module in addition to the number of ALUs in SIMD
processor to solve the possible memory module conflict
problem. However, the number of memory modules must be
relatively prime to the supported stride values resulting in
larger hardware cost in address generation and shuffling logic.
In [18], a scalable data alignment scheme was proposed for
rectangular block data access using simpler memory address
generation. It is achieved by using a two-dimensional notation
for both pixel location and memory module number. However,
it is not flexible enough to support variable block sizes. It is
noted that a block based data access approach is often used in
many image and video processing applications while the block
size can be different for different algorithms. Flexibility should
be provided when designing a general purpose video processor
to allow data access of variable block size and word length
without greatly increasing hardware complexity.
In [19]-[20], a video signal processor with read-permuter and
write-transposer placed, respectively, before and after the
vector register file was described. They facilitate data
reorganization in SIMD register before execution, but it still
needs N cycles to do an NxN transpose operation. Seo et al. [21]
on the other hand introduced diagonal memory organization
and programmable crossbars in their SIMD architecture. The
diagonal memory organization allows the horizontal and
vertical memory access without any conflict. Due to data access
complexity in H.264 algorithm, 3 programmable crossbar
shuffle networks are added such that any data shuffle patterns
required by H.264 algorithm can be supported. However, in
order to accommodate complex data access patterns, only
predefined fixed pattern crossbars are implemented. This
limitation requires the crossbar patterns to be pre-designed
based on the algorithm. They may not be flexible enough to
realize future algorithm enhancement or support new video
coding standards efficiently. Besides, the 3 shuffle networks
make the SIMD pipeline longer which may increase the branch
mis-prediction penalty and execute-to-consume latency
between pipeline stages.
Another deficiency of the traditional approaches is that they
do not have direct support to major kernel functions in image
and video processing. It will be discussed in next section.
III. ANALYSIS
As mentioned above, we use video coding as an example to
illustrate the deficiency of the traditional SIMD architectures in
supporting image and video processing kernel functions. It is
well known that motion estimation is the most computation
intensive function in H.264/AVC encoders. It contributes to
more than 50% among all computations [5][22]. If four
reference frames are used, motion estimation alone accounts for
more than 70% of computation [22]. The next intensive
function is DCT/IDCT that contributes to about 10-20% of

computation. For H.264/AVC decoders, the most intensive
functions are interpolation and inverse transform. They
contribute to about 20% and 5-10% of computation
respectively [7][11]. For intra-frame coders, the most
complicated functions are SATD transform for cost generation
and mode selection, intra prediction and DCT/Q/IDCT [23].
They contribute to about 57%, 20% and 16% of computation
respectively. Therefore, if we can enhance the SIMD execution
in motion estimation, transform and mode decision, the overall
performance can be improved significantly. Among these
functions, SAD, SATD, DCT/IDCT, and subpel interpolation
are the main targets. In this section, we analyze two video
encoding kernel functions in detail in order to demonstrate
where the conventional SIMD architectures can be further
enhanced. Example codes in VideoLAN X264 opened source
[23] are used to illustrate our findings. The VideoLAN X264
source uses the most popular Intel MMX/SSE1-3 instructions
to realize SIMD functions.
A. 4x4 Block SATD
We first analyze the 4x4 SATD function in H.264/AVC. The
function comprises several smaller sub-functions: memory load,
subtraction, two-dimensional (2-D) Hadamard transform,
transpose and summation. We went through the source codes of
SATD in VideoLAN X264 source [23]. The numbers of
instructions used to complete these sub-functions with different
block sizes are listed in TABLE II. The operations under the
Others column are improved more easily by other techniques
TABLE II
INSTRUCTION COUNT BREAKDOWN OF SATD, IDCT AND DCT IN VIDEOLAN
X264.
Instruction Counts
MMX
Block
or
Memory
1-D 4x4
1-D 4x4
Size
Subtraction
Transpose
Others Total
SSE2
Load
Transform
Transform
4x4
4x8
8x4
8x8
8x16
16x8
16x16
8
16
8
16
32
32
64
4x4
4x4
4x4
8x8
16x16
8
16
64
4x4
8x8
16x16
8
24
96
SATD4x4 (pixel_satd_<blk_size> functions)

12
12
12
12
19
24
24
24
24
38
12
12
18
12
19
24
24
36
24
38
48
48
72
48
76
48
48
72
48
76
96
96
144
96
152
DCT4x4DC (idct4x4dc functions)
0
12
12
12
9
IDCT4x4DC (idct4x4dc)
0
12
12
12
0
DCT4x4Residual (sub<blk_size>_dct functions)
12
14
12
14
0
24
28
36
28
3
96
112
144
112
9
IDCT4x4Residual (add<blk_size>_idct functions)
0
15
12
15
18
0
30
36
30
38
0
120
144
120
140
75
150
81
162
324
324
648
MMX
MMX
SSE2
SSE2
SSE2
SSE2
SSE2
49 MMX
40 MMX
60 MMX
135 SSE2
537 SSE2
68 MMX
158 SSE2
620 SSE2
such as enhancing SIMD instruction set extension. Hence they

are not discussed here. It can be seen that in VideoLAN X264
source, the number of SIMD instructions used to realize
memory load, subtraction, 2 1-D Hadamard transform and
transpose contribute to about 75% of the total instructions in
4x4 SATD function for different block sizes. In fact, as can be
seen in TABLE II, these sub-functions are equally important in
functions such as DCT and IDCT. Their efficient realization
obviously is decisive to improve the overall performance.
Although these sub-functions are very simple, conventional
SIMD architectures often cannot achieve the peak throughput
due to the following 4 reasons:
1.
lack of memory block load with different data length
support;
2.
limited support for data shuffling;
3.
requirement of carrying out the same operations by all
ALU in SIMD for each SIMD instruction execution; and
4.
inability to support cross bank data access in a SIMD
register file.
In VideoLAN 4x4 SATD function, MOVD instruction is
used to load 4 pixel data bytes from memory to lower double
word of the 64-bit MMX register while filling the upper double
word with zeros. Two PUNPCKLBW instructions are then
used to unpack 8 data bytes from two lower double words of
Fig. 2. Packed word subtraction from packed byte.
MMX registers, 4 in each, into the destination register (see Fig.

2). The instructions convert four packed data bytes to four
packed words before subtraction. The unpack instructions
prevent the execution result from being overflowed in
subsequent operations. It can be seen that the number of cycles
to just perform memory load and subtraction can consume more
than 30% of the total execution cycles of the sub-functions.
This inefficient SIMD execution can be improved by loading
data bytes from memory, extending them to data words before
writing the packed words into the register.
Conventional SIMD architectures often have limited support
for data shuffling. As can be seen in TABLE II, the number of
instructions for the implementation of matrix transpose can be
as high as 22% of the total instructions for computing SATD
and other kernel functions. Note that a matrix transpose
Fig. 3. Basic operations of 4x4 matrix transpose.
involves no arithmetic operations but only data shuffles. Most

of these instructions are not required if a dedicate hardware
construct is provided for data shuffling. In fact, data shuffling is
required in many other parts of a video codec which further
justifies the need for an efficient data shuffling unit. Fig. 3
shows the basic operations as carried out by VideoLAN X264
source for implementing matrix transpose of a 4x4 block. Since
there is not a dedicate hardware for matrix transpose in MMX,
the most efficient way to perform transpose is to use different
unpack instructions, which include PUNPCKLWD,
PUNPCKHWD, PUNPCKLDQ and PUNPCKHDQ. It is seen
that 8 instructions are required to implement the matrix
transpose, of which most of them are unnecessary if a dedicated
data shuffling unit is available. In the actual codes of
VideoLAN X264 source, 12 instructions instead of 8 are used
for each 4x4 block matrix transpose. Extra instructions are
required to store the temporary results generated in the
computation due to insufficient number of registers.
Since most operations in H.264/AVC are performed in block
mode, it is obvious that the efficiency of SIMD operations can
be significantly uplifted by having all data in a block loaded
into the register before the SIMD operations take place.
Assume the bit-width of the registers is large enough such that
all data in a block can be loaded into a register. Intuitively we
expect more data can be processed at the same time. However,
it is not the case since very often different data in a block may
need to perform slightly different operation. More commonly,
data of a block may need to work with other data in a block. Let
us take the computation of the 2-D Hadamard transform in
SATD as an example. The transform can be realized by
applying 1-D Hadamard transform to all columns and then all
rows of a data block. A length-4 1-D Hadamard transform is
defined as:
Y H.X
(1)
where X is the input 4x4 data block and Y is the transformed
output. H is the transform matrix and is given by
1
1
1 1
1 1 1 1
H
1 1 1 1
1 1 1 1 .
Its application to the columns of a 4x4 data

block can be implemented with the steps as shown in (2)-(7):
A(i, j) = X(i, j) + X(i+1, j)
(2)
i = 0, 2, j = 0-3.
(3)
B(i, j) = X(i, j) X(i1, j)
i = 1, 3, j = 0-3.
Fig. 4. Basic operations in 1-D Hadamard transforms.
Fig. 5. 1-D Hadamard transforms in 256-bit register.
(4)
Y(i, j) = A(i, j) + A(i+2, j)
i = 0, j = 0-3.
(5)
Y(i, j) = B(i, j) + B(i+2, j)
i = 1, j = 0-3
(6)
Y(i, j) = A(i, j) A(i2, j)
i = 2, j = 0-3.
(7)
Y(i, j) = B(i, j) B(i2, j)
i = 3, j = 0-3.
Fig. 4 shows the basic operations as carried out by
VideoLAN X264 source for implementing the 1-D Hadamard
transforms. Since each Intels MMX register can only handle 4
data words at the same time, 8 PADDW / PSUBW instructions
are required to complete the transform for all columns of a 4x4
data block. In fact, 12 instructions rather than 8 are used in
actual source codes. It is again due to insufficient number of
registers that requires extra instructions to deal with temporary
result storage.
To speed up the computation, an intuitive solution is to
increase the bit-width of the register such that all data in a block
can be processed at the same time. Assume now we have a
256-bit register as shown in the upper part of Fig. 5 such that all
16 data words of a block can be loaded into this register. We
expect that all 16 data can be processed at the same time, but in
fact they cannot. Based on (2)-(7), the operations to be
performed for implementing the 1-D Hadamard transforms are
shown in the lower part of Fig. 5. We can see that each data
word requires adding itself with, or subtracting itself from
another word lane in the register. This requirement deviates
from the operations performed by traditional SIMD structures,
which require all ALUs in the SIMD unit to perform exactly the
same operation whenever a SIMD instruction is executed.
Besides, the ALUs must retrieve the operands from its own
register bank. Although later multimedia extension adds new
instructions to support cross bank operand retrieval, these
instructions either limit to several retrieval patterns or require
additional instructions to configure the retrieval pattern in
another register before using. The above means that even if we
can throw in extra resource to provide long bit-width registers,
the problem cannot be resolved without a redesign of the SIMD
architecture. We show in Section IV how the proposed SIMD
architecture handles these problems.
B. Fractional Motion Estimation
G 0 d 2 H
3
4
5
A a B
C b D
E
i
F
j
G d H
k e m
I
n
J
q
M f
R g S
T h U
Fig. 6. Subpel interpolation.
k 6 e 7 m
8
9
10
M 11 f 12 N
d
13
IV. PROPOSED FEATURES
14
e
15
M
The fractional motion estimation is one of the new features

of H.264/AVC. To illustrate the difficulty of realizing
fractional motion estimation using SIMD structures, let us first
recall the procedures for half and quarter pixel interpolation. As
shown in Fig. 6, the half pixels located between two adjacent
integer pixels (e.g. k) are interpolated by applying a 6-tap filter
using three upper and three lower (A, C, G, M, R, T), or three
left and three right integer pixels (E, F, G, H, I, J). After all half
pixels adjacent to integer pixels are generated, the half pixels
located between half pixels (e.g. e) are interpolated by applying
the 6-tap filter using either three upper and three lower half
pixels (a, b, d, f, g, h) or three left and three right pixels (i, j, k,
m, n, q). Once all half pixels are generated, quarter pixels can
be interpolated. Those located adjacent to integer and half
pixels (e.g. 0), and between half pixels (e.g. 4) are estimated by
linear interpolation with the corresponding horizontal and
vertical pixels. The remaining quarter pixels (e.g. 13) are
linearly interpolated with two diagonally adjacent half pixels
(e.g. d, k).
The generation of quarter pixels requires the strict order of
pixel generation as described previously. Arbitrarily storing the
previous pixels will easily introduce much difficulty for later
pixel interpolation using SIMD instructions without memory
loads and stores, packing or shuffling beforehand. For example,
as shown in Fig. 6, half pixel d works with half pixels k and m
to generate quarter pixels 13 and 14, respectively. If, say, k and
m are stored in the same row hence they must be in different
banks, it is thus impossible for d to be stored in the same bank
as k and m at the same time. In general, it is highly likely that
some of the quarter pixel interpolations have to be performed
with half pixels stored in different rows of different register
banks. Such nearly random register access imposes great
difficulty for traditional SIMD executions. Note that at the time
the half pixels are generated and stored, the quarter pixels
required to be generated are still unknown. It is thus very
difficult to devise an optimized storage plan for the half pixel
data to solve the abovementioned problem. For this reason, the
conventional way to interpolate the quarter pixels is to store
back the interpolated half pixels to memory and then reload
them with unpacking, packing and shuffling before further
execution. It greatly affects the SIMD execution throughput. In
the following section, we show how the novel features of the
proposed SIMD architecture handle the problems.
m
16
A. New Parallel Memory Structure

Most video applications process the acquired video data in
the unit of block. Hence to avoid frequent memory access, it is
always desirable to load all data in a block to registers before
further operations take place. However, traditional memory
storages only allow sequential memory access. For this reason,
multi-bank or parallel memory structures were proposed to
allow multiple data access concurrently [16]-[20]. Similar to
the previous approaches, the proposed SIMD architecture is
equipped with a 32KB parallel memory structure served as a
buffer between the external memory and the register file as
shown in Fig. 7. The parallel memory is divided into 16
modules each of which has the size of 2K bytes and has a
separated data bus connected to one of the 16 banks of a
register file. Each register bank has 32 rows and each element
of a register bank can store a 16 bits word (in fact, the register
file is constructed by 32 256-bit registers).
External Memory
16 banks x
2048 x 8 bits
Parallel Memory
Structure
:
:
similar to that in Fig. 9b except that the memory module

assignment is transposed). However it can be seen that the
required data may be stored in different physical addresses
(pixels occupies across different dotted boxes). An efficient
address generator is needed to determine the required physical
address for each memory module.
In fact, the data loading from external memory to internal
memory modules is performed by a direct memory access
(DMA) unit following mapping functions as described below.
Let As be the starting address of the part of a video frame to be
loaded into the parallel memory and Af be the address of a pixel
16 banks x
32 x 16 bits
Register File
&
16 ALUs
(a)
(b)
Fig. 9. Memory interleave to allow block access.

Fig. 7. Proposed parallel memory structure.
Fig. 8 shows the relationship between the logical offset

address and physical address we have defined in the proposed
parallel memory structure. In the proposed architecture, the
logical address is unique for a memory location and the
physical address is the real address generated to every memory
module for data access. The data from external memory are
interleaved and loaded to the internal parallel memory modules
such that when accessing a data block, only one data from each
memory module needs to be retrieved, no matter where the
block is located. Fig. 9 shows how the pixels of an image are
stored in different memory modules to facilitate data retrieval
of different block sizes. In the figure, the characters grouped
inside a dash square block refer to the 16 memory modules that
can be accessed by the same physical address. The numbers 0-9
and letters a-f denote the memory bank number (a-f stands for
bank 10-15 respectively). To show that there is no access
conflict when loading a block of data from the parallel memory
to the register, a few examples are shown in Fig. 9. The
characters grouped inside a solid square block refer to the data
blocks to be retrieved to the register. It can be seen that for both
4x4 (Fig. 9a) and 8x2 (Fig. 9b) block accesses, one data will be
retrieved from each parallel memory module no matter where
the block is located (the data loading method for 2x8 block is
Fig. 8. Parallel memory logical offset address and physical address.
within that part of the video frame. Then

A f A s A off
(8)
where Aoff is the offset of the address of that pixel from the
starting address. Assume that the video frame has the size of Nx
columns and Ny rows. Then Aoff can always be written as
(9)
for x = 0 to Nx-1 and
A off yN x x
y = 0 to Ny-1.
Alternatively, the index x and y can be obtained from Aoff by:
y A off / N x and x A off
(10)
N
x
where . is the floor function and a
stands for a modulo b.
Let {m, p} be the module number and the physical address

respectively of the parallel memory structure as shown in Fig. 8.
By inspecting Fig. 9, the mapping functions that the DMA unit
should use for loading the data to the parallel memory are:
For 4x4 block loading,
p y / 4 * N x / 4 x / 4
(11)
m x 4 4* y 4
(12)
For 8x2 block loading,
p y / 2 * N x / 8 x / 8
m
8*
(13)
(14)
Similarly, for 2x8 block loading,

p y / 8 * N x / 2 x / 2
m x 2 2* y 8
(15)
(16)
Once the block size is known, the DMA unit will load the
data from external memory to the parallel memory following
the respective mapping functions. Then data can be retrieved
from the parallel memory to the register efficiently. Assume
that a 4x4 block with indices of the first pixel be {xs, ys} is to be
retrieved. The pixels in the block can be described by {xs+xo,
ys+yo}, where xo, yo = 0 to 3. Following from (11),
p ( y s y 0 ) / 4 * N x / 4 ( x s x 0 ) / 4
(17)
Let y s y s ' y s
and x s x s ' x s
such that y s ' and xs '
will be always divisible by 4. (17) can then be written as

p y s ' / 4 y s 4 y o / 4 * N s / 4
xs ' / 4 xs
xo / 4
(18)
The two floor functions in (18) can only be equal to 0 or 1.

Therefore, for any 4x4 block with indices of the first pixel be
{xs, ys}, all data of the block are stored in at most 4 different
physical addresses only in the parallel memory. So if the first
data is stored in ps, the rest must not be stored in address other
than {ps+1, ps+Nx/4, ps+Nx/4+1}. More specifically, let
q y ( y s 4 y 0 ) / 4 and
(19)
q x ( x s x 0 ) / 4 then
negligible as far as a video processor is concerned. The above

derivation can also be applied to 2x8 and 8x2 block retrieval:
a. 8x2 block access
q y ( ys 2 m / 8 ys 2 ) / 2 q y ' / 2 and
(30)
m x
m xs
4* y
4
x s xo
x s xo
xs
4* y
4
(21)
xo
(22)
Note that m = 0 to 15 is the index to the 16 parallel memory

modules. Substitute (22) to (21), we have
m xs m xs
y
m m
Hence
ys yo
yo
4 4 4
4* y
/ 4 m / 4
m 4 4* y
ps
p 1
s
ps N x / 8
ps N x / 8 1
q y ( ys
m / 4 y s
(26)
m / 4 y s
q x q x ' 4 where q x ' x s
m xs
4 4
4 4
(27)
(28)
Since qx and qy must be equal to {0, 1}, (20) can be written as:
ps
q y ' 4
and
q x ' 4
ps 1
q y ' 4
and
q x ' 4
ps N x / 4
q y ' 4
and
q x ' 4
ps N x / 4 1
q y ' 4
and
q x ' 4
(29)
The evaluation of qy and qx is very simple. The modulo

4
qx
) / 8 q x ' / 8
(31)
q y ' 2 and q x ' 8

q y ' 2 and q x ' 8
q y ' 2 and q x ' 8
(32)
q y ' 2 and q x ' 8
m ys
m / 8 x s
ps
p 1
s
ps N x / 2
ps N x / 2 1
8 8
) / 8 q y ' / 8 and
(33)
) / 2 q x ' / 2
(34)
q y ' 8 and q x ' 2

q y ' 8 and q x ' 2
(35)
q y ' 8 and q x ' 2

q y ' 8 and q x ' 2
For actual implementation, each parallel memory module is

installed with an address generation unit as shown in Fig. 10 for
the implementation of (29), (32), or (35) based on the selected
block size. An additional address generation unit is responsible
for pre-computing the 4 possible physical addresses in each
case. A 4-to-1 multiplexer is installed in each address
generation unit for selecting one of the supplied addresses
based on the results of (29), (32), or (35).
Add.
Gen.
Add.
Gen.
m=0
Add.
Gen.
m=1
(25)
q y q y ' 4 where q y ' y s
function .
m xs
b. 2x8 block access
xs ys bsNxdsdsb
Substitute (26) and (22) to (19), we can express qx and qy in

terms of m, xs and ys as follows:
(23)
(24)
m / 4
yo
( x
q x ( xs
q y 0 and q x 0
ps
p 1
q y 0 and q x 1
s
(20)
p
N
/
4
q y 1 and q x 0
x
s
p s N x / 4 1 q y 1 and q x 1
The following further shows that we only need to have the
indices of the first pixel {xs, ys}, we can easily determine the
physical address for each module. Again we use the 4x4 block
retrieval as an example. From (12), it can be seen that:
can be implemented by extracting the last 2 bits
of the number. m / 4 can be implemented by shifting m to

right by 2 bits. The addition and subtraction can be
implemented by a small adder. And finally the comparison
between qy and qx with a constant 4 can be implemented by
checking the output carrier bit of the small adder. All the above
can be implemented by less than 20 logic gates, which is
Add.
Gen.
m=15
Total 16 Memory
modules
xs ys bs
xs ys bs
xs ys bs
Note: bs block size; refer (36) for the definition of ds and dsb
Fig. 10. Address generation for parallel memory structure.
Fig. 11 shows an example of loading 352x40 pixels of a CIF

image from external memory into 16 internal memory modules.
Again we assume the 4x4 block access is selected hence the
way of data loading follows (11) and (12). The numbers in the
figure are the physical addresses. To load the 16 pixels of the
upper 4x4 block in Fig. 11, physical addresses 174, 174, 174,
173, 174, 174, 174, 173, 86, 86, 86, 85, 86, 86, 86, 85 are
generated by the address generation units installed with
memory module 0 to 15, respectively, following (27), (28), and
(29). The lower 4x4 box in Fig. 11 depicts the actual
implementation when the block access crosses the last row
(assume the last physical address in memory module is 879). It
can be seen that the data access are wrapped back to the
beginning of the buffer. The physical addresses generated for
each memory module in this case are 3, 3, 3, 2, 795, 795, 795,
794, 795, 795, 795, 794, 795, 795, 795, 794.
It is known that the dominant word lengths in video
Fig.11. Memory module physical addresses for CIF image pixel.
applications are 8 and 16 bits [26]. It is because most video

kernel functions take in pixel operands (8 bits in size) for
computation and generate 16 bits data words. Temporary
results with size of word or even double word may be stored
back to the parallel memory structures from the registers. They
may be retrieved later back to the registers for further
processing. Traditional parallel memory structures do not
explicitly support word or double word data access. Extra
packing, unpacking and shifting instructions may be required to
achieve the word and double word data access which lower the
efficiency of execution. The vector memory access in our
architecture can easily be extended to support word and double
word memory access. When storing a data word from the
registers back to the parallel memory structure, the data are
stored to the successive physical address of the same module.
The case for double word access is similar. The four bytes of a
double word will be stored to the consecutive four physical
addresses of the same module. For example, if the first byte of a
double word is stored into physical address 0x002 (logical
address 0x22) of module #2 in Fig. 8 the other 3 bytes will be
stored into addresses 0x003, 0x004, and 0x005 (logical
addresses 0x32, 0x42 and 0x52) of module #2. In general when
accessing a vector data of different size in the parallel memory
(either storing or retrieval), we use the following equations
which are modified from (29) as follows (again 4x4 case is used
as an example without loss in generality):
q y ' 4 and q x ' 4
p s dsb
p 1 dsb
q
s
y ' 4 and q x ' 4
(36)
q y ' 4 and q x ' 4

p s ( ds * N x ) / 4 dsb
p s ( ds * N x ) / 4 1 dsb q y ' 4 and q x ' 4
ds stands for data size that has value 1, 2 or 4 for data retrieval
of bytes, words, or double words, respectively. dsb is an index
to the byte to be accessed within a byte, word or double word.
ps is the physical address of the first pixel. Note that there is no
change to the evaluation of qx and qy. It means that the 16
address generators associated with the parallel memory
modules will remain the same. Only the additional address
generator is involved to take care of the changes in word length.
This greatly simplifies the complexity of the hardware structure
for address generation. In overall, a 4x4 block of bytes, words
and double words can be retrieved from the parallel memory
modules or stored back to the parallel memory modules in 1, 2
and 4 cycles, respectively. The same is applied to 2x8 and 8x2
block access cases.
Comparing with the previous parallel memory structures for
video processing, such as [18], the proposed approach allows
more flexibility in data access. Not only data of different word
lengths, the proposed approach also allows data access of

different block sizes, such as 2x8, 4x4 and 8x2. This would be
difficult to achieve in [18] since the memory modules are
hardwired to a specific two-dimensional form. Although more
address generators are needed in the proposed approach, each
of them is so simple that its complexity is negligible as far as a
video processor is concerned. The support of word and double
word accesses is a unique feature that cannot be found in the
previous parallel memory structures [16]-[20].
B. Configurable SIMD
The second new feature of the proposed SIMD architecture
is the configurable SIMD [24], which provides of 2 useful
functions: almost random RF access and MIMD-like (Multiple
Instruction Multiple Data) execution support. The SIMD
register file in the proposed SIMD architecture is viewed as 16
banks each of which has 32 entries specified by the row address.
The almost random RF access in the proposed SIMD
architecture is supported by a mux control unit placed between
the registers and ALUs as shown in Fig. 12, and the RF row
address control unit. Inside the mux control unit, there is a
crossbar switch by which each ALU can retrieve a data from
any register bank. The full crossbar switch supports any
operand shuffling pattern used in matrix transpose, matrix
multiplication, subpel interpolation, luma intra prediction and
other operations in video applications. The RF row address
control unit is used to generate 16 row addresses to each RF
Fig. 12. The proposed SIMD architecture with switching control.
bank. With both the mux control unit and the RF row address
control unit, almost random RF access is supported.
Although two operands can be read from a register bank at
the same time, we impose several architectural constraints in
order to save the hardware cost. Firstly, the crossbar switch
supports the shuffling of one operand only. That is, if a SIMD
operation requires two operands, one of them must still be
retrieved from its own bank. Secondly, the crossbar switch
allows only a maximum of one data coming from other register
bank to prevent the register bank conflict problem and to
minimize the number of register bank output ports to two.
Thirdly, the crossbar only shuffles word size operand. If
double word size operand shuffle is needed, two SIMD
instructions are used to shuffle the whole operand. Finally, the
computed result is restricted to write back to ALUs own RF
bank. Because of these hardware constraints, configurable
SIMD allows only almost but not fully random RF access.
To support random RF access in configurable SIMD, a
look-up table named as configurable SIMD look-up table
(CSLUT) is introduced. The CSLUT is made by 5 memory
modules. Their logical and physical addresses as well as their
structure are shown in Fig. 13. There are three major
configuration data in the table. The 80-bit row address
configuration data and 64-bit bank configuration data specify
the register row addresses and register bank numbers of 16
operands to be retrieved respectively. The mux control unit
takes in 64-bit bank configuration data from CSLUT so that
each operand of ALU can be retrieved from any bank. The RF
address control unit takes in 80-bit row address configuration
data from CSLUT and generates 16 row addresses to each RF
bank. If only the bank number configuration data in CSLUT is
used, 16 operands on the same row from different bank
specified in bank configuration data are retrieved. If only the
row address configuration data in CSLUT is used, each ALU in
SIMD takes one operand in any row address specified in row
configuration data from its own RF bank. Using both row and
accessing a particular entry in CSLUT. The format of typical

and CSIMD instructions is shown in Fig. 14 for comparison. In
the figure, CMD is the instruction opcode. The MISC field
specifies execution controls such as operand shift bits, zero or
sign extension options, etc. For typical instructions, the register
row addresses of two sources and one destination are specified
by RS1 and RS2, and RD respectively. It requires all ALU to
get two operands at row addresses RS1 and RS2 from their own
register banks; and requires the ALU to write the execution
result to row address RD of their own register banks. That is,
typical instructions do not allow cross register bank data access,
nor different row data access. For CSIMD instructions, the
CSLUT address is specified in CSLUT_ADDR field. Each
ALU can get one of the operands at any row address of any
register bank specified in CSLUT. Furthermore, slightly
different operations are allowed to be executed among the
ALUs as mentioned above. They provide great flexibility that
fully addresses the problems of traditional SIMD architectures
as mentioned above.
Fig. 14. Instruction fields of a typical and CSIMD instruction.

Fig. 13. CSLUT for configurable SIMD instruction.
bank configuration data, the operand in any row address can be

retrieved from any bank to achieve random RF access.
Besides near random RF access, the proposed configurable
SIMD also provides MIMD-like execution support, i.e. it
allows a minor difference in operation among ALUs. To
accommodate this, a 16-bit miscellaneous column is introduced
into the CSLUT for indicating the slightly different operations
to be performed among the ALUs. For example, we can use this
column to define whether an addition or subtraction is to be
performed for each ALU in a SIMD processor. It is useful in
many fast transform algorithms, including the Hadamard
transform used in the 4x4 SATD function. It will be shown in
next section.
To access the table, a set of so-called CSIMD (Configurable
SIMD) instructions is provided in the instruction set. These
instructions have a particular field to store the address for
A methodology similar to configurable SIMD is also

proposed in a patent application [25]. Compared to this patent,
the proposed configurable SIMD has additional advantages.
First of all, the configuration data in the patent is not stored in
the SRAM based LUT but programmable logic array (PLA).
Hence the extent of reconfiguration is limited. That is also why
the patent design needs extra data called Pseudo Static Control
Information (PSCI), in addition to configuration data retrieved
in instruction field, to generate the reconfiguration data. The
PSCI dictates the aspects of the functionality and behavior of
the execution unit and crossbar interconnect. It cannot be
dynamically reconfigured in cycle basis via instruction. Instead,
a dedicated PSCI-setting instruction is used to update the PSCI
data from time to time. On the other hand, the proposed
configurable SIMD uses SRAM as configuration data storage
which allows much larger extent of reconfiguration. The
reconfiguration can be dynamically done in cycle basis by
getting the look-up table entry address from the instruction.
Fig. 15. 4x4 1-D Hadamard transform using only two CSIMD instructions.
Beside, the crossbar in the patent design is controlled totally by
PSCI data. It only provides shuffling on operands read from
register file location specified by the source operand address
instruction field. It cannot allow random register file access as
the proposed approach.
In the following subsections, we demonstrate how the
implementation of H.264/AVC kernel functions is made simple
by the new CSIMD structure. We particularly use SATD and
fractional motion estimation as examples although similar
improvement can also be achieved in other kernel functions
such as intra prediction. To simplify our discussion, we assume
that video data are accessed in the form of 4x4 blocks.
Operations involving larger data blocks are composed by
combining the results of the constituting 4x4 blocks.
1) SATD Computation:
A SATD computation consists of data load, subtraction, 2-D
4x4 Hadamard transform, matrix transposes, taking absolute of
the transformed data and summation. Let us first consider the
realization of 2-D 4x4 Hadamard transform. As discussed
above, a 2-D 4x4 Hadamard transform can be implemented by
four length-4 1-D Hadamard transforms applied to the rows and
followed by another four applied to the columns. Fig. 4 shows
that at least 8 instructions are needed to perform each set of four
1-D Hadamard transforms using the MMX instruction set due
to insufficient register bit-width. We have also shown in Fig. 5
that even if we have the resource to install registers with
sufficient bit-width such that all data of a block can be loaded
into a register, we still cannot easily implement the 1-D
Hadamard transforms using SIMD instructions since different
operations are performed in different register banks and they
may require operands from different register banks. For the
proposed SIMD architecture, we use only 2 CSIMD
instructions to realize each set of four 1-D Hadamard
transforms as shown in Fig. 15. Before execution, the 4x4 input
data is placed in 256-bit SIMD register in, say, row 5. Each
CSIMD instruction takes one operand from its own RF bank
and one operand from other bank to perform either addition or
subtraction. For example, in the first CSIMD instruction, the
ALU0 (right one) takes data X00 from its own bank and data
X10 in bank 4 of row 5 to perform addition. ALU 3 (the fourth
one from right) takes data X10 from its own bank and data X00
in bank 0 of row 5 to perform subtraction. All configuration
information is specified in row, bank and misc memory content
of the CSLUT. It makes use the misc configuration in CSLUT

to specify whether addition (e.g. 1) or subtraction (e.g. 0) is
performed. The complete configuration data in CSLUT to
perform each set of four length-4 1-D Hadamard transforms is
shown in TABLE III. Such feature provides great flexibility in
program design and in turn leads to reduction in SIMD
instructions in the program.
The full crossbar switch also greatly enhances the
performance of matrix transpose. Referring to TABLE II the
instruction counts to perform a 4x4 block transpose in
VideoLan X264 are 12 and 9 when using MMX and SSE2
respectively. With CSIMD, a 4x4 block transpose can be
carried out in one clock cycle. It is actually one of the shuffling
operations supported by the full crossbar switch. In fact, when
actual implementing the SATD function, the matrix transpose
operation is embedded into second 1-D 4x4 Hadamard
transform. That is, we do not need to dedicate a CSIMD
instruction to perform the transpose operation. It is done
together with second 1-D 4x4 Hadamard transform. In overall,
the proposed SIMD architecture takes only 4 instructions to
perform the first two steps of a 4x4 SATD function (from
memory load to 2-D Hadamard transform) while VideoLAN
X264 takes 56 instructions to do the same. In fact the proposed
CSIMD structure can also greatly benefit the implementation of
a few other similar functions of H.264/AVC and MPEG4 such
as 4x4 IDCT/DCT and 4x4 matrix multiplication. In both cases,
the proposed SIMD architecture only takes 2 instructions to
finish.
2) Efficient H.264/AVC fractional motion estimation:
To compute the fractional motion estimation for a 4x4 block,
it needs a maximum of 10x10 integer pixels, which can be
loaded to 9 register rows, with row address 1 to 9 respectively,
as shown in Fig. 16. The square boxes in the figure represent
the integer pixels and the number inside the square box is the
register bank where the pixel data is stored. The 6-tap filtering
6
row=1
e
row=2
d
e
c
d
e
f
3 0 0 1 1 2 2 3 3
0
1
2
3
7 4 4 5 5 6 6 7 7
row=4
e
4
5
6
7
b 8 8 9 9 a a b b
9
8
a
b
f c c d d e e f f
row=9
3
0
1
2
3
TABLE III
row=3
d
ROW, BANK AND MISC CONFIGURATION IN CSLUT FOR THE

IMPLEMENTATION OF THE HADAMARD TRANSFORMS
a
row=6
9
a
row=7
row=11
0
row=5
d
9
row=8
5
4
ROW
First
BANK
CSIMD
MISC
ROW
Second
BANK
CSIMD
MISC
f
5
b
0
6
7
0
e
5
a
0
6
6
0
d
5
9
0
6
5
0
c
5
8
0
6
4
0
b
5
f
1
6
3
0
a
5
e
1
6
2
0
9
5
d
1
6
1
0
8
5
c
1
6
0
0
7
5
3
0
6
f
1
6
5
2
0
6
e
1
5
5
1
0
6
d
1
4
5
0
0
6
c
1
3
5
7
1
6
b
1
2
5
6
1
6
a
1
1
5
5
1
6
9
1
0
5
4
1
6
8
1
b
f
row=4
X
row=10
row=0
b
b
e
d
4 3 2 1 0 row=9
8
X
7
b
9
d
a
a
d
c
6
5
8
c
3
7
3 2 1
row=2
X
6
row=10
row=0 2
6
9
8
3
2
8
8
2
1
4
4
1
0
ALU
10
0
c
avg
row=11
Fig.16. Subpel interpolation by CSIMD.
operation for integer to half interpolation is done by six
multiply-and-accumulate (MAC) instructions. In Fig. 16 (upper
left hand side), the solid line triangle half pixel c is generated by
TABLE IV
REGISTER ROW NUMBER AND BANK INFORMATION IN CSLUT FOR SUBPEL
INTERPOLATION.
ALU
0
1
2
3
4
5
6
7
8
9
a
b
c
d
e
f
Integer to Half (dotted)
Integer to Half (Solid)
BRBRBRBRBRBR
1 4 2 4 3 4 0 9 1 9 2 9
2 4 3 4 0 9 1 9 2 9 3 9
3 4 0 9 1 9 2 9 3 9 0 5
0 9 1 9 2 9 3 9 0 5 1 5
5 4 6 4 7 4 4 9 5 9 6 9
6 4 7 4 4 9 5 9 6 9 7 9
7 4 4 9 5 9 6 9 7 9 4 5
4 9 5 9 6 9 4 5 5 5 6 5
9 4 a 4 b 4 8 9 9 9 a 9
a 4 b 4 8 9 9 9 a 9 b 9
b 4 8 9 9 9 a 9 b 9 8 5
8 9 9 9 a 9 b 9 8 5 9 5
d 4 e 4 f 4 c 9 d 9 e 9
e 4 f 4 c 9 d 9 e 9 f 9
f 4 c 9 d 9 e 9 f 9 c 5
c 9 d 9 e 9 f 9 c 5 d 5
BRBRBRBRBRBR
8 2 c 2 0 9 4 9 8 9 c 9
9 2 d 2 1 9 5 9 9 9 d 9
a 2 e 2 2 9 6 9 a 9 e 9
b 2 f 2 3 9 7 9 b 9 f 9
c 2 0 9 4 9 8 9 c 9 0 7
d 2 1 9 5 9 9 9 d 9 1 7
e 2 2 9 6 9 a 9 e 9 2 7
f 2 3 9 7 9 b 9 f 9 3 7
0 9 4 9 8 9 c 9 0 7 4 7
1 9 5 9 9 9 d 9 1 7 5 7
2 9 6 9 a 9 e 9 2 7 6 7
3 9 7 9 b 9 f 9 3 7 7 7
4 2 8 2 c 2 0 9 4 9 8 9
5 2 9 2 d 2 1 9 5 9 9 9
6 2 a 2 e 2 2 9 6 9 a 9
7 2 b 2 f 2 3 9 7 9 b 9
Half to
Quarter
B R B R
0 0 c a
1 0 d a
2 0 e a
3 0 f a
4 0 0 a
5 0 1 a
6 0 2 a
7 0 3 a
8 0 4 a
9 0 5 a
a 0 6 a
b 0 7 a
c 0 8 a
d 0 9 a
e 0 a a
f 0 b a
multiplying integer pixels in row 2 of banks 4, 8, c, and integer

pixels in row 9 of banks 0, 4 and 8 with 6 filter taps and
summing the results up. It can be seen that the operations
require nearly random access to different rows of different
register banks. For instance, the circle quarter pixel 0 is
interpolated from half pixels in row 10 of bank 0 and in row 0
of bank c, while the quarter pixel 5 is interpolated from half
pixels in row 0 of bank 1 and row 10 of bank 5. In the lower part
of Fig. 16, it shows how the half and/or quarter pixels are
retrieved randomly in any banks of any rows before execution.
Note that the multipliers and adders in the figure only show the
operation it required to do interpolation for clarity. It does not
represent the real hardware. Also, the second operand of FIR
tag to the multiplier is not shown in the figure. As mentioned,
the half to integer interpolation is performed by 6 MAC
instructions. Each MAC takes one operand from location
specified by CSLUT table before it is multiplied with a filter tag
and then added to previous MAC results. While such random
register access will introduce much difficulty to traditional
SIMD executions, the proposed CSIMD structure handles them
easily with the use of the CSLUT table and the crossbar switch.
TABLE IV shows the related information stored in the CSLUT
table required for the interpolation of the solid line and dotted
line triangle half pixels as well as the circle quarter pixels in Fig.
16. The B and R columns refer to the register bank and the
row number of the pixels to be retrieved and sent to the ALU to
perform one MAC operation. For each quarter pixel
interpolation, a CSIMD instruction will be issued and the
required entries in the CSLUT table will be retrieved. The
related register access information will be sent to the register
file and the crossbar switch. With the help of the crossbar
switch, one of the operands required in the interpolation can be
obtained from any row of any register bank. The whole
11
fractional motion estimation can be evaluated efficiently

without extra memory load store, as well as the redundant
packing and unpacking operations.
V. EXPERIMENTAL RESULTS
Extensive simulations have been performed to evaluate the
performance of the proposed CSIMD architecture in two
aspects: memory accesses and cycle counts for computing
major H.264 kernel functions. To evaluate the performance in
memory accesses, two Baseline Profile C models were used in
our experiments for comparison. One is the Optimized JM
Encoder which is optimized from JM7.4 reference model 0 by
removing Main Profile features, dynamic memory allocation
and release, and rate-distortion optimization. The other one is
our CSIMD H.264 Encoder which is based on the Optimized
JM Encoder and further enhanced by using all proposed
features described in this paper namely, the advanced parallel
memory structure with variable block size and word length
support and the CSIMD structure that allows nearly random
register access. We use the number of memory accesses as a
yardstick for performance evaluation because they directly
affect, to a large extent, the overall computation time. The
proposed CSIMD H.264 Encoder is equipped with a
16-module parallel memory structure plus efficient address
generation units. The memory accesses here refer to the
accesses to the parallel memory. Note that for the proposed
CSIMD H.264 Encoder the data access to external memory are
achieved using a hardware DMA unit similar to other
traditional parallel memory systems.
Based on the above, the numbers of memory accesses for
computing integer and fractional motion estimation (IME and
FME) required by the two models are evaluated. The motion
estimation is done on a CIF resolution image. TABLE V shows
the results we obtained in the simulation. In the table, LS and
VLS stand for the number of load/store and vector load/store
instructions, respectively. Since our algorithm uses a
bottom-up approach, the vector LS in CSIMD Encoder mainly
refers to 4x4 block load or store in our simulation. Note that
other block sizes, such as 2x8 or 8x2, can also be easily
implemented using the proposed parallel memory structure and
the address generation unit. The number of instructions
required for loading or storing a 4x4 block by Optimized JM
Encoder varies from block to block. It depends on whether the
block is aligned in memory. Since one vector LS instruction
can replace 16 scalar LS instructions at most, if the Optimized
JM Encoder and the CSIMD Encoder are different only in the
parallel memory structure, the scalar LS instructions required
by the Optimized JM Encoder (Opt.JM_LS) should be close to
TABLE V
MEMORY ACCESS AND INSTRUCTION COUNT REDUCTION.
Optimized JM
CSIMD Encoder
LS
Instr. Cnt
LS
VLS LS+(16*VLS) Instr. Cnt
IME 10,565,010 35,446,064 476,452 507,295 8,593,172 3,668,072
FME 23,956,652 96,534,062 166,954 102,155 1,801,434
789,070
ME
the sum of the scalar LS (CSIMD_LS) instructions and 16
times vector LS (CSIMDvec_LS) instructions required by the
CSIMD Encoder. However, it can be seen in TABLE V,
Opt.JM_LS >> CSIMD_ LS + (16*CSIMDvec_LS)
(37)
It is particularly true in fractional motion estimation. It shows

that while the parallel memory structure can help to reduce the
memory access, the introduction of other features in the
CSIMD Encoder, in particular the random register access
feature, gives a further amount of saving in memory access. It is
especially the case for fractional motion estimation. In fact
when using the proposed SIMD architecture for computing
motion estimation, less than 10% SIMD instructions in integer
motion estimation are CSIMD instructions, while more than
90% SIMD instructions in fractional ME are CSIMD
instructions. This explains why the improvement for fractional
ME is so significant. As a result, the total number of memory
TABLE VI
EXECUTION CYCLES SPEEDUP VERSUS VIDEOLAN X264.
Functio
SATD4x4
n
DC
Block 4x 4x 8x 8x 8x1 16x 16x1 4x
4 8 4 8
6
8
6
4
Size
Speedup 2.9 4.6 2.6 2.4 2.5 2.5 2.3 2.7
DCT4x4
IDCT4x4
Residual DC Residual
4x 8x 16x1 4x 4x 8x 16x1
4 8
6
4 4 8
6
2.6 2.4 2.7 2.1 3.5 3.3 3.7
access for integer motion estimation is reduced by ~10.7 times,

and that for fractional motion estimation is reduced by ~89.0
times. The table also shows that the total numbers of instruction
counts to perform the integer and fractional motion estimation
are reduced by ~9.7 and ~122.4 times comparing with the
Optimized JME Encoder.
To give an idea of how the proposed SIMD architecture
compares with the state-of-the-art SSE/MMX SIMD
architecture, the execution cycles to perform 4x4 SATD and
IDCT/DCT using the proposed CSIMD Encoder model and
VideoLAN X264 are estimated. We developed a performance
simulator to emulate our CSIMD Encoder. The simulator is a
cycle-accurate model. Since there is no VideoLAN X264
performance simulator, we modified our performance
simulator to emulate the SSE/MMX instructions in VideoLAN.
In TABLE VI, the speedup by using the proposed CSIMD
Encoder as compared with VideoLAN X264 for the
computations of SATD and IDCT/DCT of different block sizes
is shown. It is seen that an improvement of 2.1X to 4.6X can be
achieved. Note that the speedup of SATD for block size 4x8 is
exceptionally high. It is because Intels SSE/MMX does not
support stride load so that one row of 4 pixels from each upper
and lower 4x4 blocks inside the 4x8 block can be loaded into
TABLE VII
CYCLE COUNT REDUCTION FOR IMPLEMENTING SOME H.264 KERNEL
FUNCTIONS WHEN ENCODING 1 SECOND OF CIF SEQUENCE.
Function
I Frame
Times /
Frame
P Frame
Cycle Count Reduction /
Second (percentage)
SATD
DCT4x4RES IDCT4x4RES
75,655
11,120
4,952
70,916
6,480
1,694
95,992,506
6,498,960
2,645,264
(65.6%)
(61.9%)
(71.6)
12
one SSE register in VideoLAN. Hence the two 4x4 blocks can
only be performed separately in MMX register.
TABLE VII further shows the simulation results when
computing SATD and IDCT/DCT in a H.264 encoding process.
In this simulation, one second of video sequence Stefan (25
frames, 1I+24P) with CIF resolution was used. The number of
cycle count reduction by using the proposed CSIMD Encoder
model as compared with SIMD implementation using
VideoLAN X264 source codes is shown. It can be seen that
more than 60% of execution cycles can be reduced using the
proposed CSIMD Encoder model. All improvement as
mentioned above stems from the advanced parallel memory
and CSIMD structures.
Based on the Amdahls Law [28], we can project the speedup
of the entire H.264/AVC encoding application from the kernel
function speedup, with respect to adopting the proposed
parallel memory structure and configurable SIMD feature in
conventional SIMD architecture. Let T be the execution time
(measured in execution cycles) of the original H.264/AVC
encoding application, Tker be the execution time of the kernel
function and Tcsimd be the execution time of the kernel function
performed by our CSIMD Encoder. Amdahls Law states that
the overall speedup of the application S is:
T
1
S
(38)
T Tker Tcsimd 1 ( / s )
where Tker / T is the percentage proportion of the kernel
function in the entire application and s Tker / Tcsimd is the
speedup of the kernel function execution with respect to our
proposed features. It is easily to extend the overall application
speedup if there are multiple kernel functions as below:
1
S
(39)
1 ( i i i i / s i )
Several kernel functions are taken in our calculation. They
include integer motion estimation (IME), fractional motion
estimation (FME), SATD, DCT and IDCT. TABLE VIII shows
the kernel functions speedup and their corresponding
percentage proportion in application based on our profiling
result. The speedup of IME and FME mainly comes from the
instruction count reduction shown in TABLE V which is 9.7
and 122.4 respectively. It should be noted that the SATD in this
table only refers to inter mode decision but not in motion
estimation because SATD speedup is already accounted in
FME speedup. The speedup of SATD, DCT and IDCT is from
TABLE VI. According to equation (39), the overall speedup of
H.264/AVC encoding application is 2.46X.
TABLE VIII
PROPORTION AND SPEEDUP OF KERNEL FUNCTIONS.
Kernel
Proportion (%)
Speedup
IME
12
9.7
FME
33
122.4
SATD
7
2.9
DCT
7
2.7
IDCT
13
2.1
Besides video coding functions, the new SIMD architecture

is very generic and flexible that is also useful to many other
image and video applications. To illustrate this, we have
applied the proposed SIMD architecture to the implementation
of several general video and image processing functions (e.g.
de-interlacing, scaling, transform, color space conversion, etc.).
Due to the flexibility provided by the proposed parallel
memory structure, we can support image and video
applications of different block sizes and word lengths. And by
redefining the CSLUT table entries, we can realize these
applications efficiently using the CSIMD instructions. TABLE
IX shows the numbers of predefined entries in the CSLUT table
for the implementation of different major kernel functions in
each application. It can be seen that for implementing the listed
6 applications, only 689 entries are required. It shows that the
memory required for the storage of the CSLUT table is
insignificant as far as a general purpose video/image processor
is concerned. As such, the proposed SIMD architecture can
support multiple video applications well by simply using
different entries of the CSLUT table for different applications.
The proposed features only increase the area of the video
processor by not more than 5% of the total area. As a brief
account, the CSIMD LUT contributes to about 4% increase in
area, while the crossbar switch and CSIMD control contribute
to 0.23% and 0.6% increase in area resp.
With these features, the SIMD performance when

implementing matrix transpose, DCT/IDCT transform and
SATD can be significantly improved. The H.264/AVC
fractional motion estimation can also be implemented
efficiently. The number of memory access can be greatly
reduced. In fact, the proposed CSIMD structure can also greatly
benefit the implementation of other kernel functions such as the
Luma 4x4 intra prediction. Due to page limitation, it has not
been explained in detail in this paper.
REFERENCES
[1]
[2]
[3]
[4]
[5]
TABLE IX
NUMBER OF CSLUT CONFIGURATION ENTRIES FOR DIFFERENT IMAGE AND
VIDEO APPLICATIONS.
[6]
Video Application
H.264/AVC Encoder
H.264/AVC Decoder
AVS-M Decoder
AVS Decoder
MPEG4 Decoder
Image Processor
Fractional
Interpolation
147
88
62
56
32
0
Data
Shuffle
82
52
50
27
28
21
SATD
Transform
8
0
0
0
0
0
8
8
4
4
4
8
[7]
[8]
[9]
VI. CONCLUSION
In this paper, we have proposed a novel SIMD architecture
with two new features namely, parallel memory structure with
variable block size and word length support, and configurable
SIMD (CSIMD) structure using a look up table. When applying
to block based image or video applications, the proposed
parallel memory structure provides extra flexibility in
supporting multiple block sizes and multiple word lengths data
access by changing only a few parameters in the address
generation units. The hardware complexity of implementing
these address generation units is negligible as far as a general
purpose image and video processor is concerned. By using the
proposed parallel memory structure, a vector of 16 bytes, words
and double words can be retrieved (or stored) from (to) the
memory in 1, 2 and 4 cycles respectively. On the other hand,
the proposed CSIMD structure allows nearly random data
access to SIMD registers by means of a crossbar switch.
Programmers can specify the row number and the register bank
to be accessed in the CSLUT table, which we have shown to
require only a small amount of internal memory for its
implementation. Programmers can also define using the
CSLUT table slightly different operations among the ALUs.
13
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Wiegand, Gary J. Sullivan, Gisle Bjontegaard, and Ajay Luthra,

Overview of the H.264/AVC Video Coding Standard, IEEE
Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7,
pp. 560-576, Jul. 2003.
Intel 64 and IA-32 Architectures Software Developers Manual, Volume 1:
Basic Architecture [Online]. Available:
http://www.intel.com/products/processor/manuals.
K. Diefendorff, P. K. Dubey, R. Hochsprung, and H. Scales, AltiVec
Extension to PowerPC Accelerates Media Processing, IEEE Micro, vol.
20, no. 2, pp. 85-95, Mar.-Apr. 2000.
Yong-Hwan Kim, Jin-Woo Yoo, Seong-Won Lee, Joonki Paik, and
Byeongho Choi, Optimization of H.264 Encoder Using Adaptive Mode
Decision and SIMD Instructions, Proc., International Conference on
Consumer Electronics, pp. 289-290, Jan. 2005.
Yu Shengfa, Chen Zhenping, and Zhuang Zhaowen, Instruction-Level
Optimization of H.264 Encoder Using SIMD Instructions, Proc.,
International Conference on Communications, Circuits and Systems
Proceedings, vol. 1, pp. 126-129, Jun. 2006.
Marco Raggio, Massimo Bariani, Ivano Barbieri, Davide Brizzolara,
H.264 Implementation on SIMD VLIW Cores, STreaming Day 07,
Genova, September 2007.
Juyup Lee, Sungkun Moon, and Wonyong Sung, H.264 Decoder
Optimization Exploiting SIMD Instructions, Proc., IEEE Asia-Pacific
Conference on Circuits and Systems, vol. 2, pp. 1149-1152, Dec. 2004.
Lv Huayi, Ma Lini, Liu Hai, Analysis and Optimization of the
UMHexagonsS Algorithm in H.264 based on SIMD, Communication
Systems, Networks and Applications, pp.239-244, Jun. Jul. 2010.
Ali R. Iranpour, and Krzysztof Kuchcinski, Evaluation of SIMD
Architecture Enhancement in Embedded Processors for MPEG-4, Proc.,
Euromicro Symposium on Digital System Design, pp. 262-269, Aug.
2004.
Ye Jianhong, and Liu Jilin, Fast Parallel Implementation of H.264/AVC
Transform Exploiting SIMD Instructions, Proc., International
Symposium on intelligent Signal Processing and Communication Systems,
pp. 870-873, Nov. 2007.
Joohyun Lee, Gwanggil Jeon, Sangjun Park, Taeyoung Jung, and Jechang
Jeong, SIMD Optimization of the H.264/SVC Decoder with Efficient
Data Structure, Proc., IEEE International Conference on Multimedia
and Expo, pp. 69-72, 2008.
Stephen Warrington, Hassan Shojania, Subramania Sudharsanan, and
Wai-Yip Chan, Performance Improvement of the H.264/AVC
Deblocking Filter Using SIMD Instructions, Proc., IEEE International
Symposium on Circuits and Systems, pp. 21-24, May 2006.
Deepu Talla, Lizy Kurian John, and Dong Burger, Bottlenecks in
Multimedia Processing with SIMD Style Extensions and Architectural
Enhancements, IEEE Transactions on Computers, vol. 52, no. 8, pp.
1015-1031, Aug. 2003.
Mauricio Alvarez, Esther Salami, Alex Ramirez, and Mateo Valero,
Performance Impact of Unaligned Memory Operations in SIMD
Extensions for Video Codec Application, Proc., IEEE International
Symposium on Performance Analysis of Systems and Software, pp.62-71,
Apr. 2007.
Deependra Talla, Architectural Techniques to Accelerate Multimedia
Applications on General-Purpose Processors, Ph.D. dissertation,
University of Texas at Austin, 2001.
Jarno K. Tanskanen, Tero Sihvo, and Jarkko Niittylahti, Byte and
Modulo Addressable Parallel Memory Architecture for Video Coding,
IEEE Transactions on Circuits and Systems for Video Technology, vol. 14,
no. 11, pp. 1270-1276, Nov. 2004.
[17] Hoseok Chang, Junho Cho, and Wonyong Sung, Performance
Evaluation of an SIMD Architecture with a Multi-bank Vector Memory
Unit, Proc., IEEE Workshop on Signal Processing Systems Design and
Implementation, pp. 71-76, Oct. 2006.
[18] Georgi Kuzmanov, Georgi Gaydadjiev, and Stamatis Vassiliadis,
Multimedia Rectangularly Addressable Memory, IEEE Transactions on
Multimedia, vol. 8, no. 2, pp. 315-322, Apr. 2006.
[19] Zhi Zhang, Xiaolang Yan, and Xing Qin, An Efficient Programmable
Engine for Interpolation of Multi-Standard Video Coding, Proc., IEEE
International Conference on ASIC, pp. 750-753, Oct. 2007.
[20] Kunjie Liu, Xing Qin, Xiaolang Yan, and Li Quan, A SIMD Video Signal
Processor with Efficient Data Organization, IEEE Asian Solid-State
Circuis Conferencet, pp. 115-118, 2006.
[21] S. Seo, M. Who, S. Mahlke, T.Mudge, S. Vijay, C. Chakrabarti,
Customizing Wide-SIMD Architecture for H.264, IEEE International
Symposium on Systems, Architecture, Modeling and Simulation, pp.
172-179, Jul. 2009.
[22] Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen,
Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC
Intra Frame Coder, IEEE Transaction on Circuits and Systems for Video
Technology, vol. 15, no. 3, pp. 378-401, Mar. 2005.
[23] X264 Free H.264/AVC Encoder [Online]. Available:
http://www.videolan.org/developers/x264.html.
[24] Wing-Yee Lo, Simon Moy, Configurable SIMD Processor Instruction
Specifying Index to LUT Storing Information for Different Operation and
Memory Location for Each Processing Unit, U.S. Patent 7,441,099 B2,
Filed in October 2006, Granted in October 21, 2008.
[25] Simon Knowles, Apparatus and Method for Configurable Processing,
US 2006/0253689 A1, Published in November 9, 2006.
[26] Keith Diefendorff, and Pradeep K. Dubey, How Multimedia Workloads
Will Change Processor Design, Computer, vol. 30, iss. 9, pp.43-45, Sep.
1997.
[27] H.264/AVC JM Software Reference Model [Online]. Available:
http://iphome.hhi.de/suehring/html.
[28] D. A. Patterson and J. L. Hennessy, Computer Architecture: A
Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1996.
Wing-Yee Lo received the B.Eng. (hons) degree from

the Northumbria University, UK, and the MPil. degree
from the Chinese University of Hong Kong, both in
Electronics Engineering. She has more than 10 years of
ASIC design experience in Motorola Semiconductors
Hong Kong Ltd., VTech Communications Ltd., Hong
Kong Applied Science and Techonogy Institute, and
Beijing SimpLight Nanoelectronics Ltd. She is very
familiar with ASIC design flow and has been working
on various SoC chips for mobile and consumer products, video processor
architectural analysis and parallel processor designs. She joined a Shenzhen
startup company as Director of ASIC engineering since 2009. She is currently a
doctoral candidate in the Hong Kong Polytechnic University.
Daniel Pak-Kong Lun (M91) received the B.Sc.

(hons.) degree from the University of Essex, Essex,
U.K., and the Ph.D. degree from the Hong Kong
Polytechnic University, Hong Kong, in 1988 and 1991,
respectively. He is now an Associate Professor and
Associate Head of the Department of Electronic and
Information Engineering, the Hong Kong Polytechnic
University. His research interests include digital signal
processing, wavelets, and Multimedia Technology. Dr.
Lun is a Chartered Engineer and corporate member of
the IET and HKIE. (Home Page : http://www.eie.polyu.edu.hk/~enpklun)
14
Wan-Chi Siu (M77, SM90) received the MPhil

degree from The Chinese University of Hong Kong and
the PhD Degree from Imperial College of Science,
Technology & Medicine in October in 1977 and 1984
respectively. He joined The Hong Kong Polytechnic
University as a Lecturer in 1980 and has become Chair
Professor in the Department of Electronic and
Information Engineering since 1992. He was Head of
the same department and subsequently Dean of
Engineering Faculty between 1994 and 2002. He is now
Director of Centre for Signal Processing of the same university. He is an expert
in Digital Signal Processing, specializing in fast algorithms and video coding,
and has published 380 research papers, over 160 of which appeared in
international journals, such as IEEE Transactions on CSVT. His research
interests also include transforms, image coding, wavelets, and computational
aspects of pattern recognition. Professor Siu has been/was Guest Editor,
Associate Editor and Member of editorial board of a number of journals,
including IEEE Transactions on Circuits and Systems, Pattern Recognition,
Journal of VLSI Signal Processing Systems for Signal, Image, Video
Technology, and the EURASIP Journal on Applied Signal Processing. He is a
very popular lecturing staff member within the University, while outside the
University he has been a keynote speaker of over 10 international/national
conferences in the recent 10 years, and an invited speaker of numerous
professional events, such as IEEE CPM2002 (keynote speaker, Taipei,
Taiwan), IEEE ISIMP2004 (keynote speaker, Hong Kong), and IEEE
ICICS07 (invited speaker, Singapore) and IEEE ICNNSP2008 (keynote
speaker, Zhenjiang). He is the organizer of many international conferences,
including the MMSP08 (Australia) as General Co-Chair, and three IEEE
Society sponsored flagship conferences: ISCAS1997 as Technical Program
Chair; ICASSP2003 as the General Chair; and recently ICIP2010 as the
General Chair (2010 IEEE International Conference on Image Processing,
which was held in Hong Kong, 26-29 September 2010). Prof. Siu is also the
President Elect (2011-13) of a new professional association, the Asia-Pacific
Signal and Information Processing Association, APSIPA. He is a member
(2010-2012) of the Engineering Panel and also was a member of the Physical
Sciences and Engineering Panel (1991-1995) of the Research Grants Council
(RGC), Hong Kong Government. In 1994, he chaired the first Engineering and
Information Technology Panel of the Research Assessment Exercise (RAE) to
assess the research quality of 19 departments from all universities in Hong
Kong. (Home Page : http://www.eie.polyu.edu.hk/~wcsiu/mypage.htm)
Wendong Wang received the B.S. degree in electrical
engineering from Shandong University, China, and M.S.
degree in computer science in Beijing University of
Technology, China, in 1997 and 2004, respectively. He
is a senior software engineer in SimpLight
Nanoelectronics Ltd., Beijing and focus on computer
architecture analysis and video processing algorithm
development.
Jiqiang Song (M01, SM07) received the B.Sc. and

Ph.D. degrees from Nanjing University, China, in 1996
and 2001, respectively, both in Computer Science and
Application. He worked in the Department of
Computer Science and Engineering of the Chinese
University of Hong Kong as Postdoctoral Fellow from
2001 to 2004. After that, he joined Hong Kong Applied
Science and Technology Institute as Algorithm Lead in
a video processor project. In 2006, he worked in
Simplight Nanoelectronics Ltd., Beijing, as R&D
Director of Multimedia and engaged in multimedia SIMD processor
development. He joined Intel Labs China as Staff Research Scientist in 2008.
His research interests include graphics recognition, video encoding, image and
video processing. He has published over 30 research papers in international
journals and conferences.

2011 Lolunsiu Etc Ieee Transcsvt

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2011 Lolunsiu Etc Ieee Transcsvt

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication.

Improved SIMD Architecture for High

AbstractSIMD execution is in no doubt an efficient way to

ith the extensive use of image and video information in

much interest from both academic researchers and VLSI

Fig. 1. Proposed SIMD architecture.

issued in every clock cycle. The multi-thread and multi-issue

additional memory accesses.

In our design, two novel features are introduced to reduce the

To allow SIMD architectures to achieve the peak throughput

function is DCT/IDCT that contributes to about 10-20% of

SATD4x4 (pixel_satd_<blk_size> functions)

such as enhancing SIMD instruction set extension. Hence they

Fig. 2. Packed word subtraction from packed byte.

MMX registers, 4 in each, into the destination register (see Fig.

Fig. 3. Basic operations of 4x4 matrix transpose.

involves no arithmetic operations but only data shuffles. Most

Its application to the columns of a 4x4 data

Fig. 4. Basic operations in 1-D Hadamard transforms.

Fig. 5. 1-D Hadamard transforms in 256-bit register.

Fig. 6. Subpel interpolation.

IV. PROPOSED FEATURES

The fractional motion estimation is one of the new features

A. New Parallel Memory Structure

similar to that in Fig. 9b except that the memory module

Fig. 9. Memory interleave to allow block access.

Fig. 8 shows the relationship between the logical offset

Fig. 8. Parallel memory logical offset address and physical address.

within that part of the video frame. Then

where . is the floor function and a

stands for a modulo b.

Let {m, p} be the module number and the physical address

Similarly, for 2x8 block loading,

such that y s ' and xs '

will be always divisible by 4. (17) can then be written as

The two floor functions in (18) can only be equal to 0 or 1.

negligible as far as a video processor is concerned. The above

Note that m = 0 to 15 is the index to the 16 parallel memory

q x q x ' 4 where q x ' x s

The evaluation of qy and qx is very simple. The modulo

q y ' 2 and q x ' 8

q y ' 2 and q x ' 8

q y ' 8 and q x ' 2

q y ' 8 and q x ' 2

For actual implementation, each parallel memory module is

q y q y ' 4 where q y ' y s

b. 2x8 block access

Substitute (26) and (22) to (19), we can express qx and qy in

can be implemented by extracting the last 2 bits

of the number. m / 4 can be implemented by shifting m to

Fig. 10. Address generation for parallel memory structure.

Fig. 11 shows an example of loading 352x40 pixels of a CIF

Fig.11. Memory module physical addresses for CIF image pixel.

applications are 8 and 16 bits [26]. It is because most video

q y ' 4 and q x ' 4

lengths, the proposed approach also allows data access of

Fig. 12. The proposed SIMD architecture with switching control.

accessing a particular entry in CSLUT. The format of typical

Fig. 14. Instruction fields of a typical and CSIMD instruction.

bank configuration data, the operand in any row address can be

A methodology similar to configurable SIMD is also

of the CSLUT. It makes use the misc configuration in CSLUT

ROW, BANK AND MISC CONFIGURATION IN CSLUT FOR THE