Professional Documents
Culture Documents
I. INTRODUCTION
Copyright (c) 2011 IEEE. Personal use of this material is permitted. However, permission to use this material for any other
purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Logic
Area
mm2
%
2.7
33.6
Local Storage
Synthesized
Memory
Register
File
3.9
48.6
1.4
17.8
Total
8.0
100
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
interleaved and loaded into multiple memory modules
sequentially. In [16], a modulo addressing mode was
introduced to allow part of the bytes in a word to be accessed
from both ends of a circular buffer to reduce external memory
bandwidth. Chang et al. [17] proposed adding one extra
memory module in addition to the number of ALUs in SIMD
processor to solve the possible memory module conflict
problem. However, the number of memory modules must be
relatively prime to the supported stride values resulting in
larger hardware cost in address generation and shuffling logic.
In [18], a scalable data alignment scheme was proposed for
rectangular block data access using simpler memory address
generation. It is achieved by using a two-dimensional notation
for both pixel location and memory module number. However,
it is not flexible enough to support variable block sizes. It is
noted that a block based data access approach is often used in
many image and video processing applications while the block
size can be different for different algorithms. Flexibility should
be provided when designing a general purpose video processor
to allow data access of variable block size and word length
without greatly increasing hardware complexity.
In [19]-[20], a video signal processor with read-permuter and
write-transposer placed, respectively, before and after the
vector register file was described. They facilitate data
reorganization in SIMD register before execution, but it still
needs N cycles to do an NxN transpose operation. Seo et al. [21]
on the other hand introduced diagonal memory organization
and programmable crossbars in their SIMD architecture. The
diagonal memory organization allows the horizontal and
vertical memory access without any conflict. Due to data access
complexity in H.264 algorithm, 3 programmable crossbar
shuffle networks are added such that any data shuffle patterns
required by H.264 algorithm can be supported. However, in
order to accommodate complex data access patterns, only
predefined fixed pattern crossbars are implemented. This
limitation requires the crossbar patterns to be pre-designed
based on the algorithm. They may not be flexible enough to
realize future algorithm enhancement or support new video
coding standards efficiently. Besides, the 3 shuffle networks
make the SIMD pipeline longer which may increase the branch
mis-prediction penalty and execute-to-consume latency
between pipeline stages.
Another deficiency of the traditional approaches is that they
do not have direct support to major kernel functions in image
and video processing. It will be discussed in next section.
III. ANALYSIS
As mentioned above, we use video coding as an example to
illustrate the deficiency of the traditional SIMD architectures in
supporting image and video processing kernel functions. It is
well known that motion estimation is the most computation
intensive function in H.264/AVC encoders. It contributes to
more than 50% among all computations [5][22]. If four
reference frames are used, motion estimation alone accounts for
more than 70% of computation [22]. The next intensive
4x4
4x8
8x4
8x8
8x16
16x8
16x16
8
16
8
16
32
32
64
4x4
4x4
4x4
8x8
16x16
8
16
64
4x4
8x8
16x16
8
24
96
75
150
81
162
324
324
648
MMX
MMX
SSE2
SSE2
SSE2
SSE2
SSE2
49 MMX
40 MMX
60 MMX
135 SSE2
537 SSE2
68 MMX
158 SSE2
620 SSE2
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
4x4 SATD function for different block sizes. In fact, as can be
seen in TABLE II, these sub-functions are equally important in
functions such as DCT and IDCT. Their efficient realization
obviously is decisive to improve the overall performance.
Although these sub-functions are very simple, conventional
SIMD architectures often cannot achieve the peak throughput
due to the following 4 reasons:
1.
lack of memory block load with different data length
support;
2.
limited support for data shuffling;
3.
requirement of carrying out the same operations by all
ALU in SIMD for each SIMD instruction execution; and
4.
inability to support cross bank data access in a SIMD
register file.
In VideoLAN 4x4 SATD function, MOVD instruction is
used to load 4 pixel data bytes from memory to lower double
word of the 64-bit MMX register while filling the upper double
word with zeros. Two PUNPCKLBW instructions are then
used to unpack 8 data bytes from two lower double words of
H
1 1 1 1
1 1 1 1 .
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
(4)
Y(i, j) = A(i, j) + A(i+2, j)
i = 0, j = 0-3.
(5)
Y(i, j) = B(i, j) + B(i+2, j)
i = 1, j = 0-3
(6)
Y(i, j) = A(i, j) A(i2, j)
i = 2, j = 0-3.
(7)
Y(i, j) = B(i, j) B(i2, j)
i = 3, j = 0-3.
Fig. 4 shows the basic operations as carried out by
VideoLAN X264 source for implementing the 1-D Hadamard
transforms. Since each Intels MMX register can only handle 4
data words at the same time, 8 PADDW / PSUBW instructions
are required to complete the transform for all columns of a 4x4
data block. In fact, 12 instructions rather than 8 are used in
actual source codes. It is again due to insufficient number of
registers that requires extra instructions to deal with temporary
result storage.
To speed up the computation, an intuitive solution is to
increase the bit-width of the register such that all data in a block
can be processed at the same time. Assume now we have a
256-bit register as shown in the upper part of Fig. 5 such that all
16 data words of a block can be loaded into this register. We
expect that all 16 data can be processed at the same time, but in
fact they cannot. Based on (2)-(7), the operations to be
performed for implementing the 1-D Hadamard transforms are
shown in the lower part of Fig. 5. We can see that each data
word requires adding itself with, or subtracting itself from
another word lane in the register. This requirement deviates
from the operations performed by traditional SIMD structures,
which require all ALUs in the SIMD unit to perform exactly the
same operation whenever a SIMD instruction is executed.
Besides, the ALUs must retrieve the operands from its own
register bank. Although later multimedia extension adds new
instructions to support cross bank operand retrieval, these
instructions either limit to several retrieval patterns or require
additional instructions to configure the retrieval pattern in
another register before using. The above means that even if we
can throw in extra resource to provide long bit-width registers,
the problem cannot be resolved without a redesign of the SIMD
architecture. We show in Section IV how the proposed SIMD
architecture handles these problems.
B. Fractional Motion Estimation
G 0 d 2 H
3
4
5
A a B
C b D
E
i
F
j
G d H
k e m
I
n
J
q
M f
R g S
T h U
k 6 e 7 m
8
9
10
M 11 f 12 N
d
13
14
e
15
M
m
16
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
further operations take place. However, traditional memory
storages only allow sequential memory access. For this reason,
multi-bank or parallel memory structures were proposed to
allow multiple data access concurrently [16]-[20]. Similar to
the previous approaches, the proposed SIMD architecture is
equipped with a 32KB parallel memory structure served as a
buffer between the external memory and the register file as
shown in Fig. 7. The parallel memory is divided into 16
modules each of which has the size of 2K bytes and has a
separated data bus connected to one of the 16 banks of a
register file. Each register bank has 32 rows and each element
of a register bank can store a 16 bits word (in fact, the register
file is constructed by 32 256-bit registers).
External Memory
16 banks x
2048 x 8 bits
Parallel Memory
Structure
:
:
16 banks x
32 x 16 bits
Register File
&
16 ALUs
(a)
(b)
(8)
where Aoff is the offset of the address of that pixel from the
starting address. Assume that the video frame has the size of Nx
columns and Ny rows. Then Aoff can always be written as
(9)
for x = 0 to Nx-1 and
A off yN x x
y = 0 to Ny-1.
Alternatively, the index x and y can be obtained from Aoff by:
y A off / N x and x A off
(10)
N
x
8*
(13)
(14)
(15)
(16)
Once the block size is known, the DMA unit will load the
data from external memory to the parallel memory following
the respective mapping functions. Then data can be retrieved
from the parallel memory to the register efficiently. Assume
that a 4x4 block with indices of the first pixel be {xs, ys} is to be
retrieved. The pixels in the block can be described by {xs+xo,
ys+yo}, where xo, yo = 0 to 3. Following from (11),
p ( y s y 0 ) / 4 * N x / 4 ( x s x 0 ) / 4
(17)
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Let y s y s ' y s
and x s x s ' x s
xs ' / 4 xs
xo / 4
(18)
m x
m xs
4* y
4
x s xo
x s xo
xs
4* y
4
(21)
xo
(22)
m xs m xs
y
m m
Hence
ys yo
yo
4 4 4
4* y
/ 4 m / 4
m 4 4* y
ps
p 1
s
ps N x / 8
ps N x / 8 1
q y ( ys
m / 4 y s
(26)
m / 4 y s
m xs
4 4
4 4
(27)
(28)
Since qx and qy must be equal to {0, 1}, (20) can be written as:
ps
q y ' 4
and
q x ' 4
ps 1
q y ' 4
and
q x ' 4
ps N x / 4
q y ' 4
and
q x ' 4
ps N x / 4 1
q y ' 4
and
q x ' 4
(29)
qx
) / 8 q x ' / 8
(31)
(32)
m ys
m / 8 x s
ps
p 1
s
ps N x / 2
ps N x / 2 1
8 8
) / 8 q y ' / 8 and
(33)
) / 2 q x ' / 2
(34)
(35)
Add.
Gen.
m=1
(25)
function .
m xs
xs ys bsNxdsdsb
(23)
(24)
m / 4
yo
( x
q x ( xs
q y 0 and q x 0
ps
p 1
q y 0 and q x 1
s
(20)
p
N
/
4
q y 1 and q x 0
x
s
p s N x / 4 1 q y 1 and q x 1
The following further shows that we only need to have the
indices of the first pixel {xs, ys}, we can easily determine the
physical address for each module. Again we use the 4x4 block
retrieval as an example. From (12), it can be seen that:
Add.
Gen.
m=15
Total 16 Memory
modules
xs ys bs
xs ys bs
xs ys bs
Note: bs block size; refer (36) for the definition of ds and dsb
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
bank. With both the mux control unit and the RF row address
control unit, almost random RF access is supported.
Although two operands can be read from a register bank at
the same time, we impose several architectural constraints in
order to save the hardware cost. Firstly, the crossbar switch
supports the shuffling of one operand only. That is, if a SIMD
operation requires two operands, one of them must still be
retrieved from its own bank. Secondly, the crossbar switch
allows only a maximum of one data coming from other register
bank to prevent the register bank conflict problem and to
minimize the number of register bank output ports to two.
Thirdly, the crossbar only shuffles word size operand. If
double word size operand shuffle is needed, two SIMD
instructions are used to shuffle the whole operand. Finally, the
computed result is restricted to write back to ALUs own RF
bank. Because of these hardware constraints, configurable
SIMD allows only almost but not fully random RF access.
To support random RF access in configurable SIMD, a
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
look-up table named as configurable SIMD look-up table
(CSLUT) is introduced. The CSLUT is made by 5 memory
modules. Their logical and physical addresses as well as their
structure are shown in Fig. 13. There are three major
configuration data in the table. The 80-bit row address
configuration data and 64-bit bank configuration data specify
the register row addresses and register bank numbers of 16
operands to be retrieved respectively. The mux control unit
takes in 64-bit bank configuration data from CSLUT so that
each operand of ALU can be retrieved from any bank. The RF
address control unit takes in 80-bit row address configuration
data from CSLUT and generates 16 row addresses to each RF
bank. If only the bank number configuration data in CSLUT is
used, 16 operands on the same row from different bank
specified in bank configuration data are retrieved. If only the
row address configuration data in CSLUT is used, each ALU in
SIMD takes one operand in any row address specified in row
configuration data from its own RF bank. Using both row and
Fig. 15. 4x4 1-D Hadamard transform using only two CSIMD instructions.
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
Beside, the crossbar in the patent design is controlled totally by
PSCI data. It only provides shuffling on operands read from
register file location specified by the source operand address
instruction field. It cannot allow random register file access as
the proposed approach.
In the following subsections, we demonstrate how the
implementation of H.264/AVC kernel functions is made simple
by the new CSIMD structure. We particularly use SATD and
fractional motion estimation as examples although similar
improvement can also be achieved in other kernel functions
such as intra prediction. To simplify our discussion, we assume
that video data are accessed in the form of 4x4 blocks.
Operations involving larger data blocks are composed by
combining the results of the constituting 4x4 blocks.
1) SATD Computation:
A SATD computation consists of data load, subtraction, 2-D
4x4 Hadamard transform, matrix transposes, taking absolute of
the transformed data and summation. Let us first consider the
realization of 2-D 4x4 Hadamard transform. As discussed
above, a 2-D 4x4 Hadamard transform can be implemented by
four length-4 1-D Hadamard transforms applied to the rows and
followed by another four applied to the columns. Fig. 4 shows
that at least 8 instructions are needed to perform each set of four
1-D Hadamard transforms using the MMX instruction set due
to insufficient register bit-width. We have also shown in Fig. 5
that even if we have the resource to install registers with
sufficient bit-width such that all data of a block can be loaded
into a register, we still cannot easily implement the 1-D
Hadamard transforms using SIMD instructions since different
operations are performed in different register banks and they
may require operands from different register banks. For the
proposed SIMD architecture, we use only 2 CSIMD
instructions to realize each set of four 1-D Hadamard
transforms as shown in Fig. 15. Before execution, the 4x4 input
data is placed in 256-bit SIMD register in, say, row 5. Each
CSIMD instruction takes one operand from its own RF bank
and one operand from other bank to perform either addition or
subtraction. For example, in the first CSIMD instruction, the
ALU0 (right one) takes data X00 from its own bank and data
X10 in bank 4 of row 5 to perform addition. ALU 3 (the fourth
one from right) takes data X10 from its own bank and data X00
in bank 0 of row 5 to perform subtraction. All configuration
information is specified in row, bank and misc memory content
row=1
e
row=2
d
e
c
d
e
f
3 0 0 1 1 2 2 3 3
0
1
2
3
7 4 4 5 5 6 6 7 7
row=4
e
4
5
6
7
b 8 8 9 9 a a b b
9
8
a
b
f c c d d e e f f
row=9
3
0
1
2
3
TABLE III
row=3
d
a
row=6
9
a
row=7
row=11
0
row=5
d
9
row=8
5
4
ROW
First
BANK
CSIMD
MISC
ROW
Second
BANK
CSIMD
MISC
f
5
b
0
6
7
0
e
5
a
0
6
6
0
d
5
9
0
6
5
0
c
5
8
0
6
4
0
b
5
f
1
6
3
0
a
5
e
1
6
2
0
9
5
d
1
6
1
0
8
5
c
1
6
0
0
7
5
3
0
6
f
1
6
5
2
0
6
e
1
5
5
1
0
6
d
1
4
5
0
0
6
c
1
3
5
7
1
6
b
1
2
5
6
1
6
a
1
1
5
5
1
6
9
1
0
5
4
1
6
8
1
b
f
row=4
X
row=10
row=0
b
b
e
d
4 3 2 1 0 row=9
8
X
7
b
9
d
a
a
d
c
6
5
8
c
3
7
3 2 1
row=2
X
6
row=10
row=0 2
6
9
8
3
2
8
8
2
1
4
4
1
0
ALU
10
0
c
avg
row=11
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
operation for integer to half interpolation is done by six
multiply-and-accumulate (MAC) instructions. In Fig. 16 (upper
left hand side), the solid line triangle half pixel c is generated by
TABLE IV
REGISTER ROW NUMBER AND BANK INFORMATION IN CSLUT FOR SUBPEL
INTERPOLATION.
ALU
0
1
2
3
4
5
6
7
8
9
a
b
c
d
e
f
BRBRBRBRBRBR
1 4 2 4 3 4 0 9 1 9 2 9
2 4 3 4 0 9 1 9 2 9 3 9
3 4 0 9 1 9 2 9 3 9 0 5
0 9 1 9 2 9 3 9 0 5 1 5
5 4 6 4 7 4 4 9 5 9 6 9
6 4 7 4 4 9 5 9 6 9 7 9
7 4 4 9 5 9 6 9 7 9 4 5
4 9 5 9 6 9 4 5 5 5 6 5
9 4 a 4 b 4 8 9 9 9 a 9
a 4 b 4 8 9 9 9 a 9 b 9
b 4 8 9 9 9 a 9 b 9 8 5
8 9 9 9 a 9 b 9 8 5 9 5
d 4 e 4 f 4 c 9 d 9 e 9
e 4 f 4 c 9 d 9 e 9 f 9
f 4 c 9 d 9 e 9 f 9 c 5
c 9 d 9 e 9 f 9 c 5 d 5
BRBRBRBRBRBR
8 2 c 2 0 9 4 9 8 9 c 9
9 2 d 2 1 9 5 9 9 9 d 9
a 2 e 2 2 9 6 9 a 9 e 9
b 2 f 2 3 9 7 9 b 9 f 9
c 2 0 9 4 9 8 9 c 9 0 7
d 2 1 9 5 9 9 9 d 9 1 7
e 2 2 9 6 9 a 9 e 9 2 7
f 2 3 9 7 9 b 9 f 9 3 7
0 9 4 9 8 9 c 9 0 7 4 7
1 9 5 9 9 9 d 9 1 7 5 7
2 9 6 9 a 9 e 9 2 7 6 7
3 9 7 9 b 9 f 9 3 7 7 7
4 2 8 2 c 2 0 9 4 9 8 9
5 2 9 2 d 2 1 9 5 9 9 9
6 2 a 2 e 2 2 9 6 9 a 9
7 2 b 2 f 2 3 9 7 9 b 9
Half to
Quarter
B R B R
0 0 c a
1 0 d a
2 0 e a
3 0 f a
4 0 0 a
5 0 1 a
6 0 2 a
7 0 3 a
8 0 4 a
9 0 5 a
a 0 6 a
b 0 7 a
c 0 8 a
d 0 9 a
e 0 a a
f 0 b a
11
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
the sum of the scalar LS (CSIMD_LS) instructions and 16
times vector LS (CSIMDvec_LS) instructions required by the
CSIMD Encoder. However, it can be seen in TABLE V,
Opt.JM_LS >> CSIMD_ LS + (16*CSIMDvec_LS)
(37)
DCT4x4
IDCT4x4
Residual DC Residual
4x 8x 16x1 4x 4x 8x 16x1
4 8
6
4 4 8
6
2.6 2.4 2.7 2.1 3.5 3.3 3.7
SATD
DCT4x4RES IDCT4x4RES
75,655
11,120
4,952
70,916
6,480
1,694
95,992,506
6,498,960
2,645,264
(65.6%)
(61.9%)
(71.6)
12
one SSE register in VideoLAN. Hence the two 4x4 blocks can
only be performed separately in MMX register.
TABLE VII further shows the simulation results when
computing SATD and IDCT/DCT in a H.264 encoding process.
In this simulation, one second of video sequence Stefan (25
frames, 1I+24P) with CIF resolution was used. The number of
cycle count reduction by using the proposed CSIMD Encoder
model as compared with SIMD implementation using
VideoLAN X264 source codes is shown. It can be seen that
more than 60% of execution cycles can be reduced using the
proposed CSIMD Encoder model. All improvement as
mentioned above stems from the advanced parallel memory
and CSIMD structures.
Based on the Amdahls Law [28], we can project the speedup
of the entire H.264/AVC encoding application from the kernel
function speedup, with respect to adopting the proposed
parallel memory structure and configurable SIMD feature in
conventional SIMD architecture. Let T be the execution time
(measured in execution cycles) of the original H.264/AVC
encoding application, Tker be the execution time of the kernel
function and Tcsimd be the execution time of the kernel function
performed by our CSIMD Encoder. Amdahls Law states that
the overall speedup of the application S is:
T
1
S
(38)
T Tker Tcsimd 1 ( / s )
where Tker / T is the percentage proportion of the kernel
function in the entire application and s Tker / Tcsimd is the
speedup of the kernel function execution with respect to our
proposed features. It is easily to extend the overall application
speedup if there are multiple kernel functions as below:
1
S
(39)
1 ( i i i i / s i )
Several kernel functions are taken in our calculation. They
include integer motion estimation (IME), fractional motion
estimation (FME), SATD, DCT and IDCT. TABLE VIII shows
the kernel functions speedup and their corresponding
percentage proportion in application based on our profiling
result. The speedup of IME and FME mainly comes from the
instruction count reduction shown in TABLE V which is 9.7
and 122.4 respectively. It should be noted that the SATD in this
table only refers to inter mode decision but not in motion
estimation because SATD speedup is already accounted in
FME speedup. The speedup of SATD, DCT and IDCT is from
TABLE VI. According to equation (39), the overall speedup of
H.264/AVC encoding application is 2.46X.
TABLE VIII
PROPORTION AND SPEEDUP OF KERNEL FUNCTIONS.
Kernel
Proportion (%)
Speedup
IME
12
9.7
FME
33
122.4
SATD
7
2.9
DCT
7
2.7
IDCT
13
2.1
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
applied the proposed SIMD architecture to the implementation
of several general video and image processing functions (e.g.
de-interlacing, scaling, transform, color space conversion, etc.).
Due to the flexibility provided by the proposed parallel
memory structure, we can support image and video
applications of different block sizes and word lengths. And by
redefining the CSLUT table entries, we can realize these
applications efficiently using the CSIMD instructions. TABLE
IX shows the numbers of predefined entries in the CSLUT table
for the implementation of different major kernel functions in
each application. It can be seen that for implementing the listed
6 applications, only 689 entries are required. It shows that the
memory required for the storage of the CSLUT table is
insignificant as far as a general purpose video/image processor
is concerned. As such, the proposed SIMD architecture can
support multiple video applications well by simply using
different entries of the CSLUT table for different applications.
The proposed features only increase the area of the video
processor by not more than 5% of the total area. As a brief
account, the CSIMD LUT contributes to about 4% increase in
area, while the crossbar switch and CSIMD control contribute
to 0.23% and 0.6% increase in area resp.
[2]
[3]
[4]
[5]
TABLE IX
NUMBER OF CSLUT CONFIGURATION ENTRIES FOR DIFFERENT IMAGE AND
VIDEO APPLICATIONS.
[6]
Video Application
H.264/AVC Encoder
H.264/AVC Decoder
AVS-M Decoder
AVS Decoder
MPEG4 Decoder
Image Processor
Fractional
Interpolation
147
88
62
56
32
0
Data
Shuffle
82
52
50
27
28
21
SATD
Transform
8
0
0
0
0
0
8
8
4
4
4
8
[7]
[8]
[9]
VI. CONCLUSION
In this paper, we have proposed a novel SIMD architecture
with two new features namely, parallel memory structure with
variable block size and word length support, and configurable
SIMD (CSIMD) structure using a look up table. When applying
to block based image or video applications, the proposed
parallel memory structure provides extra flexibility in
supporting multiple block sizes and multiple word lengths data
access by changing only a few parameters in the address
generation units. The hardware complexity of implementing
these address generation units is negligible as far as a general
purpose image and video processor is concerned. By using the
proposed parallel memory structure, a vector of 16 bytes, words
and double words can be retrieved (or stored) from (to) the
memory in 1, 2 and 4 cycles respectively. On the other hand,
the proposed CSIMD structure allows nearly random data
access to SIMD registers by means of a crossbar switch.
Programmers can specify the row number and the register bank
to be accessed in the CSLUT table, which we have shown to
require only a small amount of internal memory for its
implementation. Programmers can also define using the
CSLUT table slightly different operations among the ALUs.
13
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
no. 11, pp. 1270-1276, Nov. 2004.
[17] Hoseok Chang, Junho Cho, and Wonyong Sung, Performance
Evaluation of an SIMD Architecture with a Multi-bank Vector Memory
Unit, Proc., IEEE Workshop on Signal Processing Systems Design and
Implementation, pp. 71-76, Oct. 2006.
[18] Georgi Kuzmanov, Georgi Gaydadjiev, and Stamatis Vassiliadis,
Multimedia Rectangularly Addressable Memory, IEEE Transactions on
Multimedia, vol. 8, no. 2, pp. 315-322, Apr. 2006.
[19] Zhi Zhang, Xiaolang Yan, and Xing Qin, An Efficient Programmable
Engine for Interpolation of Multi-Standard Video Coding, Proc., IEEE
International Conference on ASIC, pp. 750-753, Oct. 2007.
[20] Kunjie Liu, Xing Qin, Xiaolang Yan, and Li Quan, A SIMD Video Signal
Processor with Efficient Data Organization, IEEE Asian Solid-State
Circuis Conferencet, pp. 115-118, 2006.
[21] S. Seo, M. Who, S. Mahlke, T.Mudge, S. Vijay, C. Chakrabarti,
Customizing Wide-SIMD Architecture for H.264, IEEE International
Symposium on Systems, Architecture, Modeling and Simulation, pp.
172-179, Jul. 2009.
[22] Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen,
Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC
Intra Frame Coder, IEEE Transaction on Circuits and Systems for Video
Technology, vol. 15, no. 3, pp. 378-401, Mar. 2005.
[23] X264 Free H.264/AVC Encoder [Online]. Available:
http://www.videolan.org/developers/x264.html.
[24] Wing-Yee Lo, Simon Moy, Configurable SIMD Processor Instruction
Specifying Index to LUT Storing Information for Different Operation and
Memory Location for Each Processing Unit, U.S. Patent 7,441,099 B2,
Filed in October 2006, Granted in October 21, 2008.
[25] Simon Knowles, Apparatus and Method for Configurable Processing,
US 2006/0253689 A1, Published in November 9, 2006.
[26] Keith Diefendorff, and Pradeep K. Dubey, How Multimedia Workloads
Will Change Processor Design, Computer, vol. 30, iss. 9, pp.43-45, Sep.
1997.
[27] H.264/AVC JM Software Reference Model [Online]. Available:
http://iphome.hhi.de/suehring/html.
[28] D. A. Patterson and J. L. Hennessy, Computer Architecture: A
Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1996.
14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.