Professional Documents
Culture Documents
( m
0
) =
1
k(m
0
)
mF
(m) s ((m
0
), (m)) c(m
0
, m).
(1)
The term m = (m, n) denotes the pixel coordinates in the
image to be ltered and m
0
= (m
0
, n
0
) and m
0
= ( m
0
, n
0
)
represent the coordinates of the centered pixel in the noisy and
in the ltered images, respectively. With these notations,
( m
0
)
means the gray value of the pixel being ltered, and (m)
identies the gray value of the spatially neighboring pixels to
(m
0
) in the lter window F.
The following expressions (2) and (3) describe the photo-
metric and the geometric components s((m
0
), (m)) and
c(m
0
, m), respectively:
s ((m
0
), (m)) = exp
_
1
2
_
(m
0
) (m)
ph
_
2
_
(2)
c(m
0
, m) = exp
_
1
2
_
m
0
m
c
_
2
_
(3)
where parameters
ph
and
c
regulate the width of the Gaussian
curve assigned to s((m
0
), (m)) and c(m
0
, m), respectively.
The photometric component compares the gray value of the
centered pixel with the gray values of the spatial neighborhood
and computes the corresponding weight coefcients depending
on the factor
ph
. The more the absolute difference of the
gray values exceeds
ph
, the lower is the corresponding lter
coefcient and vice versa. The domain lter c(m
0
, m) acts as
a standard low-pass lter, the weights of which are reciprocally
proportional to the spatial distance of the centered pixel to the
pixels in the neighborhood.
Normalization with
k(m
0
) =
mF
s ((m
0
), (m)) c(m
0
, m) (4)
guarantees that the range of the ltered images does not change
signicantly due to the ltering. Owing to the fact that the
coefcients of the photometric component cannot be computed
in advance, the division by the normalization factor cannot be
avoided by means of prescaling of the lter coefcients.
IV. DESIGN CONCEPT
The image data, as well as all constants and coefcients
used in the following design concept, are integer numbers. As
discussed in Section VI, there is no need to implement oating-
point computation. With the aid of the presented design con-
cept, the bilateral lter can be realized as a highly parallelized
pipeline structure giving great importance to the effective re-
source utilization. In this paper, the data paths are detailed. The
description of the control signals is not addressed here.
4096 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Fig. 1. Order of the functional units of the bilateral lter.
Fig. 2. Principle of the input data retrieval for the image ltering.
For the design description, a window size of 5 5 is chosen.
This window size is the tradeoff between high noise reduction
and low blurring effect.
The design concept for the implementation of the bilateral
lter is subdivided into three functional blocks. The block-
based design approach reduces design complexity and simpli-
es validation [34]. Fig. 1 presents these units and their order
in the concept. The input data marked by Data_in are read
line by line and arranged for further processing in the register
matrix. The second unit is the photometric lter which weights
the input data according to the intensity of the processed pixels.
The ltering is completed by the geometric lter, and the
ltered data are marked by Data_out.
A. Register Matrix
The photometric lter component, also often referred to as
a range lter in the related literature, is a nonlinear lter. It
means that the lter coefcients change for every lter position.
Thus, the pixel weights for the photometric component have
to be calculated separately for every pixel in the lter window.
The number of weights depends on the lter window size. Here,
24 weights have to be computed for the ltering of one image
pixel.
The lter window is shifted rst along the input lines rep-
resenting the image rows, moving one row down every time
the precedent row has been ltered. Consequently, the demand
arising from this ltering technique is that at least ve lines
have to be stored for the period of time during which a line
is ltered. As an external image buffer is undesired because
of the additional expenses of resources due to the memory
controller and because of the additional latency due to the
memory accesses, the ve input lines are stored in the line
storages which are implemented as block RAMs for data with
N bits. The ve input lines are called image rows or rows in the
following. These ve rows include the row to be ltered, two
foregoing rows, and two succeeding rows.
This arrangement is depicted in Fig. 2. The pixel being l-
tered is marked by mid_pix. This pixel and its neighborhood
in the solid box represent the kernel of the bilateral lter.
After the middle row has been ltered, the outer foregoing row
Fig. 3. Register matrix of the kernel-based design concept.
line storage n-2 moves out of the register matrix. As the
input data are read into the register matrix pixel by pixel, the
content of the line storages and of the lter kernel is shifted
by one pixel at each clock event. This shift emulates the shift
of the lter kernel. Acting this way, at the end of an image
line, all remaining rows are shifted one row down. The former
succeeding row line storage n + 1 can now be processed. The
output lines form the output image which is stored externally.
The parallel calculation of 24 weights in the photometric
lter component and the subsequent weighting in the geometric
component combined with the nal normalization at the lter
output require a large amount of resources considering the
sparse time of just one pixel cycle. Due to the exibility of the
clock management in FPGAs, this challenge can be accepted.
The solution is offered by our kernel-based design concept in
Fig. 3. The single registers are interconnected in a manner that,
aside from the shift of the lter window by one pixel, the entire
kernel is provided to the next lter stage simultaneously. This
is an important advantage of the presented kernel-based design
concept as no extra data buffer is required. On the other hand, it
is necessary to process all 25 pixels in one pixel cycle in order
to keep up with the reading of the input lines into the register
matrix.
The output of the register matrix is sorted into groups, in this
case into six groups, and fed into the photometric lter compo-
nent with the quadruple pixel clock frequency synchronously.
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4097
Fig. 4. Abstract illustration of the photometric lter component.
The number of the groups is explained by the symmetry of
the geometric lter component which is discussed later in
Section IV-C. The sorting is done by means of multiplexing the
pixels in the manner shown in Fig. 3. The quadruplication of
the lter processing clock is implemented by setting the select
signal of the multiplexers four times in one pixel clock. Here,
the clock domain changes to the fourfold of the input pixel
clock. The counter on the top of Fig. 3 generates the select
signal and thus controls the readout of the register matrix. This
counter is clocked with the quadruple pixel clock as well. The
counter is rst enabled after the whole register matrix is lled.
The pixels in each group are processed in parallel while each
group is pipelined through to the register matrix output stage.
The pixel in the center of the lter window is not a part of any
group and is forwarded to a latch belonging to the input stage
of the photometric lter component. The sorting of the pixels
into groups and the quadruplication of the pixel clock are the
key to the presented synchronous FPGA design concept using
a parallelized pipeline architecture.
B. Photometric Component
After the register matrix has been lled, the grouped image
data are provided to the photometric lter component which
is pictured in Fig. 4. At the output of the photometric lter, the
weighted pixels appear, still sorted into groups, accompanied by
the weighted mid_pix. Additionally, the photometric coef-
cients have to be forwarded for the required normalization at the
last stage of the ltering according to (4). Thus, in parallel to the
pixels, the photometric coefcients also have to be processed by
the geometric lter in order to obtain the normalization factor
dened in (4). For this reason, the output of the photometric
lter consists of the following:
1) weighted pixels sorted into groups 0 . . . 5;
2) the weighted pixel being ltered, marked by mid_pix;
3) photometric coefcients corresponding to groups 0 . . . 5.
In further stages of the design, the weighted pixel values, i.e.,
the outputs of the multipliers, are named by their groups 0 . . . 5.
A detailed functional ow block diagram of the photometric
lter is shown in Fig. 5. The pixel in the center of the lter
window has to be available during the calculation of the re-
quired 24 pixel weights. Latching the centered pixel allows the
computation of the gray value differences between the centered
pixel and the remaining pixels inside of the lter window. Each
group contains four pixels. A separate pipeline belonging to
each group makes it possible to process the entire neighborhood
of mid_pix at one pixel clock signal. All six pipelines are
designed identically.
Fig. 5. Photometric lter component.
Fig. 6. Processing order of input data in the photometric lter component.
The way of arranging and the processing order of the input
data of the photometric component are shown in Fig. 6. At the
rst internal clock event t
0
, the rst pixels of each group are
provided to the respective pipeline. At the second internal clock
t
1
, the second pixels of each group enter the component. This
organization of groups allows the processing of the whole lter
window in four internal clock cycles corresponding to one pixel
cycle. In the upper part of Fig. 5, the processing path for the
group 0 is shown; in the lower part, there is the processing path
for the group 5.
4098 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Fig. 7. Limitation of the number of coefcients.
The combinatory blocks comb.0 . . . 5 compute the abso-
lute gray value difference required by (2). In order to keep
the design synchronous, the gray values of each pipeline are
registered during the difference calculation. The upper path in
Fig. 5 shows the required registers labeled group 0 to make
sure that the gray value appears at the input of the multiplier
at the same time as the corresponding photometric coefcient.
Through the following, we use registers to keep our design
synchronous. Thus, it makes any delay control inside of our
architecture redundant.
To avoid the calculation of the expensive exponential, all
possible values of the function (2) are precalculated and stored
in the lookup table (LUT). The absolute difference of the
gray values itself is directly interpreted as the address of the
corresponding weight coefcient in the LUT.
Due to the quantization, the number of the weight coef-
cients is limited. This limit depends on three parameters:
1) the word length N of the input data;
2) the parameter
ph
;
3) the word length W of the coefcients.
The rst point means that increasing the color depth of an
image causes a larger amount of intensity differences that
have to be stored in the LUT. Depending on the parameter
ph
, the slope of the Gaussian curve is steeper or more at
which inuences the number of coefcients different from zero
after the quantization. It depends on the word length W itself
whose coefcients actually are different from zero after the
quantization.
In Fig. 7, the coefcients are plotted for N = 8 b, W = 8 b,
and
ph
= 60. As the negative exponential converges toward
zero for increasing gray value differences, there are only a
limited number of quantized coefcients that are different from
zero. Considering the example in Fig. 7, there are only 188
coefcients to be stored. For simplication of the internal
control, the number of coefcients is extended to the next
power of 2, resulting in the highest address 2
P
1. In the
example, the highest address is 255. The coefcients are stored
in the LUT of each pipeline in the initialization phase of the
ltering.
Fig. 8. Abstract illustration of the geometric lter component.
If N is greater than P, via logical disjunction of left (N-P) bits,
it is checked whether the gray value difference is greater than
the chosen limit 2
P
1. The result of the disjunction selects the
coefcient address. If the gray value difference is greater than
the limit, the weight coefcient is set to zero which is stored
at the address 2
P
1. In the opposite case, the corresponding
coefcient is read out of the LUT. This coefcient may also
be zero as the number of coefcients is extended to 2
P
1.
During the readout of the coefcient, the related gray value is
registered for synchronicity. At the next internal clock event, the
gray values of each group are multiplied by the corresponding
coefcients while registering the coefcients in coeff. group
0 . . . 5 for the nal normalization.
The pixel in the center of the lter window does not belong to
any group and is processed separately. This pixel is multiplied
by the highest coefcient 2
W
1 and delayed by registers
photo_k middle and geom_in middle for synchronicity.
C. Geometric Component
For the design of the geometric lter component, advantage
is taken of its separability and its symmetry. Because of the
separability, the geometric lter is split into the vertical and hor-
izontal parts. Therefore, 2-D ltering is replaced by successive
1-D ltering in vertical and horizontal directions. This solution
is preferred in the design of the geometric lter because 1-D
ltering can be implemented more efciently. Both parts are
implemented twice to lter the weighted image data and the
photometric weights simultaneously which is shown in Fig. 8.
The input of the vertical component parts is the 2-D array
of the lter window and the 2-D array of the corresponding
coefcients. Each output is a 1-D vector in which each entry
represents one ltered and cumulated column. The coefcients
of the geometric component are labeled C_0, C_1, C_2. The
output of the geometric lter consists of the ltered unnor-
malized gray value (kernel result) and the normalization factor
(norm result).
Due to the symmetry of the weight coefcients of the geo-
metric component, the order of multiplication and addition is
swapped in both lter parts. This fact plays an important role
in pixel group formation. At rst, the weighted gray values
which are located at the same distance from the centered pixel
in the lter window are summed up [35]. Because of the equal
distance, these gray values should be weighted with the same
coefcient anyway. For a 5 5 window, there are always 4
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4099
Fig. 9. Vertical part of the geometric lter component.
or 8 pixels at the same distance from the centered pixel. For
the simplicity of the design, it makes sense to assemble the
pixels into equally large groups. Smaller groups allow for better
handling of the design. For this reason, the pixels are divided
into groups of four with regard to the subsequent processing
explained in the following sections. After the accumulation of
the pixels according to their symmetry, the sum is multiplied
by the corresponding coefcient. The horizontal processing is
done in the same way.
The coefcients for the geometric component are scaled in
such a manner that the sum of the vertical coefcients (and
the horizontal ones, respectively) is equivalent to the so-called
normalized one [35]. For the signed coefcients with the word
length W, the normalized one is equal to 2
W1
. This means
that the division of the weighted gray values and photometric
coefcients after geometric ltering can be realized as a simple
shift operation. In the last stage, the normalized ltered gray
value has to be divided by the normalized product of the photo-
metric coefcients. The geometric coefcients are calculated in
advance and stored in a block RAM.
1) Vertical Component Part: The rst stage of the geometric
component is the vertical part which is pictured in Fig. 9. With
the aid of Fig. 6, it can be seen that the pixels of the rst column
numbered 1, 2, 3, 4, 5 and the rst pixel of the middle column
numbered 11 enter the vertical component part simultaneously.
For the corresponding photometric coefcients, the same order
of processing is valid.
The groups 0, 1, 2, 3, 4, which means all columns with the
exception of the centered column, are processed as shown in
the upper part of Fig. 9. The geometrically symmetrical pixels
are cumulated at rst and then multiplied by the geometric
weight coefcient. All coefcients for the geometric lter are
constant for the chosen lter window size. Due to the scaling
Fig. 10. Horizontal part of the geometric lter component.
of the geometric coefcients, it is assured that the accumulation
does not result in a carry. The registers REGcol 0,1,2 in this
part of the design are used to delay weighted data to maintain
synchronicity. After the multiplication, the weighted values are
summed up by the adder tree to one value at each internal clock
event.
The processing of the centered column is detailed in the
lower part of Fig. 9. The centered pixel is weighted and delayed
by REGcen so that this pixel and the remaining pixels in the
centered column can be fed to the input of the adder tree simul-
taneously. The remaining pixels enter the dedicated processing
path one by one. They were multiplexed in the register matrix
in the way that they can be combined pairwise and multiplied
by the same coefcient in the geometric component. In order
to weight the pixels in a proper way, every incoming pixel is
stored in the register REGcol mid so that the subsequently
calculated sum is valid every second internal clock event. The
multiplexing of the lter coefcients with zeros assures that
invalid sums vanish due to the multiplying by zero and do not
falsify the result.
As it is shown in Fig. 8, the vertical part of the geometric l-
ter for the weighting of the photometric coefcients is designed
identically.
2) Horizontal Component Part: In Fig. 10, the horizontal
part of the geometric component is displayed. After processing
in the vertical dimension, the lter window is reduced to one
row, and its elements are computed at one internal clock event
each. In order to be able to reuse the symmetrical design, the
values of the ltered columns 0, 1, 3, 4 are stored in the shift
registers according to the order of their reception. The ltered
photometrical coefcients are stored in the same way. Since the
content of the shift register in the left part of Fig. 10 is valid
at every fourth internal clock event, the time domain changes
4100 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Fig. 11. Final normalization of the ltered data.
here to the domain of the pixel clock. This domain change is
indicated by the dashed line in Fig. 10. All operations on the
right-hand side of the dashed line are executed according to the
pixel clock.
At every pixel clock signal, the valid column values are writ-
ten to the registers which perform the division of the weighted
gray values by the normalized ones. The division is imple-
mented through a shift operation. The remaining processing is
similar to the processing described in the previous paragraph.
The geometrically symmetrical pixels are cumulated at rst and
multiplied afterward by the geometric weight coefcient. For
the geometric ltering in the horizontal direction, the same geo-
metric coefcients are used as for the vertical ltering. The nal
division by the normalized one is performed in the next stage.
D. Normalization
At the nal stage, the kernel result has to be normalized by
the normresult as shown in Fig. 11. After the nal accumulation
of these values, they are both divided by the normalized one
again. In this manner, the word lengths of the weighted gray
values and of the norm are both (W 1) bits shorter. Finally,
after the division, N bits of the nal result are forwarded to the
output of the bilateral lter.
E. Design Scalability
In previous paragraphs, we detailed the lter design for the
5 5 kernel. However, depending on an application, another
kernel size might be required. For small images, a 3 3
window size is more suitable to prevent blurring. Some authors
choose to work with a larger kernel of the size of 11
11 pixels [36]. Our design can be scaled for different kernel
sizes. Starting at the register matrix, it has to be dimensioned
according to the required kernel size. The kernel size in one
dimension is assigned with K in the following:
N
groups
= K + 1 (5)
where N
groups
means the number of the pixel groups. The
quantity of the line storages equals K. The number of required
multiplexers equals N
groups
. The multiplexing pattern of the
pixels remains unchanged for every kernel size. According to
the symmetry of the kernel, the pixels have to be grouped into
N
groups
containing n
group_member
pixels each
n
group_member
= K1. (6)
The groups are always built up in the manner that each row
except for the middle pixel forms a pixel group. The middle
column represents the last pixel group in which particular
attention has to be paid to the arrangement of the pixels in order
to keep the weighting in the geometric component valid.
Furthermore, the number of pipelines, including combinatory
blocks and coefcient LUTs in the photometric component,
equals N
groups
. The design of the pipelines remains the same.
The number of the pipelines in the vertical part of the geo-
metric component changes according to the kernel size. For
the structure in the upper part of Fig. 9, (K + 1)/2 pipelines
are required because the geometrical symmetry of the pixels
has to be taken into account. The lower part of the verti-
cal geometric component remains unchanged except for the
multiplexer which has n
group_member
inputs according to the
required lter window size. The shift register of the horizontal
part of the geometric component has to be dimensioned for
(K1) values. The number of the connected pipelines has
to be adjusted to the length of the shift register, taking the
geometrical symmetry into account again. The processing of
the centered column remains unchanged. The same holds for
the normalization coefcients as well.
Finally, if the maximal operating frequency f
operating
is
known, the internal clock frequency f
internal
can be determined
as follows:
f
internal
=
f
operating
n
group_member
. (7)
According to the internal clock frequency f
internal
, the counter
has to be adjusted, which generates the select signal for the
multiplexers and the enable signal EnREG for the horizontal
part of the geometric component.
V. IMAGE QUALITY ASSESSMENT
To evaluate the performance of the noise reduction and the
accuracy of the detail preservation, criteria for the image quality
assessment are required. The criteria chosen in this work are
PSNR
dB
and MSSIM.
1) PSNR
dB
: The well-known peak-signal-to-noise ratio
PSNR
dB
in decibels is dened as follows:
PSNR
dB
=20 log
10
_
GV
max
MSE
_
(8)
MSE =
1
MN
N
_
ref
(m)
(m)
_
2
(9)
where MSE denotes the mean squared error between the
image to be compared and the reference image. GV
max
represents the maximum gray value depending on the
word length after the digitalization of the images. The
noiseless M N image with gray values
ref
(m) pro-
vides the reference for the measurement of the MSE.
The gray values
(m) originate from the image to be
compared. Considering the quality of the noise lter,
PSNR
dB
describes the capability of the lter to suppress
noise regardless of the perceived visual quality of the
ltered image.
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4101
2) MSSIM: The mean structural similarity index MSSIM is
a method for the assessment of the image quality that
takes advantage of the characteristics of the human visual
system [37]. First, the local structural similarity SSIM of
the 11 11 image blocks v(
ref
) and v(
) is calculated
SSIM
_
v(
ref
), v(
)
_
= l
_
v(
ref
), v(
)
_
c
_
v(
ref
), v(
)
_
s
_
v(
ref
), v(
)
_
(10)
where l(v(
ref
), v(
j=1
SSIM
_
v
j
(
ref
), v(
)
_
(11)
of an entire image represented by
is identied. The
value MSSIM = 1 means that two images are completely
identical. The smaller the MSSIM, the less the structural
similarity that the two images show. The detailed descrip-
tion of MSSIM can be found in [37].
VI. RESULTS
After an implementation in Matlab, the proposed architecture
of the bilateral lter was implemented in VHDL and simulated
with ModelSim. A test image was ltered by Matlab imple-
mentation as well as the ModelSim simulation, and the ltered
images were compared. The purpose of this comparison is to
analyze the image quality drop due to the quantization of the
lter coefcients in our FPGA design.
The test image Lighthouse shown in Fig. 12(a) is an 8-b
grayscale image with a size of 512 512 pixels. Hence, in the
following, GV
max
= 255 is used.
In order to apply the bilateral lter to a color image, the
color data have to be transformed into the CIELab color space
[1]. The structure of the lter remains unchanged. However,
processing of color images is beyond our research interest, so
no results on this topic will be reported.
A. Performance Analysis
For the comparison of the ltering capability between
the Matlab implementation and the ModelSim simulation,
Gaussian noise with standard deviation
noise
= [10, 20, 30, 40,
50, 60] was added to the test image.
In Fig. 12, the test image is contrasted with its noisy coun-
terpart with
noise
= 20 and two ltered images. The lter
parameters
ph
= 3
noise
and
c
= 1 were chosen for the
photometric and geometric components, respectively. For lter-
ing in Matlab, no quantization of the lter coefcients was ap-
plied. The corresponding ltered image is shown in Fig. 12(c).
For the simulation with ModelSim, the coefcient word length
W = 8 was used. The simulation result is shown in Fig. 12(d).
Fig. 12. (a) Original image. (b) Noisy image with
noise
= 20. (c) Filtering
in Matlab. (d) Filtering in ModelSim.
4102 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
Fig. 13. Performance comparison of the Matlab implementation and the
ModelSim simulation.
Between the Matlab implementation and the ModelSim simu-
lation, no visually distinguishable difference can be registered.
The results of the quantitative comparison between the Mat-
lab implementation and the ModelSim simulation are con-
trasted in Fig. 13 and summarized in Table I. As our recent
research shows, by adjusting
ph
as a multiple of the measured
standard deviation of noise rather than by a single constant,
even better PSNR
dB
can be achieved. Thus, an optimal setting
for the lter can be chosen which reduces noise and prevents
blurring at the same time as far as possible. Exceeding this point
causes oversmoothing, and choosing the adjusting parameter
below this point leads to insufcient noise suppression. The
discussion of this topic is important but beyond the scope of
this paper. For more details, refer to [38].
Fig. 13 reveals that, for increasing noise levels, PSNR
dB
and
MSSIM both increase after noise ltering. For higher standard
deviation of noise, the gain is higher. Using our setting
ph
=
3
noise
, averaging with higher weights is performed for in-
creasing noise levels. Owing to this fact, PSNR
dB
rises by a
higher amount. MSSIM also increases because the geometrical
component remains narrow, preventing oversmoothing.
TABLE I
FILTERING RESULTS
TABLE II
SYNTHESIS RESULT
The numbers in Table I show that applying the presented
lter architecture delivers results almost as good as that of
the Matlab implementation. The slight decrease of the image
quality due to ltering by ModelSim simulation is explained by
coefcient quantization and by rounding of the internal values
during the shift operations. No artifacts caused by quantization
are introduced into the ltered image. In summary, the simula-
tion results are highly satisfying.
B. Verication
For verication, a Virtex-5 FPGA platform equipped with a
Virtex XC5VLX50-1 device was used. The shortened synthesis
report of the lter design is shown in Table II. A long-term
trial proved that the design is suitable for real-time processing.
The FPGA board was connected to a camera with a 12-b
resolution depth, generating 30 fps at a full resolution of 1024
1024 pixels.
Due to the technical specication of the camera, pauses be-
tween the frames are necessary so that 30 fps is the maximally
achievable frame rate. Thus, the maximal data ow reaches
approximately 31.5 Mpixel/s. Consequently, we restricted the
clock frequency of our design to 40 MHz in this application.
The internal clock frequency is 160 MHz. With this clock rate,
a maximal throughput of 38 fps is possible.
With a different camera, an even higher frame rate is achiev-
able. Using our FPGA platform, the maximal possible internal
frequency shown in Table II is 220 MHz. Hence, the maximal
operating frequency of our lter design with the contemplated
FPGA Virtex-5 equals 55 MHz. Considering the image reso-
lution of 1024 1024 pixels, the following frame rate can be
computed:
_
(1024 1024)
pixels
frame
18.18 ns
pixel
_
1
= 52.45
frames
second
. (12)
This calculation is valid only for a throughput of 1 pixel/cycle
which is given by our design.
GABIGER-ROSE et al.: FPGA-BASED FULLY SYNCHRONIZED DESIGN OF BILATERAL FILTER 4103
TABLE III
CITED FPGA IMPLEMENTATIONS OF THE BILATERAL FILTER
The total delay of the output pixels of our architecture with
a kernel size of 5 5 pixels applied to an image of 512
512 pixels is 2560 + 36 cycles. The time required for lling
up of the register matrix, depending on the kernel size and
image width, results in a delay of 5 512 = 2560 cycles. The
processing time from the multiplexers in the register matrix to
the output of the normalization stage is constant and depends
not on the kernel size. The critical operations are performed
at internal clock frequency. If the kernel size is changed, the
pixel groups have to be reordered, and the internal clock has to
be adjusted according to (7). In this case, the processing time
still accounts for 36 cycles. The normalization by division costs
24 cycles, which makes out 66% of the whole processing time.
For the evaluation of the performance of the lter design,
a comparison with other implementations from the references
is given in Table III. Except for the authors of [32], all other
authors implement the original bilateral lter from [1]. From
[32], the full parallel architecture is used for the comparison
in Table III. All lters are implemented on different FPGAs of
different families and generations, which makes the comparison
less signicant, but still, itemizing some features like the max-
imum clock frequency of the design or the resource demand
might give a good insight.
Our design works at the highest clock frequency. However,
considering the kernel size of 5 5 pixels and the switching
of the time domain, our architecture presents only the third
highest frame rate. However, it looks different if we implement
a 3 3 lter kernel. In this case, the operating frequency is
110 MHz, and the resulting frame rate doubles, which puts the
performance of our design on the second place.
Regarding the resource demand, it should be clear that the
logic elements of Altera and the logic slices of Xilinx are
built differently. The values in Table III give merely a hint at
the FPGA area used by each design. On the other hand, the
number of required multipliers can be compared directly. In
[30], the number of the multipliers is not available. According
to the statement of the authors of [33], an efcient parallel
implementation of a bilateral lter for a 5 5 mask requires 25
multipliers.We have shown that our design concept is efcient
and it requires only 23 multipliers. Therefore, considering the
implemented window size of 5 5 pixels, we use the resources
more economically.
VII. CONCLUSION
In this paper, we have given a detailed description of an
FPGA design of the bilateral lter for real-time image pro-
cessing. The advantages of our design can be summarized in
following points.
1) The lter design for a kernel size of 5 5 shown here
utilizes the FPGA resources economically, which makes
it feasible to implement the lter on a common medium-
sized FPGA.
2) The introduced register matrix at the rst stage of the
lter makes external image storage redundant, contribut-
ing to the decrease of the resource demand of the lter
implementation.
3) The shown architecture is synchronous and capable of
real-time processing supporting high clock frequencies.
Maximal operating frequency depends on the chosen
FPGA family.
4) Conceiving our lter architecture, we kept in mind the
scalability of the design in order to enable the implemen-
tation of arbitrary lter window size with low effort.
5) The shown lter architecture assures a constant process-
ing delay independent of the lter window size. The total
delay is the sum of the processing delay and the ll-up
time of the line storages which depends on the kernel size
and image width.
6) Image quality assessment in terms of PSNR
dB
and struc-
tural similarity assured that the image quality loss due
to coefcient quantization and due to rounding of the
internal results is negligible.
REFERENCES
[1] C. Tomasi and P. Manduchi, Bilateral ltering for gray and color im-
ages, in Proc. IEEE ICCV, 1998, pp. 839846.
[2] B. Zhang and J. P. Allebach, Adaptive bilateral lter for sharpness en-
hancement and noise removal, IEEE Trans. Image Process., vol. 17,
no. 5, pp. 664678, May 2008.
[3] B. Yan and A.-D. Saleh, Structure enhancing bilateral ltering of
images, in Proc. IEEE PCSPA, 2010, pp. 614617.
[4] M. de-Frutos-Lpez, H. Medina-Chanca, S. Sanz-Rodrguez, C. Pelez-
Moreno, and F. Daz-de-Mara, Perceptually-aware bilateral lter for
quality improvement in low bit rate video coding, in Proc. IEEE PCS,
2012, pp. 477480.
[5] J. Won Lee, R.-H. Park, and S. Chang, Noise reduction and adaptive
contrast enhancement for local tone mapping, IEEE Trans. Consum.
Electron., vol. 58, no. 2, pp. 578586, May 2012.
[6] J. Giraldo, Z. Kelm, L. Yu, J. Fletcher, B. Erickson, and C. McCollough,
Comparative study of two image space noise reduction methods for com-
puted tomography: Bilateral lter and nonlocal means, in Proc. Conf.
IEEE EMBS, 2009, pp. 35293532.
[7] L. Yu, A. Manduca, J. Trzasko, N. Khaylova, J. Koer, C. McCollough,
and J. Fletcher, Sinogram smoothing with bilateral ltering for low-
dose CT, in Proc. SPIE Med. Imag.: Phys. Med. Imag., 2008, vol. 6913,
pp. 691329-1691329-8.
[8] A. Gabiger, R. Weigel, S. Oeckl, and P. Schmitt, Enhancement of CT
image quality via bilateral ltering of projections, in Proc. 1st Int. Conf.
Image Formation X-ray Comput. Tomography, 2010, pp. 140143.
[9] A. Gabiger-Rose, R. Rose, M. Kube, P. Schmitt, and R. Weigel, Noise
adaptive bilateral ltering of projections for computed tomography, in
Proc. 11th Int. Meet. Fully Three-Dimens. Image Reconstruction Radiol.
Nucl. Med., 2011, pp. 306309.
[10] A. Gabiger, M. Kube, and R. Weigel, A synchronous FPGA design of
a bilateral lter for image processing, in Proc. IEEE IECON, 2009,
pp. 19901995.
[11] T. Riesgo, Y. Torroja, and E. de la Torre, Design methodologies based
on hardware description languages, IEEE Trans. Ind. Electron., vol. 46,
no. 1, pp. 312, Feb. 1999.
4104 IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 61, NO. 8, AUGUST 2014
[12] T. Q. Pham and L. J. van Vliet, Separable bilateral ltering for fast video
preprocessing, in Proc. IEEE ICME, 2005, pp. 14.
[13] F. Durand and J. Dorsey, Fast bilateral ltering for the display of high-
dynamic-range images, ACM Trans. Graph., vol. 21, no. 3, pp. 257266,
Jul. 2002.
[14] S. Paris and F. Durand, A fast approximation of the bilateral lter using
a signal processing approach, in Proc. ECCV, 2006, pp. 568580.
[15] J. Chen, S. Paris, and F. Durand, Real-time edge-aware image processing
with the bilateral grid, ACM Trans. Graph., vol. 26, no. 3, pp. 19,
Jul. 2007.
[16] Q. Yang, K.-H. Tan, and N. Ahuja, Real-time O(1) bilateral ltering, in
Proc. IEEE CVPR, 2009, pp. 557564.
[17] M. M. Bronstein, Lazy sliding window implementation of the bilateral
lter on parallel architectures, IEEE Trans. Image Process., vol. 20, no. 6,
pp. 17511756, Jun. 2011.
[18] B. Weiss, Fast median and bilateral ltering, ACM Trans. Graph.,
vol. 25, no. 3, pp. 519526, Jul. 2006.
[19] F. Porikli, Constant time O(1) bilateral ltering, in Proc. IEEE CVPR,
2008, pp. 18.
[20] Y.-C. Tseng, P.-H. Hsu, and T.-S. Chang, A 124 Mpixels/sec VLSI de-
sign for histogram-based joint bilateral ltering, in IEEE Trans. Image
Process., Nov. 2011, vol. 20, no. 11, pp. 32313241.
[21] F. Hannig, M. Schmid, J. Teich, and H. Hornegger, A deeply pipelined
and parallel architecture for denoising medical images, in Proc. IEEE
FPT, 2010, pp. 485490.
[22] L. Costas, P. Colodrn, J. J. Rodrguez-Andina, J. Faria, and
M.-Y. Chow, Analysis of two FPGA design methodologies applied to
an image processing system, in Proc. IEEE ISIE, 2010, pp. 30403044.
[23] N. Sudha and A. R. Mohan, Hardware-efcient image-based robotic path
planning in a dynamic environment and its FPGA implementation, IEEE
Trans. Ind. Electron., vol. 58, no. 5, pp. 19071920, May 2011.
[24] R. Marin, G. Len, R. Wirz, J. Sales, J. M. Claver, P. J. Sanz, and
J. Fernndez, Remote programming of network robots within the UJI in-
dustrial robotics telelaboratory: FPGA vision and SNRP network proto-
col, IEEETrans. Ind. Electron., vol. 56, no. 12, pp. 48064816, Dec. 2009.
[25] E. Monmasson and M. N. Cirstea, FPGA design methodology for in-
dustrial control systemsA review, IEEE Trans. Ind. Electron., vol. 54,
no. 4, pp. 18241842, Aug. 2007.
[26] J. J. Rodriguez-Andina, M. J. Moure, and M. D. Valdes, Features, design
tools, and application domains of FPGAs, IEEE Trans. Ind. Electron.,
vol. 54, no. 4, pp. 18101823, Aug. 2007.
[27] H. Zhuang, K.-S. Low, and W.-Y. Yau, Multichannel pulse-coupled
neural-network-based color image segmentation for object detection,
IEEE Trans. Ind. Electron., vol. 59, no. 8, pp. 32993308, Aug. 2012.
[28] S. Jin, D. Kim, T. T. Nguyen, D. Kim, M. Kim, and J. W. Jeon, Design and
implementation of a pipelined datapath for high-speed face detection using
FPGA, IEEE Trans. Ind. Informat., vol. 8, no. 1, pp. 158167, Feb. 2012.
[29] Y. Chen and V. Dinavahi, Digital hardware emulation of universal ma-
chine and universal line models for real-time electromagnetic transient
simulation, IEEE Trans. Ind. Electron., vol. 59, no. 2, pp. 13001309,
Feb. 2012.
[30] C. Charoensak and F. Sattar, FPGA design of a real-time implementation
of dynamic range compression for improving television picture, in Proc.
IEEE ICICS, 2007, pp. 15.
[31] A. Rosado-Muoz, M. Bataller-Mompen, E. Soria-Olivas, C. Scarante,
and J. F. Guerrero-Martnez, FPGA implementation of an adaptive lter
robust to impulsive noise: Two approaches, IEEE Trans. Ind. Electron.,
vol. 58, no. 3, pp. 860870, Mar. 2011.
[32] T. Q. Vinh, J. H. Park, Y.-C. Kim, and S. H. Hong, FPGA implementation
of real-time edge-preserving lter for video noise reduction, in Proc.
IEEE ICCEE, 2008, pp. 611614.
[33] H. Dutta, F. Hannig, J. Teich, B. Heigl, and H. Hornegger, A design
methodology for hardware acceleration of adaptive lter algorithms in
image processing, in Proc. IEEE ASAP, 2006, pp. 331340.
[34] R. Chen, L. Chen, and L. Chen, System design consideration for digital
wheelchair controller, IEEE Trans. Ind. Electron., vol. 47, no. 4, pp. 898
907, Aug. 2000.
[35] R. Turney, Two-dimensional linear ltering, in Application Note: Xilinx
FPGAs, 2007, pp. 18.
[36] M. Zhang and B. K. Gunturk, Multiresolution bilateral lter for image
denoising, IEEE Trans. Image Process., vol. 17, no. 12, pp. 23242333,
Dec. 2008.
[37] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, Image quality assess-
ment: From error visibility to structural similarity, IEEE Trans. Image
Process., vol. 13, no. 4, pp. 600612, Apr. 2004.
[38] A. Gabiger-Rose, M. Kube, P. Schmitt, R. Weigel, and R. Rose, Image
denoising using bilateral lter with noise-adaptive parameter tuning, in
Proc. IEEE IECON, 2011, pp. 45154520.
Anna Gabiger-Rose (S09) was born in
Ordshonikidse, Ukraine, in 1978. She received the
Dipl.-Ing. degree in electrical engineering, electro-
nics, and information technology from the Friedrich-
Alexander University of Erlangen-Nuremberg,
Erlangen, Germany, in 2007.
From 2001 to 2007, she was a Student Assistant
with the Department of Contactless Test and Mea-
suring Systems, Fraunhofer Institute for Integrated
Circuits, Erlangen. She is currently a Research As-
sistant with the Institute for Electronics Engineering,
University of Erlangen-Nuremberg. Her research interests include the design of
embedded systems for image processing and the investigation of digital ltering
techniques for image quality enhancement.
Mrs. Gabiger-Rose is member of the IEEE Industrial Electronics Society.
She served as a reviewer for the 35th Annual Conference of the IEEE Industrial
Electronics Society (IECON09).
Matthias Kube was born in Mainz, Germany, in
1975. He received the Dipl.-Ing. FH (M.Sc.) degree
in electrical engineering and microelectronics from
the Georg-Simon-Ohm University of Applied Sci-
ence of Nuremberg, Nuremberg, Germany, in 2002.
Since 2003, he has been working as a member of
the research staff at the Department of Contactless
Test and Measuring Systems, Fraunhofer Institute
for Integrated Circuits, Erlangen, Germany. He has
the technical leadership for the development of an
innovative indirect converting X-ray detector with
conventional optical sensors for scientic and industrial applications of non-
destructive testing (NDT), which is optimized for tasks that require a high
dynamic range, a high speed, and a long life cycle. His interests in research
include optical sensors and cameras, eld-programmable-gate-array design,
embedded systems for image processing, and X-ray imaging for NDT.
Robert Weigel (S88M89SM95F02) was born
in Ebermannstadt, Germany, in 1956. He received
the Dr.-Ing. and Dr.-Ing.habil. degrees in electrical
engineering and computer science from the Mu-
nich University of Technology, Munich, Germany, in
1989 and 1992, respectively.
He was a Research Engineer from 1982 to 1988,
a Senior Research Engineer from 1988 to 1994, and
a Professor for RF Circuits and Systems from 1994
to 1996 with the Munich University of Technology.
From 1996 to 2002, he was the Director of the
Institute for Communications and Information Engineering, University of Linz,
Linz, Austria. Since 2002, he has been the Head of the Institute for Electronics
Engineering, University of Erlangen-Nuremberg, Erlangen, Germany.
Dr. Weigel was the recipient of the IEEE Microwave Applications Award in
2007. Within IEEE Microwave Theory and Techniques Society (MTT-S), he has
been the Founder andChair of the AustrianCommunications/Microwave Theory
and Techniques Society Joint Chapter and Region 8 Coordinator. He is the Chair
of MTT-2 Microwave Acoustics and the MTT-S President-Elect in 2013.
Richard Rose (S09) was born in Nuremberg,
Germany, in 1981. He received the Dipl.-Ing. degree
in electrical engineering, electronics, and informa-
tion technology from the Friedrich-Alexander Uni-
versity of Erlangen-Nuremberg, Erlangen, Germany,
in 2007.
In 2008, he joined the Institute for Electronics
Engineering, University of Erlangen-Nuremberg, as
a Research Assistant, and since 2010, he has been the
Team Leader of the System Engineering group. His
research interests include digital signal processing,
receiver design, antenna design, localization techniques, and wireless commu-
nication systems.
Mr. Rose is a member of the IEEE Microwave Theory and Techniques So-
ciety, the IEEE Signal Processing Society, the IEEE Antennas and Propagation
Society, and the IEEE Communications Society. He served as a reviewer for the
journal of Mathematical Problems in Engineering and the International Journal
of Electronics and Communications.