You are on page 1of 4

An FPGA Implementation of 3D Numerical

Simulations on a 2D SIMD Array Processor

Yutaro Ishigaki , Yoichi Tomioka , Tsugumichi Shibata and Hitoshi Kitazawa


Email: 50014645202@st.tuat.ac.jp ytomioka@cc.tuat.ac.jp
shibata.tsugumichi@lab.ntt.co.jp kitazawa@cc.tuat.ac.jp
Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei, Tokyo, Japan
NTT Device Technology Laboratories, 3-1 Morinoasto, Wakamiya, Atsugi, Kanagawa, Japan

AbstractThree-dimensional (3D) numerical simulation is an the 3D finite-difference time-domain (FDTD) method [5] as a
indispensable technique for various analyses of physical phenom- representative of 3D numerical simulations. This accelerator
ena, but it generally requires numerous computation. In this consists of a 2D single instruction multiple data (SIMD)
paper, we propose an FPGA-based accelerator for 3D numerical [6] array processor. In our proposal, the 2D SIMD array
simulations and focus on acceleration of the 3D finite-difference processor executes 3D parallel computing with little data
time-domain (FDTD) method. This accelerator consists of a 2D
transfer overhead, and the accelerator can be extended easily
single instruction multiple data (SIMD) array processor, and
it can execute 3D parallel computing with little data transfer to multi-FPGA implementation. We implement the accelerator
overhead by applying virtual processing-elements cuboid (VPEC) on a high-end FPGA and execute the 3D FDTD method
with synchronous shift data transfer. We demonstrate that the for electromagnetic simulations of waveguides. Moreover, we
experimental hardware implemented on an Altera Stratix V compare its performance with GPGPUs.
FPGA (5SGSMD5K2F40C2N) is 3.1 times faster than parallel
computing on the NVIDIA Tesla C2075, and it reaches a 94.57% II. PARALLEL 3D C OMPUTING ON A 2D SIMD A RRAY
operating rate of the calculation units for the computation of
P ROCESSOR
the 3D FDTD method. The proposed accelerator is suitable for
multi-chip composition. In this section, we describe several techniques for parallel
computing to solve 3D numerical simulations. We apply the
I. I NTRODUCTION SIMD architecture as our basic scheme for parallel computing.
The SIMD processor consists of multiple PEs that are spe-
Three-dimensional (3D) numerical simulation is an indis- cialized in calculation. All PEs execute the same instruction
pensable technique for various analyses of physical phenom- simultaneously, but each of the PEs processes different data.
ena. However, it generally requires numerous computation; The SIMD architecture works with the best performance when
acceleration of the 3D numerical simulation is important for it operates on independent data; thus, the SIMD is suitable
fields of science and engineering. for problems that have data-level parallelism [6]. The SIMD
As an approach to accelerate numerical simulations, paral- array processor has the advantage of simple control and high
lel processing based on general-purpose computing on graphics performance for numerical simulations.
processing units (GPGPUs) [1] have been proposed and widely Basically, spatial parallelism is realized by dividing an
used recently. The GPU has many processing cores, and when analysis domain into 3D computing grids and assigning each
all of them perform efficiently, the GPU operates with very of them to a PE, as shown in Fig. 1. The discretized element
high performance. However, simulations of physical phenom- of the 3D computing grid is called a node or cell. Each
ena generally require a considerable amount of memory access, PE stores the assigned nodes data in its local memory and
i.e., they require multiple data accesses per unit operation, processes the nodes one by one. Because the processing
and GPU computing suffers from a memory-access bottleneck of a node generally requires its neighboring nodes, PE-PE
caused by the memory structure of GPUs [2]. Moreover, tuning communication is indispensable. Composing a 3D array of PEs
to reduce the bottleneck and achieve the best performance is is a basic idea for 3D parallel computing. However, the 3D
difficult without deep knowledge of GPU computing. array is not suitable for FPGAs because the structure of the
Another approach for such acceleration is to implement FPGA is 2D, and it requires a great deal of wiring resources.
an accelerator dedicated to numerical simulations using field- It is likely to be infeasible especially in the case of multi-
programmable gate arrays (FPGAs) because they are rapidly FPGA implementation because it needs significant I/O band-
advancing with regard to high performance. Several FPGA- width between FPGAs. Therefore, we introduce the virtual
based accelerators for numerical simulations are proposed in processing-elements cuboid (VPEC) technique to execute 3D
[2][4]. However, both accelerators proposed in [2] and [3] highly parallel processing on the 2D SIMD array with little
cannot solve the 3D problem. A scalable array processor data transfer overhead.
proposed in [4] can execute 3D parallel computing, but the In VPEC, a 3D array is sliced in the z-direction and
performance is limited by the DRAMs memory bandwidth. It connected to each other in the x-direction to compose a 2D
is not easy for the FPGA-based accelerator to implement 3D SIMD array, as shown in Fig. 1. In order to realize VPEC,
parallel processing and achieve high performance. we employ Synchronous shift data transfer [2], [7], which is
In this paper, we propose an FPGA-based accelerator for a technique for communication between the PEs on the SIMD
3D numerical simulations, and we focus on acceleration of array processor. In this technique, all of the PEs transfer data

978-1-4799-8391-9/15/$31.00 2015 IEEE 938


z-directional data transfer y-directional data transfer computing direction data transfer of edge nodes
(multiple transfers) (single transfer) y
z Z=0 Z=1 Z=2 x
3D computing grid
z d
Y=2 PE PE PE PE PE PE PE PE PE

Y=1 PE PE PE PE PE PE PE PE PE
y h .....
Y=0 PE PE PE PE PE PE PE PE PE
y PE PE PE
x X=0 X=1 X=2 X=0 X=1 X=2 X=0 In order to update a node, adjacent node's data are necessary.
X=1 X=2
z x
x-directional data transfer (single transfer) Fig. 2. All of the PEs updates nodes in a single direction. When the PE
, : z-directional wires between PEs reads the edge nodes data for the calculation, the PE transfers them to the
(not necessary with synchronous shift data transfer) adjacent PE.

Fig. 1. Virtual processing-elements cuboid (VPEC) is a virtual 3D array of SRAM Blocks


PEs implemented on a 2D SIMD array processor. Z-directional communication Host
is realized by using multiple x-directional synchronous shift data transfers, and Processing Element (PE)
z-directional wires are saved. Note that all PEs execute same data transfer
simultaneously on synchronous shift data transfer.

Trans Controller
Control
Processor
(CP) H
to adjacent PEs simultaneously in same direction. The PEs
are also able to transfer data to nonadjacent PEs by multi-
PE
ple synchronous shift data transfers. Therefore, z-directional Control
Memory
communication on the VPEC is realized by using multiple x-
Trans Module W
directional synchronous shift data transfers, as shown in Fig.
1. We can save routing resources with this technique.
Fig. 3. 2D SIMD array processor. This processor consists of the PE array,
As shown in Fig. 1, z-directional data transfer takes more CP, and several peripherals.
clocks than x- and y-directional transfers. However, normally
data transfer between PEs does not occur frequently while
computing all nodes assigned to each PE. Therefore, if the assigned to each PE, adjacent PEs data are required when the
computing time of each PE is longer than the transfer time, we PE calculates the nodes on the surface of the PE cuboid in Fig.
can reduce the time loss of transfer by controlling the order of 2. We call these nodes as edge nodes. In our control method,
calculation and transfer timing optimally. We show the optimal when the PE reads the edge nodes data for the calculation,
control of the 3D FDTD method in the next section. the PE transfers them to the adjacent PE, as shown Fig. 2.
In this paper, the PE architecture is optimized for com- Therefore, if the computing time to update each line is longer
putation of the 3D FDTD method. We can also apply the than each transfer time, there is no time loss. In this control
VPEC to other numerical simulations such as heat conduction method, the computing time depends on the number of nodes
and fluid dynamics by optimizing PEs composition for each in each direction, and the transfer time is determined by the
computation. hardware architecture. Note that the z-directional data transfer
time depends on the x-directional number of PEs in the VPEC
III. C ONTROL OF THE 3D FDTD M ETHOD because of multiple synchronous shift data transfers. Thus, we
optimize the number of nodes and PEs, and the composition
In this section, we describe the control method of the 3D
of the VPEC to achieve the best performance.
FDTD method on the VPEC. We decompose the electromag-
netic field-update equations ( , , ) shown in [5] to two
pseudo codes respectively in order to calculate them with same IV. H ARDWARE A RCHITECTURE
pipelined datapath, e.g., the codes of are as follows: We expand the 2D FDTD accelerator [2] to the 3D accel-
(, , ) (, , ) + 1 { (, , ) (, 1, )}, erator with little data transfer overhead. Here we explain the
(1) 2D SIMD array architecture and PE composition for the 3D
(, , ) (, , ) 2 { (, , ) (, , 1)}, FDTD method.
(2)
A. 2D SIMD Array Processor
1 = , 2 = , (3)
In order to implement the VPEC, we compose a 2D SIMD
where , , and denote the discrete position, and indicates array processor on an FPGA, as shown in Fig. 3. This SIMD
the permittivity. is the time interval, and are the processor consists of 2D arrayed PEs, a control processor
size of a node in each direction. In the 3D FDTD method, we (CP), and several peripherals. The PEs are specialized in the
must execute six field-update computations ( , , ), i.e., calculation of the 3D FDTD method and calculates either 32-
the proposed accelerator executes a dozen pseudo codes. As bit fixed-point or 32-bit floating-point number.
shown in these codes, the adjacent nodes data are necessary The CP controls the PEs operation and communication
for updating a node. By this decomposition, the PE updates between the PEs and the host PC or between the PEs and the
nodes in a single direction, as shown in Fig. 2. For example, SRAM blocks. The control instructions for the PEs are gener-
in the case of Code (1), the PE updates nodes in y-direction. ated on the host PC and stored in the PE control memory. The
As shown in Fig. 2, suppose that nodes are CP fetches the instruction from this memory in sequence. The

939
Pipeline stage:
address set memory read addition multiplication addition write back
/subtraction /subtraction Host input port
LRTB
adjacent PEs
MUXdir
regtrans y
OutData InData
MUXtrans1
MUXshift
EMem x
a z
add/ output port
MUX1
sub
addsub2
b addsub1
MUXport mult Fig. 5. Canonical problem of waveguide analysis given in [10]. This is a
MUX2 add/
ConstMem four-stage waveguide bandpass filter (BPF).
HMem mult1 sub

c
0
ut i yxni zpx
MUX3 MUX4
d
single-port memory register MUXtrans2
Index o
Mem dual-port memory
r
zyxi yxni zpx
Fig. 4. Pipelined datapath of a PE. This datapath is a six-stage pipeline
dedicated to the calculation as = ( ).
Fig. 6. An asymmetrical resonant iris given in [9]. This waveguide must be
solved by the 3D FDTD method.
instructions are very long instruction word (VLIW) [8] type
operation codes which consist of 80-bit control signals. VLIW in Table I. In Table I, w, h, and d are the x, y, and z directional
allows the PEs to execute multiple operations simultaneously, number of nodes assigned to each PE, respectively; X, Y,
such as memory read, memory write, calculation, and data and Z are the number of PEs for each direction; W and
transfer. H are the array sizes shown in Fig. 3. Table II shows the
maximum resource utilization for the accelerator compiled on
B. PE Architecture Altera Quartus II software; in this case, the accelerator is
Each PE has a pipelined datapath that consists of two implemented the floating point calculation unit, and the array
addersubtractors, one multiplier, two dual-port memories, size is optimized for the four-stage BPF. We set the clock of
and two single-port memories, as shown in Fig. 4. This the FPGA to 100 MHz.
datapath is a six-stage pipeline dedicated to the calculation
as = ( ). This calculation is suitable for the B. Electromagnetic Simulation Result
decomposed field-update computations such as Codes (1) and
We ran electromagnetic simulations of the waveguides on
(2). The PE can execute this calculation with only one clock
the accelerator. We computed 65,536 time steps with the nodes,
throughput.
as shown in Figs. 5 and 6. As part of simulation results, Fig.
In Fig. 4, EMem and HMem are data memories storing elec-
7 shows visualized absolute values of electric fields in the
tromagnetic field components. ConstMem stores some constant
asymmetrical resonant iris.
data, and IndexMem stores structure data of the computing
grid. The structure data control the range of computation and
conditional execution for the boundary of the analysis domain. C. Performance Comparison with GPGPU
Because the processor is based on the SIMD architecture, and In this section, we compare the performance of the FPGA
hence such data are essential to the conditional execution. accelerator with that of the GPGPU. We have implemented
By controlling OutData, InData, and MUXdir, communi- the 3D FDTD method on NVIDIA GPUs using CUDA. We
cation with four adjacent PEs can be implemented. In addition, set optimal thread and block sizes for CUDA, and one thread
multiple synchronous shift data transfers is performed by using updates one node. In addition, we employ a constant memory
MUXshift. of each GPU and the reduction technique [11]. The GPUs
compute with single-precision (32-bit) floating-point numbers.
V. I MPLEMENTATION AND P ERFORMANCE E VALUATION
In this section, we describe the implementation results and
the performance comparison when the proposed accelerator 1.2
executes electromagnetic simulations. We analyzed two waveg-
uides shown in Figs. 5 and 6 with the impedance boundary
condition [9] for each waveguides ports.
x
y
A. Synthesis Result
z 0
We have implemented the proposed accelerator de-
scribed in the preceding section with an Altera Stratix V
5SGSMD5K2F40C2N FPGA. The sizes of the VPECs imple- example of visualized absolute values of electric fields calculated
Fig. 7. An
by = 2 + 2 + 2 .
mented on the FPGA, optimized for each case, are summarized

940
TABLE I. T HE SIZES OF VPECs.
Four-stage BPF (Fig. 5) Asymmetrical resonant iris (Fig. 6)
Nodes / PE PEs / direction 2D array size Nodes / PE PEs / direction 2D array size
w h d X Y Z W H Total w h d X Y Z W H Total
Fixed point 5 3 15 4 13 9 36 13 468 4 3 13 4 11 10 40 11 440
Floating point 5 4 23 4 10 6 24 10 240 4 4 19 4 8 7 28 8 224

TABLE II. M AXIMUM R ESOURCE UTILIZATION .


basically same as the single-chip implementation independent
Device Altera Stratix V 5SGSMD5K2F40C2N of the number of FPGAs.
ALMs 160,317 / 172,600 (93%)
Registers 99,314 / 690,400 (14%)
DSP blocks 1,200 / 1,590 (75%) VI. C ONCLUSIONS
M20K blocks 2,013 / 2,014 (100%)
In this paper, we proposed the virtual processing-elements
TABLE III. P ERFORMANCE COMPARISON BETWEEN PROPOSED
cuboid (VPEC), which executes parallel 3D computing on a
ACCELERATOR AND GPGPU. 2D SIMD array processor. On VPEC, routing resources can
be saved by using synchronous shift data transfer. On the
Calculation time
Accelerator Computing unit Asymmetrical basis of these techniques, we implemented the accelerator
Four-stage BPF
resonant iris for the 3D FDTD method on an Altera Stratix V FPGA
GeForce GTX 780 32-bit floating point 15.02 sec 12.31 sec (5SGSMD5K2F40C2N). It was 3.1 times faster than the Tesla
Tesla C2075 32-bit floating point 11.08 sec 8.36 sec
C2075 for the 32-bit floating-point computation on the elec-
Proposed accelerator 32-bit fixed point 1.89 sec 1.34 sec
Proposed accelerator 32-bit floating point 3.75 sec 2.53 sec
tromagnetic simulations of the waveguides. The proposed 2D
SIMD array processor executes the 3D FDTD method effi-
ciently without the memory-access bottleneck and the time loss
We computed 65,536 time steps, the same as for the FPGA for the data transfer. It reaches a 94.57% operating rate of the
accelerator. The best performance on the NVIDIA GeForce calculation units for the computation of the 3D FDTD method.
GTX780 and the Tesla C2075 is listed in Table III. The proposed accelerator is likely to be able to calculate more
The performance of our proposed accelerator is also listed large problem through the use of multiple FPGAs, and it is
in Table III. The calculation time does not include the transfer suitable for multi-chip composition by applying the VPEC
time between the host PC and the GPUs global memory technique with synchronous shift data transfer.
or FPGA. From this table, it observed that our proposed
accelerator is 3.1 times faster than the Tesla C2075 on average R EFERENCES
for single-precision floating-point calculations. [1] M. Livesey, J. F. Stack, Jr., F. Costen, T. Nanri, N. Nakashima, and
Each accelerators peak performance is given by the fol- S. Fujino, Development of a CUDA implementation of the 3D FDTD
lowing calculation: , where method, IEEE Antennas and Propagation Magazine, vol. 54, no. 5, pp.
denotes the number of processing cores, means the 186195, Oct. 2012.
number of operations per clock of each core, and indi- [2] R. Takasu, Y. Tomioka, Y. Ishigaki, N. Li, T. Shibata, M. Nakanishi,
and H. Kitazawa, An FPGA implementation of the two-dimensional
cates the operation clock. The Tesla C2075 delivers up to FDTD method and its performance comparison with GPGPU, IEICE
448 2 1.15 G = 1, 030 GFLOPS of peak single-precision Trans. on Electronics, vol. E97-C, no. 07, Jul. 2014.
performance. On the other hand, the proposed accelerator [3] K. Sano, High-Performance Computing using FPGAs (chapter titled
delivers up to 240 3 100 M = 72 GFLOPS, which is FPGA-based Systolic Computational-Memory Array for Scalable Sten-
approximately 1/14 compared with the Tesla C2075. However, cil Computations). Springer, 2013, pp. 279303.
the calculation time is about three times faster than the [4] K. Sano, Y. Hatsuda, and S. Yamamoto, Multi-FPGA accelerator for
Tesla C2075 because the GPUs suffer from the considerable scalable stencil computation with constant memory bandwidth, IEEE
Trans. on Parallel and Distributed System, vol. 25, no. 3, pp. 695705,
memory-access bottleneck. On the other hand, the proposed Mar. 2014.
accelerator does not suffer from the memory-access bottle- [5] K. S. Yee, Numerical solution of initial boundary value problems
neck and the time loss for the data transfer on the VPECs. involving Maxwells equations in isotropic media, IEEE Trans. on
That is, the proposed 2D SIMD array processor executes the Antennas and Propagation, vol. AP-14, no. 3, pp. 302307, May 1966.
3D numerical simulation efficiently; the proposed accelerator [6] D. A. Patterson and J. L. Hennessy, Computer Organization And Design,
reaches a 94.57% operating rate of the calculation units for 4th ed. Morgan Kaufmann, 2008.
the computation of the 3D FDTD method. [7] Y. Tomioka, R. Takasu, T. Aoki, E. Hosoya, and H. Kitazawa, FPGA
implementation of exclusive block matching for robust moving object
extraction and tracking, IEICE Trans. on Information and Systems, vol.
D. Feasibility of Multi-FPGA Implementation E97-D, no. 3, pp. 573582, Mar. 2014.
In order to calculate more large problem, the multi-FPGA [8] J. L. Hennesy and D. A. Patterson, Computer Architecture, 4th ed.
Morgan Kaufmann, 2007.
implementation is required. By employing the VPEC technique
[9] T. Shibata and T. Itoh, Generalized-scattering-matrix modeling of
and synchronous shift data transfer, our architecture is suitable waveguide circuits using FDTD field simulations, IEEE Trans. on
for multiple FPGAs composition because we can reduce the Microwave Theory and Techniques, vol. 46, no. 11, pp. 17421751,
I/O bandwidth between FPGAs with these techniques. For the Nov. 1998.
proposed composition, the communication is limited to local [10] http://www.ieice.org/es/est/activities/kihan mst/01/main.html.
data transfer between PEs. The timing constraint is not severe [11] M. Harris, Optimizing parallel reduction in CUDA, http://developer.
compared with the architecture using global communication. download.nvidia.com/assets/cuda/files/reduction.pdf.
Moreover, the hardware architecture and control method are

941

You might also like