You are on page 1of 4

A 128/256-Point Pipeline FFT/IFFT Processor for MIMO OFDM System IEEE 802.

16e
Simeng Li, Huxiong Xu, Wenhua Fan, Yun Chen, Xiaoyang Zeng
State Key Lab. of ASIC and System, Fudan University. Shanghai, P.R.China Email:{082052001,082052056,09212020009,chenyun,xyzeng}@fudan.edu.cn AbstractIn this paper, we present a novel 128/256-point FFT/ IFFT processor for the applications in IEEE 802.16e based on MIMO-OFDM. The pipeline FFT architecture is proposed to efficiently deal with 1-4 multiple data sequences, and increase the throughput. Furthermore, less hardware complexity is needed in our design compared with conventional individual parallel approach. The signal-to-quantization noise ratio (SQNR) is 42.7 dB. The proposed FFT has been designed in 0.13 m 2 technology with the core size of 1.4701.469 mm . I. INTRODUCTION Multiple-input multiple-output (MIMO) technique has been utilized in combination with OFDM technology for wireless communication systems to enhance the link throughput as well as the robustness of transmission over frequency selective fading channel. This technology has been employed in the physical layer specification of the emerging IEEE 802.16e standard to provide broadband wireless access services [1]. According to optional specification, Alamoutischeme space-time block code (STBC) is adopted for 21 MISO transmission mode. In addition, the receiver dealing with 1-4 sequences can be designed to further improve the system performance. As a consequence, a 4 4 MIMO OFDM system for IEEE 802.16e WMAN is considered in this paper. The receiver of MIMO-OFDM system contains four RFs, four analog-to-digital converters (ADCs), four FFTs, a MIMO equalizer, four De-QAM and de-interleaver, a de-spatial parser, a de-puncturer, a channel decoder, a synchronization block, and a channel estimation block [5]. However, the hardware cost is also increased significantly, because more memory and complex multipliers are needed to allow multiple data to be operated simultaneously. Fast Fourier transform (FFT) and inverse fast Fourier transform (IFFT) are the crucial computational blocks to the baseband multicarrier demodulation and modulation in an OFDM system, respectively. Various FFT architectures have been proposed. Among them, the pipeline structure is suitable for a short-length FFT processor whose size is smaller than 512 since it can provide high throughput with moderate hardware cost. Multipath delay commutator (MDC) and single-path delay feedback (SDF) are two common
This work was supported by Shanghai Scientific and Technological Commission under Grand No.08700741100.

realizations of pipeline architecture [2]. Typically, MDC can achieve higher throughput but with higher hardware cost, whereas SDF features in an opposite way. Recently, several works adopting pipeline have been proposed to deal with multiple data sequences, such as mixed-radix multipath delay feedback (MRMDF) [5]. However, multiple butterfly units (BUs) are required within each pipeline stage. In this paper, a 128/256-point pipeline FFT/IFFT is proposed to efficiently deal with 4 parallel sequences. The input reordering forms the data from the same channel into the same frame, so that the processor works for 1-4 channels simultaneously and also supports several throughput rates. This paper is organized as follows. Section II describes the 128/256-point FFT algorithm and the IFFT algorithm. Section III focuses on describing the proposed FFT/IFFT architecture. Section IV compares its hardware cost and throughput rate with some existing FFT architectures in 128/256-point FFT. Conclusions are shown in Section V. II. ALGORITHM

The N-point discrete fourier transform (DFT) of an N-point sequence {x(n)} is defined as
X (k ) = x (n)W j 2 nk / N = x(n)W kn , 0 k N 1,
n =0 n =0 N 1 N 1

(1)

where x(n) and X(k) are complex numbers. The twiddle factor is
kn WN = e j 2 kn N

(2)

The computational complexity of (1) is O(N2) when the required computations are directly executed. The computational complexity can be reduced to O(NlogrN) using the Cooley-Tukey FFT algorithm [3], where r denotes that the radix-r FFT algorithm is adopted. Obviously, the computational complexity decreases as the radix increases for the constant length DFT computation. Based on Cooley-Tukey algorithm, 128-point FFT, (1) can be reformulated respectively, as

978-1-4244-5309-2/10/$26.00 2010 IEEE

1488

Figure 1. Block diagram of proposed reconfigurable pipeline FFT

Data Reordering

X ( k1 + 2k2 + 16k3 )
7 1 k 7 k1 = W642 n3 W128( n3 +8 n2 ) x( n3 + 2n2 + 16n1 )W2k1n1 W8k2 n2 W8k3 n3 , n3= 0 n2 n1 = 0 for k1 = 0...1; k2 = 0...7; k3 = 0...7.

Figure 2.

Block diagram of data reordering.

(3)

256-point FFT/IFFT with 1-4 parallel data sequences, in one 2 stage of reconfigurable radix-2/2 FFT operation, followed by three stages of radix-4 FFT operation. The proposed architecture can also support several channel. For 1, 2, 3, or 4 data path, four data from the same channel are formed into the same frame, and then performed the FFT operation in 4 parallel paths. In one channel mode, 4 parallel data are calculated simultaneously, so the throughput rate is 4 times than 4 channels mode at the same clock rate. The pipeline architecture is introduced to increase the throughput. The optimized memory blocks in each stage are used for computational storage and I/O buffers. In data reordering, an interleaver is introduced to reorder the input data of 4 paths [5]. As shown in Fig. 2, 4 individual data paths are reordered into 44 blocks, where data of the same index from input channel A, B, C, and D are in each frame. So that the following 4 stages of butterfly processor can receive 1 frame of 4 data from the same channel, and implement FFT operation in 4 parallel data sequences more efficiently. Moreover, the interleaver can work under the 1, 2 or 3 sequence modes. B. Pipeline Radix2/22 Butterfly Unit in Stage 1 As shown in Fig. 3, stage 1 contains a control signal and address generator, 3 blocks of register files each of which can
Address generator

In 256-point FFT calculation, (1) can be reformulated as


X ( k1 + 8k2 + 64k3 )
7 7 k 7 k1 = W642 n3 W256( n3 +8 n2 ) x( n3 + 8n2 + 64n1 )W8k1n1 W8k2 n2 W8k3n3 , n3= 0 n2 n1 = 0 for k1 = 0...7; k2 = 0...7; k3 = 0...7.

(4)

The calculation of 128 FFT process can be decomposed into 1 stage of radix-2 butterfly calculation followed by 3 stages of radix-4 butterfly calculation, and 256 point into 4 stages of radix-4 butterfly calculation. The IFFT of an N-point sequence X (k ), k = 0,1,..., N 1 is defined as
X (n) = 1 N

X (k )W
k =0

N 1

nk

(5)

In order to implement the IDFT efficiently, (5) can be rewritten as


X (n) = 1 N 1 * nk X (k )W , N k =0
*

(6)

Cummutator

where * denotes the conjugate of the data. IFFT can be realized by using FFT algorithm with additional interchange operations and normalization [4]. III. A. PROPOSED FFT PROCESSOR FOR MIMO OFDM SYSTEM

Register File 256 256 256

Data in M U X
Commutator

Twiddle factor

ROM Mapping

Proposed Architecture The architecture of the proposed 128/256-point pipeline FFT/IFFT processor is shown in Fig. 1. 128 or 256-point operation is controlled by radix mode, and the operation of FFT or IFFT is controlled by control signal. When IFFT is performed, conjugation of the input sequences will be taken and then be performed by the process in treating FFT, and then output will be conjugated and divided by N. By taking the advantage of pipeline FFT architecture, it can process 128 and

Radix-2/22
Data in
Radix mode

M U X

Mapping

Control signal
Single data path Four data path

Data M out U X

Figure 3. Block diagram of stage 1 Radix-2/22 FFT.

1489

(a)

(b) Figure 4. Block diagram of the Radix-2/22 Butterfly unit in the stage 1. Figure 6. Block diagram of pipeline stage 2 and stage 3 Radix-4 FFT: (a) Block diagram of the pipeline radix-4 butterfly unit and (b) in and out data in 4 frames of each blocks.

0 WN
1 WN

2 WN

3 WN

Figure 5.

SFG of the Radix-2/22 Butterfly unit in the stage 1.

store 256 complex data, 4 radix-2/22 butterfly unit, 4 complex multipliers here deal with the multiplications by the twiddle factors. Due to the periodical characteristic of the twiddle factor, only 1/8 of cosine and sine discrete points are stored in ROMs [6]. The function of the state controller is to manage the whole computation process and generate the read/write addresses and controlling signals for the register files and 4 radix-2/22 butterfly units. The input 4 parallel data sequence, which are reordered into 44 blocks, each of which contains data frame A, B, C and D of the same index, are stored into the register file. Computational data will be written back to the memory in the pipeline radix-4 schedule, and then output to next stage. The radix-2/22 butterfly is the kernel computation unit in stage 1, as shown in Fig. 4. Designed from the SFG of radix-4 operation shown in Fig. 5, it is constructed from modified complex adders, multiplexers, as well as 3 data paths fed back to the registers to fetch the input data and store the temporary values. With a mode signal to control radix operation modes by bypassing the adders, the radix-4 butterfly unit can be configured into radix-2 mode. With this structure, the butterfly with approximate hardware resource of radix-4 may have the ability of reconfigurable radix-2/22 computation. Four data in each frame will be operated in 4 radix-2/22 butterfly units respectively. Four butterfly units (BU4) can receive their input at each radix-2/22 calculation, as shown in Fig. 4. For instance, the 1st radix-4 butterfly unit deals with the radix-4 operation of frame A index 0. In the first 1924 clock cycles, 3N/44 sequence of data store in the registers. For

data frame A, it won't start the operation until the 0th (x[0]), the 64th (x[0 + 256/4]), the 128th (x[0 + 2562/4]), and the 192nd (x[0 + 2563/4]) points are ready. As the immediate 192nd input data come in, the operation of the radix-4 process is triggered. As soon as the radix-4 computation is done, the result delivered to the next pipeline stage after the multiplication with twiddle factors stored in the ROM, the data of the same index of next data frame B is collected and computed. After finishing the BU4 calculation of 4 frames in the same index, the next operation of the same data frame A, is triggered as soon as the 196th (x[0 + 2563/4 + 4]) point is collected. The other three BU4s act in the same way. Obviously, the operation is fully pipelined. There is no waiting latency or idle cycle. The throughput of the pipeline structure is maximized. C. Pipeline Butterfly Unit in Stage2 and Stage 3 As shown in Fig. 6, this module consists of 2 stages of radix-4 butterfly units, and complex multipliers for nontrivial twiddle factor multiplications in four parallel data path. The radix-4 butterfly unit, which is BU4 in Fig. 6 is directly designed from the radix-4 butterfly SFG in Fig. 5, without bypass to mode radix-2. Every stage in the pipeline structure needs 3 register banks and one ROM for twiddle factor storage. In stage 2 and 3, register here is used as delay buffers for 4 data paths in blocks. D. Radix-4 Butterfly Unit in Stage 4 As shown in Fig. 7, due to the 4 parallel operations in the previous stages, the radix-4 butterfly operation in stage 4 receives its 4 data input simultaneously from the 4 output sequence from stage 3. Thus delay buffer is no long needed in this stage. The radix-4 butterfly unit in this stage is mapped from the SFG in Fig. 5. E. Memory Consideration To obtain certain precision and simple implementation of hardware, fixed-point data format is introduced in this design.

1490

Stage 4 Sequence 1 Sequence 2 Sequence 3 Sequence 4 Out 1 Out 2 Out 3 Out 4 Time

TABLE I.

COMPARISON OF DIFFERENT DESIGNS


4-parallel pipeline proposed 2-parallel MRDS [7] MRMDF [5] Folding Processing [8]

BU4

Technology Number of FFT Size

0.13m 128/256 Radix2/22 Radix-4 1- 4 1008 (4N-16) 12 I:10/O:13 R 4R 1470 1469

0.18m 256 Radix-4 Radix-8 2 5102 (2N-2) 2 I:11/O:15 2R 2R -

0.13m 128/64 Radix-2 Radix-8 1-4 512 (4N-4) 6 12 R 4R 660 2142

256 Radix-22 2 5102 (2N-2) 3 2R 2R -

Algorithm Sequence Delay Element C. Multiplier Wordlength(bits)

Figure 7. channel

Block diagram of radix-4 in stage 4, n=064 from each data

Clock Rate Throughput Rate Area (m )


2

Thus the wordlength of input data is 10 bit and output data in 13bits. According to (2), (3), and SFG of radix-4 in Fig. 5, every stage in the pipeline structure needs 3 banks of registers and a ROM for twiddle factors storage. Practically, large size registers are implemented using the register file. Take the first stage of radix-2/22 as an example, for radix-4 operation, x[n], x[n+N/4], x[n+N/2], x[n+3N/4] must be stored in three 256 26 bits register files, as complex symbols of 13bits in I and Q. F. Modified Multiplier Twiddle factor multipliers (c.multiplier in Table I) in stage 2, 3, and 4 as complex multiplier are implemented as constant multipliers with address mapping multiplexers. Due to the periodical characteristic of the twiddle factor, only 1/8 of cosine and sine discrete points are stored [6]. Constant multiplication operation can be carried out using only these nine sets of constants by appropriate swapping of their real and imaginary parts and choosing the appropriate sign [10].The area of the multiplier is reduced. IV. PERFORMANCE AND COMPARISON The design is synthesized with SMIC 0.13 m 1.2V power supply 1P8M technology. The maximum operation frequency is 125 MHz. The Core size is 14701469 m2, including memory. Assume that the wordlength of input data is 10 bit and output data in 13bits, the SQNR of FFT processor is 42.7dB. The latency is 260 clock cycle. IEEE802.16e has subcarrier space of 10.94kHz, and (128/512/1024/2048) subcarriers. For 128/256-point, working at the required frequency, this 128/256-point pipeline FFT/IFFT processor can meet the requirement of throughput rate of 802.16e, and support FFT calculation for various channels from 1 to 4, in MIMO OFDM application. V. CONCLUSION

advantage of the pipeline architecture and reconfigurable butterfly unit so as to achieve low power dissipation and small area. In order to operate 1-4 simultaneous data sequences, data reordering and grouping is introduced, and the processor can provide different throughput rates more efficiently. This design is synthesized with SMIC 0.13 m 1.2V power supply 1P8M technology, with core size of 14701469 m2. REFERENCES
[1] [2] WiMAX Forum, Mobile WiMAX-Part I: A technical overview and performance evaluations, Feb. 21, 2006. Shousheng He, and M. Torkelson, Designing pipeline FFT processor for OFDM (de)modulation URSI International Symposium on Signals, Systems, and Electronics, vol.29, pp. 257-262, Oct 1998. J. W. Cooley, and J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Compt., vol. 5, no. 5, pp. 87-109, 1965. L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975. Yu-Wei Lin and Chen-Yi Lee, Design of an FFT/IFFT Processor for MIMO OFDM Systems, IEEE Trans. Circuits and Systems I, vol. 54, no. 4, pp. 807-815, Apr. 2007. Yu-Wei Lin, Hsuan-Yu Liu and Chen-Yi Lee, A dynamic scaling FFT processor for DVB-T applications, IEEE J. Solid-State Circuits, vol. 39, no. 11, pp. 2005-2013, Nov. 2004. Fang-Li Yuan, Yi-Hsien Lin, Chih-Feng Wu, Muh-Tian Shiue and Chorng-Kuang Wang A 256-Point Dataflow Scheduling 22 MIMO FFT/IFFT Processor for IEEE 802.16 WMAN in Proc. IEEE ASSCC, Nov. 2008, pp.309-312. Ludwig Schwoerer and Ernst Zielinski, Optimized FFT Architecture for MIMO Applications, in Proc. European Signal Processing Conference (EUSIPCO), Sep. 2005. Koushik Maharatna, Eckhard Grass, and Ulrich Jagdhold, A 64-Point Fourier Transform Chip for High-Speed Wireless LAN Application Using OFDM, IEEE J. Solid-State Circuits, vol. 39, no. 3, pp.484-493, Mar. 2004.

[3]

[4] [5]

[6]

[7]

[8]

[9]

In this paper, a 128/256-point novel FFT/IFFT processor for IEEE802.16e MIMO OFDM system is designed by taking

1491

You might also like