A VLSI Array Processor For 16-Point FFT
A VLSI Array Processor For 16-Point FFT
9, SEPTEMBER 1991
Abstract-This paper describes an implementation of a two- FFT using half-butterfly arithmetic (HBA) will be pre-
dimensional array processor for fast Fourier transform (FFT) sented in Section 111. Section IV will describe the circuit
using a 2-pm CMOS technology. The array processor, which is
dedicated to 16-point FFT,implements a 4 X 4 mesh array of 16 and layout design. Finally, fabrication and some measure-
processing elements (PE’s) working in parallel. Design consider- ment results will be discussed in Section V.
ations both in chip level and in PE level have been examined. A
layout design methodology based on bit-slice units (BSU’s) 11. DESIGN CONSIDERATIONS
results in a very simple design and easy debugging, as well as a
regular interconnection scheme through abutment. It contains This section discusses design considerations for our
about 48 000 transistors on an area of 53.52 m d , excluding the FFT array processor including chip architecture, the
83-pad area, and operation is on a 15-MHz clock. The array arithmetic function of a PE, and time complexity.
processor performs 24.6 million complex multiplications per
second, and computes a 16-point FIT in 3 ps.
A. Chip Architecture
~~ ~
1 -
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: VLSI ARRAY PROCESSOR FOR 16-POINT FFT 1287
TABLE I
COMPARISON
OF VARIOUS
PARALLEL
ARCHITECTURES
FOR F I T COMPUTATION
't
1
0
0
the usual radix-2 butterfly. The HBA takes its name from
the definition given in (0,which represents half of the
conventional radix-2 butterfly:
where
HBA& = f( 9 - l , k ) & f ( q - l , k + D ( q ) ) .Wp' (1)
q = 1,2,3, * * * ,log, N
Wp' = exp( - j 2 ~ r p t / N ) .
In (11, the operator "&" denotes either " + " (addition) or
" - " (subtraction) and D ( q ) represents the shuffling dis-
C. Time Complexity
The time required to compute N-point FFT can be
operator '&' denotes *+'or '-' evaluated in terms of total time for data shuffling and
(b)
butterfly arithmetic through log, N butterfly stages. A
summation of the time contributions of all shuffling dis-
Fig. 1. (a) Two-dimensional mesh array for FFT computation. (b)
Arithmetic function of a PE.
tances will give the total shuffling time of 2 ( m - 1)c,if a
unit shuffling distance takes time T,. Also, the total time
for butterfly arithmetic will be given by Thb log, N , where
PE's for N-point FFT computation is considered to be a Thb represents the times for HBA. Both Thb and T, can
special case with P = 2. be considered to be a unit clock if the HBA is imple-
On the other hand, a 2-D mesh array composed of N mented with parallel arithmetic.
PE's whose arithmetic function is also the usual radix-2 Asymptotically, however, 2 ( m - 1)T, + Thblog, N can
butterfly [191 can perform data shuffling operation with- be reduced to O(m), since T, and Thb are independent
out O(log,N) bits of address. It, however, reveals some of N . This fact suggests that a bit-parallel communication
inefficiencies as follows: a) it requires data reshuffling scheme between PE's is preferable.
operations after butterfly computation; and b) only half of
the PE's in the array participate in butterfly computation. 111. ALGORITHM FORFFT COMPUTATION
In order to resolve these problems inherent in 2-D ON THEARCHITECTURE
mesh array implementation for FFT, we make a compro- This section describes an algorithm for FFT computa-
mise by adopting half-butterfly arithmetic (HBA) [ 161 as tion using HBA. A procedure for computing a 16-point
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1288 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991
4 4 4 4
I Output Multiplexer I
t t t
(a)
I I I I
Fig. 3 . Chip architecture.
6-6-44 (d)
WI
M
I
:HBA+
BEGIN
FOR q = 1 TO 4 DO IN PARALLEL for all PE’s
I
so +
I
SI
I
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: VLSI ARRAY PROCESSOR FOR 16-POINT FFT 1289
CoefficientStore
DRU-A
MTG Adder
-
DRU-B
To DRU
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1290 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991
TABLE I1
LAYOUT
STATISTICS
Block Area (mm’) Transistors
0.132
0.967
3.345
4 x 4 Array 53.52 48 000
Die size including 83 pads 16.56 mm2
p q
Accumulator
DRU-A
NO,NI NO,NI
Control
layout at the chip level is accomplished by a simple
arrangement of the PE‘s in a 4 x 4 mesh array.
The chip contains about 48 000 transistors in an area
of 53.52 mm2, and die size including 83 pads is 76.56
mm2. The layout area of a BSU containing 150 transistors
is 0.132 mm2, and a P E composed of 3000 transistors
requires 3.345 mm2. Some layout statistics are summa-
rized in Table 11.
Since all PE’s in the chip are synchronized by a global
clock signal, a clock skew problem due to different path
lengths from clock pad to each P E was eliminated by
H-tree clock distribution scheme [181.
The layout design methodology adopted in this paper
&,SI Control so,SI features: a) regular layout structure; b) fast design and
Fig. 8. Floor plan of a PE. easy debugging; c) minimum design errors; and d) flexible
extension in both array size and data bit width.
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: VLSI ARRAY PROCESSOR FOR 16-POINT FFT 1291
FFT every 3 ps, which is equivalently 24.6 million com- Fig. 11. Area estimation of the 2-D mesh array for FFT. Area is
plex multiplications per second. normalized by layout area of a PE A,.
VI. DISCUSSION realized with the distributed arithmetic, and the hierar-
chical design methodology based on bit-slice units.
According to the measurement results of the fabricated
The prototype implementation of this paper leads us to
chip, the throughput rate (i.e.7 the number of samples
conclude that the 2-D mesh array architecture and algo-
processed per unit time) of our array processor is esti-
rithm presented in this paper will be one of the most
mated to be 5.3 MHz, which is much higher than the
attractive candidates for the real-time computation of the
40 kHz of the processor computing the discrete Fourier FFT in VLSI.
transform (DFT) [21]. It has been reported that the pro-
grammable DSP processors execute 512-point complex APPENDIX
FFT within 4.5 ms [23] and 1024-point FFT in 1.0-1.33 Algorithm I: Data Shuffling Procedure Based on HBA
ms [3], [24], achieving throughput rates in the range of
0.1-1.0 MHz, which are much lower than that of our BEGIN
array processor. Also, it is comparable to 8.8 MHz of the FOR q = 1 TO 4 DO IN PARALLEL for aEl i and j
FFT processor based on radix-4 butterfly 141 and 10 MHz (i = j = 0,1,5 3)
of the FFT system built with 27 IC’s [5]. BEGIN
Although the array processor described in this paper IF (q Q 2) THEN
implements only 16 PE’s with a 2-pm CMOS process, it D(q) = 22-4
will be possible to integrate more than 16 PE’s into a IF (i MOD 2 3 - 4 < 2 2 - 4 ) THEN
single chip if a submicrometer technology is used. The move data in PE(i7j ) to PE(i + DCq),j )
area A ( N , A ) required for an array of N PE’s can be ELSE
parameterized as follow: move data in PE(i,j ) to PE(i - L3(q),j )
ELSE
A ( N , A ) = A,NA2 X106. (3) D(q) = 2 4 - 4
In (3), A denotes half of the minimum feature size of a IF ( j ~ 0 0 2 5 - 4< 24-4) THEN
given technology, and A , represents the physical layout move data in PE(i,j ) to PE(i,j + D(q))
area of a PE. The value of A , will depend on the physical ELSE
realization of PE. In our design using double-metal layer, move data in PE(i,j ) to PE(i,j - D(q))
A , is 3.345 (see Table 11). END
Fig. 11 shows this area estimation using the A parame- END.
ter, which is normalized by A , . Based on Fig. 11, it will be
possible to implement an array processor for 32x32 2-D Algorithm 11: Determination of the Arithmetic Type of
FFT in an area of about 1 cm2 if a svbmicrometer CMOS the HBA&
technology is used. BEGIN
FOR q = 1 TO 4 DO IN PARALLEL for all i and j
VII. CONCLUSIONS (i = j = 0,I , 2,3)
An array processor dedicated to compute 16-point FFT BEGIN
has been implemented using a 2-pm CMOS process. To IF (q Q 2) THEN
achieve an area-efficient computation of the FFT, some IF (i M0D23-4 < 2 2 - 4 ) THEN
design considerations have been examined both in array PE(i,j ) compute the HBA -+-
level and in PE level. The novel features in designing the ELSE
array processor are the half butterfly arithmetic concept PE(i,j ) compute the HBA -
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1292 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991
ELSE [23] Y. Kawakami et al., “A 32b floating point CMOS digital signal
processor,” in ISSCC Dig. Tech. Papers, Feb. 1986, pp. 86-87.
IF ( j MOD 2 s - q < 2 4 - q ) THEN [24] A. Kanuma et al., “A 20 MHz 32b pipelined CMOS image proces-
PE(i,j ) compute the HBA + sor,” in ISSCC Dig. Tech. Papers, Feb. 1986, pp. 102-103.
ELSE
PE(( j ) compute the HBA -
END Moon-Key Lee (S’77-M’79) was born in Seoul,
END. Korea, in 1941. He earned the B.S., M.S., and
D.E. degrees in electrical engineering from
REFERENCES Yonsei University, Seoul, Korea in 1965, 1967,
and 1973, respectively. In 1980 he received the
[l] S. Y. Kung, VTLSI Array Processors. Englewood Cliffs, NJ: Pren- Ph.D. degree from the University of Oklahoma.
tice-Hall, 1988. He was a Lecturer in the Department of
[2] K. L. Kloker, B. Lindsley, N. Baron, and G . R. L. Sohie, “Efficient Electrical Engineering, Yonsei University, from
FFT implementation on an IEEE floating-point digital signal pro- 1968 to 1970. From 1970 to 1976 he was with
cessor,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process- Kyunghee University, Seoul, Korea, where he
ing, 1988, pp. 1399-1402. held the positions of Assistant Professor, Asso-
[3] R. W. Linderman, P. M. Ahau, W. H. Ku, and P. P. Peusens, ciate Professor, and Chairman of the Electronic Engineering Depart-
“Cusp: A 2-pm CMOS digital signal processor,” IEEE J. Solid-state ment. He was Director of the IC design division at Korea Institute of
Circuits, vol. SC-20, pp. 761-769, June 1985. Electronic Technology (ETRI, at present), Kumi, Korea, from 1980 to
[4] J. O’Brien, J. Mather, and B. Holland, “A 200 MIPS single-chip 1K 1982. In 1982 he joined the faculty of Yonsei University, Seoul, Korea,
FFT processor,” in ISSCC Dig. Tech. Papers, Feb. 1989, pp. where he is currently Professor and Chairman of the Electronic Engi-
166- 167. neering Department. He is a founder of the Research Institute of ASIC
[5] J. Fox, G . Surace, and P. A. Thomas, “A self-testing 2 p m CMOS Design (RIAD), which ‘was established in 1989 and located at Yonsei
chip set for FFT applications,” IEEE J. Solid-State Circuits, vol. University. He has published five college textbooks on electronic engi-
SC-22, pp. 15-19, Feb. 1987. neering. He has authored and co-authored over 100 papers on inter-
[6] K. Yamashita et al., “A wafer-scale 170 000-gate FFT processor grated circuit design and computer-aided design. He made pioneering
with built-in test circuits,” IEEE J. Solid-state Circuits, vol. 23, pp. contributions to IC design education by performing in the Multi Project’
336-341, Apr. 1988. Chip research, which was participated in by eight universities and
[7] W. Cochran et al., “What is the fast Fourier transform?,” IEEE supported by the Ministry of Science and Technology of Korea from
Trans. Audio Electroacoust., vol. AU-15, pp. 45-55, 1967. 1985 through 1988, for the first time in Korea. His current research
[8] H. Stone, “Parallel processing with the perfect shuffle network,” interests include VLSI design and computer-aided design and software.
IEEE Trans. Comput., vol. C-20, pp. 153-161, Dec. 1986. Dr. Lee was the Chairman of the Chapter Promotion and educational
[9] M. K. Lee, “Systolic array for FFT computation,” in ’87 Multi activities in the IEEE Korea Section. He has served as Chairman of the
Project Chip (MPC’87) Deuelopment, Dept. Electronic Eng., Yonsei IEEE Circuits and Systems chapter in Korea which was formed in 1989.
Univ., Seoul, Korea, Aug. 1988. In 1985, he organized the International Workshop on VLSI and CAD,
[lo] K. W. Shin, “A VLSI architecture for the parallel computation of Seoul, Korea and served as Chairman. He was the Program Chairman of
FFT,” Ph.D. dissertation, Dept. Electronic Eng., Yonsei Univ., the 1989 International Conference on VLSI and CAD (ICVC), Seoul,
Seoul, Korea, Aug. 1990. Korea. He was the Chairman of the financial committee of TENCON
[ l l ] K. W. Shin, B. Y. Choi, and M. K. Lee, “A VLSI architecture of 87, IEEE Region 10 Conference, Seoul, Korea, in 1987. From 1982
systolic array for FFT computation,” J . KITE, vol. 25, no. 9, pp. through 1987 he was a member of the editiorial committee and an
97-106, 1988. Associate Editor of the Journal of Korean Institute of Telematics and
[12] B. Y. Choi, B. H. Kang, J. K. Lee, K. W. Shin, and M. K. Lee, Electronics (KITE). He received the distinguished honor award from
“VLSI implementation of two-dimensional FFT algorithm on sys- KITE in 1984. In 1986, he was awarded the 40th anniversary prize for
tolic array,” in Proc. TENCON87 IEEE Region-IO Conf., 1987, pp. the outstanding paper published in the journal of KITE. Since 1988 he
125- 131. has served on the Board of Directors of the Korean Institute of
[13] E. E. Swartzlander, W. K. W. Young, and S. J. Joseph, “A radix 4 Telematics and Electronics (KITE).
delay commutator for fast Fourier transform processor implementa-
tion,” IEEE J . Solid-state Circuits, vol. SC-19, no. 5, pp. 702-709,
Oct. 1984.
[14] T. Wiley, T. S. Durrani, and R. Champinin, “An FFT systolic Kyung-Wook Shin was born in Chongiu, Korea,
processor and its applications,” in Proc. IEEE Int. Conf. Acoustics, in 1961. He received the M.S. and Ph.D. de-
Speech, Signal Processing, 1984, vol. 2, pp. 4.1-4.4. grees in electronic engineering from Yonsei
[15] L. W. Chang and M. Y. Chen, “A new systolic array for discrete University, Seoul, Korea, in 1986 and 1990, re-
Fourier transform,” IEEE Trans. Acoustics, Speech, Signal Process- spectively. During his Ph.D. course, he worked
ing, vol. 36, no. 10, pp. 1665-1666, 1988. on the design of a VLSI array processor for
[16] K. W. Shin and M. K. Lee, “A VLSI architecture for parallel parallel computation of fast Fourier transform
computation of FIT,” Systolic Array Processors. Englewood Cliffs, (FIT).
NJ: Prentice-Hall, 1989, pp. 116-125. In 1990 he joined the Integrated Circuits
[17] S. A. White, “A simple FFT butterfly arithmetic unit,” IEEE Trans. Technology Department of the Electronics and
Circuit Syst., vol. CAS-28, no. 4, pp. 352-355, Apr. 1981. Telecommunications Research Institute (ETRI),
[18] A. L. Fisher and H. T. Kung, “Synchronizing large VLSI processor Daejeon, Korea. In 1991 he joined the faculty of Kum Oh Institute of
arrays,” IEEE Trans. Comput., vol. C-34, no. 8, pp. 734-740, Aug. Technology, Kyungbuk, Korea. His research interests include VLSI
1985. signal processing, algorithm-oriented VLSI array architectures, and digi-
[191 C. D. Thomson, “A complexity theory for VLSI,” Ph.D. disserta- tal design.
tion, Comput. Sci. Dept., Carnegie-Mellon Univ., Pittsburgh, PA,
Aug. 1980.
[20] I. R. Mactaggart and M. A. Jack, “A single chip radix-2 FFT Jang-Kyu Lee was born in Seoul, Korea, in
butterfly architecture using parallel data distributed arithmetic,” 1963. He received the B.S. degree in electronic
IEEE J. Solid-state Circuits, vol. SC-19, no. 3, pp. 368-373, June engineering from Sogang University, Seoul, Ko-
1984. rea, in 1986, and the M.S. degree in electronic
[21] J. L. V. Meerhergen and F. J. V. Wyk, “A 256-point discrete engineering from Yonsei University, Seoul,
Fourier transform processor fabricated in 2 p m NMOS technology,” Korea, in 1988.
IEEE J. Solid-State Circuits, vol. SC-18, no. 5, pp. 604-609, Oct. In 1988 he joined the Semiconductor Division
1983. of Samsung Electronics, Inc., Kyungki Do, Ko-
[22] H. S. Lee, H. Mori, and H. Aiso, “Parallel processing FFT for rea, where he works on memory IC design. His
VLSI implementation,” Trans. IECE Japan, vol. E68, no. 5, pp. research interests include DRAM design and
284-291, May 1985. application-specific memory design.
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.