0% found this document useful (0 votes)

19 views

A VLSI Array Processor For 16-Point FFT

Uploaded by

tristanlvk

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

A VLSI Array Processor For 16-Point FFT

Uploaded by

tristanlvk

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

1286 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO.

9, SEPTEMBER 1991

A VLSI Array Processor for 16-Point FFT

Moon-Key Lee, Member, ZEEE, Kyung-Wook Shin, a n d Jang-Kyu Lee

Abstract-This paper describes an implementation of a two- FFT using half-butterfly arithmetic (HBA) will be pre-
dimensional array processor for fast Fourier transform (FFT) sented in Section 111. Section IV will describe the circuit
using a 2-pm CMOS technology. The array processor, which is
dedicated to 16-point FFT,implements a 4 X 4 mesh array of 16 and layout design. Finally, fabrication and some measure-
processing elements (PE’s) working in parallel. Design consider- ment results will be discussed in Section V.
ations both in chip level and in PE level have been examined. A
layout design methodology based on bit-slice units (BSU’s) 11. DESIGN CONSIDERATIONS
results in a very simple design and easy debugging, as well as a
regular interconnection scheme through abutment. It contains This section discusses design considerations for our
about 48 000 transistors on an area of 53.52 m d , excluding the FFT array processor including chip architecture, the
83-pad area, and operation is on a 15-MHz clock. The array arithmetic function of a PE, and time complexity.
processor performs 24.6 million complex multiplications per
second, and computes a 16-point FIT in 3 ps.
A. Chip Architecture

I. INTRODUCTION A number of approaches for FFT implementation are

known including an FFT network [7], a perfect-shuffle
R ECENT advances in very large scale integration
(VLSI) technology make the implementation of ar-
ray processors technologically realizable and economically
network [8], a 2-D mesh array [9]-[12], pipelined architec-
tures utilizing delay commutator [131 or systolic elevator
[141, and a 1-D systolic DFT array [15]. Table I compares
feasible. So, a new approach adopting algorithm-based these architectures in terms of area, time, and regularity.
VLSI array processors is becoming increasingly competi- From Table I, it is easily found that the FFT network and
tive for real-time digital signal processing (DSP) and the perfect-shuffle network are not suitable for VLSI
image processing applications [13. realization since irregular interconnections between PE’s
The fast Fourier transform (FFT) is probably one of the dominate the chip area. The pipelined architectures uti-
most important algorithms in DSP applications. There are lizing delay commutator or systolic elevator require addi-
two approaches for computing this transform: software tional hardware overhead of O(N1ogN) for the data
implemented on a programmable DSP 121, [31, [231, 1241, shuffling of N-point FIT.
and dedicated FFT processor development [41-[61. Real- Our design goal in the architectural level is to deter-
time DSP favors the use of the latter, which offers paral- mine a parallel architecture satisfying the following crite-
lel processing capability. ria: 1) highly parallel processing capability; 2) architec-
In this paper, a dedicated array processor implementa- tural regularity; 3) local data communications; and 4) no
tion is described. As known, the FFT algorithm requires hardware overhead for data shuffling operation. Since the
global data exchanges for butterfly computation, thus the minimization of chip area with architectural regularity is
direct implementation of the FFT flow graph onto silicon considered much more important than any other factors
is considered to be inefficient because of its inherent in VLSI design, we choose the 2-D mesh array shown in
communication cost in terms of area and time. To achieve Fig. l(a) as the chip architecture.
area-efficient implementation of the FFT with a high
degree of regularity and parallelism, we utilize two- B. Arithmetic Function of PE
dimensional mesh array of processing elements (PE’s).
This paper is organized as follows. Section I1 describes In the implementation of a 2-D mesh array for FFT,
design considerations. Then an algorithm to compute careful examinations into the trade-offs between the num-
ber of PE’s, the arithmetic function of PE, and control
complexity should be made.
Manuscript received May 21, 1990; revised April 22, 1991.
M.-K. Lee is with the VLSI and CAD Laboratory, Department of When N-point FFT is computed using a 2-D mesh
Electronic Engineering, Yonsei University, Seoul, 120-749, Korea. array having M (where 2 M Q = N ) PE’s whose arith-
K.-W. Shin was with the Department of Electronic Engineering, metic functions are the usual radix-2 butterfly [221, each
Yonsei University, Seoul, 120-749, Korea>.He is now with the Depart-
ment of Electronic Engineering, Kum Oh Institute of Technology, 188 PE requires P = N / M data storages, and each datum
Shinpyung-Dong, Kumi City, Kyungbuk 730-701, Korea. should be identified using O(log2N) bits of address in
J.-K. Lee was with the Department of Electronic Engineering, Yonsei order to control data shuffling operations through log, N
University, Seoul, 120-749, Korea. He is now with the Semiconductor
Division, Samsung Electronics, Inc., Kyungki Do, 449-900, Korea. butterfly stages. In this case, all PE’s participate in butter-
IEEE Log Number 9101784. fly computations. Note that a 2-D mesh array of N / 2

0018-9200/91/0900-1286$01 .OO 01991 IEEE

~~ ~

1 -

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: VLSI ARRAY PROCESSOR FOR 16-POINT FFT 1287

TABLE I
COMPARISON
OF VARIOUS
PARALLEL
ARCHITECTURES
FOR F I T COMPUTATION

I Area I interconnection I interconnection I PE's I PE's+DC's I PE's+SE's I PES

PE's: Processing Elements, DC's: Delay Commutators, SE's: Systolic Elevators

the arithmetic function of PE (see Fig. l(b)), rather than

't
1
0

0
the usual radix-2 butterfly. The HBA takes its name from
the definition given in (0,which represents half of the
conventional radix-2 butterfly:

where
HBA& = f( 9 - l , k ) & f ( q - l , k + D ( q ) ) .Wp' (1)

q = 1,2,3, * * * ,log, N
Wp' = exp( - j 2 ~ r p t / N ) .
In (11, the operator "&" denotes either " + " (addition) or
" - " (subtraction) and D ( q ) represents the shuffling dis-

tance of the q th butterfly stage. An algorithm for comput-

ing FFT using HBA will be described in Section 111.
The adoption of HBA as the arithmetic function of PE
results in a simple control which does not require
O(log, N ) bits of address, as well as 100% PE utilization
(i.e., no idle PE's during butterfly computation) and no
data reshuffling. These facts reveal the merits of using the
HBA concept.

C. Time Complexity
The time required to compute N-point FFT can be
operator '&' denotes *+'or '-' evaluated in terms of total time for data shuffling and
(b)
butterfly arithmetic through log, N butterfly stages. A
summation of the time contributions of all shuffling dis-
Fig. 1. (a) Two-dimensional mesh array for FFT computation. (b)
Arithmetic function of a PE.
tances will give the total shuffling time of 2 ( m - 1)c,if a
unit shuffling distance takes time T,. Also, the total time
for butterfly arithmetic will be given by Thb log, N , where
PE's for N-point FFT computation is considered to be a Thb represents the times for HBA. Both Thb and T, can
special case with P = 2. be considered to be a unit clock if the HBA is imple-
On the other hand, a 2-D mesh array composed of N mented with parallel arithmetic.
PE's whose arithmetic function is also the usual radix-2 Asymptotically, however, 2 ( m - 1)T, + Thblog, N can
butterfly [191 can perform data shuffling operation with- be reduced to O(m), since T, and Thb are independent
out O(log,N) bits of address. It, however, reveals some of N . This fact suggests that a bit-parallel communication
inefficiencies as follows: a) it requires data reshuffling scheme between PE's is preferable.
operations after butterfly computation; and b) only half of
the PE's in the array participate in butterfly computation. 111. ALGORITHM FORFFT COMPUTATION
In order to resolve these problems inherent in 2-D ON THEARCHITECTURE
mesh array implementation for FFT, we make a compro- This section describes an algorithm for FFT computa-
mise by adopting half-butterfly arithmetic (HBA) [ 161 as tion using HBA. A procedure for computing a 16-point

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1288 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991

4 4 4 4

I Output Multiplexer I
t t t

(a)

I I I I
Fig. 3 . Chip architecture.

6-6-44 (d)

WI
M
I

:HBA+

Fig. 2. Sixteen-point FFT computation using H B A (a) first butterfly

stage ( q = l), (b) second butterfly stage ( q = 2), (c) third butterfly stage
( q = 3), and (d) fourth butterfly stage ( q = 4).

FIT is described as follows: - E1

BEGIN
FOR q = 1 TO 4 DO IN PARALLEL for all PE’s
I
so +
I

SI
I

BEGIN Fig. 4. Internal block diagram of PE.

data shuffling (ALGORITHM I)
HBA computation (ALGORITHM II)
END AND LAYOUT
IV. CIRCUIT DESIGN
END. A . Circuit Design
Initially, 16 input data are loaded into the array in The chip architecture is a mesh array of 16 PE’s as
row-major order. Then, data shuffling and HBA computa- shown in Fig. 3. A PE consists of five blocks: two data
tion are repeated through four (i.e., log,16) butterfly routing units (DRU’s), an HBA unit (HBAU), a coeffi-
stages according to the Algorithms I and 11, respectively cient store, and an internal control logic (ICL), as shown
(see the Appendix). After the fourth butterfly stage has in Fig. 4.
been completed, the array pipes out the results and The DRU’s perform data communications with four
accepts new input data in parallel and pipelined fashion nearest-neighbor PE’s to perform data shuffling. It is
through each of the four boundary PE’s. composed of two multiplexers (MUX’s) and a pipeline
Fig. 2 visualizes these procedures. In the first butterfly register. The MUX in the DRU controls data shuffling
stage (Fig. 2(a)), PE’s in the first row send and receive directions (i.e., vertical data shuffling or horizontal data
data to and from the PE’s in the third row; at the same shuffling). The pipeline register can be reconfigured de-
time, the PE‘s in the second row send and receive data to pending on control inputs to serve three purposes. In data
and from the PE’s in the fourth row. It takes two units of shuffling mode, it behaves as a parallel-in, parallel-out
time to carry out this data shuffling operation because (PIPO) register to provide bit-parallel data communica-
shuffling distance is 2. Then, the PE’s in the first and tion with neighboring PE’s. During the HBA computa-
second rows denoted as black circles compute HBA+, tion, the pipeline register is reconfigured into a parallel-in,
and the PE’s in the third and fourth rows denoted as serial-out (PISO) register to carry out arithmetic shift
white circles compute HBA - . Similar operations are right. Also, in self-testing mode, it is reconfigured into a
repeated in the second (Fig. 2(b)), third (Fig. 2(c)), and linear feedback shift register (LFSR) to generate pseudo-
fourth (Fig. 2(d)) butterfly stages. random test patterns.

CoefficientStore
DRU-A

MTG Adder
-
DRU-B

To DRU

Fig. 5 . Internal structure of the HBAU.

Fig. 5 shows the internal block diagram of the HBAU,

which implements the HBA of (1) using the distributed
arithmetic concept [17]. Using W-bit, fixed-point, frac-
tional two's complement arithmetic, (1) is expressed by
the distributed arithmetic as follows:
w-1 Fig. 6. Layout design hierarchy.
Re (HBA&) = A , & L ~ - ( ~ - ' )+
K, Q m 2-"
n=O sistors which have no connections to power or ground.
w-1 Since it allows all drain and source regions within the
Im { HBA&} = Ai&2-(W-')K, + Qin2? ( 2 ) adder to be shared between two devices, a dense layout of
n=O
90x98 p m 2 is achieved for a 1-b MTGA. Circuit simula-
where tion for an 8-b MTGA shows an 18-ns delay.
According to Parseval's theorem, the FFT output signal
A , = Re { f ( q - 1, k)} , Ai = Im { f( 4 - 1, k)} is larger than the input signal power. Thus, in implement-
ing the FFT with fixed-point arithmetic, a scaling policy is
usually necessary to prevent overflow. For this purpose,
B, = Im{ f ( s- 1 , k + D ( q ) ) } . we adopt a step-by-step scaling method which scales down
In (21, f ( q - 1, k) and f ( q - 1, k + D ( q ) ) represent the data by a factor of 2 at every butterfly stage. This scaling
intermediate results obtained from the ( q - 1)th butter- is simply performed by 1-b shift right.
fly stage, and subscript n represents the nth bit of B, In general, finite register length within an arithmetic
and B,. Also, the Q,, and Q,, are coefficients whose unit causes some roundoff errors, and reveals trade-offs
values are determined by an EXCLUSIVE-OR (XOR) and between chip area and arithmetic error. In our array
EXCLUSIVE-NOR (XNOR) combination of B,, and Bl,. All processor, a 1-b increase in register length results in an
the values for Q,, and Q,, in (21, which vary depending increase in the chip area of about 8%, but reduces total
on not only the position of PE but also the butterfly stage mean-squared error by 12 dB. In this prototype design,
q , are precomputed and stored within each PE. we use fixed-point 8-b as the operation data word, which
From ( 2 ) it is known that the HBA implementation yields a total mean-squared error of about -83 dB.
based on the distributed arithmetic requires W addition
steps, and the results obtained from the distributed arith- B. Layout Design
metic are combined with A , and A , to complete the
The layout design is accomplished in a hierarchical
HBA.
approach based on bit-slice units (BSU's). The hierarchy
As shown in Fig. 5 , the HBAU is implemented with two
adopted is depicted in Fig. 6, and is described as follows:
identical parts: one is responsible for the real part of
HBA& and the other is for the imaginary part of HBA&. design basic cells in full custom,
As a result, the HBAU is simply realized with only two build a BSU by vertically stacking the basic cells,
adders. Among various adder structures, we choose a construct a PE by repeating the BSU's,
modified transmission gate adder (MTGA) [31 in order to complete the chip layout by abutting the PE's in
minimize chip area. A 1-b MTGA requires only 16 tran- 4 x 4 mesh array.

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1290 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991

TABLE I1
LAYOUT
STATISTICS
Block Area (mm’) Transistors
0.132
0.967
3.345
4 x 4 Array 53.52 48 000
Die size including 83 pads 16.56 mm2

p q
Accumulator

DRU-A

Fig. 7. Floor plan of a BSU.

Fig. 9. Microphotograph of the fabricated array processor.

NO,NI NO,NI
Control
layout at the chip level is accomplished by a simple
arrangement of the PE‘s in a 4 x 4 mesh array.
The chip contains about 48 000 transistors in an area
of 53.52 mm2, and die size including 83 pads is 76.56
mm2. The layout area of a BSU containing 150 transistors
is 0.132 mm2, and a P E composed of 3000 transistors
requires 3.345 mm2. Some layout statistics are summa-
rized in Table 11.
Since all PE’s in the chip are synchronized by a global
clock signal, a clock skew problem due to different path
lengths from clock pad to each P E was eliminated by
H-tree clock distribution scheme [181.
The layout design methodology adopted in this paper
&,SI Control so,SI features: a) regular layout structure; b) fast design and
Fig. 8. Floor plan of a PE. easy debugging; c) minimum design errors; and d) flexible
extension in both array size and data bit width.

The physical layouts of the basic cells are tailored to

satisfy the boundary connection requirements. To verify V. FABRICATION
AND MEASUREMENT
RESULTS
these basic cells, switch-level logic simulation is carried The chip has been fabricated with a 2-pm p-well CMOS
out using the netlist data extracted from layout. technology with two metal layers. The effective channel
Once all the basic cells have been physically designed lengths of the PMOS and NMOS transistors are 1.6 and
and verified, a BSU is assembled from them. The BSU is 1.44 pm, respectively. The microphotograph of the fabri-
constructed with seven basic cells stacked vertically, as cated chip is shown in Fig. 9.
shown in Fig. 7. The height and width of a BSU are 1468 Fig. 10 shows the measured waveforms for load signal
and 90 pm, respectively. and data out for the DRU block. The measured delays
Fig. 8 shows the floor plan of a PE, which is built up of from load signal to data out are 12, 8, and 6.2 ns at
two identical half-planes with a central control section. V,, = 4,5, and 6 V, respectively. This result says that each
Each half-plane consists of eight BSU’s. All interconnec- PE within the array can perform data communications
tions between BSU’s are achieved through abutment. with neighboring PE’s at speeds up to 65 MHz.
Buffers are inserted between ICL and each half-plane to From the measurement results, the HBA computation
drive the control signals decoded by the ICL. The final takes 0.65 ps, and the array processor computes a 16-point

Fig. 10. Measured waveforms for the operation of DRU. Measured I I I I

delay from load signal LBR to data out shows 8 ns at V,, = 5 V. 4 6 8 10
ARRAY SIZE, h&N]

FFT every 3 ps, which is equivalently 24.6 million com- Fig. 11. Area estimation of the 2-D mesh array for FFT. Area is
plex multiplications per second. normalized by layout area of a PE A,.

VI. DISCUSSION realized with the distributed arithmetic, and the hierar-
chical design methodology based on bit-slice units.
According to the measurement results of the fabricated
The prototype implementation of this paper leads us to
chip, the throughput rate (i.e.7 the number of samples
conclude that the 2-D mesh array architecture and algo-
processed per unit time) of our array processor is esti-
rithm presented in this paper will be one of the most
mated to be 5.3 MHz, which is much higher than the
attractive candidates for the real-time computation of the
40 kHz of the processor computing the discrete Fourier FFT in VLSI.
transform (DFT) [21]. It has been reported that the pro-
grammable DSP processors execute 512-point complex APPENDIX
FFT within 4.5 ms [23] and 1024-point FFT in 1.0-1.33 Algorithm I: Data Shuffling Procedure Based on HBA
ms [3], [24], achieving throughput rates in the range of
0.1-1.0 MHz, which are much lower than that of our BEGIN
array processor. Also, it is comparable to 8.8 MHz of the FOR q = 1 TO 4 DO IN PARALLEL for aEl i and j
FFT processor based on radix-4 butterfly 141 and 10 MHz (i = j = 0,1,5 3)
of the FFT system built with 27 IC’s [5]. BEGIN
Although the array processor described in this paper IF (q Q 2) THEN
implements only 16 PE’s with a 2-pm CMOS process, it D(q) = 22-4
will be possible to integrate more than 16 PE’s into a IF (i MOD 2 3 - 4 < 2 2 - 4 ) THEN
single chip if a submicrometer technology is used. The move data in PE(i7j ) to PE(i + DCq),j )
area A ( N , A ) required for an array of N PE’s can be ELSE
parameterized as follow: move data in PE(i,j ) to PE(i - L3(q),j )
ELSE
A ( N , A ) = A,NA2 X106. (3) D(q) = 2 4 - 4
In (3), A denotes half of the minimum feature size of a IF ( j ~ 0 0 2 5 - 4< 24-4) THEN
given technology, and A , represents the physical layout move data in PE(i,j ) to PE(i,j + D(q))
area of a PE. The value of A , will depend on the physical ELSE
realization of PE. In our design using double-metal layer, move data in PE(i,j ) to PE(i,j - D(q))
A , is 3.345 (see Table 11). END
Fig. 11 shows this area estimation using the A parame- END.
ter, which is normalized by A , . Based on Fig. 11, it will be
possible to implement an array processor for 32x32 2-D Algorithm 11: Determination of the Arithmetic Type of
FFT in an area of about 1 cm2 if a svbmicrometer CMOS the HBA&
technology is used. BEGIN
FOR q = 1 TO 4 DO IN PARALLEL for all i and j
VII. CONCLUSIONS (i = j = 0,I , 2,3)
An array processor dedicated to compute 16-point FFT BEGIN
has been implemented using a 2-pm CMOS process. To IF (q Q 2) THEN
achieve an area-efficient computation of the FFT, some IF (i M0D23-4 < 2 2 - 4 ) THEN
design considerations have been examined both in array PE(i,j ) compute the HBA -+-
level and in PE level. The novel features in designing the ELSE
array processor are the half butterfly arithmetic concept PE(i,j ) compute the HBA -

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1292 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991

ELSE [23] Y. Kawakami et al., “A 32b floating point CMOS digital signal
processor,” in ISSCC Dig. Tech. Papers, Feb. 1986, pp. 86-87.
IF ( j MOD 2 s - q < 2 4 - q ) THEN [24] A. Kanuma et al., “A 20 MHz 32b pipelined CMOS image proces-
PE(i,j ) compute the HBA + sor,” in ISSCC Dig. Tech. Papers, Feb. 1986, pp. 102-103.
ELSE
PE(( j ) compute the HBA -
END Moon-Key Lee (S’77-M’79) was born in Seoul,
END. Korea, in 1941. He earned the B.S., M.S., and
D.E. degrees in electrical engineering from
REFERENCES Yonsei University, Seoul, Korea in 1965, 1967,
and 1973, respectively. In 1980 he received the
[l] S. Y. Kung, VTLSI Array Processors. Englewood Cliffs, NJ: Pren- Ph.D. degree from the University of Oklahoma.
tice-Hall, 1988. He was a Lecturer in the Department of
[2] K. L. Kloker, B. Lindsley, N. Baron, and G . R. L. Sohie, “Efficient Electrical Engineering, Yonsei University, from
FFT implementation on an IEEE floating-point digital signal pro- 1968 to 1970. From 1970 to 1976 he was with
cessor,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process- Kyunghee University, Seoul, Korea, where he
ing, 1988, pp. 1399-1402. held the positions of Assistant Professor, Asso-
[3] R. W. Linderman, P. M. Ahau, W. H. Ku, and P. P. Peusens, ciate Professor, and Chairman of the Electronic Engineering Depart-
“Cusp: A 2-pm CMOS digital signal processor,” IEEE J. Solid-state ment. He was Director of the IC design division at Korea Institute of
Circuits, vol. SC-20, pp. 761-769, June 1985. Electronic Technology (ETRI, at present), Kumi, Korea, from 1980 to
[4] J. O’Brien, J. Mather, and B. Holland, “A 200 MIPS single-chip 1K 1982. In 1982 he joined the faculty of Yonsei University, Seoul, Korea,
FFT processor,” in ISSCC Dig. Tech. Papers, Feb. 1989, pp. where he is currently Professor and Chairman of the Electronic Engi-
166- 167. neering Department. He is a founder of the Research Institute of ASIC
[5] J. Fox, G . Surace, and P. A. Thomas, “A self-testing 2 p m CMOS Design (RIAD), which ‘was established in 1989 and located at Yonsei
chip set for FFT applications,” IEEE J. Solid-State Circuits, vol. University. He has published five college textbooks on electronic engi-
SC-22, pp. 15-19, Feb. 1987. neering. He has authored and co-authored over 100 papers on inter-
[6] K. Yamashita et al., “A wafer-scale 170 000-gate FFT processor grated circuit design and computer-aided design. He made pioneering
with built-in test circuits,” IEEE J. Solid-state Circuits, vol. 23, pp. contributions to IC design education by performing in the Multi Project’
336-341, Apr. 1988. Chip research, which was participated in by eight universities and
[7] W. Cochran et al., “What is the fast Fourier transform?,” IEEE supported by the Ministry of Science and Technology of Korea from
Trans. Audio Electroacoust., vol. AU-15, pp. 45-55, 1967. 1985 through 1988, for the first time in Korea. His current research
[8] H. Stone, “Parallel processing with the perfect shuffle network,” interests include VLSI design and computer-aided design and software.
IEEE Trans. Comput., vol. C-20, pp. 153-161, Dec. 1986. Dr. Lee was the Chairman of the Chapter Promotion and educational
[9] M. K. Lee, “Systolic array for FFT computation,” in ’87 Multi activities in the IEEE Korea Section. He has served as Chairman of the
Project Chip (MPC’87) Deuelopment, Dept. Electronic Eng., Yonsei IEEE Circuits and Systems chapter in Korea which was formed in 1989.
Univ., Seoul, Korea, Aug. 1988. In 1985, he organized the International Workshop on VLSI and CAD,
[lo] K. W. Shin, “A VLSI architecture for the parallel computation of Seoul, Korea and served as Chairman. He was the Program Chairman of
FFT,” Ph.D. dissertation, Dept. Electronic Eng., Yonsei Univ., the 1989 International Conference on VLSI and CAD (ICVC), Seoul,
Seoul, Korea, Aug. 1990. Korea. He was the Chairman of the financial committee of TENCON
[ l l ] K. W. Shin, B. Y. Choi, and M. K. Lee, “A VLSI architecture of 87, IEEE Region 10 Conference, Seoul, Korea, in 1987. From 1982
systolic array for FFT computation,” J . KITE, vol. 25, no. 9, pp. through 1987 he was a member of the editiorial committee and an
97-106, 1988. Associate Editor of the Journal of Korean Institute of Telematics and
[12] B. Y. Choi, B. H. Kang, J. K. Lee, K. W. Shin, and M. K. Lee, Electronics (KITE). He received the distinguished honor award from
“VLSI implementation of two-dimensional FFT algorithm on sys- KITE in 1984. In 1986, he was awarded the 40th anniversary prize for
tolic array,” in Proc. TENCON87 IEEE Region-IO Conf., 1987, pp. the outstanding paper published in the journal of KITE. Since 1988 he
125- 131. has served on the Board of Directors of the Korean Institute of
[13] E. E. Swartzlander, W. K. W. Young, and S. J. Joseph, “A radix 4 Telematics and Electronics (KITE).
delay commutator for fast Fourier transform processor implementa-
tion,” IEEE J . Solid-state Circuits, vol. SC-19, no. 5, pp. 702-709,
Oct. 1984.
[14] T. Wiley, T. S. Durrani, and R. Champinin, “An FFT systolic Kyung-Wook Shin was born in Chongiu, Korea,
processor and its applications,” in Proc. IEEE Int. Conf. Acoustics, in 1961. He received the M.S. and Ph.D. de-
Speech, Signal Processing, 1984, vol. 2, pp. 4.1-4.4. grees in electronic engineering from Yonsei
[15] L. W. Chang and M. Y. Chen, “A new systolic array for discrete University, Seoul, Korea, in 1986 and 1990, re-
Fourier transform,” IEEE Trans. Acoustics, Speech, Signal Process- spectively. During his Ph.D. course, he worked
ing, vol. 36, no. 10, pp. 1665-1666, 1988. on the design of a VLSI array processor for
[16] K. W. Shin and M. K. Lee, “A VLSI architecture for parallel parallel computation of fast Fourier transform
computation of FIT,” Systolic Array Processors. Englewood Cliffs, (FIT).
NJ: Prentice-Hall, 1989, pp. 116-125. In 1990 he joined the Integrated Circuits
[17] S. A. White, “A simple FFT butterfly arithmetic unit,” IEEE Trans. Technology Department of the Electronics and
Circuit Syst., vol. CAS-28, no. 4, pp. 352-355, Apr. 1981. Telecommunications Research Institute (ETRI),
[18] A. L. Fisher and H. T. Kung, “Synchronizing large VLSI processor Daejeon, Korea. In 1991 he joined the faculty of Kum Oh Institute of
arrays,” IEEE Trans. Comput., vol. C-34, no. 8, pp. 734-740, Aug. Technology, Kyungbuk, Korea. His research interests include VLSI
1985. signal processing, algorithm-oriented VLSI array architectures, and digi-
[191 C. D. Thomson, “A complexity theory for VLSI,” Ph.D. disserta- tal design.
tion, Comput. Sci. Dept., Carnegie-Mellon Univ., Pittsburgh, PA,
Aug. 1980.
[20] I. R. Mactaggart and M. A. Jack, “A single chip radix-2 FFT Jang-Kyu Lee was born in Seoul, Korea, in
butterfly architecture using parallel data distributed arithmetic,” 1963. He received the B.S. degree in electronic
IEEE J. Solid-state Circuits, vol. SC-19, no. 3, pp. 368-373, June engineering from Sogang University, Seoul, Ko-
1984. rea, in 1986, and the M.S. degree in electronic
[21] J. L. V. Meerhergen and F. J. V. Wyk, “A 256-point discrete engineering from Yonsei University, Seoul,
Fourier transform processor fabricated in 2 p m NMOS technology,” Korea, in 1988.
IEEE J. Solid-State Circuits, vol. SC-18, no. 5, pp. 604-609, Oct. In 1988 he joined the Semiconductor Division
1983. of Samsung Electronics, Inc., Kyungki Do, Ko-
[22] H. S. Lee, H. Mori, and H. Aiso, “Parallel processing FFT for rea, where he works on memory IC design. His
VLSI implementation,” Trans. IECE Japan, vol. E68, no. 5, pp. research interests include DRAM design and
284-291, May 1985. application-specific memory design.

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.

FFT128 Project
No ratings yet
FFT128 Project
70 pages
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
From Everand
Exploring BeagleBone: Tools and Techniques for Building with Embedded Linux
Derek Molloy
4/5 (1)
Key technologies for NG-PON2 system
From Everand
Key technologies for NG-PON2 system
Rawa Muayad
No ratings yet
Microcontroller Project Report 2016
No ratings yet
Microcontroller Project Report 2016
42 pages
Design and Implementation of A 1024-Point
No ratings yet
Design and Implementation of A 1024-Point
5 pages
FFT and Ifftv Seminar Project
No ratings yet
FFT and Ifftv Seminar Project
83 pages
Design and Implementation of Pipelined FFT Processor: D.Venkata Kishore, C.Ram Kumar
No ratings yet
Design and Implementation of Pipelined FFT Processor: D.Venkata Kishore, C.Ram Kumar
4 pages
VD 02 Design and Implement of FFT Processor For OFDMA System
No ratings yet
VD 02 Design and Implement of FFT Processor For OFDMA System
3 pages
ELEC692 VLSI Signal Processing Architecture: Architecture For Fourier Transform
No ratings yet
ELEC692 VLSI Signal Processing Architecture: Architecture For Fourier Transform
40 pages
Area-Efficient Architecture For Fast Fourier Transform
No ratings yet
Area-Efficient Architecture For Fast Fourier Transform
7 pages
Base Paper FPR FFT
No ratings yet
Base Paper FPR FFT
5 pages
FFT Algorithms: A Survey: Pavan Kumar K M, Priya Jain, Ravi Kiran S, Rohith N, Ramamani K
No ratings yet
FFT Algorithms: A Survey: Pavan Kumar K M, Priya Jain, Ravi Kiran S, Rohith N, Ramamani K
5 pages
Low-Power, High-Speed FFT Processor For MB-OFDM UWB Application
No ratings yet
Low-Power, High-Speed FFT Processor For MB-OFDM UWB Application
10 pages
On-Chip Implementation of High Speed and High Resolution Pipeline Radix 2 FFT Algorithm
No ratings yet
On-Chip Implementation of High Speed and High Resolution Pipeline Radix 2 FFT Algorithm
3 pages
1 - A Novel Area-Power Efficient Design For Approximated Small-Point FFT Architecture
No ratings yet
1 - A Novel Area-Power Efficient Design For Approximated Small-Point FFT Architecture
12 pages
Design and Implementation of Low Power Fft/Ifft Processor For Wireless Communication
No ratings yet
Design and Implementation of Low Power Fft/Ifft Processor For Wireless Communication
4 pages
Design of A Power Optimized L024-Point 32-Bit
No ratings yet
Design of A Power Optimized L024-Point 32-Bit
3 pages
Design of An Efficient FFT Processor For OFDM Systems: Haining Jiang, Hanwen Luo, Jifeng Tian and Wentao Song
No ratings yet
Design of An Efficient FFT Processor For OFDM Systems: Haining Jiang, Hanwen Luo, Jifeng Tian and Wentao Song
5 pages
fft
No ratings yet
fft
4 pages
Impact of DPU 2017
No ratings yet
Impact of DPU 2017
6 pages
Unit-5 DSP
No ratings yet
Unit-5 DSP
48 pages
Frequency Analyzer
No ratings yet
Frequency Analyzer
4 pages
Butterfly Unit Supporting Radix-4 and Radix-2 FFT
No ratings yet
Butterfly Unit Supporting Radix-4 and Radix-2 FFT
8 pages
SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
No ratings yet
SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications
23 pages
VHDL Implementation of An Optimized 8-Point FFT - IFFT Processor in Pipeline Architecture For OFDM Systems
No ratings yet
VHDL Implementation of An Optimized 8-Point FFT - IFFT Processor in Pipeline Architecture For OFDM Systems
5 pages
DD
No ratings yet
DD
78 pages
Autoscaling Radix-4 FFT For TMS320C6000
No ratings yet
Autoscaling Radix-4 FFT For TMS320C6000
12 pages
KLakshmiNarasamma KSundeep 139
No ratings yet
KLakshmiNarasamma KSundeep 139
6 pages
(IJCST-V3I2P16) :harpreet Kaur
No ratings yet
(IJCST-V3I2P16) :harpreet Kaur
6 pages
2020 ISCAS A 128-Point Multi-Path SC FFT Architecture
No ratings yet
2020 ISCAS A 128-Point Multi-Path SC FFT Architecture
5 pages
12 - Chepter 5
No ratings yet
12 - Chepter 5
11 pages
10 1109@icoei48184 2020 9143051
No ratings yet
10 1109@icoei48184 2020 9143051
6 pages
Vlsi Architecture For r2b r4b r8b
No ratings yet
Vlsi Architecture For r2b r4b r8b
81 pages
1707 01697 PDF
No ratings yet
1707 01697 PDF
5 pages
Design & Development of IP-core of FFT For Field Programmable Gate Arrays
No ratings yet
Design & Development of IP-core of FFT For Field Programmable Gate Arrays
7 pages
Implementation of 64-Point FFT
No ratings yet
Implementation of 64-Point FFT
5 pages
FFT VHDL Code
No ratings yet
FFT VHDL Code
4 pages
High Speed Eight-Parallel Mixed-Radix FFT Processor For OFDM Systems
No ratings yet
High Speed Eight-Parallel Mixed-Radix FFT Processor For OFDM Systems
4 pages
Varun Gautam MT23202 Lab2 HW
No ratings yet
Varun Gautam MT23202 Lab2 HW
11 pages
An Efficient FPGA Architecture For Reconfigurable FFT Processor Incorporating An Integration of An Improved CORDIC and Radix-2 Algorithm
No ratings yet
An Efficient FPGA Architecture For Reconfigurable FFT Processor Incorporating An Integration of An Improved CORDIC and Radix-2 Algorithm
29 pages
Hw-Efficient Reduced-Latency Architecture For Configurable Mixed-Radix FFT Processors
No ratings yet
Hw-Efficient Reduced-Latency Architecture For Configurable Mixed-Radix FFT Processors
7 pages
A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes
No ratings yet
A Memory-Based FFT Processor Design With Generalized Efficient Conflict-Free Address Schemes
11 pages
128-Points FFT CORE ARchitectutre
No ratings yet
128-Points FFT CORE ARchitectutre
12 pages
Efficient Cached 64 Point FFT Processor Using Floating Point Arithmetic For OFDM
No ratings yet
Efficient Cached 64 Point FFT Processor Using Floating Point Arithmetic For OFDM
6 pages
2003 A 2048 Complex Point FFT Processor Using A Novel Data Scaling Approach
No ratings yet
2003 A 2048 Complex Point FFT Processor Using A Novel Data Scaling Approach
4 pages
15.pipelined Parallel FFT Architectures Via Folding Transformation
No ratings yet
15.pipelined Parallel FFT Architectures Via Folding Transformation
14 pages
Fpga Implementation of FFT Algorithm For Ieee 802.16E (Mobile Wimax)
No ratings yet
Fpga Implementation of FFT Algorithm For Ieee 802.16E (Mobile Wimax)
7 pages
LiU Tek Lic 2003 23 W - Li
No ratings yet
LiU Tek Lic 2003 23 W - Li
120 pages
VLSI Implementation of Pipelined Fast Fourier Transform
No ratings yet
VLSI Implementation of Pipelined Fast Fourier Transform
6 pages
2020-2
No ratings yet
2020-2
5 pages
PTFFT 2
No ratings yet
PTFFT 2
6 pages
SSRN Id3869494
No ratings yet
SSRN Id3869494
5 pages
Design and Simulation of 32-Point FFT Using Radix-2 Algorithm For FPGA 2012
No ratings yet
Design and Simulation of 32-Point FFT Using Radix-2 Algorithm For FPGA 2012
5 pages
FPGA Implementation of FFT Using Heterogeneous Adder
No ratings yet
FPGA Implementation of FFT Using Heterogeneous Adder
4 pages
Comp Networking - IJCNWMC - Design Approach For Implementation
No ratings yet
Comp Networking - IJCNWMC - Design Approach For Implementation
8 pages
A 128/512/1024/2048-Point Pipeline Fft/Ifft Architecture For Mobile Wimax
No ratings yet
A 128/512/1024/2048-Point Pipeline Fft/Ifft Architecture For Mobile Wimax
2 pages
High-Throughput VLSI Architecture For FFT Computation
No ratings yet
High-Throughput VLSI Architecture For FFT Computation
5 pages
Doc1132 PDF
No ratings yet
Doc1132 PDF
9 pages
CORDIC Based Implementation of Fast Fourier Transform: - CORDIC Is An Iterative Arithmetic Computing
No ratings yet
CORDIC Based Implementation of Fast Fourier Transform: - CORDIC Is An Iterative Arithmetic Computing
6 pages
Loop-shaping Robust Control
From Everand
Loop-shaping Robust Control
Philippe Feyel
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Computer Aided Visual Materials
No ratings yet
Computer Aided Visual Materials
20 pages
In A Logic Circuit A Hazard Is Independent Form
No ratings yet
In A Logic Circuit A Hazard Is Independent Form
11 pages
Chapter 1
No ratings yet
Chapter 1
8 pages
JIJNASA'24
No ratings yet
JIJNASA'24
117 pages
Intro To Computing
No ratings yet
Intro To Computing
50 pages
Computer System 1.1
No ratings yet
Computer System 1.1
24 pages
A Low Noise Power 45 NM Technology Based Simultaneous Switching Noise (SSN) Reduction Model For Mixed Signal VLSI Circuits
No ratings yet
A Low Noise Power 45 NM Technology Based Simultaneous Switching Noise (SSN) Reduction Model For Mixed Signal VLSI Circuits
7 pages
Generation of Computers
No ratings yet
Generation of Computers
4 pages
15A04802 Low Power VLSI Circuits & Systems
No ratings yet
15A04802 Low Power VLSI Circuits & Systems
1 page
Implementation of Low Power Ternary Logic Gates Using CMOS Technology
No ratings yet
Implementation of Low Power Ternary Logic Gates Using CMOS Technology
4 pages
Interact With IT Book 1 Answers
No ratings yet
Interact With IT Book 1 Answers
77 pages
VDTT 2020 Brochure
No ratings yet
VDTT 2020 Brochure
10 pages
Unit 2
No ratings yet
Unit 2
62 pages
Ayesha Intern Doc Final
No ratings yet
Ayesha Intern Doc Final
35 pages
Fabrication of Mosfet
89% (9)
Fabrication of Mosfet
19 pages
Classification of Computers: Introduction To Computing
No ratings yet
Classification of Computers: Introduction To Computing
21 pages
Overview of Computing and Computer Systems
No ratings yet
Overview of Computing and Computer Systems
68 pages
Assignment On Generation of Computer Send
No ratings yet
Assignment On Generation of Computer Send
4 pages
Vlsi Design: A Training Report
No ratings yet
Vlsi Design: A Training Report
7 pages
Ict Notes-1
No ratings yet
Ict Notes-1
268 pages
Aicte FDP Vlsi Brouchure
No ratings yet
Aicte FDP Vlsi Brouchure
2 pages
Notes MP 001
No ratings yet
Notes MP 001
4 pages
M.Tech MVLSI 2021 22
No ratings yet
M.Tech MVLSI 2021 22
1 page
Grade 7 CBC Computer Science Note Slides
100% (2)
Grade 7 CBC Computer Science Note Slides
73 pages
ASICs by M J Smith
100% (10)
ASICs by M J Smith
1,179 pages
VLSI Objectives Outline 2019
No ratings yet
VLSI Objectives Outline 2019
3 pages
Built-In Fault-Tolerant Computing Paradigm For Resilient Large-Scale Chip Design
No ratings yet
Built-In Fault-Tolerant Computing Paradigm For Resilient Large-Scale Chip Design
318 pages
Preview-9781351838412 A35484080
No ratings yet
Preview-9781351838412 A35484080
90 pages
Course Contents For AI in VLSI Designs
No ratings yet
Course Contents For AI in VLSI Designs
2 pages

A VLSI Array Processor For 16-Point FFT

Uploaded by

A VLSI Array Processor For 16-Point FFT

Uploaded by

1286 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO.

A VLSI Array Processor for 16-Point FFT

I. INTRODUCTION A number of approaches for FFT implementation are

0018-9200/91/0900-1286$01 .OO 01991 IEEE

I Area I interconnection I interconnection I PE's I PE's+DC's I PE's+SE's I PES

the arithmetic function of PE (see Fig. l(b)), rather than

tance of the q th butterfly stage. An algorithm for comput-

Fig. 2. Sixteen-point FFT computation using H B A (a) first butterfly

FIT is described as follows: - E1

BEGIN Fig. 4. Internal block diagram of PE.

Fig. 5 . Internal structure of the HBAU.

Fig. 5 shows the internal block diagram of the HBAU,

Fig. 7. Floor plan of a BSU.

Fig. 9. Microphotograph of the fabricated array processor.

The physical layouts of the basic cells are tailored to

Fig. 10. Measured waveforms for the operation of DRU. Measured I I I I

You might also like