0% found this document useful (0 votes)
19 views

A VLSI Array Processor For 16-Point FFT

Uploaded by

tristanlvk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

A VLSI Array Processor For 16-Point FFT

Uploaded by

tristanlvk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1286 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO.

9, SEPTEMBER 1991

A VLSI Array Processor for 16-Point FFT


Moon-Key Lee, Member, ZEEE, Kyung-Wook Shin, a n d Jang-Kyu Lee

Abstract-This paper describes an implementation of a two- FFT using half-butterfly arithmetic (HBA) will be pre-
dimensional array processor for fast Fourier transform (FFT) sented in Section 111. Section IV will describe the circuit
using a 2-pm CMOS technology. The array processor, which is
dedicated to 16-point FFT,implements a 4 X 4 mesh array of 16 and layout design. Finally, fabrication and some measure-
processing elements (PE’s) working in parallel. Design consider- ment results will be discussed in Section V.
ations both in chip level and in PE level have been examined. A
layout design methodology based on bit-slice units (BSU’s) 11. DESIGN CONSIDERATIONS
results in a very simple design and easy debugging, as well as a
regular interconnection scheme through abutment. It contains This section discusses design considerations for our
about 48 000 transistors on an area of 53.52 m d , excluding the FFT array processor including chip architecture, the
83-pad area, and operation is on a 15-MHz clock. The array arithmetic function of a PE, and time complexity.
processor performs 24.6 million complex multiplications per
second, and computes a 16-point FIT in 3 ps.
A. Chip Architecture

I. INTRODUCTION A number of approaches for FFT implementation are


known including an FFT network [7], a perfect-shuffle
R ECENT advances in very large scale integration
(VLSI) technology make the implementation of ar-
ray processors technologically realizable and economically
network [8], a 2-D mesh array [9]-[12], pipelined architec-
tures utilizing delay commutator [131 or systolic elevator
[141, and a 1-D systolic DFT array [15]. Table I compares
feasible. So, a new approach adopting algorithm-based these architectures in terms of area, time, and regularity.
VLSI array processors is becoming increasingly competi- From Table I, it is easily found that the FFT network and
tive for real-time digital signal processing (DSP) and the perfect-shuffle network are not suitable for VLSI
image processing applications [13. realization since irregular interconnections between PE’s
The fast Fourier transform (FFT) is probably one of the dominate the chip area. The pipelined architectures uti-
most important algorithms in DSP applications. There are lizing delay commutator or systolic elevator require addi-
two approaches for computing this transform: software tional hardware overhead of O(N1ogN) for the data
implemented on a programmable DSP 121, [31, [231, 1241, shuffling of N-point FIT.
and dedicated FFT processor development [41-[61. Real- Our design goal in the architectural level is to deter-
time DSP favors the use of the latter, which offers paral- mine a parallel architecture satisfying the following crite-
lel processing capability. ria: 1) highly parallel processing capability; 2) architec-
In this paper, a dedicated array processor implementa- tural regularity; 3) local data communications; and 4) no
tion is described. As known, the FFT algorithm requires hardware overhead for data shuffling operation. Since the
global data exchanges for butterfly computation, thus the minimization of chip area with architectural regularity is
direct implementation of the FFT flow graph onto silicon considered much more important than any other factors
is considered to be inefficient because of its inherent in VLSI design, we choose the 2-D mesh array shown in
communication cost in terms of area and time. To achieve Fig. l(a) as the chip architecture.
area-efficient implementation of the FFT with a high
degree of regularity and parallelism, we utilize two- B. Arithmetic Function of PE
dimensional mesh array of processing elements (PE’s).
This paper is organized as follows. Section I1 describes In the implementation of a 2-D mesh array for FFT,
design considerations. Then an algorithm to compute careful examinations into the trade-offs between the num-
ber of PE’s, the arithmetic function of PE, and control
complexity should be made.
Manuscript received May 21, 1990; revised April 22, 1991.
M.-K. Lee is with the VLSI and CAD Laboratory, Department of When N-point FFT is computed using a 2-D mesh
Electronic Engineering, Yonsei University, Seoul, 120-749, Korea. array having M (where 2 M Q = N ) PE’s whose arith-
K.-W. Shin was with the Department of Electronic Engineering, metic functions are the usual radix-2 butterfly [221, each
Yonsei University, Seoul, 120-749, Korea>.He is now with the Depart-
ment of Electronic Engineering, Kum Oh Institute of Technology, 188 PE requires P = N / M data storages, and each datum
Shinpyung-Dong, Kumi City, Kyungbuk 730-701, Korea. should be identified using O(log2N) bits of address in
J.-K. Lee was with the Department of Electronic Engineering, Yonsei order to control data shuffling operations through log, N
University, Seoul, 120-749, Korea. He is now with the Semiconductor
Division, Samsung Electronics, Inc., Kyungki Do, 449-900, Korea. butterfly stages. In this case, all PE’s participate in butter-
IEEE Log Number 9101784. fly computations. Note that a 2-D mesh array of N / 2

0018-9200/91/0900-1286$01 .OO 01991 IEEE

~~ ~

1 -

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: VLSI ARRAY PROCESSOR FOR 16-POINT FFT 1287

TABLE I
COMPARISON
OF VARIOUS
PARALLEL
ARCHITECTURES
FOR F I T COMPUTATION

I Area I interconnection I interconnection I PE's I PE's+DC's I PE's+SE's I PES


PE's: Processing Elements, DC's: Delay Commutators, SE's: Systolic Elevators

the arithmetic function of PE (see Fig. l(b)), rather than

't
1
0

0
the usual radix-2 butterfly. The HBA takes its name from
the definition given in (0,which represents half of the
conventional radix-2 butterfly:

where
HBA& = f( 9 - l , k ) & f ( q - l , k + D ( q ) ) .Wp' (1)

q = 1,2,3, * * * ,log, N
Wp' = exp( - j 2 ~ r p t / N ) .
In (11, the operator "&" denotes either " + " (addition) or
" - " (subtraction) and D ( q ) represents the shuffling dis-

tance of the q th butterfly stage. An algorithm for comput-


ing FFT using HBA will be described in Section 111.
The adoption of HBA as the arithmetic function of PE
results in a simple control which does not require
O(log, N ) bits of address, as well as 100% PE utilization
(i.e., no idle PE's during butterfly computation) and no
data reshuffling. These facts reveal the merits of using the
HBA concept.

C. Time Complexity
The time required to compute N-point FFT can be
operator '&' denotes *+'or '-' evaluated in terms of total time for data shuffling and
(b)
butterfly arithmetic through log, N butterfly stages. A
summation of the time contributions of all shuffling dis-
Fig. 1. (a) Two-dimensional mesh array for FFT computation. (b)
Arithmetic function of a PE.
tances will give the total shuffling time of 2 ( m - 1)c,if a
unit shuffling distance takes time T,. Also, the total time
for butterfly arithmetic will be given by Thb log, N , where
PE's for N-point FFT computation is considered to be a Thb represents the times for HBA. Both Thb and T, can
special case with P = 2. be considered to be a unit clock if the HBA is imple-
On the other hand, a 2-D mesh array composed of N mented with parallel arithmetic.
PE's whose arithmetic function is also the usual radix-2 Asymptotically, however, 2 ( m - 1)T, + Thblog, N can
butterfly [191 can perform data shuffling operation with- be reduced to O(m), since T, and Thb are independent
out O(log,N) bits of address. It, however, reveals some of N . This fact suggests that a bit-parallel communication
inefficiencies as follows: a) it requires data reshuffling scheme between PE's is preferable.
operations after butterfly computation; and b) only half of
the PE's in the array participate in butterfly computation. 111. ALGORITHM FORFFT COMPUTATION
In order to resolve these problems inherent in 2-D ON THEARCHITECTURE
mesh array implementation for FFT, we make a compro- This section describes an algorithm for FFT computa-
mise by adopting half-butterfly arithmetic (HBA) [ 161 as tion using HBA. A procedure for computing a 16-point

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1288 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991

4 4 4 4

I Output Multiplexer I
t t t

(a)

I I I I
Fig. 3 . Chip architecture.

6-6-44 (d)

WI
M
I

:HBA+

Fig. 2. Sixteen-point FFT computation using H B A (a) first butterfly


stage ( q = l), (b) second butterfly stage ( q = 2), (c) third butterfly stage
( q = 3), and (d) fourth butterfly stage ( q = 4).

FIT is described as follows: - E1

BEGIN
FOR q = 1 TO 4 DO IN PARALLEL for all PE’s
I
so +
I

SI
I

BEGIN Fig. 4. Internal block diagram of PE.


data shuffling (ALGORITHM I)
HBA computation (ALGORITHM II)
END AND LAYOUT
IV. CIRCUIT DESIGN
END. A . Circuit Design
Initially, 16 input data are loaded into the array in The chip architecture is a mesh array of 16 PE’s as
row-major order. Then, data shuffling and HBA computa- shown in Fig. 3. A PE consists of five blocks: two data
tion are repeated through four (i.e., log,16) butterfly routing units (DRU’s), an HBA unit (HBAU), a coeffi-
stages according to the Algorithms I and 11, respectively cient store, and an internal control logic (ICL), as shown
(see the Appendix). After the fourth butterfly stage has in Fig. 4.
been completed, the array pipes out the results and The DRU’s perform data communications with four
accepts new input data in parallel and pipelined fashion nearest-neighbor PE’s to perform data shuffling. It is
through each of the four boundary PE’s. composed of two multiplexers (MUX’s) and a pipeline
Fig. 2 visualizes these procedures. In the first butterfly register. The MUX in the DRU controls data shuffling
stage (Fig. 2(a)), PE’s in the first row send and receive directions (i.e., vertical data shuffling or horizontal data
data to and from the PE’s in the third row; at the same shuffling). The pipeline register can be reconfigured de-
time, the PE‘s in the second row send and receive data to pending on control inputs to serve three purposes. In data
and from the PE’s in the fourth row. It takes two units of shuffling mode, it behaves as a parallel-in, parallel-out
time to carry out this data shuffling operation because (PIPO) register to provide bit-parallel data communica-
shuffling distance is 2. Then, the PE’s in the first and tion with neighboring PE’s. During the HBA computa-
second rows denoted as black circles compute HBA+, tion, the pipeline register is reconfigured into a parallel-in,
and the PE’s in the third and fourth rows denoted as serial-out (PISO) register to carry out arithmetic shift
white circles compute HBA - . Similar operations are right. Also, in self-testing mode, it is reconfigured into a
repeated in the second (Fig. 2(b)), third (Fig. 2(c)), and linear feedback shift register (LFSR) to generate pseudo-
fourth (Fig. 2(d)) butterfly stages. random test patterns.

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: VLSI ARRAY PROCESSOR FOR 16-POINT FFT 1289

CoefficientStore
DRU-A

MTG Adder
-
DRU-B

To DRU

Fig. 5 . Internal structure of the HBAU.

Fig. 5 shows the internal block diagram of the HBAU,


which implements the HBA of (1) using the distributed
arithmetic concept [17]. Using W-bit, fixed-point, frac-
tional two's complement arithmetic, (1) is expressed by
the distributed arithmetic as follows:
w-1 Fig. 6. Layout design hierarchy.
Re (HBA&) = A , & L ~ - ( ~ - ' )+
K, Q m 2-"
n=O sistors which have no connections to power or ground.
w-1 Since it allows all drain and source regions within the
Im { HBA&} = Ai&2-(W-')K, + Qin2? ( 2 ) adder to be shared between two devices, a dense layout of
n=O
90x98 p m 2 is achieved for a 1-b MTGA. Circuit simula-
where tion for an 8-b MTGA shows an 18-ns delay.
According to Parseval's theorem, the FFT output signal
A , = Re { f ( q - 1, k)} , Ai = Im { f( 4 - 1, k)} is larger than the input signal power. Thus, in implement-
ing the FFT with fixed-point arithmetic, a scaling policy is
usually necessary to prevent overflow. For this purpose,
B, = Im{ f ( s- 1 , k + D ( q ) ) } . we adopt a step-by-step scaling method which scales down
In (21, f ( q - 1, k) and f ( q - 1, k + D ( q ) ) represent the data by a factor of 2 at every butterfly stage. This scaling
intermediate results obtained from the ( q - 1)th butter- is simply performed by 1-b shift right.
fly stage, and subscript n represents the nth bit of B, In general, finite register length within an arithmetic
and B,. Also, the Q,, and Q,, are coefficients whose unit causes some roundoff errors, and reveals trade-offs
values are determined by an EXCLUSIVE-OR (XOR) and between chip area and arithmetic error. In our array
EXCLUSIVE-NOR (XNOR) combination of B,, and Bl,. All processor, a 1-b increase in register length results in an
the values for Q,, and Q,, in (21, which vary depending increase in the chip area of about 8%, but reduces total
on not only the position of PE but also the butterfly stage mean-squared error by 12 dB. In this prototype design,
q , are precomputed and stored within each PE. we use fixed-point 8-b as the operation data word, which
From ( 2 ) it is known that the HBA implementation yields a total mean-squared error of about -83 dB.
based on the distributed arithmetic requires W addition
steps, and the results obtained from the distributed arith- B. Layout Design
metic are combined with A , and A , to complete the
The layout design is accomplished in a hierarchical
HBA.
approach based on bit-slice units (BSU's). The hierarchy
As shown in Fig. 5 , the HBAU is implemented with two
adopted is depicted in Fig. 6, and is described as follows:
identical parts: one is responsible for the real part of
HBA& and the other is for the imaginary part of HBA&. design basic cells in full custom,
As a result, the HBAU is simply realized with only two build a BSU by vertically stacking the basic cells,
adders. Among various adder structures, we choose a construct a PE by repeating the BSU's,
modified transmission gate adder (MTGA) [31 in order to complete the chip layout by abutting the PE's in
minimize chip area. A 1-b MTGA requires only 16 tran- 4 x 4 mesh array.

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1290 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991

TABLE I1
LAYOUT
STATISTICS
Block Area (mm’) Transistors
0.132
0.967
3.345
4 x 4 Array 53.52 48 000
Die size including 83 pads 16.56 mm2

p q
Accumulator

DRU-A

Fig. 7. Floor plan of a BSU.

Fig. 9. Microphotograph of the fabricated array processor.

NO,NI NO,NI
Control
layout at the chip level is accomplished by a simple
arrangement of the PE‘s in a 4 x 4 mesh array.
The chip contains about 48 000 transistors in an area
of 53.52 mm2, and die size including 83 pads is 76.56
mm2. The layout area of a BSU containing 150 transistors
is 0.132 mm2, and a P E composed of 3000 transistors
requires 3.345 mm2. Some layout statistics are summa-
rized in Table 11.
Since all PE’s in the chip are synchronized by a global
clock signal, a clock skew problem due to different path
lengths from clock pad to each P E was eliminated by
H-tree clock distribution scheme [181.
The layout design methodology adopted in this paper
&,SI Control so,SI features: a) regular layout structure; b) fast design and
Fig. 8. Floor plan of a PE. easy debugging; c) minimum design errors; and d) flexible
extension in both array size and data bit width.

The physical layouts of the basic cells are tailored to


satisfy the boundary connection requirements. To verify V. FABRICATION
AND MEASUREMENT
RESULTS
these basic cells, switch-level logic simulation is carried The chip has been fabricated with a 2-pm p-well CMOS
out using the netlist data extracted from layout. technology with two metal layers. The effective channel
Once all the basic cells have been physically designed lengths of the PMOS and NMOS transistors are 1.6 and
and verified, a BSU is assembled from them. The BSU is 1.44 pm, respectively. The microphotograph of the fabri-
constructed with seven basic cells stacked vertically, as cated chip is shown in Fig. 9.
shown in Fig. 7. The height and width of a BSU are 1468 Fig. 10 shows the measured waveforms for load signal
and 90 pm, respectively. and data out for the DRU block. The measured delays
Fig. 8 shows the floor plan of a PE, which is built up of from load signal to data out are 12, 8, and 6.2 ns at
two identical half-planes with a central control section. V,, = 4,5, and 6 V, respectively. This result says that each
Each half-plane consists of eight BSU’s. All interconnec- PE within the array can perform data communications
tions between BSU’s are achieved through abutment. with neighboring PE’s at speeds up to 65 MHz.
Buffers are inserted between ICL and each half-plane to From the measurement results, the HBA computation
drive the control signals decoded by the ICL. The final takes 0.65 ps, and the array processor computes a 16-point

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
LEE et al.: VLSI ARRAY PROCESSOR FOR 16-POINT FFT 1291

Fig. 10. Measured waveforms for the operation of DRU. Measured I I I I


delay from load signal LBR to data out shows 8 ns at V,, = 5 V. 4 6 8 10
ARRAY SIZE, h&N]

FFT every 3 ps, which is equivalently 24.6 million com- Fig. 11. Area estimation of the 2-D mesh array for FFT. Area is
plex multiplications per second. normalized by layout area of a PE A,.

VI. DISCUSSION realized with the distributed arithmetic, and the hierar-
chical design methodology based on bit-slice units.
According to the measurement results of the fabricated
The prototype implementation of this paper leads us to
chip, the throughput rate (i.e.7 the number of samples
conclude that the 2-D mesh array architecture and algo-
processed per unit time) of our array processor is esti-
rithm presented in this paper will be one of the most
mated to be 5.3 MHz, which is much higher than the
attractive candidates for the real-time computation of the
40 kHz of the processor computing the discrete Fourier FFT in VLSI.
transform (DFT) [21]. It has been reported that the pro-
grammable DSP processors execute 512-point complex APPENDIX
FFT within 4.5 ms [23] and 1024-point FFT in 1.0-1.33 Algorithm I: Data Shuffling Procedure Based on HBA
ms [3], [24], achieving throughput rates in the range of
0.1-1.0 MHz, which are much lower than that of our BEGIN
array processor. Also, it is comparable to 8.8 MHz of the FOR q = 1 TO 4 DO IN PARALLEL for aEl i and j
FFT processor based on radix-4 butterfly 141 and 10 MHz (i = j = 0,1,5 3)
of the FFT system built with 27 IC’s [5]. BEGIN
Although the array processor described in this paper IF (q Q 2) THEN
implements only 16 PE’s with a 2-pm CMOS process, it D(q) = 22-4
will be possible to integrate more than 16 PE’s into a IF (i MOD 2 3 - 4 < 2 2 - 4 ) THEN
single chip if a submicrometer technology is used. The move data in PE(i7j ) to PE(i + DCq),j )
area A ( N , A ) required for an array of N PE’s can be ELSE
parameterized as follow: move data in PE(i,j ) to PE(i - L3(q),j )
ELSE
A ( N , A ) = A,NA2 X106. (3) D(q) = 2 4 - 4
In (3), A denotes half of the minimum feature size of a IF ( j ~ 0 0 2 5 - 4< 24-4) THEN
given technology, and A , represents the physical layout move data in PE(i,j ) to PE(i,j + D(q))
area of a PE. The value of A , will depend on the physical ELSE
realization of PE. In our design using double-metal layer, move data in PE(i,j ) to PE(i,j - D(q))
A , is 3.345 (see Table 11). END
Fig. 11 shows this area estimation using the A parame- END.
ter, which is normalized by A , . Based on Fig. 11, it will be
possible to implement an array processor for 32x32 2-D Algorithm 11: Determination of the Arithmetic Type of
FFT in an area of about 1 cm2 if a svbmicrometer CMOS the HBA&
technology is used. BEGIN
FOR q = 1 TO 4 DO IN PARALLEL for all i and j
VII. CONCLUSIONS (i = j = 0,I , 2,3)
An array processor dedicated to compute 16-point FFT BEGIN
has been implemented using a 2-pm CMOS process. To IF (q Q 2) THEN
achieve an area-efficient computation of the FFT, some IF (i M0D23-4 < 2 2 - 4 ) THEN
design considerations have been examined both in array PE(i,j ) compute the HBA -+-
level and in PE level. The novel features in designing the ELSE
array processor are the half butterfly arithmetic concept PE(i,j ) compute the HBA -

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.
1292 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 26, NO. 9, SEPTEMBER 1991

ELSE [23] Y. Kawakami et al., “A 32b floating point CMOS digital signal
processor,” in ISSCC Dig. Tech. Papers, Feb. 1986, pp. 86-87.
IF ( j MOD 2 s - q < 2 4 - q ) THEN [24] A. Kanuma et al., “A 20 MHz 32b pipelined CMOS image proces-
PE(i,j ) compute the HBA + sor,” in ISSCC Dig. Tech. Papers, Feb. 1986, pp. 102-103.
ELSE
PE(( j ) compute the HBA -
END Moon-Key Lee (S’77-M’79) was born in Seoul,
END. Korea, in 1941. He earned the B.S., M.S., and
D.E. degrees in electrical engineering from
REFERENCES Yonsei University, Seoul, Korea in 1965, 1967,
and 1973, respectively. In 1980 he received the
[l] S. Y. Kung, VTLSI Array Processors. Englewood Cliffs, NJ: Pren- Ph.D. degree from the University of Oklahoma.
tice-Hall, 1988. He was a Lecturer in the Department of
[2] K. L. Kloker, B. Lindsley, N. Baron, and G . R. L. Sohie, “Efficient Electrical Engineering, Yonsei University, from
FFT implementation on an IEEE floating-point digital signal pro- 1968 to 1970. From 1970 to 1976 he was with
cessor,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process- Kyunghee University, Seoul, Korea, where he
ing, 1988, pp. 1399-1402. held the positions of Assistant Professor, Asso-
[3] R. W. Linderman, P. M. Ahau, W. H. Ku, and P. P. Peusens, ciate Professor, and Chairman of the Electronic Engineering Depart-
“Cusp: A 2-pm CMOS digital signal processor,” IEEE J. Solid-state ment. He was Director of the IC design division at Korea Institute of
Circuits, vol. SC-20, pp. 761-769, June 1985. Electronic Technology (ETRI, at present), Kumi, Korea, from 1980 to
[4] J. O’Brien, J. Mather, and B. Holland, “A 200 MIPS single-chip 1K 1982. In 1982 he joined the faculty of Yonsei University, Seoul, Korea,
FFT processor,” in ISSCC Dig. Tech. Papers, Feb. 1989, pp. where he is currently Professor and Chairman of the Electronic Engi-
166- 167. neering Department. He is a founder of the Research Institute of ASIC
[5] J. Fox, G . Surace, and P. A. Thomas, “A self-testing 2 p m CMOS Design (RIAD), which ‘was established in 1989 and located at Yonsei
chip set for FFT applications,” IEEE J. Solid-State Circuits, vol. University. He has published five college textbooks on electronic engi-
SC-22, pp. 15-19, Feb. 1987. neering. He has authored and co-authored over 100 papers on inter-
[6] K. Yamashita et al., “A wafer-scale 170 000-gate FFT processor grated circuit design and computer-aided design. He made pioneering
with built-in test circuits,” IEEE J. Solid-state Circuits, vol. 23, pp. contributions to IC design education by performing in the Multi Project’
336-341, Apr. 1988. Chip research, which was participated in by eight universities and
[7] W. Cochran et al., “What is the fast Fourier transform?,” IEEE supported by the Ministry of Science and Technology of Korea from
Trans. Audio Electroacoust., vol. AU-15, pp. 45-55, 1967. 1985 through 1988, for the first time in Korea. His current research
[8] H. Stone, “Parallel processing with the perfect shuffle network,” interests include VLSI design and computer-aided design and software.
IEEE Trans. Comput., vol. C-20, pp. 153-161, Dec. 1986. Dr. Lee was the Chairman of the Chapter Promotion and educational
[9] M. K. Lee, “Systolic array for FFT computation,” in ’87 Multi activities in the IEEE Korea Section. He has served as Chairman of the
Project Chip (MPC’87) Deuelopment, Dept. Electronic Eng., Yonsei IEEE Circuits and Systems chapter in Korea which was formed in 1989.
Univ., Seoul, Korea, Aug. 1988. In 1985, he organized the International Workshop on VLSI and CAD,
[lo] K. W. Shin, “A VLSI architecture for the parallel computation of Seoul, Korea and served as Chairman. He was the Program Chairman of
FFT,” Ph.D. dissertation, Dept. Electronic Eng., Yonsei Univ., the 1989 International Conference on VLSI and CAD (ICVC), Seoul,
Seoul, Korea, Aug. 1990. Korea. He was the Chairman of the financial committee of TENCON
[ l l ] K. W. Shin, B. Y. Choi, and M. K. Lee, “A VLSI architecture of 87, IEEE Region 10 Conference, Seoul, Korea, in 1987. From 1982
systolic array for FFT computation,” J . KITE, vol. 25, no. 9, pp. through 1987 he was a member of the editiorial committee and an
97-106, 1988. Associate Editor of the Journal of Korean Institute of Telematics and
[12] B. Y. Choi, B. H. Kang, J. K. Lee, K. W. Shin, and M. K. Lee, Electronics (KITE). He received the distinguished honor award from
“VLSI implementation of two-dimensional FFT algorithm on sys- KITE in 1984. In 1986, he was awarded the 40th anniversary prize for
tolic array,” in Proc. TENCON87 IEEE Region-IO Conf., 1987, pp. the outstanding paper published in the journal of KITE. Since 1988 he
125- 131. has served on the Board of Directors of the Korean Institute of
[13] E. E. Swartzlander, W. K. W. Young, and S. J. Joseph, “A radix 4 Telematics and Electronics (KITE).
delay commutator for fast Fourier transform processor implementa-
tion,” IEEE J . Solid-state Circuits, vol. SC-19, no. 5, pp. 702-709,
Oct. 1984.
[14] T. Wiley, T. S. Durrani, and R. Champinin, “An FFT systolic Kyung-Wook Shin was born in Chongiu, Korea,
processor and its applications,” in Proc. IEEE Int. Conf. Acoustics, in 1961. He received the M.S. and Ph.D. de-
Speech, Signal Processing, 1984, vol. 2, pp. 4.1-4.4. grees in electronic engineering from Yonsei
[15] L. W. Chang and M. Y. Chen, “A new systolic array for discrete University, Seoul, Korea, in 1986 and 1990, re-
Fourier transform,” IEEE Trans. Acoustics, Speech, Signal Process- spectively. During his Ph.D. course, he worked
ing, vol. 36, no. 10, pp. 1665-1666, 1988. on the design of a VLSI array processor for
[16] K. W. Shin and M. K. Lee, “A VLSI architecture for parallel parallel computation of fast Fourier transform
computation of FIT,” Systolic Array Processors. Englewood Cliffs, (FIT).
NJ: Prentice-Hall, 1989, pp. 116-125. In 1990 he joined the Integrated Circuits
[17] S. A. White, “A simple FFT butterfly arithmetic unit,” IEEE Trans. Technology Department of the Electronics and
Circuit Syst., vol. CAS-28, no. 4, pp. 352-355, Apr. 1981. Telecommunications Research Institute (ETRI),
[18] A. L. Fisher and H. T. Kung, “Synchronizing large VLSI processor Daejeon, Korea. In 1991 he joined the faculty of Kum Oh Institute of
arrays,” IEEE Trans. Comput., vol. C-34, no. 8, pp. 734-740, Aug. Technology, Kyungbuk, Korea. His research interests include VLSI
1985. signal processing, algorithm-oriented VLSI array architectures, and digi-
[191 C. D. Thomson, “A complexity theory for VLSI,” Ph.D. disserta- tal design.
tion, Comput. Sci. Dept., Carnegie-Mellon Univ., Pittsburgh, PA,
Aug. 1980.
[20] I. R. Mactaggart and M. A. Jack, “A single chip radix-2 FFT Jang-Kyu Lee was born in Seoul, Korea, in
butterfly architecture using parallel data distributed arithmetic,” 1963. He received the B.S. degree in electronic
IEEE J. Solid-state Circuits, vol. SC-19, no. 3, pp. 368-373, June engineering from Sogang University, Seoul, Ko-
1984. rea, in 1986, and the M.S. degree in electronic
[21] J. L. V. Meerhergen and F. J. V. Wyk, “A 256-point discrete engineering from Yonsei University, Seoul,
Fourier transform processor fabricated in 2 p m NMOS technology,” Korea, in 1988.
IEEE J. Solid-State Circuits, vol. SC-18, no. 5, pp. 604-609, Oct. In 1988 he joined the Semiconductor Division
1983. of Samsung Electronics, Inc., Kyungki Do, Ko-
[22] H. S. Lee, H. Mori, and H. Aiso, “Parallel processing FFT for rea, where he works on memory IC design. His
VLSI implementation,” Trans. IECE Japan, vol. E68, no. 5, pp. research interests include DRAM design and
284-291, May 1985. application-specific memory design.

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on April 06,2022 at 07:26:28 UTC from IEEE Xplore. Restrictions apply.

You might also like