0 ratings0% found this document useful (0 votes) 43 views12 pagesAutoscaling Radix-4 FFT For TMS320C6000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
we TEXAS Application Report
INSTRUMENTS ‘SPRA654- March 2000
Autoscaling Radix-4 FFT for TMS320C6000™
Yao-Ting Cheng Taiwan Semiconductor Sales & Marketing
ABSTRACT
Fixed-point digital signal processors (DSPs) have limited dynamic range to deal with digital
data. This application report proposes a scheme to test and scale the result output from each
Fast Fourier Transform (FFT) stage in order to fix the accumulation overflow. The radix-4 FFT
algorithm is selected since it provides fewer stages than radix-2 algorithm. Thus, the scaling
operations are minimized. This application report is organized as follows:
© Basics of FFT
‘© Multiplication and addition overflow
‘© Algorithm to test bit growth and scaling the result
‘* Implementation by C and Linear Assembly on the C6000 DSP
© List of the codes
Contents
1 FFT (Fast Fourier Transform) . 2
2 Multiplication and Additions Overflow 3
3 Bit-Growth Detection and Scaling Algorithm 5
4 Example 1 - Main Program . 6
5 Example 2- Autoscaling Radix-4 FFT With C6000 C Intrinsics ...... 7
6 Example 3 - Autoscaling Radix-4 FFT With C6000 Linear Assembly . 8
7 References . M1
List of Figures
Figure 1. Radix-2 FFT forN=B ........ceee coe t ete teeettteeeees seve 2
Figure 2. Radix-4 Butterfly ..... Sbooenecboseoseoaced , Snoopsuncaboneaoce)
‘TM$32006000 is a trademark of Texas Instruments,Texas
SPRAG54 INSTRUMENTS,
1
FFT (Fast Fourier Transform)
Many applications require the processing of signals in the digital world, digital signal processing.
Because we may need to process a signal based on its frequency characteristics, there is a
need to reformat the signal. The Discrete Fourier Transform (DFT) is one of the ways to convert
the signal from time domain to frequency domain. DFT is a discrete version of Fourier Transform
and is very computable by the modern microprocessor. The DFT equation is listed below:
Net
Xk) = > x(n)WAr,K = Oto N— 1 where W, = eR!
‘a
Many calculations are needed. There are N2 complex multiplications and N2 complex additions
for an N-point DFT. One of the algorithms that can reduce dramatically the number of
computations is the radix-2 FFT, which takes advantage of the periodicity of the Twiddle Factor
Wy: For example, if n=N, then
wt = wat
The radix-2 FFT equation is listed below:
en) = = Rtk
Ney Mo
XW = 2 = S [xo +(-h{n+ wit
: =o
The radix-2 FFT equation simply divides the DFT into two smaller DFTs. Each of the smaller
DFTs is then further divided into smaller ones and so on (see Figure 1). It consists of logaN
‘stages and each stage consists of N/2 butterflies. Each butterfly consists of two additions for the
input data and one multiplication to the twiddle factor.
x(0) xo)
x(1) x(a)
x@) xe)
x(3) x(6)
x(4) x1)
x65) x6)
(6) X(3),
x(7) x7)
Stage 1 Stage 2 Stage 3
Figure 1. Radix-2 FFT for N=8
Autoscaling Radix-4 FFT for TMS320C6000"9 Texas
INSTRUMENTS SPRAG54
The other popular algorithm is the radix-4 FFT, which is even more efficient than the radix-2 FFT.
The radix-4 FFT equation is listed below:
xk) = S [ xen-+ (atari) + aytaln + 8) + @ha(n ll wy
a
The radix-4 FFT equation essentially combines two stages of a radix-2 FFT into one, so that half
as many stages are required (see Figure 2). Since the radix-4 FFT requires fewer stages and
butterflies than the radix 2 FFT, the computations of FFT can be further improved. For example,
to calculate a 16-point FFT, the radix-2 takes logo 16=4 stages but the radix-4 takes only
log416=2 stages. Next, we discuss the numerical issue that arises from a finite length problem.
Most people use a fixed-point DSP to perform the calculation in their embedded system because
the fixed-point DSP is highly programmable and is cost efficient. The drawback is that the
fixed-point DSP has limited dynamic range, which is worsened by the summation overtiow
problem that occurs all the time in FFT. A scheme is needed to overcome this issue.
Figure 2. Radix-4 Butterfly
2 Multiplication and Additions Overflow
FFT is nothing but a bundle of multiplications and summations which may overflow during
multiplication and addition. This application report adopts the radix-4 algorithm developed by C.
S. Burrus and T. W. Parks to explain how to solve these two kinds of overflow on a C6000 DSP.
The radix-4 FFT C equivalent program is listed below:
void radix4(int n, short x{J, short w{])
{
int nly n2, ie, ial, ia2, ia3, 10, i1, 12, 13, 4, ki
short t, rl, r2, 81, 82, col, co2, co3, sil, si2, $i3;
n2 =n;
for (k =n; k > 1; k >>= 2) ( // number of stages
Al = n2;
‘Autoscaling Radix-4 FFT for TMS320C6000" 3Texas
SPRAG54 INSTRUMENTS,
3 < nay 344) // number of butterflies
fal + ialy // per stage
da2 + dal;
wal ©2441);
woial * 21;
co2 = w(ia2 * 2+ 1);
812 = w[ia2 * 2];
co3 = wliad ¥ 2+ 11;
8i3 = wlia3 * 2);
dal = jal + de;
sil
for (0 = 3; 10 > 15;
x(2 * i2 + 1] = (81 * cod ~ rl * si2y>> 15;
ba x(2* it +1) - x12 * 43415
rl = 22+;
r2=32-t;
t= x(2* 41] - «(2 * 43);
sl=s2-t;
82 = s2 +t;
x(2* i] = (rl * col + sl * sil) >> 15;
x(2 * il +1] = (s1 * col ~ rl * sily>>
x[2 * 43] = (x2 * co} + s2* 8i3) >>
x(2 * 43 + 1] = (52 * cod ~ 22 * si3)>>
de << 2;
To deal with the multiplication overflow, we need to interpret all input samples and twiddle
factors, Wx, as fractional numbers because a fractional number times a fractional number is.
always less than or equal to one. For the C6000 DSP, the largest 16-bit positive fractional binary
number is 0.111 1111 1111 1111, which is mapped as 32767 in integer domain (or Ox7FFF in
hexadecimal). The smallest negative number is 1.000 0000 0000 0000, which is noted as 32768
in integer (or 0x8000 in hexadecimal). The only exception that multiplication still occurs is —1
times —1; the result of which should be equal to positive 1. However, we have the largest positive
number 0.111 1111 1111 1111, which is very close to one but not precisely the perfect 1. The
4 Autoscaling Radix-4 FFT for TMS320C6000"9 Texas
INSTRUMENTS SPRAG54
C6000 DSP provides Saturation Multiplication instructions such as SMPY that can fix this
problem.
The second overflow comes from additions. According to the algorithm listed above, up to five
additions are needed to calculate the output. For example, one of the FFT output data is
calculated as
x(2 * i] = rl * col + s1* sil
(22 +) * col + (82 - t) * sit
= (e2 + (x[2"i141] = x[243411)) * col +
(52 ~ (x(2*i2] ~ x(2*43])) * 812.
Itcan contribute up to a 3-bit growth within the butterflies. The easiest way to fix itis to scale
down the input samples 3 bits at each stage. Somehow, it costs a lot of dynamic range. The
other way to fix itis to detect if the bit grows at the output of each stage. Then, scale down the
result based on how many bits have grown before feeding the result into the next stage.
3 Bit-Growth Detection and Scaling Algorithm
The C6000 DSP provides the instruction NORM that can help detect how many bits grow after
each addition. For example, assume the content of the 32-bit register A1 is 0000 0000 0000
0000 0010 0010 1100 1111. After performing the NORM operation such as
NORM.L1 Al, A2
The At will stay unchanged and A2 will be 17, which is simply the number of non-redundant sign
bits as shown with double underscore. If the content of A1 grows 3 bits as 0000 0000 0000 0901
0010 0010 0011 0011, the result of NORM will be 14 because the non-redundant sign bit is
decreased by 3 bits. Once we have the number of bit-growth, we can properly scale down the
result by right-shifting the content of the register. One more issue to be considered is the input
data format. Generally, the Q15 number is adopted for most of the system. It means that there is
one sign bit in the most significant bit (MSB) for 16-bit data such as S.XXX XXXX XXXX XXXX,
where S is the sign bit. To prevent addition overflow for radix-4 FFT, we need three guard bits;
therefore, the data should be Q12, such as SSSS. XXXX XXXX XXXX. Itis a reasonable
approach since the resolution of most of the analog-to-digital converters is less than or equal to
12 bits. The result returned by the NORM instruction for Q12 data is therefore 19. The algorithm
is summarized below:
Step 1: Input data should be in the format of Q12 to gain three guard bits. Set exp = 19,
which is the number of non-redundant sign bits of Q12 data.
Step 2: At the end of each butterfly calculation, take the test of bit growth and record the
maximum as follows:
Af ((exp_temp = _norm(x[k])) < 19)
Af (exp_temp < exp) exp = exp_temp;
Step 3: At the end of each stage, test to see if FFT is not in the last stage. There is no need
to scale the last output. Then, test if the bit grows. If it does, scale down the output
back to O12.
if (!ast_stage) {
if (exp < 19) (
for (10; i¢(2*N); i++) X{1]>>=(18-exp);
‘Autoscaling Radix-4 FFT for TMS320C6000" 5SPRAG54
Texas
INSTRUMENTS.
scale += (19-exp);
exp - 18;
)
Example programs are listed below. Example 1 is the main program that provides the input
samples and the twiddle factors for 16-point FFT. Example 2 is the autoscaling radix-4 FFT
implemented in C with C6000 intrinsics. Example 3 is the FFT subroutine implemented with
C6000 linear assembly.
4 Example 1 - Main Program
#define 912 SCALE 8
extern int r4_fft (short,
short x(32]={ 0,
short*,
4617/912_SCALE,
9118/012_SCALE,
13389/Q12_SCALE,
17324/Q12_SCALE,
20825/Q12_SCALE,
23804/Q12_SCALE,
26187/012_SCALE,
27914/912_SCALE,
28941/Q12_SCALE,
29242/012_SCALE,
28811/912_SCALE,
27658/Q12_SCALE,
25811/012_SCALE,
23318/012_SCALE,
20241/012_SCALE,
short w(32]={ 0,
12540,
23170,
30274,
32767,
30274,
23170,
12540,
0,
12540,
23170,
30274,
32767,
-30274,
-23170,
12540,
short index{16]
32767,
30274,
23170,
12540,
0,
12540,
23170,
30274,
32767,
-30274,
23170,
-12540,
o,
12540,
23170,
30274
4, 8,
5, 9,
6, 10,
7, iy
short y(32]; // outputs
main()
6 © Autoscaling Radix-4 FFT for TMS320C6000"
short*);
9, // input samples
°, // Scale the data from Q15 to Ql2
o,
°,
°,
°,
°,
9,
°,
°,
9,
°,
°,
°,
°,
oo
// Twiddle Factors
Jf 32768*sin(2P1*n/N), 32768*cos (2PI*n/N)
he
12, // index for 16-points digit reverse
13,
14,
1509 Texas
INSTRUMENTS SPRAG54
int n=16;
int i;
int scale;
scale = r4_££t(n,x,w);
for(i=0; den; itt) {
y(2"i] = xlindex{i]*21;
y(2*itt] = x{index[i]*2+1];
5 Example 2 — Autoscaling Radix-4 FFT With C6000 C Intrinsics
int r4_fft (short n, int x{], const int w(])
‘
int ni, n2, ie, ial, ia2, ia3, i0, il, i2, 43, 3, ke
int t0, t1, t2;
int xtmph, xtmpl;
int shift, exp-19, scale=0;
a2 =n;
fe = 4)
for (ken; Kel; ko>92 ) (
nl = n2;
a2 >>= 2;
ial = 0;
for ( 3-07 j> 16)
oxoooarete;
x{i2] = xtmph | xtmpl;
£0 = _ub2(x{i1],*143]);
tl = (0 << 16);
tO = ti | ((t0 >> 16) & oxo000EEet);
t1 = “adda (t2,t0);
£2 = “sub2 (e2, £0
xtmph = (_smpyh(t1,w[ial]) - _smpy(tl,w(ial]}) © Ox££££0000;
xtmpl = ((ompylh(t1,w[ial]) + _smpyhi(ti,wlial})) >> 16) &
oxoo00Feee;
x(i2] = xtmph | xtmp1;
xtmph = (empyh(t2,w(ia3]) ~ _empy(t2,w[ia3]}) & Ox££££0000;
xtmpl = ((empylh(t2,w(ia3]) + _smpyhl(t2,w[ia3])) >> 16) &
oxooo0rets;
‘Autoscaling Radix-4 FFT for TMS320C6000" = 7Texas
SPRAG54 INSTRUMENTS,
x(43] = xtmph | xemply
)
jal - ial + ier
,
ME (> 4)
ie <<= 2;
3-0;
while ( (exp > 16) 6& (3 > 16);
xtmpl = norm(x{j] << 16 >> 16);
if ( xtmph < exp ) exp=xtmphy
if ( xtmpl < exp ) exp=xtmpl;
ate
if (exp < 19) {
shift = 19-exp;
exp = 19
scale += shift;
—nassert (3215);
for ( 3-0; Jeni j#+) 1
xtmph = (x[j] >> shift) 6 Ox€£££0000;
xtmpl = ((x(3] << 16) >> (L6tshitt)) & oxoov0Eses;
x(i] = xtmph | xtmpl;
}
)
return scal
6 Example 3 - Autoscaling Radi
-4 FFT With C6000 Linear Assembly
stitle “r4_fft.sa”
sdef | _r4_ftt
text
r4_fft Leproc ny px) pW
treg nl, 2, ie, ial, ia, ia3, 10, i2, 42, 43, 3p ke
sreg tO, tl, t2, wy x0, x1, x2, x3;
:reg tmp, mskh, xtmph, xtmply
sreg exp, scale;
add on, 0, 2
mee ie
zero mskh
mvkh — OXfFF£0000, makh
zero scale
add on, 0, Kk
stage_loop:
add n2, 0, al
she m2, 2, a2
zero ial
zero 3
group_loop:
8 Autoscaling Radix-4 FFT for TMS320C6000"9 Texas
INSTRUMENTS SPRAG54
add da, at, 1a2
add ia2, ial, ia3
add jy, 0, 40
butterfly_loo
add i0, n2, it
add it, m2, 12
add = i2, m2, i3
ldw = *#p_x [10], x0
law *4p_x{il], x1
law *4p_x [42], x2
law = “4p_x[i3], x3
add2 x1, x3, tO
add2 x0, x2, tL
sub2 x0, x2, t2
add2 0, tl, x0 i x0
sub2 tl, tO, t1
dw *4p_w[ia2], w j load twiddle factor w2
smpyh tl, w, tmp
smpy ti, w, xtmph
sub tmp, xtmph, xtmph
and -xtmph, mskh, xtmph
smpylh ti, w, tmp
smpyhl tl, w, xtmpl
add tmp, xtmpl, xtmpl
shru —-xtmpl, 16, xtmpl
or xtmph, xtmpl, x2 i x2
sub2 x1, x3, tO
shi tO, 16, t1
neg tl, tl
extu 0, 0,16, t0
or tl, to, t0
add2t2, tO, t1
sub2t2, tO, t2
ldw = “4p_widall, w } load twiddle factor wi
smpyh ti, w, tmp
smpy tl, w, xtmph
sub tmp, xtmph, xtmph
and xtmph, mskh, xtmph
smpylh ti, w, tmp
smpyhl ti, w, xtmpl
add tmp, xtmpl, xtmpl
shru —-xtmpl, 16, xtmpl
or xtmph, xtmpl, x1 pox
dw *4p_w[ia3], w j load twiddle factor w2
smpyh 2, w, tmp
smpy t2, w, xtmph
sub tmp, xtmph, xtmph
and xtmph, mskh, xtmph
smpylh 2, w, tmp
smpyhl 2, w, xtmpl
add tmp, xtmpl, xtmpl
‘Autoscaling Radix-4 FFT for TMS320C6000" 9Texas
INSTRUMENTS.
SPRAGS4
shea xtmpl, 16, xtmpl
or xtmph, xtmpl, x3 box
stw x0, *#p_x[i0]
ste xd, *p_x(i1]
stw x2, *4p_x[i2]
stw x3, *4p_x(13]
add 0, m1, i0
emplt 40, 8, tmp
(emp) butterfly_loop ; branch to butterfly loop
add iat, ie, ial
add, J
cmpit 5, n2, tmp
(emp) group_loop } branch to group loop
cmpeq k, 4, tmp } test if last stage
(emp) end } Af true, branch to end
mvk 2, exp } initialize exponent
zero 3 } initialize index
mvkl OxO000ffff, t2 ; mask for masking xtmpl
nvkh Ox000OFFFE, 2
test_bit_growth: .trip 16
Tew “4x L 3], tmp
norm tmp, xtmph } test for redundant sign bit of HI half
shl tmp, 16, xtmpl
norm xt mpl, xtmpl ; test for redundant sign bit of 10 half
cmplt xtmph, exp, tmp } test if bit grow
(tmp)add xtmph, 0, exp
cmpit xtmpl, exp, tmp } test if bit grow
(tmpladd xt mpl, 0, exp
cmpgt exp, 2, tmp if exp>2 than no scaling
(empl no_scale
empeq exp, 0, tmp } compare if bit grow 3 bits
(tmpisub 3, exp, tO } calculate shift
[emplmvk 0x0213, t1 } esta & ostb to ext xtmpl
(tmpladd scale, £0, scale } accumulate scale
(empl scaling
empeq exp, 1, tmp } compare if bit grow 2 bit
[tmpisub 3, exp, tO
(emplmvk — 0x0212, t1 } esta & estb to ext xtmpl
(tmpladd scale, 0, scale } accumulate scale
(empl sealing
sub 3, exp, t0 3 grows 1 bit
mvk —0x0211, 1 } esta 6 esth to ext xtmpl
add scale, t0, scale } accumulate scale
> sealing
no_scale:
add 3, ty 5
emplt 3, n, tmp } compare if test all output
(empl test_bit_growth } if not, test next output
> next_stage } else go to next stage
10 Autoscaling Radix-4 FFT for TMS320c6000"9 Texas
INSTRUMENTS SPRAGS4
scaling:
zero 3
scaling_loop: .trip 16
dew Mp_x(]) tmp
she tmp, £0, xtmph } scaling HI half
and xtmph, mskh, xtmph } mask HI half.
ext tmp, t1, xtmpl scaling LO half
and -xtmpl, t2, xtmpl } mask LO half by Ox0000seee
or xtmph, xtmpl, tmp } x(3]=(xtmph | xtmpl]
stw tmp, *#p_x(5]
add 3, 3
emplt 3, n, tmp
(emp] > scaling_loop
next_stage:
shl ie, 2, ie
shr ky 2, k
> stage_loop ; end of stage loop
end:
-return scale
sendproc
7 References
1. C.S, Burrus and T.W. Parks, DFT/FFT and Convolution Algorithms and Implementation, John
Wiley & Sons, New York, 1985.
‘Autoscaling Raaix-4 FFT for TMS320C6000" 11IMPORTANT NOTICE
Texas Instruments and its subsidiaries (Tl) reserve the right o make changes to thelr products or to discontinue
any product or service without notice, and advise customers to obtain the latest version of relevant information,
to verity, before placing orders, that information being relied on is current and complete. All products are sold
subject to the terms and conditions of sale supplied at the time of order acknowledgement, including those
peraining to warranty, patent infingement, and imitation of lability.
TI warrants performance ofits semiconductor products to the specifications applicable atthe time of sale in
accordance with T's standard warranty. Testing and other quailty control techniques are utlized to the extent
Tideems necessary to suppor this warranty. Specific testing fall parameters of each device is not necessarily
performed, except those mandated by government requirements.
CERTAIN APPLICATIONS USING SEMICONDUCTOR PRODUCTS MAY INVOLVE POTENTIAL RISKS OF
DEATH, PERSONAL INJURY, OR SEVERE PROPERTY OR ENVIRONMENTAL DAMAGE (‘CRITICAL
APPLICATIONS’). Tl SEMICONDUCTOR PRODUCTS ARE NOT DESIGNED, AUTHORIZED, OR
WARRANTED TO BE SUITABLE FOR USE IN LIFE-SUPPORT DEVICES OR SYSTEMS OR OTHER,
CRITICAL APPLICATIONS. INCLUSION OF TI PRODUCTS IN SUCH APPLICATIONS IS UNDERSTOOD TO,
BE FULLY AT THE CUSTOMER'S RISK.
In order to minimize risks associated with the customer's applications, adequate design and operating
safeguards must be provided by the customer to minimize inherent or procedural hazards.
Tlassumes no liabilty for applications assistance or customer product design. Tidoes not warrant or represent
that any license, either express or implied, is granted under any patent right, copyright, mask work right, or other
intellectual property right of TI covering or relating to any combination, machine, or process in which such
semiconductor products or services might be or are used. TI's publication of information regarding any third
party's products or services does not constitute TI's approval, warranty or endorsement thereot.
Copyright © 2000, Texas Instruments Incorporated