(Signal Processing and Communications 13) Hu, Yu Hen - Programmable Digital Signal Processors - Architecture, Programming, and App PDF
(Signal Processing and Communications 13) Hu, Yu Hen - Programmable Digital Signal Processors - Architecture, Programming, and App PDF
1 INTRODUCTION
TM
2 VLIW ARCHITECTURE
TM
y ⫽ a1 x1 ⫹ a2 x2 ⫹ a3 x3
cycle 1: load a 1
cycle 2: load x 1
cycle 3: load a 2
cycle 4: load x 2
cycle 5: multiply z 1 a 1 x 1
cycle 6: multiply z 2 a 2 x 2
cycle 7: add y z 1 z 2
cycle 8: load a 3
cycle 9: load x 3
cycle 10: multiply z 1 a 3 x 3
cycle 11: add y y z 2
TM
TM
TM
y⫽ 冱cx
i⫽0
i i (1)
i⫽7
y⫽ 冱 |c ⫺ x |
i⫽0
i i (2)
TM
TM
Every VLIW processor tries to utilize both instruction-level and data-level paral-
lelisms. They distinguish themselves in the number of banks and amount of on-
TM
TM
TM
TM
TM
TM
TM
TM
TM
TM
TM
TM
if (X ⬎ Y)
S ⫽ S ⫹ X;
else
S ⫽ S ⫹ Y;
Due to the idle instruction slots, it takes either 7 or 10 cycles per data point
(depending on the path taken) because we cannot use instruction-level and data-
level parallelisms effectively. Thus, to overcome this if/then/else barrier
in VLIW processors, two methods can be used:
• Use predicated instructions: Most of the instructions can be predicated.
A predicated instruction has an additional operand that determines
whether or not the instruction should be executed. These conditions are
stored either in a separate set of 1-bit registers called predicate registers
or regular 32-bit registers. An example with a predicate register to han-
dle the if/then/else statement is shown in Figure 14b. This
method requires only 5 cycles (compared to 7 or 10) to execute the
same code segment. A disadvantage of this approach is that only one
data point is processed at a time; thus it cannot utilize data-level paral-
lelism.
• Use select instruction: select along with compare can be utilized
to handle the if/then/else statement efficiently, compare as il-
lustrated in Figure 14c is utilized in comparing each pair of subwords
in two partitioned source registers and storing the result of the test (i.e.,
TRUE or FALSE) in the respective subword in another partitioned des-
tination register. This partitioned register can be used as a mask register
TM
TM
TM
in Figure 17. The guide table is either given before the program starts (off-line)
or generated in the earlier stage of processing. The guided transfer is set up by
specifying base address, data size, count, and guide table pointer. data size is
the number of bytes that will be accessed from each guide table entry, and the
guide table is pointed by guide table pointer.
For VLIW processors, the scheduling of all instructions is the responsibility of the
programmer and/or compiler. Thus, the assembly language programmers must
understand the underlying architecture intimately to be able to obtain high perfor-
mance in a certain algorithm and/or application. Smart compilers for VLIW pro-
cessors to ease the programming burden are very important. Tightly coupled with
the advancement of compiler technologies, there have been many useful program-
ming techniques, as discussed in Section 4 and use of C intrinsics (Faraboschi
et al., 1998). The C intrinsics can be used as a good compromise between the
performance and programming productivity. A C intrinsic is a special C language
extension, which looks like a function call, but directs the compiler to use a
certain assembly language instruction. In programming TMS320C62, for exam-
ple, the int_add2(int, int) C intrinsic would generate an ADD2 assembly
instruction (two 16-bit partitioned additions) using two 32-bit integer arguments.
TM
5.1 2D Convolution
Convolution plays a central role in many image processing and digital signal
processing applications. In convolution, each output pixel is computed to be a
weighted average of several neighboring input pixels. In the simplest form, gener-
alized 2D convolution of an N ⫻ N input image with an M ⫻ M convolution
kernel is defined as
x⫹M⫺1 x⫹M⫺1
冱 冱 f (i, j )h(x ⫺ i, y ⫺ j )
1
b(x, y) ⫽ (3)
s i⫽x j⫽y
TM
TM
TM
If the kernel width is greater than 8, then the kernel can be subdivided into
several sections and the inner loop is iterated multiple times, while accumulating
the multiplication results.
The MAP1000 has an advanced inner-product instruction, called srshin-
prod.pu8.ps16, as shown in Figure 18. It can multiply eight 16-bit kernel
coefficients (of partitioned local constant [PLC] register) by eight 8-bit input
pixels (of partitioned local variable [PLV] register) and sum up the multiplication
results. This instruction can also shift a new pixel into a 128-bit PLV register.
x 0 through x 23 represent sequential input pixels, and c 0 through c 7 represent kernel
TM
5.2 FFT
The fast Fourier transform (FFT) has made the computation of discrete Fourier
transform (DFT) feasible, which is an essential function in a wide range of areas
that employ spectral analysis, frequency-domain processing of signals, and image
reconstruction. Figure 19 illustrates the flowgraph for a 1D 8-point FFT. Figure
20a shows the computation of butterfly and Figure 20b shows the detailed opera-
tions within a single butterfly. Every butterfly requires a total of 20 basic opera-
TM
TM
冱 冢冱 x冢n, m冣W 冣
N⫺1 N⫺1
X[k, l] ⫽ lm
N W kn
N (4)
n⫽0 m⫽0
where x is the input image, W N are the twiddle factors, and X is the FFT output.
This method requires 4N 2 log 2 N real multiplications and 6N 2 log 2 N real
additions/subtractions (Dudgeon and Mersereau, 1984), which is 33% more mul-
tiplications and 9.1% more additions/subtractions than the direct 2D approach.
However, this separable 2D FFT algorithm has been popular because all of the
data for the rowwise or columnwise 1D FFT being computed can be easily stored
in the on-chip memory or cache. The intermediate image is transposed after the
rowwise 1D FFTs so that another set of rowwise 1D FFTs can be performed.
This is to reduce the number of SDRAM row misses, which otherwise (i.e., if
1D FFTs are performed on columns of the intermediate image) will occur many
times. One more transposition is performed before storing the final result.
TM
six and four, respectively, because both the real and imaginary parts (16 bits
each) could be loaded or stored with one 32-bit load/store operation.
The software pipelined implementation of the butterfly on the TMS320C80
is shown in Table 2, where the 32-bit add/subtract unit is divided into 2 units
to perform two 16-bit additions/subtractions in parallel. In Table 2, operations
having the same background are part of the same butterfly, whereas operations
within heavy black borders are part of the same tight-loop iteration. With this
implementation, the number of cycles to compute each butterfly per ADSP is
only six cycles (Cycles 5 through 10). All of the add/subtract and load/store
operations are performed in parallel with these six multiply operations. Because
there are 4 ADSPs working in parallel, the ideal number of cycles per 2-point
butterfly is 1.5.
TM
1 L1 L2
2 A3 A4 L3
3 M1
4 M2
5 M3 A1 A2 L1 L2
6 M4 A3 A4 L3
7 M5 A5 A6
8 M6
9 M1 S1 S2
10 M2
11 M3 A1 A2
12 M4
13 M5 A5 A6
14 M6
15 S1 S2
TM
冱
c(k)
x ij ⫽ kl
k⫽0
2 l⫽0
1
c(k) ⫽ for k ⫽ 0; c(k) ⫽ 1 otherwise (5)
√2
1
c(l) ⫽ for l ⫽ 0; c(l) ⫽ 1 otherwise
√2
where F is the input data, c(⋅) are the scaling terms, and x is the IDCT result. It
can be computed in a separable fashion by using 1D eight-point IDCTs. First,
rowwise eight-point IDCTs are performed on all eight-rows, followed by col-
umnwise eight-point IDCTs on all eight columns of the row IDCT result. Instead
of performing columnwise IDCTs, the intermediate data after the computation
of rowwise IDCTs are transposed so that another set of rowwise IDCTs can be
performed. The final result is transposed once more before the results are stored.
Because the implementation of DCT is similar to that of IDCT, only the IDCT
implementation is discussed here.
TM
1 through 7. When implemented with a basic instruction set, the CIDCT algorithm
requires 16 multiplications and 26 additions. Thus, including 16 loads and 8 store
operations, 66 operations are necessary. Table 3 shows the CIDCT algorithm
implemented on the TMS320C80, where operations belonging to different 1D
eight-point IDCTs (similar to FFT) are overlapped to utilize software pipelining.
Variables with a single prime and double primes (such as F′ and F″) are the
intermediate results. In Table 3, we are performing 32-bit additions/subtractions
on the intermediate data because we are allocating 32 bits for the multiplications
of two 16-bit operands to reduce quantization errors. Another description of im-
plementing CIDCT can be found in (Lee, 1997), where 16 bits are utilized for
representing multiplication results, thus performing 16-bit additions/subtractions
on the intermediate data. The coefficients need to be reloaded because of lack
of registers. Because there are 4 ADSPs, the ideal number of cycles per 8-point
IDCT in our implementation is 6.5. Thus, it takes 104 cycles to compute one
8 ⫻ 8 2D IDCT.
TM
Load/store Load/store
Cycle Multiply unit Add/subtract unit unit #1 unit #2
1 F1″ ⫽ F1 ∗ c1 p0 ⫽ P0 ⫹ P1 store f4 store f5
2 F7″ ⫽ F7 ∗ c7 p1 ⫽ P0 ⫺ P1 load c3 load c5
3 F5′ ⫽ F5 ∗ c3 Q1 ⫽ F1′ ⫺ F7′
4 F3′ ⫽ F3 ∗ c5 S1 ⫽ F1″ ⫺ F7″
5 F5″ ⫽ F5 ∗ c5 Q0 ⫽ F5′ ⫺ F3′
6 F3″ ⫽ F3 ∗ c3 q1 ⫽ Q1 ⫹ Q0 load c2 load F2
7 F2′ ⫽ F2 ∗ c2 S0 ⫽ F5″ ⫹ F3″ load c6 load F6
8 F6′ ⫽ F6 ∗ c6 q0 ⫽ Q1 ⫺ Q0
9 F2″ ⫽ F2 ∗ c6 r0 ⫽ F2′ ⫹ F6′
10 F6″ ⫽ F6 ∗ c2 s0 ⫽ S1 ⫺ S0 load c4
11 q0′ ⫽ q0 ∗ c4 s1 ⫽ S1 ⫹ S0
12 s0′ ⫽ s0 ∗ c4 r1 ⫽ F2″ ⫺ F6″
13 g0 ⫽ p0 ⫹ r0
14 h0 ⫽ p0 ⫺ r0
15 g1 ⫽ p1 ⫹ r1
16 h1 ⫽ p1 ⫺ r1
17 g3 ⫽ s0′ ⫺ q0′
18 h3 ⫽ s0′ ⫹ q0′
19 f0 ⫽ g0 ⫹ s1 load c4
20 f7 ⫽ g0 ⫺ s1 store f 0 load F0
21 f1 ⫽ g1 ⫹ h3 store f 7 load F4
22 f6 ⫽ g1 ⫺ h3 store f 1 load F1
23 P0 ⫽ F0 ∗ c4 f2 ⫽ h1 ⫹ g3 store f6 load F7
24 P1 ⫽ F4 ∗ c4 f5 ⫽ h1 ⫺ g3 load c7 store f3
25 F1′ ⫽ F1 ∗ c7 f3 ⫽ h0 ⫹ q1 load c1 load F5
26 F7′ ⫽ F7 ∗ c1 f4 ⫽ h0 ⫺ q1 store f 2 load F3
f0 A 00 A 10 A 20 A 30 A 40 A 50 A 60 A 70 F0
f1 A 01 A 11 A 21 A 31 A 41 A 51 A 61 A 71 F1
f2 A 02 A 12 A 22 A 32 A 42 A 52 A 62 A 72 F2
f3 A 03 A 13 A 23 A 33 A 43 A 53 A 63 A 73 F3
⫽ ⫻
f4 A 04 A 14 A 24 A 34 A 44 A 54 A 64 A 74 F4
f5 A 05 A 15 A 25 A 35 A 45 A 55 A 65 A 75 F5
f6 A 06 A 16 A 26 A 36 A 46 A 56 A 66 A 76 F6
f7 A 07 A 17 A 27 A 37 A 47 A 57 A 67 A 77 F7
TM
fx ⫽ 冱Au⫽0
ux Fu
This instruction utilizes 128-bit PLC and PLV registers. The setplc.128
instruction is necessary to set the 128-bit PLC register with IDCT coefficients.
Because the accumulations of srshinprod.ps16 are performed in 32 bits,
to output 16-bit IDCT results, the MAP1000 has an instruction called
compress2_ps32_rs15 to compress four 32-bit operands to four 16-bit op-
erands. This instruction also performs 15-bit right-shift operation on each 32-bit
operand before compressing. Because there are a large number of registers on
the MAP1000 (64,32-bit registers per cluster), once the data are loaded into the
registers for computing the first 1D 8-point IDCT, the registers can be retained
for computing the subsequent IDCTs, thus eliminating the need for multiple load
operations. Ideally, an 8-point IDCT can be computed in 11 cycles (8 inner-
product, 2 compress2_ps32_rs15, 1 setplc.128), and 2D 8 ⫻ 8
IDCT can be computed in 2 ⫻ 11 ⫻ 8 ⫽ 176 cycles. Because there are two
clusters, 88 cycles are necessary to compute a 2D 8 ⫻ 8 IDCT.
There are two sources of quantization error in IDCT when computed with
a finite number of bits, as discussed in Section 4.3. The quantized multiplication
coefficients are the first source, whereas the second one arises from the need to
have the same number of fractional bits as the input after multiplication. Thus,
to control the quantization error on different decoder implementations, the MPEG
standard specifies that the IDCT implementation used in the MPEG decoder must
comply with the accuracy requirement of the IEEE Standard 1180–1990. The
simulations have shown that by utilizing 4 bits for representing the fractional
part and 12 integer bits, the overflow can be avoided while meeting the accuracy
requirements (Lee, 1997). MPEG standards also specify that the output x ij in Eq.
(5) must be clamped to 9 bits (⫺256 to 255). Thus, to meet these MPEG standard
requirements, some preprocessing of the input data and postprocessing of the
IDCT results are necessary.
TM
x i ⫽ a 11 x o ⫹ a 12 y o ⫹ a 13 , 0 ⱕ x o ⬍ image_width
(6)
y i ⫽ a 21 x o ⫹ a 22 y o ⫹ a 23 , 0 ⱕ y o ⬍ image_height
For each discrete pixel in the output image, an inverse transformation with
Eq. (6) results in a nondiscrete subpixel location within the input image, from
which the output pixel value is computed. In order to determine the gray-level
output value at this nondiscrete location, some form of interpolation (e.g., bilin-
ear) has to be performed with the pixels around the mapped location.
The main steps in affine warp are (1) geometric transformation, (2) address
calculation and coefficient generation, (3) source pixel transfer, and (4) 2 ⫻ 2
bilinear interpolation. While discussing each step, the details of how affine warp
can be mapped to the TMS320C80 and MAP1000 are also discussed.
TM
TM
TM
TM
6 SUMMARY
To meet the growing computational demand arising from the digital media at an
affordable cost, new advanced digital signal processors architectures with VLIW
have been emerging. These processors achieve high performance by utilizing
both instruction-level and data-level parallelisms. Even with such a flexible and
powerful architecture, to achieve good performance necessitates the careful de-
sign of algorithms that can make good use of the newly available parallelism.
In this chapter, various algorithm mapping techniques with real examples on
modern VLIW processors have been presented, which can be utilized to imple-
ment a variety of algorithms and applications on current and future DSP proces-
sors for optimal performance.
TM
Basoglu C, W Lee, Y Kim. An efficient FFT algorithm for superscalar and VLIW proces-
sor architectures. Real-Time Imaging 3:441–453, 1997.
Basoglu C, RJ Gove, K Kojima, J O’Donnell. A single-chip processor for media applica-
tions: The MAP1000. Int J Imaging Syst Technol 10:96–106, 1999.
Basoglu C, D Kim, RJ Gove, Y Kim. High-performance image computing with modern
microprocessors. Int J Imaging Syst Technol 9:407–415, 1998.
Berkeley Design Technology (BDT). DSP processor fundamentals, http:/ /www.bdti.com/
products/dsppf.htm, 1996.
Boland K, A Dollas. Predicting and precluding problems with memory latency. IEEE
Micro 14(4):59–67, 1994.
Chamberlain A. Efficient software implementation of affine and perspective image
warping on a VLIW processor. MSEE thesis, University of Washington, Seattle,
1997.
Chen WH, CH Smith, SC Fralick. ‘‘A fast computational algorithm for the discrete cosine
transform,’’ IEEE Trans. on Communications, vol. 25, pp. 1004–1009, 1977.
Dudgeon DE, RM Mersereau. Multidimensional Digital Signal Processing. Englewood
Cliffs, NJ: Prentice-Hall, 1984.
Equator Technologies, ‘‘MAP-CA Processor,’’ http:/ /www.equator.com, 2000.
Evans O, Y Kim. Efficient implementation of image warping on a multimedia processor.
Real-Time Imaging 4:417–428, 1998.
Faraboschi P, G Desoli, JA Fisher. The latest world in digital media processing. IEEE
Signal Processing Mag 15:59–85, March 1998.
Fisher JA. The VLIW machine: A multiprocessor from compiling scientific code. Com-
puter 17:45–53, July 1984.
Fujitsu Limited. FR500, http:/ /www.fujitsu.co.jp, 1999.
Greppert L, TS Perry. Transmeta’s magic show. IEEE Spectrum 37(5):26–33, 2000.
Guttag K, RJ Gove, JR VanAken. A single chip multiprocessor for multimedia: The MVP.
IEEE Computer Graphics Applic 12(6):53–64, 1992.
Kim D, RA Managuli, Y Kim. Data cache vs. direct memory access (DMA) in program-
ming mediaprocessors. IEEE Micro, in press.
Lam M. Software pipelining: An effective scheduling technique for VLIW machines. SIG-
PLAN 23:318–328, 1988.
Lee EA. Programmable DSP architecture: Part I. IEEE ASSP Mag 5:4–19, 1988.
Lee EA. Programmable DSP architecture: Part II. IEEE ASSP Mag 6:4–14, 1989.
Lee R. Accelerating multimedia with enhanced microprocessors. IEEE Micro 15(2):22–
32, 1995.
Lee W. Architecture and algorithm for MPEG coding. PhD dissertation, University of
Washington Seattle, 1997.
Managuli R, G York, D Kim, Y Kim. Mapping of 2D convolution on VLIW mediaproces-
sors for real-time performance. J Electron Imaging 9:327–335, 2001.
Managuli RA, C Basoglu, SD Pathak, Y Kim. Fast convolution on a programmable media-
processor and application in unsharp masking. SPIE Med Imaging 3335:675–683,
1998.
TM
TM
1 INTRODUCTION
TM
* Through this chapter, the subwords in a register will be indexed from 1 to n, where n will be the
number of subwords in that register. The first subword (index ⫽ 1) will be in the most significant
position in a register, whereas the last subword (index ⫽ n) will be in the least significant position.
In the figures, the subword on the left end of a register will have index ⫽ 1 and therefore will be
in the most significant position. The subword on the right end of a register will have index ⫽ n
and therefore be in the least significant position.
TM
Figure 2 MicroSIMD parallelism uses packed data types and a partitionable arithmetic
and logic unit structure.
TM
* 3DNow! may be considered as having two versions. In June 2000, 25 new instructions were added
to the original 3DNow! specification. In this text, we will actually be considering this extended
3DNow! architecture.
TM
ADD R c ,R a ,R b
R c is the target register, whereas R a and R b are the first and the second source
registers, respectively. For AltiVec and IA-64, where some instructions may have
one target and three source fields, R d is used to represent the target register.
VSUMMBM, which will be explained in Section 4, is such an AltiVec instruction
and it is represented as follows:
VSUMMBM R d ,R a ,R b ,R c
TM
R d is the target register, whereas R a , R b and R c are the first, second, and third
source registers, respectively.
Our initial assumption that all the instructions use registers for source and
target fields is not always true. MMX and SSE-2 are two important exclu-
sions. Multimedia instructions in these extensions may use a memory location
as a source operand. Thus, using our default representation for such instances
will not be conforming to the rules of that particular architecture. However, to
keep the notation simple and consistent, this distinction will not be observed,
except for being noted here. For the exact instruction formatting and source–
target register ordering, the reader is referred to the architecture manuals listed
in the references.
* For details on instruction formatting used in this discussion, please refer to the last paragraph of
Section 1.2.
TM
are summed up, and the four sums are written to the target register. Figure 5
shows a packed subtract operation that operates on registers holding four sub-
words each.
Figure 5 PSUB R c ,R a ,R b : Packed subtract operation. Each register has four subwords.
TM
Figure 6 To get the correct results in this packed add operation, the carry bits are not
propagated into the first and second sums.
TM
TM
* A flag bit is an indicator bit that is set or cleared depending on the outcome of a particular oper-
ation. In the context of this discussion, an overflow flag bit is an indicator that is set when an
add operation generates an overflow. There are occasions where the use of the flag bits are de-
sirable. Consider a loop that iterates many times and, in each iteration, performs many add oper-
ations. In this case, it is not desirable to handle overflows (by taking overflow trap routines) as
soon as they occur, as this would negatively impact the performance by interrupting the execu-
tion of the loop body. Rather, the overflow flag can be set when the overflow occurs, and the
program flow may be resumed as if the overflow did not occur. At the end of each iteration
however, this overflow flag can be checked and the overflow trap can be executed if the flag
turns out to be set. In this way, the program flow would not be interrupted while the loop body
executes.
TM
* Each one of the three registers can be treated as containing either a signed or an unsigned integer,
which gives 2 3 possible combinations.
TM
TM
TM
Clip a to an arbitrary maximum value v max PADD.sss R a ,R a ,R b R b contains the value (2 15 ⫺ 1 ⫺ v max ). If a ⬎ v max ,
[v max ⬍ (2 15 ⫺ 1)] this operation clips a to 2 15 ⫺ 1 on the high end.
PSUB.sss R a ,R a ,R b a is at most v max .
Clip a to an arbitrary minimum value v min PSUB.sss R a ,R a ,R b R b contains the value (⫺2 15 ⫹ v min ). If a ⬍ v min , this
[(v min ⬎ ⫺2 15 )] operation clips a to ⫺2 15 at the low end.
PADD.sss R a ,R a ,R b a is at least v min .
Clip a to within the arbitrary range [v min , v max ] PADD.sss R a ,R a ,R b R b contains the value (2 15 ⫺ 1 ⫺ v max ). This opera-
[⫺2 15 ⬍ v min ⬍ v max ⬍ (2 15 ⫺ 1)] tion clips a to 2 15 ⫺ 1 on the high end.
PSUB.sss R a ,R a ,R d R d contains the value (2 15 ⫺ 1 ⫺ v max ⫹ 2 15 ⫺ v min ).
This operation clips a to ⫺2 15 at the low end.
PADD.sss R a ,R a ,R c R e contains the value (⫺2 15 ⫺ v min ). This operation
clips a to v max at the high end and to v min at the
low end.
Clip the signed integer a to an unsigned integer PADD.sss R a ,R a ,R b R b contains the value (2 15 ⫺ 1 ⫺ v max ). This opera-
within the range [0, v max ] tion clips a to 2 15 ⫺ 1 at the high end.
[0 ⬍ v max ⬍ (2 15 ⫺ 1)] PSUB.uus R a ,R a ,R b This operation clips a to v max at the high end and to
0 at the low end.
Clip the signed integer a to an unsigned integer PADD.uus R a ,R a ,0 If a ⬍ 0, then a ⫽ 0, else a ⫽ a.
within the range [0, v max ] If a was negative, it gets clipped to 0, else remains
[(2 15 ⫺ 1) ⬍ v max ⬍ 2 16 ⫺ 1] same.
c ⫽ max(a, b) PSUB.uuu R c ,R a ,R b If a ⬎ b, then c ⫽ (a ⫺ b), else c ⫽ 0.
Maximum operation: It writes the greater subword PADD R c ,R b ,R c If a ⬎ b, then c ⫽ a, else c ⫽ b.
to the target register.
c ⫽ |a ⫺ b | PSUB.uuu R c ,R a ,R b If a ⬎ b, then e ⫽ (a ⫺ b), else e ⫽ 0.
Absolute difference operation: It writes the abso- PSUB.uuu R f ,R b ,R a If a ⱕ b, then f ⫽ (b ⫺ a), else f ⫽ 0.
lute value of the difference of the two subwords PADD R c ,R e ,R f If a ⬎ b, then c ⫽ (a ⫺ b), else c ⫽ (b ⫺ a).
to the target register.
Note: This table describes the contents of the registers (e.g. a or b) as if they contained a single value for simplicity. The same description applies to all
the subwords in the registers. Initial contents of R a and R b are a and b unless otherwise noted.
TM
Note: Not every saturation option indicated is applicable to every subword size. 3DNow! does not
have an entry in this table because it does not have integer microSIMD extensions.
* All of the discussions in this chapter consider IA-64 as the base architecture. Evaluations of the
other architectures are generally carried out by comparisons to IA-64.
TM
TM
Example
The format of the FPMA instruction is FPMA R d ,R a ,R b ,Rc and the operation it
performs is R d ⫽ R a ∗ R b ⫹ R c . If FR1 is used as the first or the second source
operand (FPMA R d ,FR1 R b ,R c ), a packed FP add operation is realized (R d ⫽
FR1 ∗ R b ⫹ R c ⫽ 1.0 ∗ R b ⫹ R c ⫽ R b ⫹ R c ). Similarly, a FPMS instruction
can be used to realize a packed FP subtract operation. Using FR0 as the third
source operand in FPMA or FPMS (FPMA R d ,R a ,R b , FR0) results in a packed
FP multiply operation (R d ⫽ R a ∗ R b ⫹ FR0 ⫽ R a ∗ R b ⫹ 0.0 ⫽ R a ∗ R b ).
Note: SP and DP FP numbers are 32 and 64 bits long, respectively, as defined by the IEEE-754 FP
number standard.
TM
3DNow! has two packed subtract instructions for FP numbers. The only
difference between these two instructions is in the order of the operands. The
PFSUB instruction subtracts the second packed source operand from the first,
whereas the PFSUBR instruction subtracts the first packed source operand from
the second.
Table 4 gives a summary of the packed add/subtract operations discussed
in this section. In Table 3, the first column contains the description of the opera-
tions. The symbols a i and b i represent the subwords from the two source registers.
The symbol c i represents the corresponding subword in the target register. A
shaded background indicates a packed FP operation.
This section describes variants of the packed add instructions that are generally
designed to further increase performance in multimedia applications.
TM
Figure 11 PAVG R c ,R a ,R b : Packed average operation using the round away from zero
option.
TM
3.2 Accumulate
AltiVec provides an instruction to facilitate the accumulation of streaming data.
This instruction performs an addition of the subwords in the same register and
places the sum in the upper half of the target register, while repeating the same
process for the second source register and using the lower half of the target regis-
ter (Figure 13).
Figure 12 PAVG R c ,R a ,R b : Packed average operation using the round to odd option.
TM
TM
TM
This sequence can be shortened by combining the shift left instruction and
add instruction into one new shift left and add instruction. The following new
sequence performs the same multiplication in half as many instructions and uses
one less register.
TM
Initial values are C ⫽ 0.375 ⫽ 0.011 2 , R a ⫽ [1| 2 | 3 | 4] ⫽ [0001 | 0010| 0011 | 0100] 2
TM
Figure 15 Packed multiply operation using four 16-bit subwords per register. Not all
of the full-sized products can be accommodated in a single target register.
TM
TM
The other approach can be to keep the least significant halves of the prod-
ucts in the target register. Some examples to this are the MMX’s PMULLW and
AltiVec’s VMUL. These instructions work as shown in Figure 17.
IA-64 architecture comes up with a generalization of this, with its
PMPYSHR instruction. This instruction lifts the limit that one has to choose
either the upper or the lower half of the products to be put into the target register.
PMPYSHR instruction does a packed multiply followed by a shift right operation.
This allows four* possible 16-bit fields (IA-64 has a 64-bit register size) from
each of the 32-bit products to be chosen and be placed in the target register. The
PMPYSHR instruction is shown in Figure 18.
* The right-shift amounts are limited to 0, 7, 15, or 16 bits. This limitation allows a reduction in the
number of bits necessary to encode the instruction.
TM
All of the instructions we have seen thus far performed a full multiplication
on all of the subword pairs of the source operands and then decided how to
handle the large products. However, instead of truncating the products, the source
subwords that will participate in the multiplication may be selected so that the
final product is never larger than can be accommodated in a single target register.
IA-64 has the PMPY instruction, which has two variants. PMPY allows
only half of the pairs of the source subwords to go into multiplication. Either
the odd or the even indexed subwords are multiplied. This makes sure that only
as many full products as can be accommodated in one target register are gen-
erated. The two variants of the PMPY instruction are depicted in Figures 19
and 20.
TM
TM
Figure 22 Multiply high and add R d ,R a ,R b ,R c : The multiply and add instruction of
the AltiVec architecture. In this variant of the instruction, only the high-order bits of the
intermediate products are used in the addition.
TM
to generate the sum of products. A third word from the third source operand is
added to this sum of products. The final sum is placed in the corresponding word
field of the target register. This process is repeated for each of the four words
(Fig. 24).
TM
always greater than either of the multiplicands. This does not allow all of the
product terms to be written into the target register. The special format of FP
numbers does not cause such a size problem. The same number of bits* is used
to represent a FP number regardless of how large the number is. In this respect,
multiplication of packed FP registers is similar to the addition of packed FP
TM
* Two numbers a and b can be compared for one of the following 12 possible relations: equal, less-
than, less-than-or-equal, greater-than, greater-than-or-equal, unordered, not-equal, not-less-than,
not-less-than-or-equal, not-greater-than, not-greater-than-or-equal, ordered. Typical notation for
these relations are as follows: ⫽⫽, ⬍, ⬍⫽, ⬎, ⬎⫽, ?, !⫽, !⬍, !⬍⫽, !⬎, !⬎⫽, !?, respectively.
TM
FP operations
ci ⫽ ai ∗ bi ⻫ ⻫d ⻫
d i ⫽ ⫺a i ∗ b i ⻫
di ⫽ ai ∗ bi ⫹ ci ⻫ ⻫
di ⫽ ai ∗ bi ⫺ ci ⻫
d i ⫽ ⫺a i ∗ b i ⫹ c i ⻫ ⻫
a
For use in multiplication of a packed register by an integer constant.
b
For use in multiplication of a packed register by a fractional constant.
c
Shift amounts are limited to 0, 7, 15, or 16 bits.
d
Scalar versions of these instructions also exist.
TM
tures allow for a more limited subset of relations. A typical packed compare
instruction is shown in Figure 26 for the case of four subwords.
In the packed maximum/minimum operations, the greater/smaller of the
subwords in the compared pair gets written to the corresponding field in the target
register. Figures 27 and 28 illustrate the packed maximum/minimum operations.
TM
TM
tion of the operations. The symbols a i and b i represent the subwords from the
two source registers. The symbol c i represents the corresponding subword in the
target register. A shaded background indicates a packed FP operation.
Most of the architectures have instructions that support packed shift/rotate opera-
tions on packed data types. These instructions prove very useful in multimedia,
arithmetic, and encryption applications. There are usually great differences be-
TM
TM
TM
TM
The instructions for data/subword packing and rearrangement are most interest-
ing and have the widest variety among different architectures.
TM
two pack instructions are generally in the size of the supported subwords and in
the saturation options that can be used.
TM
unpacked to the target register. The high/low unpack instructions select and un-
pack the high/low-order subwords of the source operands. (See Figs. 35 and 36.)
TM
* This second register needs to be at least 64 bits wide to fully accommodate the 64 control bits
needed for 16 subwords.
2 2
4 8
8 24
16 64
32 160
64 384
128 896
TM
TM
TM
* The bytes are indexed from 0 to 7. Index 0 corresponds to the most significant byte, which is on
the left end of the registers.
TM
register. The remaining bits of the target register are either zeroed or unchanged.
Deposit instructions may be limited to work on subwords instead of arbitrarily
long bit fields and arbitrary patch locations. Figures 43 and 44 show some possi-
ble deposit instructions.
TM
Figure 45 Move mask R b ,R a : Move mask operation on a register with four subwords.
The most significant bit of each subword is written in order to the least significant byte
of the target register.
TM
Integer operations
Mix Left ⻫ ⻫ ⻫
Mix Right ⻫ ⻫ ⻫
MUX.rev ⻫
MUX.mix ⻫
MUX.shuf ⻫
MUX.alt ⻫
MUX.brcst ⻫
Arbitrary Permutation of n subwords ⻫ (n ⫽ 4) ⻫ (n ⫽ 4) ⻫ (n ⫽ 4) ⻫ (n ⫽ 4) ⻫ (n ⫽ 16, 32) a
PACK ⻫ ⻫ ⻫
UNPACK (high/low) ⻫ ⻫ ⻫ ⻫
Move Mask ⻫
FP operations
Mix Left ⻫
Mix Right ⻫
PACK ⻫
UNPACK(high/low) ⻫
Arbitrary Permutation of n FP numbers ⻫ (n ⫽ 2, 4)
a
This is the VPERM instruction and it has some limitations for n ⫽ 32. See text for more details on this instruction.
TM
TM
Example
Assume that register R a contains the SP FP number a, and we want to calculate
1/a by first approximating the result and then by refining this estimate to achieve
the desired SP accuracy. The first step requires the use of the PFRCP instruction:
PFRCP R b ,R a
ci ⫽ approx(1/a) ⻫
ci ⫽ approx(1/√a) ⻫
ci ⫽ approx(1/a i ) ⻫ ⻫a ⻫
ci ⫽ approx(1/√ai) ⻫ ⻫a ⻫
ci ⫽ approx(log 2 a i ) ⻫
ci ⫽ approx(2a i ) ⻫
ci ⫽ refine(est(1/a)) ⻫
ci ⫽ refine(est(1/√a)) ⻫
a
Scalar versions of these operations also exist.
TM
9 SUMMARY
We have described the latest multimedia instructions that have been added to
current microprocessor instruction set architectures (ISAs) for native signal pro-
cessing, or, more generally, for multimedia processing. We described these in-
structions by broad classes: packed add/subtract operations, packed special
arithmetic operations, packed multiply operations, packed compare operations,
packed shift operations, subword rearrangement operations, and approximation
operations. For each of these instruction classes, we compared the instructions
Table 13 SSE-2 Features a Packed Instruction for Computing the Square Root
of a FP Register
FP operation IA-64 MAX-2 MMX SSE-2 3DNow! AltiVec
c i ⫽ √a i ⻫ , SP, DP
a
TM
REFERENCES
1. RB Lee, M Smith. Media processing: A new design target. IEEE Micro 16(4):6–
9, 1996.
TM
TM
1 INTRODUCTION
TM
TM
1.1 Definitions
The following definitions are used to describe various attributes related to recon-
figurable computing:
• Reconfigurable or adaptive: In the context of reconfigurable computing,
this term indicates that the logic functionality and interconnect of a
computing system or device can be customized to suit a specific applica-
tion through postfabrication, user-defined programming.
• Run-time (or dynamically) reconfigurable: System logic functionality
and/or interconnect connectivity can be modified during application ex-
ecution. This modification may be either data driven or statically sched-
uled.
• Fine-grained parallelism: Logic functionality and interconnect connec-
tivity is programmable at the bit level. Resources encompassing multi-
ple logic bits may be combined to form parallel functional units.
TM
TM
TM
TM
TM
TM
TM
TM
tween logic blocks is provided via a series of wire segments located in channels
between the blocks. Programmable pass transistors and multiplexers can be used
to provide both block-to-segment connectivity and segment-to-segment connec-
tions.
Much of the recent interest in reconfigurable computing has been spurred
by the development and maturation of field programmable gate arrays. The recent
development of systems based on FPGAs has been greatly enhanced by an expo-
nential growth rate in the gate capacity of reconfigurable devices and improved
device performance due to shrinking die sizes and enhanced fabrication tech-
niques. As shown in Figure 3, reported gate counts [24–26] for look-up table
(LUT)-based FPGAs, from companies such as Xilinx Corporation, have roughly
followed Moore’s law over the past decade.* This increase in capacity has en-
abled complex structures such as multitap filters and small RISC processors to
be implemented directly in a single FPGA chip. Over this same time period, the
system performance of these devices has also improved exponentially. Whereas
in the mid-1980s, system-level FPGA performance of 2–5 MHz was considered
acceptable, today’s LUT-based FPGA designs frequently approach performance
* In practice, usable gate counts for devices are often significantly lower than reported data book
values (by about 20–40%). Generally, the proportion of per-device logic that is usable has remained
roughly constant over the years, as indicated in Figure 3.
TM
TM
local SRAMs. Numerous DSP applications have been mapped to Splash II, includ-
ing audio and video algorithm implementations. These applications are described
in greater detail in Section 5. Recently, FPGA-based systolic architectures based
on the Splash II system have been developed by Annapolis Micro Systems [33].
The company’s peripheral component interface (PCI) based Wildforce system
contains five Xilinx XC4000XL devices aligned in a systolic chain. A similar,
VME-based Wildstar board contains four Xilinx Virtex devices.
As shown in Figure 5, Programmable active memory DECPeRLe-1 system
[30] contain arrangements of FPGA processors (labeled X) in a two-dimensional
mesh with memory devices (labeled M) aligned along the array perimeter. PAMs
were designed to create the architectural appearance of a functional memory for
a host microprocessor and the PAM programming environment reflects this. From
a programming standpoint, the multi-FPGA PAM can be accessed like a memory
through an interface FPGA, XI, with written values treated as inputs and read
values used as results. Designs are generally targeted to PAMs through hand-
crafting of design subtasks, each appropriately sized to fit on an FPGA. The PAM
TM
array and its successor, the Pamette [34], are interfaced to a host workstation
through a backplane bus. Additional discussion of PAMs with regard to DSP
applications appears in Section 5.
TM
TM
4.1 Specialization
As stated in Section 2.1, programmable digital signal processors are optimized
to deliver efficient performance across a set of signal processing tasks. Although
the specific implementation of tasks can be modified through instruction-
configurable software, applications must frequently be customized to meet spe-
cific processor architectural aspects, often at the cost of performance. Currently,
TM
4.2 Reconfigurability
Most reconfigurable devices and systems contain SRAM-programmable memory
to allow full logic and interconnect reconfiguration in the field. Despite a wide
range of system characteristics, most DSP systems have a need for configurability
under a variety of constraints. These constraints include environmental factors
such as changes in statistics of signals and noise, channel, weather, transmission
rates, and communication standards. Although factors such as data traffic and
interference often change quite rapidly, other factors such as location and weather
change relatively slowly. Still other factors regarding communication standards
vary infrequently across time and geography, limiting the need for rapid recon-
figuration. Some specific ways that DSP can directly benefit from hardware re-
configuration to support these factors include the following:
TM
4.3 Parallelism
An abundance of programmable logic facilitates the creation of numerous func-
tional units directly in hardware. Many characteristics of FPGA devices, in partic-
ular, make them especially attractive for use in digital signal processing systems.
The fine-grained parallelism found in these devices is well matched to the high
sample rates and distributed computation often required of signal processing ap-
plications in areas such as image, audio, and speech processing. Plentiful FPGA
flip-flops and a desire to achieve accelerated system clock rates have led designers
to focus on heavily pipelined implementations of functional blocks and interblock
communication. Given the highly pipelined and parallel nature of many DSP
tasks, such as image and speech processing, these implementations have exhibited
substantially better performance than standard PDSPs. In general, these systems
have been implemented using both task and functional unit pipelining. Many
DSP systems have featured bit-serial functional unit implementations [60] and
systolic interunit communication [29] that can take advantage of the synchroniza-
tion resources of contemporary FPGAs without the need for software instruction
fetch and decode circuitry. As detailed in Section 5, bit-serial implementations
have been particularly attractive due to their reduced implementation area. How-
ever, as reconfigurable devices increase in size, more nibble-serial and parallel
implementations of functional units have emerged in an effort to take advantage
of data parallelism.
Recent additions to reconfigurable architectures have aided their suitability
for signal processing. Several recent architectures [26,61] have included 2–4-
kbit SRAM banks that can be used to store small amounts of intermediate data.
This allows for parallel access to data for distributed computation. Another im-
portant addition to reconfigurable architectures has been the capability to rapidly
change only small portions of device configuration without disturbing existing
TM
Since the appearance of the first reconfigurable computing systems, DSP applica-
tions have served as important test cases in reconfigurable architecture and soft-
ware development. In this section, a wide range of DSP design approaches and
applications that have been mapped to functioning reconfigurable computing sys-
tems are considered. Unless otherwise stated, the design of complete DSP sys-
tems is stressed, including I/O, memory interfacing, high-level compilation, and
real-time issues rather than the mapping of individual benchmark circuits. For
this reason, a large number of FPGA implementations of basic DSP functions
like filters and transforms that have not been implemented directly in system
hardware have been omitted. Although our consideration of the history of DSP
and reconfigurable computing is roughly chronological, some noted recent trends
were initially investigated a number of years ago. To trace these trends, recent
advancements are directly contrasted with early contributions.
TM
TM
and the amount of hardware needed to achieve desired results. For applications
requiring extended precision, floating point is a necessity. In Ref. 70, an initial
attempt was made to develop basic floating-point approaches for FPGAs that met
IEEE-754 standards for addition and multiplication. Area and performance were
considered for various FPGA implementations, including shift-and-add, carry-
save, and combinational multiplier. Similar work was explored in Ref. 71, which
applied 18-bit-wide floating-point adders/subtractors, multipliers, and dividers to
2D fast Fourier transform (FFT) and systolic FIR (finite impulse response) filters
implemented on Splash II. This work was extended to a full 32-bit floating point
in Ref. 72 for multipliers based on bit-parallel adders and digit-serial multipliers.
More recent work [73] re-examines these issues with an eye toward greater area
efficiency.
TM
TM
TM
TM
TM
TM
TM
TM
Figure 8 Architectural template for a single-chip Pleiades device. (From Ref. 116.)
TM
TM
TM
REFERENCES
TM
TM
TM
TM
TM
TM
TM
TM
1 INTRODUCTION
TM
2 BACKGROUND
TM
TM
TM
TM
TM
TM
TM
words and then sums each pair to produce two 32-bit results. On a P55C, the
execution takes three cycles when fully pipelined. Because multiply-add opera-
tions are critical in many video signal processing algorithms such as DCT, this
feature can improve the performance of some video applications (e.g., JPEG and
MPEG) greatly. The motivation behind the packed compare instructions is a com-
mon video technique known as chroma key, which is used to overlay an object
on another image (e.g., weather person on weather map). In a digital implementa-
tion with MMX, this can be done easily by applying packed logical operations
after packed compare. Up to eight pixels can be processed at a time.
Unlike MAX2, MMX instructions do not use general-purpose registers; all
the operations are done in eight new registers (MM0–MM7). This explains why
the four packed logical instructions are needed in the instruction set. The MMX
registers are mapped to the floating-point registers (FP0–FP7) in order to avoid
introducing a new state. Because of this, floating-point and MMX instructions
cannot be executed at the same time. To prevent floating-point instructions from
corrupting MMX data, loading any MMX register will trigger the busy bit of
all the FP registers, causing any subsequent floating-point instructions to trap.
Consequently, an EMMS instruction must be used at the end of any MMX routine
to resume the status of all the FP registers. In spite of the awkwardness, MMX
has been implemented in several Pentium models and also inherited in Pentium
II and Pentium III.
TM
TM
For a number of reasons, the visual instruction set (VIS) instructions are
implemented in the floating-point unit rather than integer unit. First, some VIS
instructions (e.g., partitioned multiply and pack) take multiple cycles to execute,
so it is better to send them to the floating-point unit (FPU) which handles multi-
ple-cycle instructions like floating-point add and multiply. Second, video applica-
tions are register-hungry; hence, using FP registers can save integer registers for
address calculation, loop counts, and so forth. Third, the UltraSparc pipeline only
allows up to three integer instructions per cycle to be issued; therefore, using
FPU again saves integer instruction slots for address generation, memory load/
store, and loop control. The drawback of this is that the logical unit has to be
duplicated in the floating-point unit, because VIS data are kept in the FP registers.
The VIS instructions (listed in Table 5) support the following data types:
pixel format for true-color graphics and images, fixed16 format for 8-bit data,
and fixed32 format for 8-, 12-, or 16-bit data. The partitioned add, subtract, and
multiply instructions in VIS function very similar to those in MAX2 and MMX.
In each cycle, the UltraSparc can carry out four 16 ⫻ 8 or two 16 ⫻ 16 multiplica-
tions. Moreover, the instruction set has quite a few highly specialized instructions.
For example, EDGE instructions compare the address of the edge with that of
the current pixel block, and then generate a mask, which later can be used by
partial store (PST) to store any appropriate bytes back into the memory without
using a sequence of read–modify–write operations. The ARRAY instructions are
specially designed for three-dimensional (3D) visualization. When the 3D dataset
is stored linearly, a 2D slice with arbitrary orientation could yield very poor
locality in cache. The ARRAY instructions convert the 3D fixed-point addresses
into a blocked-byte address, making it possible to move along any line or plane
with good spatial locality. The same operation would require 24 RISC-equivalent
instructions. Another outstanding instruction is PDIST, which calculates the SAD
(sum of absolute difference) of two sets of eight pixels in parallel. This is the
TM
4.4 Commentary
Instruction set extensions increase the processing power of general-purpose pro-
cessors by adding new functional units dedicated for video processing and/or
TM
5 APPLICATION-SPECIFIC PROCESSORS
TM
8⫻8 3104 (VCP) H.324 videophone Multimedia processor 33 MHz 240-PQFP, 225-BGA,
3404 (LVP) MPEG-1 I-frame en- architecture (MPA), 5 V, 2 W
coder DSP-like engine
MPEG-1 decoder
Analog Devices ADV 601 4 :1 to 350 :1 real-time Wavelet kernel, adaptive 27–29.5 120-PQFP, 5 V, low cost
ADV 601LC wavelet compression quantizer, and coder MHz
ADV 611 Real-time compression/ Wavelet kernel plus pre- 27 MHz 120-LQFP
ADV 612 decompression of cise compressed bit
CCIR-601 video at up rate control
to 7500 :1
C-Cube DV X 5110 MPEG-2 main profile at DV X multimedia architec- 100 MHz 352-BGA, 3.3 V
DV X 6210 main level encoder ture ⬎10 BOPS
CLM 4440 MPEG2 authoring en- CL4040 Video RISC Pro- 60 MHz 240-MQUAD, 3 W
coder cessor 3 (VRP-3) 5.7 BOPS
CLM 4725 MPEG-2 storage encoder loaded with different
CLM 4740 MPEG2 broadcast en- microcode
coder
AViA 500 MPEG-2 audio/video Video RISC processor- 160-PQFP, 3.3 V, 1.6 W
AViA 502 decoder based architecture
ESS Technology ES3308 MPEG-2 audio/video RISC processor and 80 MHz 208-PQFP, 3.3 V, ⬍1 W
decoder MPEG processor
IBM MPEGME31 MPEG-2 main profile at RISC-based architecture 54 MHz 304-CQFP, 0.5 µm,
MPEGME30 main level encoder loaded with different 3.3 V, 3.0–4.8 W
(chipset) microcode
MPEGCS22 MPEG-2 audio/video RISC-based architecture 160-PQFP, 0.4/0.5 µm,
MPEGCD21 decoder loaded with different 3.3 V, 1.4 W
microcode
TM
TM
TM
cluding CCIR-656. It has precise compressed bit-rate control, with a wide range
of compression ratios from visually lossless (4: 1) to 350 : 1. The glueless video
and host interfaces greatly reduce system cost while yielding high-quality images.
As shown in Figure 5, the ADV601 consists of four interface blocks and five
processing blocks. The wavelet kernel contains a set of filters and decimators
that process the image in both horizontal and vertical directions. It performs for-
ward and backward biorthogonal 2D separable wavelet transforms on the image.
The transform buffer provides delay line storage, which significantly reduces
bandwidth when calculating wavelet transforms on horizontally scanned images.
Under the control of an external host or digital signal processor (DSP), the adap-
tive quantizer generates quantized wavelet coefficients at a near-constant bit-rate
regardless of scene changes.
TM
TM
Note: No shading ⫽ base layer; light shading ⫽ enhancement layer 1; dark shading ⫽ enhancement layer 2.
Source: Ref. 25.
TM
range of ⫾202 pixels and vertical range of ⫾124 pixels. A DSP coprocessor can
execute up to 1.6 billion arithmetic pixel-level operations per second. The IPC
interface coordinates multiple DV x chips (at the speed of 80 Mbyte/sec) to sup-
port higher quality and resolution. The video interface is a programmable high-
speed input/output (I/O) port which transfers video streams into and out of the
processor. MPEG audio is implemented in a separate processor.
Both the AViA500 and AVia502 support the full MPEG-2 video main pro-
file at the main level and two channels of layer-I and layer-II MPEG-2 audio,
with all the synchronization done automatically on-chip. Their architectures are
shown in Figure 7. In addition, the AViA502 supports Dolby Digital AC-3 sur-
TM
TM
TM
TM
5.10 Commentary
Dedicated video codecs, which are optimized for one or more video-compression
standards, achieve high performance in the application domain. Due to the com-
plexity of the video standards, all of the VSPs have to use microprogrammable
TM
TM
6 PROGRAMMABLE VSP
TM
TM
Peak
Vendor Processor(s) Application(s) Architecture perform. Technology
Chromatic Mpact2 R/6000 MPEG-1 encoding VLIW (SIMD) 6 ALU 125 MHz 0.35 µm, 352-
Research MPEG-1 & -2 decoding groups, 2 Rambus RAC 6 BOPS BGA, 3.3 V
Windows GUI acceleration channels, and 5 DMA bus
2D/3D graphics controllers
H.320/H.324 videophone
Audio, FAX/modem
MicroUnity Cronus Communicating and pro- Single instruction group 300 MHz 0.6 µm, 441-
cessing at broad-band data, 5 threads BGA, 3.3 V
rates
Mitsubishi D30V MPEG-1 & -2 decoding Two-way VLIW RISC core 250 MHz 0.3 µm, 135-
Dolby AC-3 decoding 1 BOPS PGA, 2V
H.263 codec
2D/3D graphics; modem
NEC V830R/AV MPEG-1 encoding Superscalar RISC core and 200 MHz 0.25 µm, 208-
MPEG-1 & -2 decoding SIMD multimedia exten- 2 BOPS PQFP, 2.5 V,
sion 2W
TM
TM
Data Inst.
Func. Issue width Register width Inst. Float.
Processor(s) units slots (bit) file (bit) $/RAM Data $/RAM point Develop. tools
TM
TM
control protocol is used in the multiprocessor system, with several types of hard-
ware support such as low-latency memory-mapped I/O.
The MicroUnity media processor exploits parallelism from two perspec-
tives: group instructions and multithreading. The group instructions specify oper-
ations on four 128-bit register pairs, totaling a bandwidth of 512 bits/instruction.
This architecture, referred to as Single Instruction Group Data (SIGD), is almost
exactly the same as the instruction set extensions introduced in Section 6, and a
very similar concept can also be found in Texas Instruments’ MVP, where the
splittable ALUs can be reconfigured dynamically. There is some difference, how-
ever, in the size and number of operands. Although most other general-purpose
microprocessors work on two 64-bit source operands, MicroUnity’s media pro-
cessor can take up to three source register pairs, each 128 bits long (a register
pair consists of two 64-bit registers and can be used to represent different data
granularities from two 64-bit words to 128 single bits). In order to deal with
unaligned or mixed-precision data, the broad-band media processor also provides
switching instructions which can shift, shuffle, extract, expand, and swizzle op-
erands as well as other kinds of manipulation. These switching instructions are
much more powerful than any other processors. Other instructions include control
instructions, which can effectively reduce branch overhead. For example, branch-
gateway instruction fetches 128 bits from memory into a pair of registers (code
and data pointers, respectively) while checking translation lookaside buffer
(TLB) for access control; then it jumps to the code pointer, storing a result link
in its space. This is extremely helpful for active message in message passing
based multiprocessor systems. In addition, MicroUnity’s media processor pro-
vides extended math operations such as multiply over 8-bit Galois fields [i.e.
TM
TM
Storage (8, 16, 32, 64, or 128 Load 8, 16, 32, 64, or 128 bits, Unsigned, aligned, immediate 2-1-2
bits) and synchronization little- or big-endian
(64 bits) Store 8, 16, 32, 64, or 128 bits, Aligned, immediate 4-1-0
little- or big-endian
Store add, compare, or multi- Immediate, -and-swap 8-7-7
plex 64 bits
Branch (64 bits) Branch and-equal, and-not- 2-1-1 for pipelined,
equal, less, or less-equal-zero 2-1-4 for unpipelined
Branch equal, not-equal, less, 2-1-1
or greater-equal
Branch floating-point equal,
not-equal, less, or greater-
equal (16, 32, 64, or 128
bits)
Branch Immediate, -and-link 2-1-1
Branch gateway Immediate 2-2-1
Branch down or back 2-1-1
Fixed point (64 bits) and group Add or subtract Immediate, overflow 1-1-1
(128 ⫻ 1, 64 ⫻ 2, 32 ⫻ 4, Multiply Unsigned, -and-add 1-5-7 for 32-bit,
16 ⫻ 8, 8 ⫻ 16, 4 ⫻ 32, or Divide Unsigned 1-20-22 for 64-bit multiply,
2 ⫻ 64 bits) 1-23-25 for 64-bit multiply-
and-add,
1-2-4 for others
TM
TM
such as variable length saturation instruction, join instruction, add sign instruc-
tion, and so on. The branch unit has a variable number of delay slots and addi-
tional conditional branches (e.g., test zero and branch, test notzero and branch),
which enable zero-delay branches and zero-overhead loops. All of the functional
units in the D30V core are fully pipelined using a four-stage pipeline; they are
controlled by a 64-bit VLIW instruction, which contains two short or one long
RISC subinstructions. The dual-issue processor has used some advanced tech-
niques to improve the performance. These techniques include predicated execu-
tion and speculative execution.
The D30V processor core was designed not only to meet computational
requirements and cost but also to provide the flexibility of programmable proces-
sors. However, the dual-issue processor is not powerful enough for computation-
intensive video applications like MPEG-2 encoding. In an implementation of an
MPEG-2 decoder, two D30V cores and several small processing units are re-
quired in addition to a dedicated motion-estimation processor.
TM
can perform horizontal or vertical image filtering and scaling, YUV to RGB color
space conversion, as well as overlay live video on background image. The
variable-length decoder (VLD) coprocessor offloads the VLIW CPU by decoding
Huffman-encoded video bit streams. Due to the characteristics of the algorithm,
this task has little inherent parallelism and, hence, is not suited for VLIW pro-
cessing. The two coprocessors are microprogrammable. They are independent of
the VLIW CPU and are synchronized with it using an interrupt mechanism.
The VLIW processor has a rich instructions set (197 instructions), including
many extensions for handling multimedia data types. Parallelism is achieved by
incorporating 27 functional units in the VLIW engine and feeding them with five
instruction issue slots. The type and number of functional units are listed in Table
12. All of the functional units are pipelined, with a depth ranging from 1 to 17
stages. The five constant units do not perform any calculation except providing
ports for accessing immediate values stored in the instruction word. Like many
other processors, TM-1000 also provides pack/unpack and group instructions,
which can manipulate 4 bytes or two 16-bit words at one time, exploiting subword
parallelism. Other special instructions include me8 for motion estimation, which
is similar to the PDIST instruction in UltraSparc’s visual instruction set. Most
instructions accept a guard register for predicated execution. Although the TM-
1000 processor has 27 functional units, in each cycle it can issue only up to 5
instructions.
The TM-1000 has a dedicated instruction cache and a data cache on-chip,
both of which are eight-way set-associative with LRU (least-recently used) re-
placement and locking mechanism to improve performance. The 16 KB dual-
TM
Latency Recovery
Name Quantity (cycles) (cycles)
Constant 5 1 1
Integer ALU 5 1 1
Memory load/store 2 3 1
Shift 2 1 1
DSP ALU 2 2 1
DSP multiply 2 3 1
Branch 3 3 1
Floating-point ALU 2 3 1
Integer/floating-point multiply 2 3 1
Floating-point compare 1 1 1
Floating-point sqrt/divide 1 17 16
ported data cache allows two simultaneous nonblocking accesses. To save band-
width and storage space, the VLIW instructions are stored and cached using a
2–23-byte compressed format until they are fetched from the instruction cache.
The chip also has a glueless interface to support four banks of external SDRAMs.
The TM-1000 development environment includes a VLIW compiler, a C/
C⫹⫹ software development environment, and the pSOS⫹ real-time operating
system.
The TriMedia CPU64 [49] is a successor to the TM-1000. The CPU64 has
a 64-bit word and uses subword parallelism within the VLIW CPU to increase
parallelism on small data words.
TM
video processing is performed by the vector processor because the RISC CPU
only deals with general system functions.
The MSP is a fully programmable media processor with a rich instruction
set, including standard ARM RISC instructions for scalar processing, high-
performance SIMD instructions for vector processing, I/O instructions for block
load/store, and special instructions for filtering and MPEG applications. The pro-
gramming model also has macro library instructions such as DCT, CONV, and
MULM. The software development tools include MSP-oriented assembler, com-
piler, linker, debugger, and simulator.
TM
powered by 32-bit instructions and can issue, in each cycle, a parallel multiply,
an add, and a 64-bit load/store, yielding 100 MFLOPS at 50 MHz. The floating-
point unit can perform both single- and double-precision arithmetic.
Each of the four ADSPs is a 32-bit integer DSP optimized for bit- and
pixel-oriented imaging and graphics applications. Each parallel processor can
issue, in each cycle, a multiply, an ALU operation, and two memory accesses
within a single 64-bit instruction. The parallelism comes from two independent
datapaths. The multiplier datapath includes a three-stage 16 ⫻ 16 multiplier, a
half-word swapper, and rounding hardware. The ALU datapath includes a 32-
bit three-input ALU, a barrel rotator, a mask generator, a 1-bit to n-bit expander,
a left/rightmost and left/rightmost bit-change logic, and several multipliers. The
32-bit three-input ALU can perform all of the 256 three-input Boolean combina-
tions as well as many other mixed logical and arithmetic operations. Both the
multiplier and the ALU are splittable. Although the 16 ⫻ 16 multiplier can be
split into two 8 ⫻ 8 multipliers, the 32-bit ALU can be divided into two 16-bit
ALUs or four 8-bit ALUs. The big register file contains 8 data registers, 10 ad-
dress registers, 6 index registers, and 20 other user-visible registers. Three hard-
ware loop controllers enable zero-overhead looping/branching and multiple-loop
end points. The ADSPs provide conditional operation (also referred to as predi-
cated execution).
The video controller handles both video input and output and can simulta-
neously support two independent capture or display systems. The transfer control-
ler combines a memory interface and a DMA engine, handling data movement
TM
TM
TM
6.8 Commentary
Programmable VSPs represent a new trend in multimedia systems. They tend to
be more versatile in dealing with multimedia applications, including video, audio,
and graphics. VSPs have to be very powerful because the amount of computation
required by video compression is enormous. To meet the performance demands,
all of the VSPs employ parallel processing techniques to some degree: VLIW
(SIMD), multiprocessor-based MIMD, or the concept from vector processing
(SIGD). However, none of these programmable VSPs are able to compete with
dedicated state-of-the-art VSPs—none of them could support real-time MPEG-
2 encoding yet.
It is not surprising to see that many programmable VSPs adopt VLIW archi-
tecture. There are basically two reasons for doing this. First, there is much paral-
lelism in video applications [53]. Second, in VLIW machines, a high degree of
parallelism and high clock rates are made possible by shifting part of the hardware
workload to software. This kind of shift once happened in the microprocessor
evolution from CISC to RISC. By relieving the hardware burden, RISC achieved
a new level that CISC was unable to compete with and the revolution has been
a milestone in microprocessor history. Analogously, we would expect VLIW
to outperform other architectures. Unlike their superscalar counterparts, VLIW
processors rely on the compilers entirely to exploit the parallelism; static schedul-
ing is performed by sophisticated optimizing compilers. All of this raises chal-
lenges for next-generation compilers. More discussions on the VLIW architecture
as well as its associated compiler and coding techniques can be found in Fisher et
al.’s review [54]. Although offering architectural advantages for general-purpose
computing (where unpredictability and irregularity are high), multithreading ar-
chitectures are not as optimal for video processing where regularity and predict-
ability are much higher.
7 RECONFIGURABLE SYSTEMS
TM
TM
TM
TM
TM
of the reconfigurable array is very similar to DISC (Fig. 16). Each row of the
array contains 23 logic blocks, each capable of handling 2 bits. In addition, there
is a distributed cache built in the reconfigurable array. Similar to an instruction
cache, the configuration cache holds the most recent configuration so as to expe-
dite dynamic reconfiguration. Simulation results show speedups ranging from 2
to 24 against a 167-MHz Sun UltraSPARC 1/170.
The REMARC reconfigurable array processor [65] also couples some re-
configurable hardware with a MIPS microprocessor. It consists of a global control
unit and 8 ⫻ 8 programmable logic blocks called nanoprocessors (Fig. 18). Each
nanoprocessor is a small 16-bit processor: It has a 32-entry instruction RAM, a
16-entry data RAM 1 ALU, 1 instruction register, 8, 16-bit data registers, 4 data
input registers, and 1 data output register. The nanoprocessor are interconnected
in a mesh structure. In addition, there are eight horizontal buses and eight vertical
busses for global communication. All of the 64 nanoprocessors are controlled by
the same program counter, so the array processor is very much like a VLIW
processor. REMARC is not based on FPGA technology, but the authors compared
it with an FPGA-based reconfigurable coprocessor (which is about 10 times larger
than REMARC) and found that both have similar performance, which is 2.3–
7.3 times as fast as the MIPS R3000 microprocessor. Simulation results also
show that both reconfigurable systems outperform Intel MMX instruction set
extensions.
Used as an attached coprocessor, PipeRench [66] explores parallelism at
a coarser granularity. It employs a pipelined, linear reconfiguration to solve the
problems of compilability, configuration time, and forward compatibility. Tar-
geting at stream-based functions such as finite impulse response (FIR) filtering,
PipeRench consists of a sequence of stripes, which are equivalent to pipeline
stages. However, one physical stripe can function as several pipeline stages in a
TM
TM
TM
7.3 Commentary
Reconfigurable computing systems to some degree combine the speed of ASIC
and the flexibility of software together. They emerge as a unique approach to
high-performance video signal processing. Quite a few systems have been built
to speed up video applications and they have proven to be more efficient than
systems based on general-purpose microprocessors.
This approach also opens a new research area and raises many challenges
for both hardware and software development. On the hardware side, how to cou-
ple reconfigurable components with microprocessors still remains open, and the
granularity, speed, and portion of reconfiguration as well as routing structures
are also subjects of active research. On the software side, CAD tools need great
improvement to automate or accelerate the process of mapping applications to
reconfigurable systems.
Although reconfigurable systems have shown the ability to speed up video
signal processing as well as many other types of applications, they have not met
the requirement of the marketplace; most of their applications are limited to indi-
vidual research groups and institutions.
8 CONCLUSIONS
In this chapter, we have discussed four major approaches to digital video signal
processing architectures: Instruction set extensions try to improve the perfor-
mance of modern microprocessors; dedicated codecs seem to offer the most cost-
effective solutions for some specific video applications such as MPEG-2 decod-
ing; programmable VSPs tend to support various video applications efficiently;
and reconfigurable computing compromises flexibility and performance at the
system level. Because the four approaches are targeted at different markets, each
having both advantages and disadvantages, they will continue to coexist in the
future. However, as standards become more complex, programmability will be
important for even highly application-specific architectures. The past several
years have seen limited programmability become a commonplace in the design of
application-specific video processors. As just one example, all the major MPEG-2
encoders incorporate at least a RISC core on-chip.
TM
TM
REFERENCES
TM
TM
TM
TM
1 INTRODUCTION
TM
TM
* Not to be confused with instruction scheduling, which is a fine-level optimization technique, dis-
cussed in Section 2.
TM
TM
TM
TM
* The processor architecture assumed by Aho and Johnson [23] has n general-purpose registers and
a finite number of memory locations. All registers are interchangeable, as are all memory locations.
TM
* Means-ends analysis is a well-established artificial intelligence technique that takes into consider-
ation the available resources (means) and the goals of the problem (ends) in specifying a plan of
action.
TM
* In the tree/DAG traversal method, the code-generating routines (or macros) must be handwritten.
In Twig, the assembly instructions associated with a given pattern are also handwritten.
TM
TM
2.2 Optimizations
An optimizing compiler is one that applies optimization techniques such that the
resultant machine code is faster and/or smaller than that produced by conven-
tional compilers. Perhaps the most important tenet about optimizations is that
they aim at improving code, rather than actually attaining optimal code.
TM
TM
TM
TM
* The available memory size is limited. In addition, the assumption that a small code will also execute
fast is often valid.
TM
3 OVERVIEW OF OASIS
* We note the distinction between the terms optimal and optimized. The first refers to the absolute
best. The latter is a frequently used term by compiler vendors, implying that optimization techniques
have been applied and the result may or may not be optimal.
TM
* Due to the use of heuristics, an optimal solution cannot be guaranteed. However, for a sufficiently
small DAG and given sufficient computational resources, if an optimal solution falls within the
solution space defined by the heuristics, OASIS will find it.
TM
TM
Example 1
S1: x ⫽ b ∗ c;
S2: y ⫽ x ⫹ b ∗ c ⫹ d ∗ e;
In our current implementation, the front end (FE) is adapted from lcc com-
piler [61]. The FE takes an ANSI C program as input and generates a linearized
DAG as the intermediate representation (IR). Because the linearized form is quite
difficult to visualize, we will use a graphical representation of the FE’s output
for Example 1, as shown in Figure 6. Nodes of type INDIR represent the fetch
of a variable value from a resource, and nodes of type ASGN write a new value
to the variable stored in a certain resource. The translation from the linearized
form to the graphical form loses some information, namely control and data de-
pendencies. In Example 1, it implies that statements S1 must be executed prior
to S2. However, in Figure 6, there is nothing that indicates that node 1 should
be covered before node 12. In the next subsection, we show how that information
can be recovered, as well as discuss other pertinent issues related to the IR.
TM
S1: x⫽...
S2: ...⫽x
S1: ...⫽x
S2: x⫽...
A new value cannot be assigned to x until x’s current value has been used;
hence, the use of x in S1 must be executed before the assignment in S2.
3. Output dependency (or write-after-write hazard):characterized by
S1: x⫽...
S2: x⫽...
The final value that x assumes depends directly on which statement is exe-
cuted last. The order in which the statements appear in the source program must
be preserved by the scheduling algorithm. In this case, the assignment in S1 must
be executed before the assignment in S2.
As the name implies, true dependencies are the only real dependencies; a
variable required by the current statement must have its value evaluated by a
previous statement. Both antidependence and output dependence can be elimi-
nated by variable renaming [65,66]. Variable renaming usually increases the re-
TM
AddDataDependencyArcs(DAG) {
for (every node i in DAG that is of type ASGN or INDIR) {
for (every node j•i in DAG that is of type ASGN or INDIR) {
if (no arc exists between nodes i and j) {
Dep Type ⫽ DetermineDependencyType(DAG, i, j);
case (DepType) {
TrueDep: add arc from node i to node j;
AntiDep: add arc from node j to node i;
OutputDep: if (i⬎j) add arc from node i to node j
else add arc from node j to node i;
};
};
};
};
};
Nodes of type INDIR represent the fetch of a variable value from a re-
source, and nodes of type ASGN write a new value to the variable stored in a
certain resource. Hence, data dependency arcs occur only among INDIR and
ASGN nodes. However, not all INDIR and ASGN nodes need a data dependency
arc, only those that falls into a true, antidependency or output dependency cate-
gory. Figure 7 illustrates one such case in which no data dependency arc is neces-
TM
Example 2
S1: if (a) x ⫽ 1;
S2: else x ⫽ 2;
S3: y ⫽ x;
...
node#3 ADDRLP count ⫽ 1 a
node#2 INDIRI count ⫽ 1 #3
node#4 CNSTI count ⫽ 1 0
node’1 EQI count ⫽ 0 #2 #4 2
node#6 ADDRLP count ⫽ 1 x
node#7 CNSTI count ⫽ 1 1
node’5 ASGNI count ⫽ 0 #6 #7
node#9 ADDRGP count ⫽ 1 3
node’8 JUMPV count ⫽ 0 #9
2:
node#12 ADDRLP count ⫽ 1 x
node#13 CNSTI count ⫽ 1 2
node’11 ASGNI count ⫽ 0 #12 #13
3:
node#16 ADDRLP count ⫽ 1 y
node#18 ADDRLP count ⫽ 1 x
node#17 INDIRI count ⫽ 1 #18
node’15 ASGNI count ⫽ 0 #16 #17
TM
and it has arcs pointing to two other nodes, node 6 and node 7.* The same DAG,
in a more legible graphical form is shown in Figure 8.
The lcc compiler transforms if–then–else structures into a one-dimensional
description with GOTOs (JUMPs) to determine ordering. When a two-dimen-
sional DAG is built from such description, the ordering information is lost. For
example, in the linearized DAG, nodes 11 to 13 precede nodes 15 to 18. A JUMP
in node 8 brings us to the subtree labeled 3, but no JUMP is necessary at the
end of the subtree labeled 2 because it is assumed that the subtree labeled 3
follows unconditionally. However in the graphical representation of the linearized
DAG, shown in Figure 8, subtrees are spread around, with nothing to indicate
the covering order. One way to recover the precedence information is to augment
the DAG with control dependence edges. Such edges are processed by the IR
Preprocessing block just like any other edge. Hence, if a node M must be covered
before a node N, an edge must be added from N to M. This forces a partial
scheduling order more in line with the one meant by lcc. The resultant augmented
DAG is shown in Figure 9.
TM
ASAP(DAG) {
node[root].schedule ⫽ 0;
N ⫽ set of all nodes of the DAG;
p ⫽ 1; /* schedule # */
while (N not empty) {
for (every node j in N) {
if (all parents of node j have been scheduled) {
temp[j] ⫽ p;
N ⫽ N–{j};
}
else temp[j] ⫽ ⫺1;
};
for (i ⫽ 1; i⬍ ⫽ number of nodes; i⫹⫹) {
TM
ALAP(DAG){
Reverse the directions of all edges in DAG;
Assume the ‘‘root’’ is ‘‘output’’ and vice-versa;
ASAP(DAG);
n ⫽ largest schedule #;
for (all nodes j) {
node[j].schedule ⫽ n-(node[j].schedule) ⫹ 1;
};
};
This ALAP initial scheduling has three purposes. First, it provides a basis
for the actual node coverage ordering. At each iteration, the node evaluation
functions (described in Section 4.1.3) evaluate the effective cost of all expandable
nodes. The least expensive node is then selected for coverage. Second, it is com-
mon for more than one node to share the minimum effective cost. In that case,
the decision is made based on the initial scheduling—the node with the lowest
initial schedule is selected. It is still possible, however, that more than one node
share both the minimum effective cost and initial schedule number. In that case,
an arbitrary decision is made: The node with the lowest node number is selected.
As we will see, node evaluation is the single slowest component of the
whole algorithm. OASIS allows the user to specify a limit, B, on the number of
nodes to be evaluated. Again, the initial schedule plays an important role. Cov-
erable nodes are sorted in increasing initial schedule and only the first B nodes
are evaluated.
The scheduled DAG for Example 1 is shown in Figure 10. In Example 1,
variable x in S2 depends on the value assigned to it in S1. This data dependency
is indicated by the arc from node 12 to node 1, which causes node 12 to be
scheduled after node 1.
TM
TM
TM
Figure 11 A heuristic search tree with depth K ⫽ 3. At each level p, heuristics are
used to expand a node into its most promising successors. The search engine supports
two kinds of trees: (a) Each successor is associated with a node cost and only the least
expensive successor is selected for expansion at level p ⫹ 1; this kind of tree is used at
the topmost level (see Fig. 4). (b) All promising successor nodes are expanded at level
p ⫹ 1 and the node cost at level p is the sum of the expansion cost at level p and the
minimum node cost at level p ⫹ 1; this kind of tree is used to evaluate the node costs of
the tree in (a).
TM
TM
TM
Example 3
S2: y ⫽ x ⫹ b ∗ c ⫹ d ∗ e;
Subgoal 1:
LAC x ;load accumulator with x
Subgoal 2:
SACL temp ;store x in location temp in memory
PAC ;move (b ⫻ c) to accumulator
TM
Subgoal 2:
PAC ;move (b ⫻ c) to accumulator
SACL y ;move (b ⫻ c) to location y in memory
Subgoal 1:
LAC x ;load accumulator with x
The latter ordering is clearly less expensive and is returned as the better
plan.
The second issue comes to play when the plan is executed. Note that after
instruction PAC is applied, one operand to ADD, (b ⫻ c), is in the accumulator,
whereas the other operand, x, is in data memory. ADD’s prerequisite requires
exactly the opposite, but because ⫹ is a commutative operator, ADD is applicable
at this point. Hence, only the first component of the plan is used and the rest is
discarded.
TM
TM
TM
Proof. The worst case is when each template covers only one DAG node. From
the search tree in Figure 11a, if a node expands (in the worst case) into B succes-
sors and the time necessary to evaluate the cost of a successor node is T, one
can readily derive that the (worst case) total search time is given by
τ ⫽ total search time ⫽ S (time to expand one node) ⫽ S(BT) (5)
Because T is proportional to O(S ⫹ B K ) (from Theorem 1), the total search time
is proportional to O(BS 2 ⫹ B K⫹1S).
Hence, for reasonably small (fixed) values of K, the proposed algorithm
executes in polynomial time with the DAG size.
* A K-level tree rooted at one node, with each node expanding into B successors, has exactly (B K ⫹
B K⫺1 ⫹ . . . ⫹ B ⫹ 1) nodes. For simplicity, it is common practice to approximate the total number
of nodes to O(B K ). This approximation becomes more accurate as K increases.
TM
5 CONTEXT-DEPENDENT ALGORITHM
TRANSFORMATIONS
Even if a code generator could generate optimal code for a given HLL program,
that code may still be inferior to handwritten assembly code. The reason is simple.
Although the IR can uniquely represent a given sequence of HLL instructions
(program), the latter is not a unique implementation of a desired algorithm. In
fact, there are infinite programs that evaluate a given expression. For example,
one could implement (a ⫻ b) as [(a ⫹ b) 2 ⫺ a 2 ⫺ b 2]/2. Incidentally, Massalin’s
Superoptimizer [71] can generate optimal instruction sequences given a function
to be performed. Superoptimizer, however, is not a code generator. It takes as
input a machine language program and returns another (smaller) program, in the
same machine language, which computes the same function as the input. It does
not understand the intended function. It simply performs an exhaustive search
over the target machine’s instruction set for all possible equivalent pro-
grams, first of length 1, then of length 2, and so on. For each function, the user
must define a set of test vectors to be used to verify program equivalence and
manually inspect the programs that pass such a test for equivalence with the
original program. Due to its exponential nature, its application is limited to very
small programs. Massalin wrote: ‘‘The current version of Superoptimizer has
generated programs 12 instructions long in several hours running time. . . . There-
fore, the Superoptimizer has limited usefulness as a code generator for a
compiler.’’
The use of HLLs emphasizes the need for algorithm transformation. Unlike
assembly languages, HLLs are, in principle, processor independent. Without tar-
get architecture information, HLL programmers cannot bias the program toward
certain constructs, as experienced assembly programmers often do. Hence, it is
the responsibility of the compiler to perform that task.
The term algorithm transformation has a broad meaning. Research on
TM
TM
TM
TM
TM
Example 4
S1: f ⫽ a ⫹ (b ⫺ c);
S2: g ⫽ a ⫺ (b ⫹ c);
S1′: f ⫽ b ⫹ (a ⫺ c);
S2′: g ⫽ (a ⫺ c) ⫺ b;
TM
Example 5
S1: f ⫽ b ⫹ (a ⫺ c);
S2: g ⫽ a ⫺ (b ⫹ c);
S1′: f ⫽ a ⫹ (b ⫺ c);
S1: f ⫽ b ⫹ (a ⫺ c);
S2′: g ⫽ (a ⫺ c) ⫺ b;
TM
Proof. In the worst case, every search node has Z alternative transformations
and each transformation requires replication of the entire branch rooted at that
search node. Following the rationale in the proof of Theorem 2, the time neces-
sary to expand one node is multiplied by Z, because each node now can have Z
replicas. Similarly, the time necessary to evaluate the cost of a successor node,
T is also multiplied by Z. Hence,
6 EXPERIMENTAL RESULTS
TM
TM
TM
TM
REFERENCES
1. EA Lee. Programmable DSP architectures: Part I. IEEE ASSP Mag 5(4):4–19, 1988.
2. EA Lee. Programmable DSP architectures: Part II. IEEE ASSP Mag 6(1):4–14,
1988.
3. H Ahmed. Recent advances in DSP systems, IEEE Commun Mag 29(5):32–45, 1991.
4. D Bursky. DSPs expand role as costs drops and speed increases, Electron Des
39(19):53–65, 1991.
5. D Shear. HLL compilers and DSP run-time libraries make DSP-system program-
ming easy. EDN. 33(13):69–74, 1988.
6. D Genin, EV deVelde, JD Moortel, D Desmet. System design, optimization and
intelligent code generation for standard digital signal processors. Proceedings of the
IEEE International Symposium on Circuits and Systems. New York: IEEE, 1989,
pp 565–569.
7. J Hartung, SL Gay, SG Haigh. A practical C language compiler/optimizer for real
time implementations on a family of floating point DSPs. Proceedings of the Interna-
tional Conference on Acoustics, Speech and Signal Processing. New York: IEEE,
1988, pp 1674–1677.
TM
TM
TM
TM
TM
TM
TM
1.2 Motivation
Although MMX technology can offer much higher performance, programming
with MMX technology is easier said than done. First, different algorithms have
different characteristics for parallel processing. Algorithms with repetitive, regu-
TM
TM
* For example, one instruction that we use frequently in DCT/IDCT is PMADDWD, Packed Multiply
and Add instruction. This instruction multiplies four operands by another four operands, adds the
results of the first two multiplications, and adds the results of the other two multiplications at
once. If the operands are in the right place, four multiplications take the same time as two or three
multiplications. In this case, overoptimized algorithms with data shuffling may result in a slower
performance.
TM
Figure 1 (a) Conventional scalar operation. To add two vectors together, we have to
add each pair of the components sequentially. (b) In processors with SIMD capability,
we can add two vectors in parallel using one instruction.
TM
TM
on
Arithmetic Addition PADDB, PADDW, PADDD PADDSB, PADDUSB,
PADDSW PADDUSW
Subtraction PSUBB, PSUBW, PSUBD PSUBSB, PSUBUSB,
PSUBSW PSUBUSW
Multiplication PMULL, PMULH
Multiply and Add PMADD
Comparison Compare for Equal PCMPEQB, PCMPEQW,
PCMPEQD
Compare for Greater PCMPGTPB, PCMPGTPW,
Than PCMPGTPD
Conversion Pack PACKSSWB, PACKUSWB
PACKSSDW
Unpack High PUNPCKHBW, PUNPCKHWD,
PUNPCKHDQ
Unpack Low PUNPCKLBW, PUNPCKLWD,
PUNPCKLDQ
TM
TM
* There is a trade-off between precision and parallelism. Although the MMX instructions on Intel
Pentium II processors or earlier can operate on multiple integer operands, the Streaming SIMD
Extensions on Intel Pentium III processors or later can operate on multiple single-precision floating-
point operands. On Intel Pentium 4 processors, we can even operate on multiple double-precision
floating-point operands. If we give more precision to the operands, then we can work on fewer
operands at once.
† Intel Pentium III processors have prefetch instructions to help applications bring operands into
caches before real operations need the data. An application example can be found in Ref. 22. Intel
Pentium 4 processors have hardware prefetch unit so that applications do not have to prefetch data
that will be brought into caches.
TM
TM
instructions because MMX registers and state are aliased onto the
floating-point registers and state. This was done so that no new regis-
ters or states were introduced by MMX technology.†
2. MMX instructions that reference memory or integer registers do not
mix well with integer instructions referencing the same memory or
registers.
3. It is important to arrange data in the best way for MMX technology
processing (e.g., structure of array, array of structure, rowwise or col-
umnwise arrangements). Columnwise processing in general is better
than sequential rowwise processing.
The following are two processor-specific rules (on Intel Pentium III or earlier):
4. MMX shift/pack/unpack instructions do not mix well with each other.
In general, two MMX instructions can be executed in the same clock
cycle. However, only one MMX shift/pack/unpack instruction can be
executed on one clock cycle because there is only one shifter unit.
5. MMX multiplication instructions ‘‘pmull/pmulh/pmadd’’ do not mix
well with each other because there is only one multiplication unit.
† MMX code sections should end with ‘‘emms’’ instructions if floating-point operations are to be
used later in the program. In order to maintain operating system compatibility, MMX registers are
aliases to the scalar floating-point registers. As we read or write to an MMX register, we read and
write to one of the floating-point registers and vice versa. Thus, we cannot guarantee what the
contents of the floating-point register will be after we execute an MMX piece of code, or vice
versa. Mixing MMX instructions and floating-point code fragments in the same application is chal-
lenging. To guarantee that no floating-point errors will occur when we switch from MMX to floating
point, we must use the new MMX instruction EMMS (Empty MMX Technology State), which
marks all the floating-point registers as Empty. Using EMMS at the end of every MMX function
may not be the most efficient; we just need to use EMMS before floating-point operations.
TM
void get_dc_image_c(void)
{
int y, x, j, k, temp;
for (y ⫽ 0; y ⬍ height_in_blocks; y⫹⫹){
for (x ⫽ 0; x ⬍ width_in_blocks; x⫹⫹)
{
temp ⫽ 0;
for (j ⫽ 0; j ⬍ block_size; j⫹⫹){
for (k ⫽ 0; k ⬍ block_size; k⫹⫹){
temp ⫹⫽ image_data[(y ∗ block_size ⫹ j) ∗ image_width
⫹ x ∗ block_size ⫹ k];
}
}
dc_image_data[y ∗ width_in_blocks ⫹ x] ⫽ (temp/(block_size
∗block_size));
}
}
}
The inner loops (loop j and loop k) are simple additions of 8-bit integers.
Because of the extreme regularity, the inner loops of this subroutine can be imple-
mented efficiently in MMX technology. We unrolled the inner loop so that the
operation can be executed in parallel. The following is a high-level, conceptual
code after unrolling. (We define image_data [y, x, j, k] as
image_data[(y ∗ block_size ⫹ j) ∗ image_width ⫹ x ∗
block_size ⫹ k].)
for (k ⫽ 0; k ⬍ block_size; k⫹⫹){
temp[k] ⫽ 0;
}
for (j ⫽ 0; j ⬍ block_size; j⫹⫹)
{ /* execute the following additions in parallel */
temp[0] ⫹⫽ image_data [y, x, j, 0];
temp[1] ⫹⫽ image_data [y, x, j, 1];
temp[2] ⫹⫽ image_data [y, x, j, 2];
...
temp[block_size-1] ⫹⫽ image_data [y, x, j, block_size-1];
}
temp ⫽ 0;
for (k ⫽ 0; k ⬍ block_size; k⫹⫹){
temp ⫹⫽ temp [k];
}
TM
(a)
(b)
Figure 4 Breakdowns of CPU time in our watermark detection scheme: (a) C implemen-
tation and (b) MMX technology optimized implementation. We divide our watermark
detection scheme into five major parts: (1) compute reduced-resolution images, (2) com-
pute image-dependent pseudo-random noises, (3) extract spatial-domain watermark, (4)
extract frequency-domain watermark, and (5) function call overheads. In the C implemen-
tation, about half of the CPU time is spent in the module that calculates low-resolution
images. After we optimize this module with MMX technology, this module is no longer
the major bottleneck. We achieve 2.3⫻ speedup in this module, which makes the speedup
of our whole watermark detection 1.4⫻.
TM
After this MMX technology optimization, the subroutine that calculates the
reduced-resolution image takes only 27.5% of the CPU time, as shown in Figure
4b. Also, after we optimize this module with MMX technology, the execution
time is distributed more evenly to the other modules so this module is no longer
the major bottleneck. We achieve a 2.3⫻ speedup in this module.
TM
In this section, we will show two more principles of MMX technology optimiza-
tion by using the MPEG-4 pixel padding procedure as an example. The second
rule of MMX technology optimization is to transform conditional executions into
logic operations. The first rule of MMX technology optimization is to execute
multiple identical operations in one instruction, as shown in Section 2.3. How-
ever, there are conditional operations like the following:
TM
MMX instructions can load or store multiple operands if operands are placed in
a row. However, multiple operands in a column are harder to access. For faster
execution, we should arrange data to be processed in a row-major order or change
the algorithm. On the other hand, for some two-dimensional image/video pro-
cessing operations, an algorithm needs to process data in both directions. In Sec-
tion 3.4, a matrix-transpose procedure will be demonstrated to provide the flexi-
bility of choosing row-major or column-major processing.
TM
functionality in MPEG-1 and MPEG-2, the new MPEG-4 video coding standard
supports arbitrary shaped video objects [26]. In MPEG-4, the video input is no
longer considered as a rectangular region. On the other hand, similar to the
MPEG-1 and MPEG-2, the MPEG-4 video coding scheme processes the succes-
sive images of a VOP (video–object–plane) sequence in a block-based manner
(e.g., motion estimation, motion compensation, DTC, IDCT). Therefore, before
motion estimation, motion compensation, and DCT, nonobject pixels of contour
macroblocks (which contain the shape edge of an object) are filled using the
padding technique. The padding operation turns out to be a computationally com-
plex and irregular operation [27]. In the following example, we have created a
new procedure using MMX technology to speed up the MPEG-4 padding process
by 1.5⫻ to 2.0⫻.
First, for each arbitrary shaped object, a minimum-sized rectangular
bounding box is defined. The box is divided and expanded to an array of mac-
roblocks with the natural number of macroblocks in horizontal and vertical direc-
tions. Because of the arbitrary shape of the object, not all pixels inside this
bounding box contain valid object pixel values. There are macroblocks that lie
completely inside the object, macroblocks that lie completely outside, and mac-
roblocks that cover the border of the video object, as shown in Figure 5. Mac-
roblocks that lie inside the object remain untouched. Macroblocks that cover the
object boundary are filled using the repetitive padding algorithm.
Padding is accomplished by copying the pixels that lay on the edge of the
mask outward. First, the pixels are padded in the horizontal direction, with bound-
ary pixels propagated both leftward and rightward. On the second pass, pixels
are padded in the vertical direction. In both cases, if a pixel that lies outside of
the mask is bounded by two masked pixels on opposite sides, the unmasked pixel
should be assigned the average of both bounding pixels.
Figure 5 Bounding box and macroblocks of an arbitrary shaped video object: (a) outside
the object, (b) inside the object, and (c) on the boundary.
TM
A B C
冤 冥冤 冥冤 冥
* * * 40 40 40 40 40 40 40 40 40
* * * * * * * * 29 29 30 36
* 17 19 32 17 17 19 32 17 17 19 32
20 * * 14 20 16 16 14 20 16 16 14
A shows the original 4 ⫻ 4 matrix. Pixels labeled with an asterisk are outside of
the pixel mask. B shows the matrix after the horizontal padding stage. Pixels in
bold are the changed values. C. The final matrix after the vertical padding stage.
TM
Figure 7 A simplified case of the pixel padding procedure. We assume that we are
performing vertical padding without average pixel values.
TM
is equivalent to
if (mask[i ⫹ 1][j] ⫽⫽ 0)
pixel[i ⫹ 1][j] ⫽ pixel[i][j];
This can be done easily in MMX technology because the above algorithm
can be executed without any knowledge of the pixel or mask values. Note that
there are no branch statements. This algorithm can be sped up using MMX in-
structions by computing all eight pixels in a row concurrently. The following is
our code for this simplified pixel padding procedure:
TM
TM
TM
Figure 8 Our pixel padding procedure contains four parts: (1) transpose of the block,
(2) vertical padding, (3) transpose of the block, and (4) vertical padding.
extension of the high-order elements of the destination operand. In this case, the
instructions can convert data into a higher-precision representation.
In addition to gathering data from different memory locations, interleaving
planar and duplicating data, the unpacking/packing instructions can transpose
rows and columns of data. Figure 11 illustrates a method for performing 4⫻4
transpositions on 16-bit packed values [28]. The basic idea behind this method
Figure 9 An example of our pixel padding procedure: (a) original block, (b) transpose
of the original block, (c) vertical padding, (d) transpose of the vertically padded transposed
block (which is equivalent to the horizontally padded block), and (e) vertical padding.
TM
follows. First, we collect the higher-order data into a set of registers. Then, col-
lecting the higher-order data from the registers, which contains higher-order data,
is equivalent to collecting the highest-order data from each row; that is, we have
the data originally in a column now in a MMX register. Transposing 8⫻8 transpo-
sitions on 8-bit packed values is left as an exercise for the readers.
We measure the performance of the MMX technology-optimized imple-
mentation. Our simulation results show that the new pixel padding routines run
between 1.5 and 2 times faster than the original scalar-instruction-only version.
TM
The fourth rule of MMX technology optimization is to reduce shuffling and max-
imize grouping of operations into one instruction. In this section, we demonstrate
this rule by optimizing the SA-DCT (shape-adaptive inverse discrete cosine trans-
form) and the SA-IDCT (shape-adaptive inverse cosine transform). The 8⫻8
TM
TM
冱 cos冢 冣
N⫺1
n(2k ⫹ 1)
yn ⫽ cn π xk
k⫽0
2N
where
c 0 ⫽ 1/√N
and
c n ⫽ √2/N
for n ⫽ 1, . . . , N ⫺ 1.
TM
⫽ 1/√2 冤
1 ⫺1 x 冥
冥 冤
1 1 x 0
where x i is the input data and y i is the transformed data. In conventional algorith-
mic optimization, we minimize the number of additions and multiplications.
Thus, we define
z0 ⫽ x0 ⫹ x1
z1 ⫽ x0 ⫺ x1
Then,
1
y0 ⫽ z0
√2
1
y1 ⫽ z1
√2
In this way, we need only two additions and two multiplications instead of
two additions and four multiplications. The following is the C code for this
algorithm.
TM
_asm {
mov eax, in
mov ecx, out
movd mm0, [eax] / / mm0 ⫽ xx, xx i1, i0
pshufw mm1, mm0, 01000100b / / mm1 ⫽ i1, i0, i1, i0
pmaddwd mm1, xstatic1 / / mm1 ⫽ i0 ∗ f0 ⫺i1 ∗ f0,
/ / i0 ∗ f0 ⫹ i1 ∗ f0
paddd mm1, rounding / / do proper rounding
psrad mm1, 15
packssdw mm1, mm7 / / mm1 ⫽ x, x, o1, o0
movd [ecx], mm1
}
}
TM
√ 冢冣 √ 冢 冣 √ 冢 冣 √ 冢 冣
y0 2 π 2 3π 2 5π 2 7π x0
冤冥 冤冥
cos cos cos cos
4 8 4 8 4 8 4 8
y1 x1
⫽
√ 冢 冣 √ 冢 冣 √ 冢 冣 √ 冢 冣
y2 2 2π 2 6π 2 10π 2 14π x2
cos cos cos cos
4 8 4 8 4 8 4 8
y3 x3
√ 冢 冣 √ 冢 冣 √ 冢 冣 √ 冢 冣
2 3π 2 9π 2 15π 2 21π
cos cos cos cos
4 8 4 8 4 8 4 8
1 1
0 0
√2 √2
冤 冥 冤 冥冤 冥
1 0 0 0 1 1 1 0 0 1 x0
⫺ 0 0
0 0 1 0 √2 √2 0 1 1 0 x1
√
1
冢 冣 冢冣
⫽ 3π π (1)
2 0 1 0 0 0 0 cos cos 0 1 ⫺1 0 x2
8 8
0 0 0 1 1 0 0 ⫺1 x3
0 0 ⫺cos
冢冣 π
8
cos
冢 冣
3π
8
冤 冥
1 1
√2 √2
1 1
⫺
√2 √2
to
1 1 1
√2 1 ⫺1
.冤 冥
Equation (1) can be expressed as follows:
1 1
冤 冥
1 0 0
冤 冥 冤 冥冤 冥
y0 1 0 0 0 √2 1 ⫺1 0 0 1 0 0 1 x0
冤冥 y1
y2
⫽
√
1
2
0
0
0
1
1
0
0
0
0 0 cos
冢 冣
3π
8
cos
冢冣 π
8
0
0
1
1 ⫺1
1 0
0
x1
x2
冢冣 冢 冣
y3 0 0 0 1 π 3π 1 0 0 ⫺1 x3
0 0 ⫺cos cos
8 8
TM
_asm {
mov eax, in
mov ecx, out
movq mm0, [eax] / / i3 i2 i1 i0
pshufw mm1, mm0, 00011011b / / i0 i1 i2 i3
movq mm2, mm1
paddsw mm2, mm0 // b0 b1 b1 b0
psubsw mm0, mm1 // ⫺b3 ⫺b2 b2 b3
pmaddwd mm2, xstatic1 // 01 ⬍⬍ 15, o0, ⬍⬍ 15
pmaddwd mm0, xstatic2 // 03 ⬍⬍ 15, o2 ⬍⬍ 15
paddd mm2, rounding // proper rounding
paddd mm0, rounding // proper rounding
psrad mm2, 15
psrad mm0, 15
packssdw mm2, mm0 / / o3 o1 o2 o0
pshufw mm3, mm2, 11011000b / / o3 o2 o1 o0
movq [ecx], mm3
}
}
The rest of the N-point DCTs and IDCTs are left as exercise for the readers.
Our final implementation boosts the SA-DCT/SA-IDCT process by 1.1–1.5 times
in the MPEG-4 VOP-based coding scheme. The MMX technology versions of
TM
the N-point DCTs performed from 1.3 to 3.0 times faster than the fixed-point
versions, as shown in Table 3.
We also compare the performance of our MMX technology-optimized
SA-DCT/SA-IDCT implementation and the performance of an MMX technology
optimized 8⫻8 DCT/IDCT. SA-DCT is 1.5 times faster than the 8⫻8 DCT, even
with the mask shifting overhead.* The eight-point DCT is slower than the lower-
order DCTs.
* Assuming that all N-point routines are called with equal probability.
TM
5 CONCLUSIONS
TM
ACKNOWLEDGMENT
The authors would like to thank James C. Abel, Intel Corporation, for his exten-
sive and precious suggestions in the early stage of this work.
TM
REFERENCES
1. RE Owen, D Martin. A uniform analysis method for DSP architectures and instruc-
tion sets with a comprehensive example. Proceedings of IEEE Workshop on Signal
Processing Systems, 1998, pp 528–537.
2. A Peleg, U Weiser. The MMX technology extension to the Intel architecture. IEEE
Micro, 16(4):42–50, 1996.
3. Intel Corp. Intel Architecture MMX Technology Developer’s Manual. IL: Intel
Corporation, 1996. (Order No. 243006-001.)
4. Intel Corp. Intel Architecture MMX Technology Programmer’s Reference Man-
ual. IL: Intel Corporation, 1996. (Order No. 243007-002.)
5. Intel Corp. Intel Architecture Optimization Manual. IL: Intel Corporation, 1997.
6. Intel Corp. Intel Architecture Software Developer’s Manual. IL: Intel Corporation,
1997. (Order No. 243191.)
7. Intel Corp. Intel Streaming SIMD Extensions Application Notes. IL: Intel Corpora-
tion (available on-line: http:/ /developer.intel.com/software/products/itc/strmsimd/
sseappnots.htm).
8. S Thakkar, T Huff. Internet Streaming SIMD Extensions. IEEE Computer 32(12):
26–34, 1999.
9. C Dichter, GJ Hinton. Optimizing for the Willamette processor. Intel Developer
UPDATE Mag 6:3–5, March 2000 (http:/ /developer.intel.com/update/departments/
initech/it03003.pdf).
10. Intel Corp. IA-32 Intel Architecture Software Developer’s Manual Volume 1: Basic
Architecture. IL: Intel Corporation; 2000. (Order No. 245470.)
11. Intel Corp. IA-32 Intel Architecture Software Developer’s Manual Volume 2: In-
struction Set Reference. IL: Intel Corporation, 2000. (Order No. 245471.)
12. M Atkins, R Subramanism. PC software performance tuning. IEEE Computer 29(9):
47–54, 1996.
13. Y-K Chen, SY Kung. Multimedia signal processors: An architectural platform with
algorithmic compilation. J VLSI Signal Process Syst 20(1/2):183–206, 1998.
14. SY Kung. VLSI Array Processor, Englewood Cliffs, NJ: Prentice-Hall, 1988.
15. M Bierling. Displacement estimation by hierarchical block matching. Proceedings
of SPIE Visual Communications and Image Processing, 1988, vol. 1001, pp 942–
951.
16. J Ju, Y-K Chen, SY Kung. A fast algorithm for very low bit rate video coding. IEEE
Trans Circuits Syst Video Technol 9(7):994–1002, 1999.
17. M Fomitchev. MMX technology code optimization. Dr. Dobb’s J 303:38–48, Sep-
tember 1999.
18. J Khazam, B Bachmayer. Programming strategies for Intel’s MMX. BYTE 21(8):
63–64, 1996.
19. JE Lecky. Using MMX technology to speed up machine vision algorithms. Imag-
ing Technology Tutorials (available on-line: http:/ /www.imaging.com/tutorials/
00000009/tutorial.html).
20. R Coelho, M Hawash. DirectX, RDX, RSX, and MMX Technology: A Jumpstart
Guide to High Performance APIs. Reading, MA: Addison-Wesley, 1998.
TM
TM
TM
TM
TM
1 BACKGROUND
TM
Thus, ((a, b)), ((d, f ), ( f, e), (e, f ), ( f, e)), ((b, c), (c, a), (a, b)), and ((a, b), (b,
h)) are examples of paths in Figure 1.
We say that a path p ⫽ (e 1 , e 2 , . . . , e n) originates at the vertex src(e 1)
and terminates at snk(e n), and we write
edges( p) ⫽ {e 1 , e 2 , . . ., e n}
(4)
vertices( p) ⫽ {src(e 1), src(e 2), . . ., src(e n), snk(e n)}
A cycle is a path that originates and terminates at the same vertex. A cy-
cle (e 1 , e 2 , . . . , e n) is a simple cycle if src(e i) ≠ src(e j ) for all i ≠ j. In Figure 1,
((c, a), (a, b), (b, c)), ((a, b), (b, c), (c, a)), and (( f, e), (e, f )) are examples of
simple cycles. The path ((d, f ), ( f, e), (e, f ), ( f, e), (e, d )) is a cycle that is not
a simple cycle.
By a subgraph of a directed graph G ⫽ (V, E ), we mean the directed graph
formed by any subset V′ ⊆ V together with the set of edges {e ∈ E|(src(e), snk(e) ∈
V′)}. For example, the directed graph
TM
TM
Example 1
A simple example of an SDF abstraction is shown in Figure 2. Here, each edge
is annotated with the numbers of data values produced and consumed by the
source and sink actors, respectively. For example, prd((B, C )) ⫽ 1 and cns ((B,
C )) ⫽ 2. The ‘‘2D’’ next to the edge (D, E ) represents two units of delay. Thus,
del ((D, E )) ⫽ 2.
TM
Example 2
Consider again the SDF graph of Figure 2. The repetitions vector of this graph
is given by
q(A, B, C, D, E) ⫽ (10, 2, 1, 1, 2) (8)
Additionally, we have TNSE G ((A, D)) ⫽ 10 and TNSE G ((B, C )) ⫽ 2.
If a repetitions vector exists for an SDF graph but a valid schedule does
not exist, then the graph is deadlocked. Thus, an SDF graph is consistent if and
only if a repetitions vector exists and the graph is not deadlocked. For example,
if we reduce the number of delays on the edge (D, E ) in Figure 2 (without adding
delay to any of the other edges), then the graph will become deadlocked.
In summary, SDF is currently the most widely used data flow model in
commercial and research-oriented DSP design tools. Although SDF has limited
expressive power, the model has proven to be of great practical value in the
domain of signal processing and digital communication. SDF encompasses a
broad and important class of applications, including modems, digital audio broad-
casting systems, video encoders, multirate filter banks, and satellite receiver sys-
tems, just to name a few [2,19–23]. Commercial tools that employ SDF semantics
include Simulink by The Math Works, SPW by Cadence, and ADS by Hewlett
TM
TM
and similarly,
τ(A)
cns(e′) ⫽ 冱Ci⫽1
e,i (9)
Example 4
As an example of increased flexibility in expressing actor interactions, consider
the CSDF specification illustrated in Figure 3. This specification represents a
recursive digital filter computation of the form
TM
TM
TM
A fundamental task in synthesizing hardware and software from a data flow speci-
fication is that of scheduling, which, as described in Section 2.2, refers to the
process of determining the order in which actors will be executed. During co-
synthesis, it is often desirable to obtain efficient, parallel implementations, which
execute multiple actor firings simultaneously on different resources.
For this purpose, the class of ‘‘valid schedules’’ introduced in Section 2.2
is not sufficient; multiprocessor schedules, which consist of multiple firing se-
quences—one for each processing resource—are required. However, the consis-
tency concepts developed in Section 2.2 are inherent to SDF specifications and
apply regardless of whether or not parallel implementation is used. In particular,
when performing static, multiprocessor scheduling of SDF graphs, it is still neces-
sary to first compute the repetitions vector and to verify that the graph is deadlock-
free, and the techniques for accomplishing these objectives are no different for
the multiprocessor case.
TM
Example 5
As a simple example, Figure 5 shows an SDF graph and its associated APG.
TM
TM
冢冦冱 冱 f (e ) ⫹ f (A)冷
n n
冧冣
(e 1, e 2 , . . . , e n) is a path in G that originates at A
Under this formulation, the priority of an actor is taken to be the associated value
of λ G (∗, f v , f e); in other words, the priority list for list scheduling is constructed
in decreasing order of the metric λ G (∗, f v , f e).
Example 6
If actor execution times are constant, f v (A) is taken to be the execution time of
A, and f e is taken to be the zero function on E [ f (e′) ⫽ 0 for all e′ ∈ E ], then
TM
4.1 GCLP
The global criticality, local phase (GCLP) algorithm [43], developed by Kala-
vade and Lee, gives an approach for combined hardware/software partitioning
and scheduling for minimum latency. Input to the algorithm includes an applica-
tion graph G ⫽ (V, E ), a target platform consisting of a programmable processor
and a fabric for implementing custom hardware, and constraints on the latency
and on the code size of the software component. Each actor A ∈ V is characterized
by its execution time t h (A) and area a h (A) if implemented in hardware, and by
its execution time t s (A) and code size a s (A) if implemented in software. The
GCLP algorithm attempts to compute a mapping of graph actors into hardware
and software and a schedule for the mapped actors. The objective is to minimize
the area of the custom hardware subject to the constraints on latency and software
code size.
At each iteration i of the algorithm, a ready actor is selected for mapping
and scheduling based on a dynamic priority function P i : V → ℵ that takes into
account the relative difficulty (time criticality) in achieving the latency constraint
based on the partial schedule S i constructed so far. Increasing levels of time criti-
cality translate to increased affinity for hardware implementation in the compu-
tation P i of actor priorities. Because it incorporates the structure of the entire
TM
冱 ElemOps(A)A ∈ Hi
C (i) ⫽
冱 ElemOps(A)
g (14)
A ∈ Ui
TM
TM
TM
R(θ) ⫽ 冱
(V, E) ∈ appset
|{A ∈ V|(type(A) ⫽ θ)}| (19)
and the normalized form of this metric, which we denote R N , is defined by nor-
malizing to values restricted within [0, 1]:
R(θ)
R N (θ) ⫽ (20)
max({R(type(A)) | (A ∈ V appset)})
Performance-area trade-off information is quantified by a metric T that
measures the speedup in moving an actor implementation from software to hard-
ware relative to the required hardware area:
t s (A) ⫺ t h (A)
T(A) ⫽ for each A ∈ v appset (21)
a h (A)
The normalized form of this metric, T N , is defined in a fashion analogous to Eq.
(20) to again obtain a value within [0, 1].
TM
冱 t (v)
v ∈ Vi
s
criticality(G i) ⫽ ⫽ (24)
Li
Intuitively, an application with high criticality requires a large amount of hard-
ware area to satisfy its latency constraint and thus makes it more difficult to meet
the minimization objective of cosynthesis.
Version GCLP-MF-B operates by processing application graphs in decreas-
ing order of their criticality, keeping track of interapplication resource-sharing
possibilities throughout the cosynthesis process, and systematically incorporat-
TM
4.2 COSYN
Optimal or nearly optimal hardware/software cosynthesis solutions are difficult
to achieve because there are numerous relevant implementation considerations
and constraints. The COSYN algorithm [45], developed by Dave et al., takes
numerous elements of this complexity into account. The design considerations
and objectives addressed by the algorithm include allowing arbitrary, possibly
heterogeneous collections of processors and communication links, intraprocessor
concurrency (e.g., in FPGAs and ASICs), pre-emptive versus non-pre-emptive
scheduling, actor duplication on multiple processors to alleviate communication
bottlenecks, memory constraints, average, quiescent and peak power dissipation
in processing elements and communication links, latency (in the form of actor
deadlines), throughput (in the form of subgraph initiation rates), and overall dollar
cost, which is the ultimate minimization objective.
TM
TM
λ G (A, f t , f c) (26)
if (e ∈ subsumed)
冦max(t (e, c )|(c ∈ C)
0
f c (e, c) ⫽ (28)
c i i and (t c (e, c i) ⬍ ∞)}) otherwise
Here, subsumed denotes the set of edges in E that have been ‘‘enclosed’’ by the
clusters created by all previous clustering operations; that is, the set of edges e
such that src(e) and snk(e) have already been clustered and both belong to the
same cluster.
At each clustering step, an unclustered actor A that maximizes λ G (∗, f t , f c)
is selected, and based on certain compatibility criteria, A is first either merged
into the cluster of a predecessor actor or inserted into a new cluster, and then
the resulting cluster may be further expanded to contain a successor of A.
TM
λ G (A, g t , g c) (29)
where g t : V → Z is defined by
冦t (e, asgn(e))
f c (e) if (asgn(e) ⫽ NULL)
g c (e) ⫽ (31)
c otherwise
冦t (v, asgn(v))
min ({t e (v, r i)|(r i ∈ R)}) if (asgn(v) ⫽ NULL)
t best (v) ⫽ (32)
e otherwise
TM
冦t (e, asgn(e))
min ({t c (e, c i)|(c i ∈ C)}) if asgn(e) ⫽ NULL
t best (e) ⫽ (33)
c otherwise
The worst-case latencies, denoted t worst (v) and t worst (e), are defined (using the same
minor abuse of notation) in a similar fashion.
From these best- and worst-case latencies, allocation-conscious best- and
worst-case finish-time estimates F best and F worst of each actor and each edge are
computed by
F best (v) ⫽ max({F best (e in) ⫹ t best (v)| e in ∈ in(v)}), and (34)
F worst (v) ⫽ max({F worst (e in) ⫹ t worst (v)| e in ∈ in(v)}) for v ∈ V; (35)
F best (e) ⫽ F best (src(e)) ⫹ t best (e), and (36)
F worst (e) ⫽ F worst (src(e)) ⫹ t worst (e) for e ∈ E (37)
The worst-case and best-case finish times, as computed by Eqs. (34–37),
are used in evaluating the quality of a candidate allocation. Let V deadline ⊆ V denote
the subset of actors for which deadlines are specified; let α denote the set of
candidate allocations for a selected cluster; and let α′ ⊆ α be the set of candidate
allocations for which all actors in V deadline have their corresponding deadlines satis-
fied in the best case (i.e., according to {F best (v)}). If α′ ≠ ∅, then an allocation
is chosen from the subset α′ that maximizes the sum
冱
v ∈ V deadline
F worst (v) (38)
of worst-case finish times over all actors for which prespecified deadlines exist.
On the other hand, if α′ ⫽ ∅, then an allocation is chosen from α that maximizes
the sum
冱
v ∈ V deadline
F best (v) (39)
of best-case finish times over all actors for which deadlines exist. In both cases,
the maxima over the respective sets of sums are taken because they ultimately
lead to final allocations that have lower overall dollar cost [45].
TM
4.3 CodeSign
As part of the CodeSign project at ETH Zurich, Blickle et al. have developed
a search technique for hardware/software cosynthesis [46] that is based on the
framework of evolutionary algorithms. In evolutionary algorithms, complex
search spaces are explored by encoding candidate solutions as ‘‘chromosomes
and evolving ‘‘populations’’ of these chromosomes by applying the principles
of reproduction (retention of chromosomes in a population), crossover (deriva-
tion of new chromosomes from two or more ‘‘parent’’ chromosomes), mutation
TM
4.3.1 Specifications
A key innovation in the CodeSign approach is a novel formulation of joint alloca-
tion, assignment, and scheduling as mappings between sequences of graphs and
‘‘activations’’ of vertices and edges in these graphs. This formulation is intu-
itively appealing and provides a natural encoding structure for embedding within
the framework of evolutionary algorithms.
The central data structure that underlies the CodeSign cosynthesis formula-
tion is the specification. A CodeSign specification can be viewed as an ordered
pair S ⫽ (H S , M S), where H s ⫽ {G 1, G 2, . . . , G N }; each G i is a directed graph
(called a ‘‘dependence graph’’) (V i , E i); and each M i is a set of mapping edges
that connect vertices in successive dependence graphs (i.e., for each e ∈ M i ,
src(e) ∈ V i and snk(e) ∈ V i⫹1 ). If the specification in question is understood, we
write
N N N⫺1
VH ⫽ 傼 i⫽1
Vi, EH ⫽ 傼
i⫽1
Ei, EM ⫽ 傼M
i⫽1
i (43)
Thus, V H and E H denote the sets of all dependence graph vertices and edges,
respectively, and E M denotes the set of all mapping edges. The specification graph
of S is the graph G S ⫽ (V S , E S) obtained by integrating all of the dependence
graphs and mapping edges: V S ⫽ V H and E S ⫽ (E H 傼 E M ).
The ‘‘top-level’’ dependence graph (the problem graph) G 1 gives a behav-
ioral specification of the application to be implemented. In this sense, it is similar
to the application graph concept defined in Section 3.1. However, it is slightly
different in its incorporation of special communication vertices that explicitly
represent interactor communication and are ultimately mapped onto communica-
tion resources in the target architecture [46].
The remaining dependence graphs G 2, G 3, . . . , G N specify different levels
of abstraction or refinement during implementation. For example, a dependence
graph could specify an architectural description consisting of available resources
for computation and communication (architecture graph) and another depen-
dence graph could specify the decomposition of a target system into integrated
circuits and off-chip buses (chip graph). Due to the general nature of the
CodeSign specification formulation, there is full flexibility to define alternative
or additional levels of abstraction in this manner.
TM
Example 7
Figure 6a provides an illustration of a CodeSign specification for hardware/soft-
ware cosynthesis onto an architecture that consists of a programmable processor
resource P S , a resource for implementing custom hardware P H , and a bidirectional
bus B that connects these two processing resources. The v i’s denote problem
graph actors and the c i’s denote communication vertices. Here, only hardware
implementation is allowed for v 2 and v 5, only software implementation is allowed
for v 3, and v 1 and v 4 may each be mapped to either hardware or software. Thus,
for example, there is no edge connecting v 2 or v 5 with the vertex P S associated
with the programmable processor. In general, communication vertices can be
mapped either to the bus B (if the source and sink vertices are mapped to different
processing resources) or internally to either the hardware (P H ) or software (P S)
resource (if the source and sink are mapped to the same processing resource).
However, mapping restrictions of the problem graph actors may limit the possible
mapping targets of a communication vertex. For example, because v 2 and v 3 are
restricted respectively to hardware and software implementations, communica-
tion vertex c 2 must be mapped to the bus B. Similarly, c 3 can be mapped to P s
or B, but not to P H . The set of mapping edges for this example is given by
E M ⫽ {(v 1 , P H), (v 1 , P S), (v 2 , P H), (v 3 , P S), (v 4 , P H), (v 4 , P S),
(44)
(v 5 , P H), (c 1 , B), (c 1 , P S), (c 2 , B), (c 3 , B), (c 3 , P S), (c 4 , B)}
TM
TM
TM
TM
where start (v, k) and end (v, k) respectively represent the times at which firing
k of actor v begins execution and completes execution.
Initially, the synchronization graph G s is identical to G ipc . However, various
transformations can be applied to G s in order to make the overall synchronization
structure more efficient. After all transformations on G s are complete, G s and G ipc
can be used to map the given parallel schedule into an implementation on the
target architecture. The IPC edges in G ipc represent buffer activity and are imple-
mented as buffers in shared memory, whereas the synchronization edges of G s
represent synchronization constraints, and are implemented by updating and test-
ing flags in shared memory. If there is an IPC edge as well as a synchronization
edge between the same pair of actors, then a synchronization protocol is executed
before the buffer corresponding to the IPC edge is accessed to ensure sender–
receiver synchronization. On the other hand, if there is an IPC edge between two
actors in the IPC graph but there is no synchronization edge between the two,
then no synchronization needs to be done before accessing the shared buffer. If
there is a synchronization edge between two actors but no IPC edge, then no
shared buffer is allocated between the two actors; only the corresponding syn-
chronization protocol is invoked.
Any transformation that we perform on the synchronization graph must
respect the synchronization constraints implied by G ipc . If we ensure this, then
we only need to implement the synchronization edges of the optimized synchroni-
zation graph. If G 1 ⫽ (v, E 1) and G 2 ⫽ (V, E 2) are synchronization graphs with
the same vertex-set and the same set of intraprocessor edges (edges that are not
synchronization edges), we say that G 1 preserves G 2 if for all e ∈ E 2 such that
e ∉ E 1 , we have ρ G 1 (src(e), snk(e)) ⱕ del(e), where ρ G (x, y) D ∞ if there is
no path from x to y in the synchronization graph G, and if there is a path from
x to y, then ρG (x, y) is the minimum over all paths p directed from x to y of the
sum of the edge delays on p. The following theorem (developed in Ref. 48)
TM
Example 8
Figure 8 shows an example of a redundant synchronization edge. The dashed
edges in this figure are synchronization edges. Here, before executing actor D,
the processor that executes {A, B, C, D} does not need to synchronize with the
processor that executes {E, F, G, H} because due to the synchronization edge
x 1, the corresponding firing of F is guaranteed to complete before each firing of
D is begun. Thus, x 2 is redundant.
The following result establishes that the order in which we remove redun-
dant synchronization edges is not important.
TM
Example 9
Figure 9a shows a synchronization graph that arises from a two-processor sched-
ule for a four-channel multiresolution quadrature mirror filter (QMF) bank, which
has applications in signal compression. As in Figure 8, the dashed edges are
synchronization edges. If we apply redundant synchronization removal to the
synchronization graph of Figure 9a, we obtain the synchronization graph in Fig-
ure 9b; the edges (A 1 , B 2), (A 3 , B 1), (A 4 , B 1), (B 2 , E 1), and (B 1 , E 2) are detected
to be redundant; and the number of synchronization edges is reduced from 8 to
3 as a result.
5.2 Resynchronization
The goal of resynchronization is to introduce new synchronizations in such a
way that the number of original synchronizations that become redundant exceeds
TM
TM
THEOREM 3 Throughout the self-timed execution of an IPC graph G ipc , the num-
ber of tokens on a feedback edge e of G ipc is bounded; an upper bound is given
by
TM
TM
6 BLOCK PROCESSING
Recall from Section 2.2 that DSP applications are characterized by groups of
operations that are applied repetitively on large, often unbounded, data streams.
Block processing refers to the uninterrupted repetition of the same operation (e.g.,
data flow graph actor) on two or more successive elements from the same data
stream. The scalable synchronous data flow (SSDF) model is an extension of
SDF that enables software synthesis of vectorized implementations, which exploit
the opportunities for efficient block processing and, thus, form an important com-
ponent of the cosynthesis design space. The internal specification of an SSDF
actor A assumes that the actor will be executed in groups of (N v (A) successive
firings, which operate on N v (A) ⫻ cns(e))-unit blocks of data at a time from each
incoming edge e. Block processing with well-designed SSDF actors reduces the
rate of interactor context switching and context switching between successive
code segments within complex actors, and it may improve execution efficiency
significantly on deeply pipelined architectures.
At the Aachen University of Technology, as part of the COSSAP [27] soft-
ware synthesis environment for DSP (now developed by Synopsys), Ritz et al.
investigated the optimized compilation of SSDF specifications [53]. This work
has targeted the minimization of the context-switch overhead, or the average rate
at which actor activations occur. An actor activation occurs whenever two dis-
tinct actors are invoked in succession. Activation overhead includes saving the
contents of registers that are used by the next actor to invoke, if necessary, and
loading state variables and buffer pointers into registers.
For example, the schedule
(2(2B)(5A))(5C ) (50)
results in five activations per schedule period. Parenthesized terms in Eq. (50)
represent schedule loops, which are repetitive firing patterns that are to be trans-
lated into loops in the target code. More precisely, a parenthesized term of the
form (nT 1T2 . . . T n) specifies the successive repetition n times of the subschedule
T 1T 2 . . . T n . Schedules that contain only one appearance of each actor, such as
the schedule of Eq. (50), are referred to as single appearance schedules. Because
of their code size optimality and because they have been shown to satisfy a num-
ber of useful formal properties [2], single appearance schedules have been the
focus of a significant component of work in DSP software synthesis.
Ritz estimates the average rate of activations for a valid schedule S as the
number of activations that occur in one iteration of S divided by the blocking
TM
Figure 11 This example illustrates that minimizing actor activations does not imply
minimizing actor appearances.
TM
Example 10
Figure 12 depicts a complete hierarchization of an SDF graph. Figure 12a shows
the original SDF graph; here, q (A, B, C, D) ⫽ (1, 2, 4, 8). Figure 12b shows
TM
TM
7 SUMMARY
TM
REFERENCES
TM
TM
TM
1 INTRODUCTION
Storage technology ‘‘takes the center stage’’ [1] in more and more systems be-
cause of the eternal push for more complex applications with especially larger
and more complicated data types. In addition, the access speed, size, and power
consumption associated with this storage form a severe bottleneck in these sys-
tems (especially in an embedded context). In this chapter, several building blocks
for memory storage will be investigated, with the emphasis on internal architec-
tural organization. After a general classification of the memory hierarchy compo-
nents in Section 2, cache architecture issues will be treated in Section 3, followed
by main memory organization aspects in Section 4. The main emphasis will lie
on modern multimedia and telecom oriented processors, both of the microproces-
sor and DSP type.
Apart from the storage architecture itself, the way data are mapped to these
architecture components are as important for a good overall memory management
solution. Actually, these issues are gaining in importance in the current age of
deep submicron technologies where technology and circuit solutions are not suf-
ficient on their own to solve the system design bottlenecks. Therefore, the last
three sections are devoted to different aspects of data transfer and storage explora-
tion: source code transformations (Sec. 5), task versus data parallelism exploita-
tion (Sec. 6), and memory data layout organization (Sec. 7). Realistic multimedia
TM
The goal of a storage device is, in general, to store a number of n-bit data words
for a short or long term. These data words are sent to processing units (processors)
at the appropriate point in time (cycle) and the results of the operations are then
written back in the storage device for future use. Due to the different characteris-
tics of the storage and access, different styles of devices have been developed.
TM
TM
netic media and tapes which are intended for slow access of mass data.
We will restrict ourselves to the most common case on the chip, namely
volatile.
3. Address mechanism: Some devices require only sequential ad-
dressing, such as the first-in first-out (FIFO) queue, first-in last-out
(FILO), or stack structures discussed in Section 2.3, which put a severe
restriction on the order in which the data are read out. Still, this restric-
tion is acceptable for many applications. A more general but still se-
quential access order is available in a pointer-addressed memory
(PAM). In the PAM, the main limitation is that each data value is both
written and read once in any statically predefined order. However, in
most cases the address sequence should be random (including repeti-
tion). Usually, this is implemented with a direct addressing scheme
(typically called a random-access memory or RAM). Then, an impor-
tant requirement is that in this case, the access time should be indepen-
dent of the address selected. In many programmable processors, a spe-
cial case of random-access-based buffering is realized, exploiting
comparisons of tags and usually also including (full or partial) associa-
tivity (in a so-called cache buffer).
4. Number of independent addresses and corresponding gateways
(buses) for access: This parameter can be one (single port), two (dual
port), or even more (multiport). Any of these ports can be for reading
TM
only, writing only, or R/W. Of course, the area occupied will increase
considerably with the number of ports.
5. Internal organization of the memories: The memory can be meant
of capacities which remain small or which can become large. Here, a
trade-off is usually involved between speed and area efficiency. The
register files in Section 2.2 constitute an example of the fast small-
capacity organizations which are usually also dual ported or even
multiported. The queues and stacks in Section 2.3 are meant for me-
dium-sized capacities. The RAMs in Section 4 can become extremely
large (up to 256 Mbit for the state of the art) but are also much slower
in random access.
6. Static or dynamic: For R/W memories, the data can remain valid as
long as VDD is on (static cell) or the data should be refreshed about
every millisecond (dynamic cell). Circuit-level issues are discussed in
overview articles like Ref. 2 for SRAMs and Refs. 3 and 4 for DRAMs.
In the following subsections, the most important read/write-type memories and
their characteristics will be investigated in more detail.
TM
TM
TM
In principle, the stack can be made ‘‘dynamic’’ (Fig. 8) where the data are
pushed and popped in such a way that all data move (as if a spring were present).
This leads to a tremendous waste of power in a complementary metal-oxide semi-
conductor (CMOS) and should be used only in other technologies. A better solu-
tion in CMOS is to make the stack ‘‘static’’ as in Figure 9. Here, the only moves
are made in a shift register (pointer) which can now move in two directions, as
opposed to the unidirectional shift in the FIFO case.
The objectives of this section are as follows: (1) to discuss the fundamental issues
about how cache memories operate; (2) to discuss the characteristic parameters
TM
TM
that for a direct-mapped cache, we have only one tag comparison. Once this
process is done, the block offset is used to obtain the particular data element in
the chosen cache line. This data element is now transferred to/from the CPU.
In Figure 10, we have used a direct-mapped cache, whereas if we had used
an n-way set-associative cache, the following differences will be observed: (1)
n tag comparisons instead of one and (2) less index bits and more tag bits. This
is shown in Figure 11, which shows a two-way set associative cache. We will
briefly discuss some of the issues that need to be considered during the design
of cache memories in the next subsection.
TM
TM
TM
Figure 12 Initial algorithm and the corresponding cache states for a fully associative
cache. For (i ⫽ 3; i ⬍ 11; i⫹⫹), b[i ⫺ 1] ⫽ b[i ⫺ 3] ⫹ a[i] ⫹ a[i ⫺ 3].
reuse in the cache (on-chip). Thus, for the algorithm in our example, we have
10 conflict misses, 2 capacity misses, and 12 compulsory misses.
TM
TM
Table 1 Differences Between Hardware and Software Caches for Current State-of-
the-Art Multimedia and DSP Processors
Hardware-controlled cache Software-controlled cache
Basic concepts Lines and sets Lines
Data transfer Hardware Partly software
Updating policy Hardware Software
Replacement policy Hardware NA
2. The hardware performs the data transfer based on the execution order
of the algorithm at run time using fixed statistical measures, whereas
for the software controlled cache, this task is performed either by
the compiler or the user. This is currently possible using high-level
compile-time directives like ‘‘ALLOCATE( )’’ and link time options
like ‘‘LOCK’’—to lock certain data in part of the cache, through the
compiler/linker [11].
3. The most important difference between hardware- and software-con-
trolled cache is in the way the next higher level of memory is updated,
namely the way coherence of data is maintained. For the hardware-
controlled cache, the hardware writes data to the next higher level of
memory either every time a write occurs or when the particular cache
line is evicted, whereas for the software-controlled cache, the compiler
decides when and whether or not to write back a particular data element
[11]. This results in a large reduction in the number of data transfers
between different levels of memory hierarchy, which also contributes
to lower power and reduced bandwidth usage by the algorithm.
4. The hardware-controlled cache needs an extra bit for every cache line
for determining the least recently used data, which will be replaced on
a cache miss. For the software-controlled cache, because the compiler
manages the data transfer, there is no need for additional bits or a
particular replacement policy.
A large variety of possible types of RAM for use as main memories has been
proposed in the literature and research on RAM technology is still very active,
as demonstrated by the results in for example, the proceedings of the latest Inter-
national Solid-State Circuits (ISSCC) and Custom Integrated Circuit (CICC) con-
ferences. Summary articles are available in Refs. 4, 12, and 13. The general orga-
TM
nization will depend on the number of ports, but, usually, single-port structures
are encountered in the large-density memories. This will also be the restriction
here. Most other distinguishing characteristics are related to the circuit design
(and the technological issues) of the RAM cell, the decoder, and the auxiliary
R/W devices. In this section, only the general principles will be discussed. De-
tailed circuit issues fall outside the scope of this chapter, as mentioned earlier.
TM
The CS signal is also necessary to allow the use of memory banks (on
separate chips) as needed for personal computers or workstations (Fig. 15). Note
the fact that the CS1 and CS2 control signals can be considered as the most
significant part of the 18-bit address. Indeed, in a way, the address space is split
up vertically over two separate memory planes. Moreover, every RAM in a hori-
zontal slice contributes only a single data bit.
For large-capacity RAMs, the basic floor plan of Figure 14 leads to a very
slow realization because of the too long bit lines. For this purpose, the same
principle as in large ROMs, namely postdecoding, is applied. This leads to the
use of an X decoder and a Y decoder (Fig. 16) where the flexibility of the floor-
plan shape is now used to end up with a near square (dimensions x ⫻ y), which
TM
x ⫹ y ⫽ k and x ⫹ y ⫹ log 2B
TM
bit (byte) RAMs are now commonly produced. In the near future, one can expect
other formats to appear.
TM
such a RAM is illustrated in Figure 19, in which φ1 and φ2 are clock phases.
The different pipeline stages in the reading/writing of a value from/to the syn-
chronous RAM are indicated.
TM
can be more drastically reduced still in the future (on condition that similar invest-
ments are done).
Combined with the advance in process technology, all of this has lead to
a remarkable reduction of the DRAM related power: from several watts for the
16–32-MB generation to about 100 mW for 100-MHz operation in a 256-MB
DRAM.
Hence, modern stand-alone DRAM chips, which are of the so-called syn-
chronous (SDRAM) type, already offer low-power solutions, but this comes at
a price. Internally, they contain banks and a small cache with a (very) wide width
connected to the external high-speed bus (see Fig. 20) [15,20]. Thus, the low-
power operation per bit is only feasible when they operate in burst mode with
large data widths.
This is not directly compatible with the actual use of the data in the proces-
sor data paths; therefore, without a buffer to the processors, most of the bits that
are exchanged would be useless (and discarded). Obviously, the effective energy
consumption per useful bit becomes very high in that case and also the effective
bandwidth is quite low.
Therefore, a hierarchical and typically much more power-hungry intermedi-
ate memory organization is needed to match the central DRAM to the data-
ordering and bandwidth requirements of the processor data paths. This is also
illustrated in Figure 21. The decrease of the power consumption in fast random-
access memories is not yet as advanced as in DRAMs but that one is saturating,
because many circuit and technology level tricks have been applied also in
SRAMs. As a result, fast SRAMs keep on consuming on the order of watts for
TM
TM
very wide DRAMs, and SRAMs with more than two ports (see, e.g., the eight-
port SRAM in Ref. 24) [4].
Code rewriting techniques, consisting of loop and data flow transformations, are
an essential part of modern optimizing and parallelizing compilers. They are
mainly used to enhance the temporal and spatial locality for cache performance
and to expose the inherent parallelism of the algorithm to the outer (for asynchro-
nous parallelism) or inner (for synchronous parallelism) loop nests [25–27].
Other application areas are communication-free data allocation techniques [28]
and optimizing communications in general [29].
It is thus no surprise that these code rewriting techniques are also at the
heart of our DTSE methodology. As the first step (after the preprocessing and
pruning) in the script, they are able to significantly reduce the required amount
of storage and transfers. As such however, they only increase the locality and
regularity of the code. This enables later steps in the script [notably the data
reuse, memory (hierarchy) assignment and in-place mapping steps] to arrive at
the desired reduction of storage and transfers.
Crucial in our methodology is that these transformations have to be applied
globally (i.e. with the entire algorithm as scope). This is in contrast with most
existing loop transformation research, where the scope is limited to one procedure
or even one loop nest. This can enhance the locality (and parallelization possibili-
ties) within that loop nest, but it does not change the global data flow and associ-
ated buffer space needed between the loop nests or procedures. In this section,
we will also illustrate our preprocessing and pruning step, which is essential to
apply global transformations.
In Section 5.1, we will first give a very simple example to show how loop
transformations can significantly reduce the data storage and transfer require-
ments of an algorithm. Next, we will demonstrate our approach by applying it
to a cavity-detection application for medical imaging. This application is intro-
duced in Section 5.2 and the code rewriting techniques are applied in Section
5.3. Finally (Sec. 5.4), we will also give a brief overview of how we want to
perform global loop transformations automatically in the DTSE context.
TM
for (i ⫽ 1; i ⬍ ⫽ N; ⫹⫹i) {
A[i] ⫽ . . .;
B[i] ⫽ f(A[i]);
}
TM
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
gx[y][x] ⫽ . . . / /Apply horizontal gaussblurring
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍N; ⫹⫹ x)
gxy[y][x] ⫽ . . . / / Apply vertical gaussblurring
}
void ComputeEdges (unsigned char gxy[M][N], unsigned char ce[M][N]) {
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
ce[y][x] ⫽ . . . / / Replace pixel with the maximum difference
with its neighbors
}
void Reverse (unsigned char ce[M][N], unsigned ce rev[M][N]) {
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
maxval ⫽ . . . / / Compute maximum value of the image
/ /Subtract every pixelvalue from this maximum value
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
ce rev[y][x] ⫽ maxval ⫺ ce[y][x];
}
void DetectRoots (unsigned char ce[M][N], unsigned char
image out[M][N]) {
unsigned char ce rev[M][N];
Reverse (ce, ce rev); / / Reverse image
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
image out[y][x] ⫽ . . . / / Is true if no neighbors are bigger than
current pixel
}
void main () {
unsigned char image in[M][N], gxy[M][N], ce[M][N], image out[M][N];
/ /...(read image)
GaussBlur(image In, gxy);
ComputeEdges(gxy, ce);
DetectRoots (ce, image out);
TM
void cav detect (unsigned char image in[M][N], unsigned char image out
[M][N]) {
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
gx[y][x] ⫽ . . . / / Apply horizontal gaussblurring
TM
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
gxy[y][x] ⫽ . . . / / Apply vertical gaussblurring
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
ce[y][x] ⫽ . . . / / Replace pixel with the maximum difference
with its neighbors
for (y ⫽ 0; y ⬍ M; ⫹⫹y)
for (x ⫽ 0; x ⬍ N; ⫹⫹x)
image out[y][x] ⫽ . . . / / Is true if no neighbors are smaller
than current pixel
}
Next, another data flow transformation can be performed to reduce the ini-
tializations. In the initial version, these are always done for the entire image
frame. This is not needed; only the borders have to be initialized, which saves
a lot of costly memory accesses. In principle, designers are aware of this, but
we have found that, in practice, the original code usually still contains a large
amount of redundant accesses. By systematically analyzing the code for this
(which is heavily enabled by the preprocessing phase), we can identify all redun-
dancy in a controlled way.
TM
TM
TM
TM
TM
TM
of a line (thus equivalent to 6 line buffers of full th). Because the accesses are
not FIFO compatible in this case, the buffers will have to be organized as SRAMs,
which are more expensive than FIFOs.
The results are summarized in Table 3. It is clear that the task-parallel
version is the optimal solution here. Note that the load balancing is less ideal
than for the data-parallel version, but it is important to trade off performance and
DTSE; for example, if we can avoid a buffer of 32 Kbits by using an extra
processor, this can be advantageous even if this processor is idle 90% of the time
(which also means that we have a very bad load balance), because the cost of
this extra processor in terms of area and power is less than the cost of a 32-Kbit
on-chip memory.
TM
TM
stored on-chip. The increase of the number of accesses (from 9900K in the non-
parallel solution to 12860K) is due to the overlapping motion-estimation regions
of the blocks in the boundaries of neighboring frame areas.
TM
TM
TM
6.3 Conclusions
As far as the memory size required for the storage of the intermediate array
signals is concerned, the results of the partitionings based on the initial descrip-
tion prove that this size is reduced when the partitioning becomes more data
oriented. This size is smaller for the first hybrid partitioning (245K), which is
more data oriented than the second hybrid partitioning (282K) and the task-level
partitioning (287 K).
For the reorganized description; the results indicate the opposite. In terms
of the number of memory accesses to the intermediate signals, the situation is
simpler: This number always decreases as the partitioning becomes more data
oriented.
Table 4 shows an overview of the achieved results. The estimated area and
power figures were obtained using a model of Motorola (this model is proprietary
so we can only give relative values). From Table 4, it is clear that the rankings
for the different alternatives (initial and reorganized) are clearly distinct. For the
reorganized description, the task-level-oriented hybrids are better. This is true
because this kind of partitioning keeps the balance between double buffers (pres-
ent in task-level partitioning) and replicates of array signals with the same func-
TM
In section 3, we introduced the three types of cache miss and identified that
conflict misses are one of the major hurdles to achieving better cache utilization.
In the past, source-level program transformations to modify execution order to
enhance cache utilization by improving the data locality have been proposed [39–
41]. Storage-order optimizations are also helpful in reducing cache misses
[42,43]. However, existing approaches do not eliminate the majority of conflict
misses. In addition [39,42], very little has been done to measure the impact of
data layout optimization on the cache performance. Thus, advanced data layout
techniques need to be identified to eliminate conflict misses and improve the
cache performance.
In this section, we discuss the memory data layout organization (MDO)
technique. This technique allows an application designer to remove most of the
conflict cache misses. Apart from this, MDO also helps in reducing the required
bandwidth between different levels of memory hierarchy due to increased spatial
locality.
First, we will briefly introduce the basic principle behind memory data
layout organization with an example. This is followed by the problem formulation
and a brief discussion of the solution to this problem. Experimental results using
a source-to-source compiler for performing data layout optimization and related
discussions are presented to conclude this section.
TM
are stored in the main memory, as shown in Figure 29. To obtain this modified
data layout, the following steps are carried out: (1) The initial arrays are split
into subarrays of equal size. The size of each subarray is called tile-size. (2)
Merge different arrays so that the sum of their tile-size’s equals cache size. Now,
store the merged array(s) recursively until all of the arrays concerned are mapped
completely in the main memory. Thus, we now have a new array which comprises
all the arrays, but the constituent arrays are stored in such a way that they get
mapped into cache so as to remove conflict misses and increase spatial locality.
This new array is represented by x[ ] in Figure 29.
In Figure 29, two important observations need to be made: (1) There is a
recursive allocation of different array data, with each recursion equal to the cache
size and (2) the generated addressing, which is used to impose the modified data
layout on the linker.
TM
First, the tile size evaluation problem and, second, the array merging/clustering
problem.
L1 ⫽ x1 ⫹ x2 ⫹ x3 ⫹ ⋅ ⋅ ⋅ ⫹ xn ⱕ C
L 2 ⫽ x 11 ⫹ x 12 ⫹ x 13 ⫹ ⋅ ⋅ ⋅ ⫹ x 1n ⱕ C (1)
⋅⋅
⋅
L m ⫽ x (m⫺1)
1 ⫹ x (m⫺1)
2 ⫹ x (m⫺1)
3 ⫹ ⋅ ⋅ ⋅ ⫹ x (m⫺1)
n ⱕC
L wk ⫽ 冱 effsize
i⫽1
i (2)
TM
* In the worst case, one tile size for every loop nest in which the array is alive.
† In the worst case, we could have a different tile size for every array in every loop nest for the
given program.
TM
2. DOECU II: In the second heuristic, the tile sizes are evaluated by a
more global method. Here, we first accumulate the effective sizes for
every array over all of the loop nests. Next, we perform the proportion-
ate allocation for every loop nest based on the accumulated effective
sizes. This results in smaller difference between tile size evaluated for
an array in one loop nest compared to the one in another loop nest.
This is necessary because suboptimal tile sizes can result in larger self-
conflict misses. The merging of different arrays is done in a similar
way to that in the first heuristic.
TM
cause the motion-estimation algorithm has only one (large) loop nest with a depth
of six, namely six nested loops with one body.
The main observations from all the three tables are as follows. MDO opti-
mized code has a larger spatial reuse of data both in the L1 and L2 cache. This
increase in spatial reuse is due to the recursive allocation of simultaneously alive
data for a particular cache size. This is observed from the L1 and L2 cache line
reuse values. The L1 and L2 cache hit rates are consistently greater too, which
indicates that the tile size evaluated by the tool were nearly optimal because for
suboptimal tile sizes, there will more self-conflict cache misses.
Because the spatial reuse of data is increased, the memory access time is
reduced by an average factor 2 all of the time. Similarly, the bandwidth used
between the L1 and L2 caches is reduced by 40% to a factor of 2.5 and the
TM
bandwidth between the L2 cache and the main memory is reduced by a factor
of 2–20. This indicates that although the initial algorithm had larger hit rates,
the hardware was still performing many redundant data transfers between differ-
ent levels of the memory hierarchy. These redundant transfers are removed by
the modified data layout and heavily decreased the system bus loading. This
has a large impact on the global system performance, because most multimedia
applications are required to operate with peripheral devices connected using the
off-chip bus.
Because we generate complex addressing, we also perform address optimi-
zations [48] to remove the addressing overhead. Our studies have shown that we
are able to not only remove the complete overhead in addressing but also gain
by up to 20% in the final execution time, on MIPS R10000 and HP PA-8000
processors, compared to the initial algorithm, apart from obtaining the large gains
in the cache and memory hierarchy.
REFERENCES
1. G Lawton. Storage technology takes the center stage. IEEE Computer Mag 32(11):
10–13, 1999.
2. R Evans, P Franzon. Energy consumption modeling and optimization for SRAMs.
IEEE J Solid-State Circuits 30(5):571–579, 1995.
3. K Itoh, Y Nakagome, S Kimura, T Watanabe. Limitations and challenges of multi-
gigabit DRAM chip design. IEEE J Solid-State Circuits 26(10), 1997.
4. B Prince. Memory in the fast lane. IEEE Spectrum 38–41, 1994.
5. R Jolly. A 9ns 1.4GB/s 17-ported CMOS register file. IEEE J Solid-State Circuits
26(10):1407–1412, 1991.
6. N Weste, K Esharaghian. Principles of CMOS VLSI Design. 2nd ed. Reading, MA:
Addison-Wesley, 1993.
7. D Patterson, J Hennessey. Computer Architecture: A quantitative Approach. San
Francisco: Morgan Kaufmann, 1996.
8. AJ Smith. Line size choice for CPU cache memories. IEEE Trans Computers 36(9),
1987.
9. CL Su, A Despain. Cache design tradeoffs for power and performance optimization:
a case study. Proc. Int. Conf. on Low Power Electronics and Design (ICLPED),
1995, pp 63–68.
10. U Ko, PT Balsara, A Nanda. Energy optimization of multi-level processor cache
architectures. Proc. Int. Conf. on Low Power Electronics and Design (ICLPED),
1995, pp 63–68.
11. Philips. TriMedia TM1000 Data Book. Sunnyvale, CA: Philips Semiconductors,
1997.
12. R Comerford, G Watson. Memory catches up. IEEE Spectrum 34–57, 1992.
13. Y Oshima, B Sheu, S Jen. High speed memory architectures for multimedia applica-
tions. IEEE Circuits Devices Mag 8–13, 1997.
TM
TM
32. A Darte, Y Robert. Scheduling uniform loop nests. Internal Report 92-10, ENSL/
IMAG, Lyon, France, 1992.
33. A Darte, Y Robert. Affine-bystatement sheduling of uniform loop nests over para-
metric domains. Internal Report 92-16, ENSL/IMAG, Lyon, France, 1992.
34. M Neeracher, R Ruhl. Automatic parallelisation of LINPACK routines on distributed
memory parallel processors. Proc. IEEE Int. Parallel Proc. Symp. (IPPS), 1993.
35. C Polychronopoulos. Compiler optimizations for enhancing parallelism and their
impact on the architecture design. IEEE Trans Computer 37(8):991–1004, 1988.
36. A Agarwal, D Krantz, V Nataranjan. Automatic partitioning of parallel loops and
data arrays for distributed sharedmemory multiprocessors. IEEE Trans Parallel Dis-
trib Syst 6(9):943–962, 1995.
37. K Danckaert, F Cathhoor, H de Man. System-level memory management for weakly
parallel image processing. In Proc. Euro-Par Conf. Lecture Notes in Computer Sci-
ence Vol. 1124. Berlin: Springer Verlag, 1996.
38. K Danckaert, F Cathhoor, H de Man. A loop transformation approach for combined
parallelization and data transfer and storage optimization. Proc. Conf. on Parallel
and Distributed Processing Techniques and Applications, 2000, Volume V,
pp 2591–2597.
39. M Kandemir, J Ramanujam, A Choudhary. Improving cache locality by a combina-
tion of loop and data transformations. IEEE Trans Computers 48(2):159–167, 1999.
40. M Lam, E Rothberg, M Wolf. The cache performance and optimization of blocked
algorithms. Proc. Int. Conf. on Architectural Support for Programming Languages
and Operating Systems, 1991, pp 63–74.
41. D Kulkarni, M Stumm. Linear loop transformations in optimizing compilers for
parallel machines. Austral Computer J 41–50, 1995.
42. PR Panda, ND Dutt, A Nicolau. Memory data organization of improved cache per-
formance in embedded processor applications. Proc. Int. Symp. on System Synthesis,
1996, pp 90–95.
43. E De Greef. Storage size reduction for multimedia applications. PhD thesis, Depart-
ment of Electrical Engineering, Katholieke Universiteit, Leuven, Belgium, 1998.
43. S Ghosch, M Martonosi, S Malik. Cache miss equations: A compiler framework for
analyzing and tuning memory behaviour. ACM Trans Program Lang Syst 21(4):
702–746, 1999.
44. CL Lawson, RJ Hanson. Solving Least-Square Problems. Classics in Applied Mathe-
matics. Philadelphia: SIAM, 1995.
45. GL Nemhauser, LA Wolsey. Integer and Combinatorial Optimizations. New York:
Wiley, 1988.
46. C Kulkarni. Cache conscious data layout organization for embedded multimedia
applications. Internal Report, IMEC-DESICS, Leuven, Belgium, 2000.
47. C Kulkarni, F Cathhoor, H de Man. Advanced data layout optimization for multime-
dia applications. Proc workshop on Parallel and Distributed Computing in Image,
Video and Multimedia Processing (PDIVM) of IPDPS 2000. Lecture Notes in Com-
puter Science Vol. 1800, Berlin: Springer-Verlag, 2000, pp 186–193.
48. S Gupta, M Miranda, F Catthoor, R Gupta. Analysis of high-level address code
transformations for programmable processors. Proc. 3rd ACM/IEEE Design and
Test in Europe Conf., 2000.
TM
edited by
Yu Hen Hu
University of Wisconsin–Madison
Madison, Wisconsin
Headquarters
Marcel Dekker, Inc.
270 Madison Avenue, New York, NY 10016
tel: 212-696-9000; fax: 212-685-4540
The publisher offers discounts on this book when ordered in bulk quantities. For more
information, write to Special Sales/Professional Marketing at the headquarters address
above.
Neither this book nor any part may be reproduced or transmitted in any form or by any
means, electronic or mechanical, including photocopying, microfilming, and recording,
or by any information storage and retrieval system, without permission in writing from
the publisher.
TM
TM
Editorial Board
Maurice G. Ballanger, Conservatoire National
des Arts et Métiers (CNAM), Paris
Ezio Biglieri, Politecnico di Torino, Italy
Sadaoki Furui, Tokyo Institute of Technology
Yih-Fang Huang, University of Notre Dame
Nikhil Jayant, Georgia Tech University
Aggelos K. Katsaggelos, Northwestern University
Mos Kaveh, University of Minnesota
P. K. Raja Rajasekaran, Texas Instruments
John Aasted Sorenson, IT University of Copenhagen
TM
TM
Since their inception in the late 1970s, programmable digital signal processors
(PDSPs) have gradually expanded into applications such as multimedia signal
processing, communications, and industrial control. PDSPs have always played
a dual role: on the one hand, they are programmable microprocessors; on the
other hand, they are designed specifically for digital signal processing (DSP)
applications. Hence they often contain special instructions and special architec-
ture supports so as to execute computation-intensive DSP algorithms more effi-
ciently. This book addresses various programming issues of PDSPs and features
the contributions of some of the leading experts in the field.
In Chapter 1, Kittitornkun and Hu offer an overview of the various aspects
of PDSPs. Chapter 2, by Managuli and Kim, gives a comprehensive discussion
of programming methods for very-long-instruction-word (VLIW) PDSP architec-
tures; in particular, they focus on mapping DSP algorithms to best match the
underlying VLIW architectures. In Chapter 3, Lee and Fiskiran describe native
signal processing (a technique to enhance the performance of multimedia signal
processing by general-purpose microprocessors) and compare various formats for
multimedia extension (MMX) instruction. Chapter 4, by Tessier and Burleson,
presents a survey of academic research and commercial development in recon-
figurable computing for DSP systems over the past 15 years.
The next three chapters focus on issues in software development. In Chapter
5, Wu and Wolf examine the pros and cons of various options for implementing
video signal processing applications. Chapter 6, by Yu and Hu, details a method-
ology for optimal compiler linear code generation. In Chapter 7, Chen et al. offer
TM
Yu Hen Hu
TM
Series Introduction
Preface
Contributors
TM
TM
TM
TM