Architectures for Programmable Digital Signal
Processing Devices
Basic Architectural Features
A programmable DSP device should provide instructions similar to a
conventional microprocessor
The instruction set of a typical DSP device should include the
following,
a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc
b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation
In addition to the above provisions, the architecture should also
include,
a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)
Problem: Investigate the basic features that should be provided in
the DSP architecture to be used to implement the following Nth
order FIR filter
In order to implement the above operation in a DSP, the architecture
requires the following features
i. A RAM to store the signal samples x (n)
ii. A ROM to store the filter coefficients h (n)
iii. An MAC unit to perform Multiply and Accumulate operation
iv. An accumulator to store the result immediately
v. A signal pointer to point the signal sample in the memory
vi. A coefficient pointer to point the filter coefficient in the memory
vii. A counter to keep track of the count
viii. A shifter to shift the input samples appropriately
DSP Computational Building Blocks
Each computational block of the DSP should be optimized for
functionality and speed
The design should be sufficiently general so that it can be easily
integrated with other blocks to implement overall DSP systems
Multipliers
The advent of single chip multipliers paved the way for
implementing DSP functions on a VLSI chip
Parallel multipliers replaced the traditional shift and add multipliers
Parallel multipliers take a single processor cycle to fetch and execute
the instruction and to store the result and are also called as Array
multipliers
The key features to be considered for a multiplier are:
a. Accuracy
b. Dynamic range
c. Speed
The numbers of bits used to represent the operands decide the
accuracy and the dynamic range of the multiplier
Whereas speed is decided by the architecture employed
If the multipliers are implemented using hardware, the speed of
execution will be very high but the circuit complexity will also
increases considerably
Thus there should be a trade-off between the speed of execution and
the circuit complexity
Hence the choice of the architecture normally depends on the
application
Parallel Multipliers
Consider the multiplication of two unsigned numbers A and B
Let A be represented using m bits as (Am-1 Am-2 …….. A1 A0) and B be
represented using n bits as (Bn-1 Bn-2 …….. B1 B0)
Then the product of these two numbers is given by,
This operation can be implemented paralleling using Braun
multiplier whose hardware structure is as shown in the figure below:
Braun Multiplier for a 4x4 Multiplication
Multipliers for Signed Numbers
Generally signed numbers are represented in 2’s complement form
In the Braun multiplier the sign of the numbers are not considered
into account
In order to implement a multiplier for signed numbers, additional
hardware is required to modify the Braun multiplier
The modified multiplier is called as Baugh-Wooley multiplier
Consider two signed numbers A and B,
Product P=Pm+n-1........P1P0
If we need to multiply two 4 bit numbers A 3A2A1A0 x B3B2B1B0 with
A3 and B3 as sign bits
Steps are as follows:
Step 1: Multiply A2 A1 A0 x B2 B1 B0 to give product p5-p0
Step 2: Multiply sign bits A3 B3 and add them at p6 position
Step 3: Multiply the sign bit A3 with B2 B1 B0 to get A3B2 A3B1 A3 B0 and add
them at positions p5, p4 and p3
The addition is performed after obtaining 2’s complement of A3B2 A3B1 A3
B0 which is equal to subtracting A3B2 A3B1 A3 B0
Step 4: Multiply the sign bit B3 with A2 A1 A0 to get B3A2 B3A1 B3 A0 and add
them to position p5, p4 and p3
Steps 1 to 4 give the total product of two signed numbers
Using 2’s complement the subtraction can be expressed as additions and
hence resulting the multiplier consist of only adders
Example: Multiply 2 numbers -3 and 6
Step 1: -3 = A3A2A1Ao = 1101 and 6 = B3B2B1Bo = 0110
Multiplication of
A2A1A0xB2B1B0 →101x110
000
101
101
p5p4p3p2p1p0 → 011110
Step 2: Sign bits; A3B3 = 1 x 0 =0→ add them at P6 (Subscripts of A3B3
=3 + 3 = 6) to get p6p5p4p3p2p1p0 = 0011110
Step 3:
Multiplication of A3 with B2 B1 B0 → 1x110 =110
Subtract this at p5p4p3 gives p6 p5 p4p3p2p1p0 →0011110
p5p4p3→ 1110
B=1 1101110
Borrow = 1 implies p7,s p8 etc are all 1 (sign extension)
Step 4: Multiply B3 with A2A1A0 → 0 x 101 = 000
Subtract this at p5p4p3 (since zero; p6 po remains same)
Hence
p7p6p5p4p3p2p1p0 = 1110 1110 = EEh = -18h
The Baugh Wooley multiplier is as shown below:
Speed
Conventional Shift and Add technique of multiplication requires n
cycles to perform the multiplication of two n bit numbers
Whereas in parallel multipliers the time required will be the longest
path delay in the combinational circuit used
As DSP applications generally require very high speed, it is desirable
to have multipliers operating at the highest possible speed by having
parallel implementation
Bus Widths
Consider the multiplication of two n bit numbers X and Y
The product Z can be atmost 2n bits long
In order to perform the whole operation in a single execution cycle,
we require two buses of width n bits each to fetch the operands X
and Y and a bus of width 2n bits to store the result Z to the memory
Although this performs the operation faster, it is not an efficient way
of implementation as it is expensive
Many alternatives for the above method have been proposed
One such method is to use the program bus itself to fetch one of the
operands after fetching the instruction, thus requiring only one bus to
fetch the operands
And the result Z can be stored back to the memory using the same
operand bus
But the problem with this is the result Z is 2n bits long whereas the
operand bus is just n bits long
We have two alternatives to solve this problem,
a. Use the n bits operand bus and save Z at two successive memory
locations
Although it stores the exact value of Z in the memory, it takes two
cycles to store the result
b. Discard the lower n bits of the result Z and store only the higher
order n bits into the memory
It is not applicable for the applications where accurate result is
required
Another alternative can be used for the applications where speed is
not a major concern
In which latches are used for inputs and outputs thus requiring a
single bus to fetch the operands and to store the result (shown
below)
A Multiplier with Input and Output Latches
Shifters
Shifters are used to either scale down or scale up operands or the
results
The following scenarios give the necessity of a shifter
a. While performing the addition of N numbers each of n bits long, the
sum can grow up to n+log2N bits long
If the accumulator is of n bits long, then an overflow error will occur
This can be overcome by using a shifter to scale down the operand
by an amount of log2N
b. Similarly while calculating the product of two n bit numbers, the
product can grow up to 2n bits long
Generally the lower n bits get neglected and the sign bit is shifted to
save the sign of the product
c. Finally in case of addition of two floating-point numbers, one of the
operands has to be shifted appropriately to make the exponents of
two numbers equal
From the above cases it is clear that, a shifter is required in the
architecture of a DSP
Problem 1: It is required to find the sum of 64, 16 bit numbers.
How many bits should the accumulator have so that the sum can
be computed without the occurrence of overflow error or loss of
accuracy?
Solution: The sum of 64, 16 bit numbers can grow up to (16+ log 2 64)
=22 bits long. Hence the accumulator should be 22 bits long in order
to avoid overflow error from occurring
Problem 2: In the previous problem, it is decided to have an
accumulator with only 16 bits but shift the numbers before the
addition to prevent overflow, by how many bits should each
number be shifted?
Solution: As the length of the accumulator is fixed, the operands have
to be shifted by an amount of log2 64 = 6 bits prior to addition
operation, in order to avoid the condition of overflow
Problem 3: If all the numbers in the previous problem are fixed
point integers, what is the actual sum of the numbers?
Solution: The actual sum can be obtained by shifting the result by 6
bits towards left side after the sum being computed. Therefore
Actual Sum= Accumulator content X 2 6
Barrel Shifters
In conventional microprocessors, normal shift registers are used for
shift operation
As it requires one clock cycle for each shift, it is not desirable for
DSP applications, which generally involves more shifts
In other words, for DSP applications as speed is the crucial issue,
several shifts are to be accomplished in a single execution cycle
This can be accomplished using a barrel shifter, which connects the
input lines representing a word to a group of output lines with the
required shifts determined by its control inputs
For an input of length n, log2 n control lines are required
And an additional control line is required to indicate the direction of
the shift
The block diagram of a typical barrel shifter is as shown in figure
below
A Barrel Shifter
Implementation of a 4 bit Shift Right Barrel Shifter
Figure above depicts the implementation of a 4 bit shift right barrel
shifter
Shift to right by 0, 1, 2 or 3 bit positions can be controlled by setting
the control inputs appropriately
Problem 4: A Barrel Shifter is to be designed with 16 inputs for left
shifts from 0 to 15 bits. How many control lines are required to
implement the shifter?
Solution: As the numbers of bits used to represent the input are 16, log 2
16=4 control inputs are required
Multiply and Accumulate Unit
Most of the DSP applications require the computation of the sum of
the products of a series of successive multiplications
In order to implement such functions a special unit called a Multiply
and Accumulate (MAC) unit is required
A MAC consists of a multiplier and a special register called
Accumulator
MACs are used to implement the functions of the type A+BC
A typical MAC unit is as shown in the figure below
A MAC Unit
Although addition and multiplication are two different operations,
they can be performed in parallel
By the time the multiplier is computing the product, accumulator can
accumulate the product of the previous multiplications
Thus if N products are to be accumulated, N-1 multiplications can
overlap with N-1 additions
During the very first multiplication, accumulator will be idle and
during the last accumulation, multiplier will be idle
Thus N+1 clock cycles are required to compute the sum of N
products
Problem 5: If a sum of 256 products is to be computed using a
pipelined MAC unit, and if the MAC execution time of the unit is
100nsec, what will be the total time required to complete the
operation?
Solution: As N=256 in this case, MAC unit requires
N+1=257execution cycles
As the single MAC execution time is 100nsec, the total time required
will be, (257*100nsec)=25.7μsec
Overflow and Underflow
While designing a MAC unit, attention has to be paid to the word
sizes encountered at the input of the multiplier and the sizes of the
add/subtract unit and the accumulator, as there is a possibility of
overflow and underflows
Overflow/underflow can be avoided by using any of the following
methods viz
a. Using shifters at the input and the output of the MAC
b. Providing guard bits in the accumulator
c. Using saturation logic
a. Shifters
Shifters can be provided at the input of the MAC to normalize the
data and at the output to denormalize the same
b. Guard bits
As the normalization process does not yield accurate result, it is not
desirable for some applications
In such cases we have another alternative by providing additional
bits called guard bits in the accumulator so that there will not be
any overflow error
Here the add/subtract unit also has to be modified appropriately to
manage the additional bits of the accumulator
Problem 6: Consider a MAC unit whose inputs are 16 bit numbers.
If 256 products are to be summed up in this MAC, how many
guard bits should be provided for the accumulator to prevent
overflow condition from occurring?
Solution: As it is required to calculate the sum of 256, 16 bit numbers,
the sum can be as long as (16+ log2 256)=24 bits
Hence the accumulator should be capable of handling these 32 bits
Thus the guard bits required will be (24-16)= 8 bits
The block diagram of the modified MAC after considering the guard
or extension bits is as shown in the figure below
MAC Unit with Guard Bits
Problem 7: What should be the minimum width of the accumulator
in a DSP device that receives 10 bit A/D samples and is required
to add 64 of them without causing an overflow?
Solution: As it is required to calculate the sum of 64, 10 bit numbers,
the sum can be as long as (10+ log2 64)=16 bits
Hence the accumulator should be capable of handling these 16 bits
Thus the guard bits required will be (16-10)= 6 bits
c. Saturation Logic
Overflow/ underflow will occur if the result goes beyond the most
positive number or below the least negative number the
accumulator can handle
It can be resolved by loading the accumulator with the most positive
number which it can handle at the time of overflow
The least negative number that it can handle at the time of underflow
This method is called as saturation logic
A schematic diagram of saturation logic is as shown in figure below
In saturation logic, as soon as an overflow or underflow condition is
satisfied the accumulator will be loaded with the most positive or
least negative number overriding the result computed by the MAC
unit
A Schematic Diagram of the Saturation Logic
A Schematic Diagram of the Saturation Logic
Examples of overflow and underflow are illustrated below
Overflow
+3 0011
+2 0010
+5 0101 Here Cin= 0 = Cout, Hence no overflow
+5 0101
+4 0100
+9 1001 Here Cin= 1; Cout,=0, Cin≠ Cout Hence overflow
Underflow
-3 1101
-2 1110
-5 1 1101 Here Cin= 1; Cout,=1, Cin= Cout=1 Hence no underflow
-5 1011
+2 0010
-3 1101 Here Cin= 0 = Cout, Hence no underflow
-5 1011
-4 1100
-9 10111 Here Cin= 0; Cout= 1, Cin≠ Cout Hence underflow
Arithmetic and Logic Unit
A typical DSP device should be capable of handling arithmetic
instructions like ADD, SUB, INC, DEC etc and logical operations
like AND, OR , NOT, XOR etc
The block diagram of a typical ALU for a DSP is as shown in the
figure below
It consists of status flag register, register file and multiplexers
Status Flags: ALU includes circuitry to generate status flags after
arithmetic and logic operations - flags include sign, zero, carry and
overflow
Arithmetic Logic Unit of a DSP
Overflow Management: Depending on the status of overflow and
sign flags, the saturation logic can be used to limit the accumulator
content
Register File: Instead of moving data in and out of the memory
during the operation, for better speed, a large set of general purpose
registers are provided to store the intermediate results
Bus Architecture and Memory
Conventional microprocessors use Von Neumann architecture for
memory management wherein the same memory is used to store both
the program and data (Fig below)
Von Neumann Architecture
Although this architecture is simple, it takes more number of
processor cycles for the execution of a single instruction as the
same bus is used for both data and program
In order to increase the speed of operation, separate memories were
used to store program and data and a separate set of data and address
buses have been given to both memories, the architecture called as
Harvard Architecture
It is as shown in figure below
Harvard Architecture
Although the usage of separate memories for data and the instruction
speeds up the processing, it will not completely solve the problem
As many of the DSP instructions require more than one operand, use
of a single data memory leads to the fetch the operands one after the
other, thus increasing the delay of processing
This problem can be overcome by using two separate data memories
for storing operands separately, thus in a single clock cycle both the
operands can be fetched together
Although the architecture improves the speed of operation, it
requires more hardware and interconnections, thus increasing the
cost and complexity of the system
Harvard Architecture with Dual Data Memory
Therefore there should be a trade off between the cost and speed
while selecting memory architecture for a DSP
On-chip Memories
In order to have a faster execution of the DSP functions, it is
desirable to have some memory located on chip
As dedicated buses are used to access the memory, on chip memories
are faster
Speed and size are the two key parameters to be considered with
respect to the on-chip memories
Speed
On-chip memories should match the speeds of the ALU operations in
order to maintain the single cycle instruction execution of the DSP
Size
In a given area of the DSP chip, it is desirable to implement as many
DSP functions as possible
Thus the area occupied by the on-chip memory should be minimum
so that there will be a scope for implementing more number of DSP
functions on- chip
Organization of On-chip Memories
Ideally whole memory required for the implementation of any DSP
algorithm has to reside on-chip so that the whole processing can be
completed in a single execution cycle
Although it looks as a better solution, it consumes more space on
chip, reducing the scope for implementing any functional block on-
chip, which in turn reduces the speed of execution and hence some
other alternatives required
The following are some other ways in which the on-chip memory
can be organized
a. As many DSP algorithms require instructions to be executed
repeatedly, the instruction can be stored in the external memory, once
it is fetched can reside in the instruction cache
b. The access times for memories on-chip should be sufficiently small
so that it can be accessed more than once in every execution cycle
c. On-chip memories can be configured dynamically so that they can
serve different purpose at different times
Data Addressing Capabilities
Data accessing capability of a programmable DSP device is
configured by means of its addressing modes
The summary of the addressing modes used in DSP is as shown in
the table below
Table 2.1 DSP Addressing Modes
Addressing Operand Sample Format Operation
Mode
Immediate Immediate Value ADD #imm #imm +A →A
Register Register Contents ADD reg reg +A→ A
Direct Memory Address ADD mem mem+A →A
Register
Indirect Memory contents with ADD *addreg *addreg +A→ A
address in register
1. Immediate Addressing Mode: In this addressing mode, data is
included in the instruction itself
2. Register Addressing Mode: In this mode, one of the registers will be
holding the data and the register has to be specified in the instruction
3. Direct Addressing Mode: In this addressing mode, instruction holds
the memory location of the operand
4. Indirect Addressing Mode: In this addressing mode, the operand is
accessed using a pointer. A pointer is generally a register, which holds
the address of the location where the operands resides
Indirect addressing mode can be extended to inculcate automatic
increment or decrement capabilities, which has lead to the following
addressing modes
Table: Indirect Addressing Modes
Addressing Mode Sample Format Operation
Post Increment ADD *addreg+ A← A + *addreg
addreg ←addreg+1
Post Decrement ADD *addreg- A← A + *addreg
addreg← addreg-1
Pre Increment ADD +*addreg addreg← addreg+1
A← A + *addreg
Pre Decrement ADD -*addreg addreg ←addreg-1
A ←A + *addreg
Post_Add_Offset ADD *addreg, offsetreg+ A ←A + *addreg
addreg ←addreg+offsetreg
Post_Sub_Offset ADD *addreg, offsetreg- A←A + *addreg
addreg←addreg-offsetreg
Pre_Add_Offset ADD offsetreg+,*addreg addreg ←addreg+offsetreg
A←A + *addreg
Pre_Sub_Offset ADD offsetreg-,*addreg addreg ←addreg-offsetreg
A ←A + *addreg
Problem 8: What are the memory addresses of the operands in
each of the following cases of indirect addressing modes? In each
case, what will be the content of the addreg after the memory
access? Assume that the initial contents of the addreg and the
offsetreg are 0200h and 0010h, respectively
a. ADD *addreg-
b. ADD +*addreg
c. ADD offsetreg+,*addreg
d. ADD *addreg,offsetreg-
Solution:
Instruction Addressing Operand Address addreg Content
Mode after Access
ADD *addreg- Post Decrement 0200h 0200-01=01FFh
ADD +*addreg Pre Increment 0200+01=0201h 0201h
ADD offsetreg+,*addreg Pre_Add_Offset 0200+0010=0210h 0210h
ADD *addreg,offsetreg- Post_Sub_Offset 0200h 0200-0010=01F0h
Problem 9: Identify the addressing modes of the operands in each of
the following instructions
a. ADD #1234h
b. ADD 1234h
c. ADD *AR+
d. ADD offsetreg-,*AR
Solution:
Instruction Addressing Mode
ADD #1234h Immediate Addressing Mode
ADD 1234h Direct Addressing Mode
ADD *AR+ Post Increment Indirect Addressing Mode
ADD offsetreg-,*AR Pre-Sub_Offset Indirect Addressing Mode
Special Addressing Modes
For the implementation of some real time applications in DSP,
normal addressing modes will not completely serve the purpose
Thus some special addressing modes are required for such
applications
1. Circular Addressing Mode
While processing the data samples coming continuously in a
sequential manner, circular buffers are used
In a circular buffer the data samples are stored sequentially from the
initial location till the buffer gets filled up
Once the buffer gets filled up, the next data samples will get stored
once again from the initial location
This process can go forever as long as the data samples are processed
in a rate faster than the incoming data rate
Circular Addressing mode requires three registers viz
a. Pointer register to hold the current location (PNTR)
b. Start Address Register to hold the starting address of the buffer
(SAR)
c. End Address Register to hold the ending address of the buffer (EAR)
There are four special cases in this addressing mode
They are
a. SAR < EAR & updated PNTR > EAR
b. SAR < EAR & updated PNTR < SAR
c. SAR >EAR & updated PNTR > SAR
d. SAR > EAR & updated PNTR < EAR
The buffer length in the first two case will be (EAR-SAR+1)
whereas for the next two cases (SAR-EAR+1)
The pointer updating algorithm for the circular addressing mode is as
shown below
Four cases explained earlier are as shown in the figure below:
Case iii) EAR<SAR and updated PNTR>SAR Case iv) EAR<SAR and updated PNTR<EAR
Problem 10: A DSP has a circular buffer with the start and the end
addresses as 0200h and 020Fh respectively. What would be the
new values of the address pointer of the buffer if, in the course of
address computation, it gets updated to
a. 0212h
b. 01FCh
Solution: Buffer Length= (EAR-SAR+1)= 020F-0200+1=10h
a. New Address Pointer= Updated Pointer-buffer length = 0212-
10=0202h
b. New Address Pointer= Updated Pointer+buffer length =
01FC+10=020Ch
Problem 11: Repeat the previous problem for SAR= 0210h and
EAR=0201h
Solution: Buffer Length= (SAR-EAR+1)= 0210-0201+1=10h
c. New Address Pointer= Updated Pointer-buffer length = 0212-
10=0202h
d. New Address Pointer= Updated Pointer+buffer length =
01FC+10=020Ch
2. Bit Reversed Addressing Mode
To implement FFT algorithms we need to access the data in a bit
reversed manner
Hence a special addressing mode called bit reversed addressing
mode is used to calculate the index of the next data to be fetched
It works as follows. Start with index 0. The present index can be
calculated by adding half the FFT length to the previous index in a
bit reversed manner, carry being propagated from MSB to LSB
Current index= Previous index+ B (1/2(FFT Size))
Compute the indices for an 8-point FFT using Bit reversed
Addressing Mode
Start with index 0. Therefore the first index would be (000)
Next index can be calculated by adding half the FFT length, in this
case it is (100) to the previous index
i.e. Present Index= (000)+B (100)= (100)
Similarly the next index can be calculated as
Present Index= (100)+B (100)= (010)
The process continues till all the indices are calculated
The following table summarizes the calculation
Index in Binary BCD value Bit reversed BCD value
index
000 0 000 0
001 1 100 4
010 2 010 2
011 3 110 6
100 4 001 1
101 5 101 5
110 6 011 3
111 7 111 7
Address Generation Unit
Address Generation Unit generate the address of the operands required to
carry out the operation
They have to work fast in order to satisfy the timing constraints
As the address generation unit has to perform some mathematical operations
in order to calculate the operand address, it is provided with a separate ALU
Address generation typically involves one of the following
operations
a. Getting value from immediate operand, register or a memory
location
b. Incrementing/ decrementing the current address
c. Adding/subtracting the offset from the current address
d. Adding/subtracting the offset from the current address and
generating new address according to circular addressing mode
e. Generating new address using bit reversed addressing mode
The block diagram of a typical address generation unit is as shown in
figure below
Address Generation Unit
Programmability and Program Control
A programmable DSP device should provide the programming
capability involving branching, looping and subroutines
The implementation of repeat capability should be hardware based so
that it can be programmed with minimal or zero overhead
A dedicated register can be used as a counter
In a normal subroutine call, return address has to be stored in a stack
thus requiring memory access for storing and retrieving the return
address, which in turn reduces the speed of operation
Hence a LIFO memory can be directly interfaced with the program
counter
1. Program Control
Like microprocessors, DSP also requires a control unit to provide
necessary control and timing signals for the proper execution of the
instructions
In microprocessors, the controlling is micro coded based where each
instruction is divided into microinstructions stored in micro memory
called microstore as microcode
This microstore and microcode are not accessible by the programmer
The processor is designed such that only control unit can access them
during the execution of instructions
Advantage: Easy to design and implement as less hardware is required
Disadvantage: Low speed of execution of instruction, as execution of
each instruction implies fetching the microcode from microstore and
decoding them
As this mechanism is slower, it is not applicable for DSP
applications
Hence in DSP the controlling is hardwired base
Each instruction of the DSP acts as control lines to the hardwired
control unit and hence resulting control and timing signals are
generated to complete the instruction decode and execution
Advantage: Instruction execution time is much faster as compared to
microcoded design
Disadvantage: High hardware complexity; not easy to change the
design of control unit to incorporate additional features
2. Program Sequencer
It is a part of the control unit used to generate instruction addresses
in sequence needed to access instructions
It calculates the address of the next instruction to be fetched
The next address can be from one of the following sources
a. Program Counter
b. Instruction register in case of branching, looping and subroutine calls
c. Interrupt Vector table
d. Stack which holds the return address
The block diagram of a program sequencer is as shown in figure
below
Program Sequencer
Program sequencer should have the following circuitry
a. PC has to be updated after every fetch
b. Counter to hold count in case of looping
c. A logic block to check conditions for conditional jump instructions
d. Condition logic-status flag
Speed Issues
Fast execution of algorithms is an essential requirement of a digital
signal processing architecture
DSP architecture must include features that facilitate high speed of
operation and large throughputs
Some of the feature that can increase the execution speed of the DSP
architecture are
1. Hardware Architecture
2. Parallelism
3. Pipelining
Hardware Architecture
Harvard architecture which separates the program and data
memories with separate buses for each, increases the speed of
execution of programs considerably
Dual data memories with Individual buses for each help in access
dual operands simultaneously
Multiple external memories require multiple buses external to the
DSP in addition to being expensive, external buses are slow for
program access and execution
By providing on-chip memories and an instruction cache, program
execution is speeded up considerably
There are many techniques used in DSP architectures to increase
their speed of operation
Two of these techniques are – Parallelism and Pipelining
Parallelism
High speed of operation can be achieved in DSP by the provision of
Parallelism
Parallelism means several things done at a time
Functional units may operate in parallel and increase the throughput
For example, instead of the same arithmetic unit being used to do
computations on data and address, a separate address arithmetic unit
can be provided to take care of address computations
This frees up the main arithmetic unit to concentrate on data
computations alone and thereby increases the throughput
Another example may be the provision of multiple memories and
multiple buses to fetch an instruction and operands simultaneously
Algorithms can perform more than one operation at the same time,
such as adding while carrying out a multiply, shifting while reading
data from memory etc
The architecture should be such that instructions and data required
for a computation are fetched from the memory simultaneously
By parallelism, the following operations has to be performed in a
single clock cycle
1. Fetch instructions and multiple data required for the computation
2. Shift data as they are fetched in order to accomplish scaling
3. Carryout a multiplication operation on the fetched data
4. Add the product to the previously computed result in the
accumulator
5. Save the accumulator contents in the memory storage, if required,
and
6. Compute new addresses for the instruction and data required for
the next operation
Pipelining
An architectural feature to increase the speed of the DSP algorithm is
pipelining
In a pipelined architecture, an instruction to be executed is broken
into a number of steps
A separate unit of the architecture performs each of these steps
When the first of these units performs the first step on the current
instruction
The second unit will be performing the second step on the previous
instruction
The third unit will be performing the third step on the instruction
prior to that, etc
If p steps were required to complete the execution of each
instruction, it would take p units of time for the complete execution
of each instruction
Since all the units will work all the time, one output will flow out of
the architecture at the end of each time unit
The throughput is maintained as one instruction per unit time
Limitations
1. Dividing each instruction into steps takes equal amount of time to
perform and design the architectures
2. Extra time required at the start of algorithm execution results in
pipeline latency
For example, Let us assume that the execution of an instruction can
be broken into 5 steps:
1. Instruction fetch
2. Instruction decode
3. Operand fetch
4. Execute
5. Save the result
Figure shows how a pipelined processor will handle this
For simplicity, we will assume that all the steps take equal amounts
of time
From the figure, we observe that the output corresponding to the
first instruction is available after 5 units of time
System level parallelism and pipelining
The parallelism and pipelining concepts can be extended to the
implementation of DSP algorithms
Consider the example of an 8-tap (8 coefficients) FIR filter given by:
It can be realized using different methods
1. Implementation using a single MAC unit
2. Pipelined Implementation using eight multipliers and eight
accumulators
3. Parallel Implementation using two MAC units
Implementation using a single MAC unit
If only one multiplier and accumulator is available, it must be used
8 times to compute the eight product terms and find their sum
Each input sample is delayed from the previous sample by 8T
T is the time taken by the multiplier and accumulator to compute
one product term and add it to the previously accumulated sum in
the accumulator
Input samples and the filter coefficients are fed to the multiplier
through multiplexers
As each product term is generated, it is added to the previously
accumulated sum in the MAC unit
Pipelined Implementation using eight multipliers and
eight accumulators
The implementation of the FIR filter of Eqn can be speeded up if
more multipliers and accumulators are available
Let us assume that there are eight multipliers and eight
accumulators connected in a pipelined structure, as shown in
Figure
Each multiplier computes one product term and passes it on to the
corresponding accumulator, which in turn adds it to the summation
passed on from the previous accumulator
Since all the multipliers and accumulators work all the time, a new
output sample is generated once every T units of time
This is the time required by the multiplier and accumulator to
compute one product term and add it to the sum passed on from
the previous stage of the pipeline
This implementation can take in a new input sample once every T
units of time and generate an output sample at the same rate
In other words, this filter implementation works 8 times faster than
the simple one MAC implementation
Parallel Implementation using two MAC units
A third implementation of the FIR filter of Eqn is shown in Figure
This implementation uses two MAC units and an adder at the output
Each MAC computes four of the eight product terms
Input samples and the filter coefficients are fed to the MACs using
multiplexers that are controlled such that correct combinations of
samples and the corresponding filter coefficients are fed to the two
MACs at any given time
If T time units are required to compute one pair of products and add
them to the previously accumulated sum in the MAC units, it will
require 4T units of time to generate the final output by adding the
outputs of the two MACs
At this time, a new input sample can be applied to the filter for
computation of the next output sample
The speed of this implementation is 2 times that of one MAC
implementation of Figure and one fourth of that of the pipelined
eight- multiplier, eight-accumulator implementation
The maximum rate at which input samples can be applied to this
filter implementation is 2 times that of the first implementation and
one fourth that of the second
Table summarizes the performance of the three implementations
described
It is possible to achieve higher speed implementation by the use of
parallelism and/or pipelining
It increases the circuit complexity
Features for external interfacing
It is important for a DSP device to be able to communicate with the
outside world
The outside world provides the signal to be processed and receives
the processed signal
Most of the peripherals used in conventional microprocessors are
also needed in a DSP system
Peripherals include interfaces for interrupts, direct memory access,
serial I/O, and parallel I/O