Floating and Fixed Point Arithmetic
Floating and Fixed Point Arithmetic
S/W or H/W
S/W H/W
Partition
Fixed Point
Conversion
Fixed Point
S/W or H/W RTL Verilog Functional
Implementation Co-verification Implementation Verifiction
S/W
Synthesis
Timing &
Functional
Verification
System
Layout
Integration
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 4
System Design Flow
The requirements and specifications of the application are captured
The algorithms are then developed in double precision floating point format
Matlab or C/C++
Digital Signal
Processors
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 15
2s Complement Arithmetic
Numbers
Positive Negative
Number Number
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 17
2s Complement Arithmetic
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 18
Example
23 22 21 20
-8 + 2 + 1 = -5
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 19
Equivalent Representation
Many design tools do not display numbers as 2s complement signed numbers
A signed number is represented as an equivalent unsigned number
Equivalent unsigned value of an N-bit negative number is
2N - |a|
Example
for -5 = 1011
N=4
a=-5
24 - |-5| = 16 5= +11
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 20
Four-bit representation of twos complement and equivalent unsigned numbers
number number
23 22 21 20
0 0 0 0 0 0
+1 0 0 0 1 1
+2 0 0 1 0 2
+3 0 0 1 1 3
+4 0 1 0 0 4
+5 0 1 0 1 5
+6 0 1 1 0 6
+7 0 1 1 1 7
8 1 0 0 0 8
7 1 0 0 1 9
6 1 0 1 0 10
5 1 0 1 1 11
4 1 1 0 0 12
3 1 1 0 1 13
2 1 1 1 0 14
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 21
1 1 1 1 1 15
Computing Twos Complement of a Signed Number
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 22
Scaling and sign extension
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 23
Sign Extension
1s,
Signed value remains the same
Equivalent unsigned value is changed
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 24
Dropping Redundant Sign bits
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 25
Floating Point Format
x ( 1) s 1 m 2 e b
s e m
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 27
Example Floating Point Representation
0 10000010 11010000_00000000_0000000
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 28
Floating-point Arithmetic Addition
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 29
Example: Floating-point Addition
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 30
Floating-point Multiplication
S0: Add the two exponents e1 and e2 and subtract the bias
once
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 31
Fixed-point format
Algorithms in double precision Floating-
point format are converted into Fixed-Point
format for mapping on low cost DSPs and
Application Specific HW designs
32
Qn.m Format for Fixed-point Arithmetic
Sign Bit
Fraction Bits
Integer Bit
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 33
Qn.m Positive Number
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 34
Conversion to Qn.m
(assume 10 bits)
Fractional bits
Sign bit
Integer bit
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 35
Example
-21 20 . 2-1 2-2 2-3 2-4 2-5 2-6 2-7
0x1D0
Q2.10
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 37
Fixed-point numbers in COTS DSPs
Commercially available off the shelf processors usually
have16 bits to represent a number
In C, a programmer can define 8, 16 or 32 bit numbers as
char, short and int/long respectively
In case a variable requires different precision than what can
be defined in C, the number is defined in higher precision
For example an 18-bit number should be defined as a 32-bits
integer
High precision arithmetic is performed using low precision
arithmetic operations
32-bit addition requires two 16-bit addition
32-bit multiplication requires four 16-bit multiplications
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 38
Floating Point to Fixed Point Conversion
Serialize the floating-point code to separate all atomic computations
and assignments to variables
Insert range directives after each serial floating-point computation
Runs the serialized implementation with range directives for all set
of possible inputs
Convert all floating point varilables to fixed point format using the
maximum and minimum values for each variable
The integer part for vari is defined as
Floating-Point Algorithm
Range Estimation
Integer part determination
Fixed-Point Algorithm
HW-SW Implementation
Target System
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 40
Range Determination for Qn.m Format
Ranges
Note Min and Max values each
variable takes for Qn.m format
determination
MIN MAX
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 41
Floating-point to Fixed-point Conversion
Shift left m fractional bits to the integer part and truncate or round the
results
In Matlab, the conversion is done as
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 43
Matlab Supports Fixed Point Arithmetic
All the attributes of fixed point variable can be
set for fixed point arithmetic
These attributes are shown here
PI =
3.1563
DataType: Fixed
Scaling: Binary Point Data
Signed: true
WordLength: 8 Numeric Type
FractionLength: 5
RoundMode: round
OverflowMode: saturate
ProductMode: Full Precision Fixed-point math
MaxProductWordLength: 128
SumMode: Full Precision
MaxSumWordLength: 128
CastBeforeSum: true
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 44
Arithmetic: Addition in Q Format
Addition of two fixed-point numbers a and b of Qn1.m1 and
Qn2.m2 formats, respectively, results in a Qn.m format
number, where n is the larger of n1 and n2 and m is the
larger of m1 and m2.
Example
implied decimal
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 48
Multiplication in Q Format
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 49
Multiplication in Q-Format
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 50
Unsigned by Unsigned
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 51
Signed by Unsigned
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 52
Unsigned by Signed
The unsigned multiplicand is changed to a signed positive
number
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 53
Signed by Signed
Sign extend all partial products
Takes 2s complement of the last partial product if multiplier is a
negative number.
The MSB of the product is a redundant sign bit
Removed the bit by shifting the product to left, the product is in
Q( n1 + n2 1) . ( m1 + m2 + 1 )
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 55
Corner Case:
Signed-Signed Fractional Multiplication
1 0 0 = Q1.2 = -1
0 0 0 0 0 0
0 0 0 0 0 X
0 1 0 0 X X 2s Compliment
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 56
Fixed Point Multiplication
return(L_var_out);
}
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 57
Fractional and Integer Arithmetic and truncation
4x4 bit multiplication can easily trimmed to get in 4 bit precision for
fractional multiplication but for integer it may cause overflow
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 58
Unified Multiplier
a b
The multiplier 0
a[15]
0
b[15]
supports integer
and fractional Signed/
Mux Mux
Signed/
multiplication Unsigned
Op1
Unsigned
1 16 Op2 1 16
All types of
operands 17 17
Unsigned 0
multiplication
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 59
Bit Growth in Fixed-point Arithmetic
y[n ]= ay[n-1]+x[n]
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 60
Bit Growth in Q-format arithmetic
For LTI system Schwarzs inequality is used for estimating the bit
growth
yn h2 n x2 n
n n
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 61
Truncation
In multiplication of two Q format numbers as the number
of bits in the product increases
We sacrifice precision by throwing some low precision
bits of the product
Qn1.m1 is truncated to Qn1.m2 where m2 < m1
8b01110110 (Q4.4)
4 + 2 + 1 + 0.25 + 0.125 = 7.375
Truncate it to Q4.2 results
6b011101 = 7.25
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 62
Rounding with Truncation
truncation point
0 1 1 1_0 1 1 1 in Q4.4 is 7.4375
1
add 1 for rounding
0 1 1 1_1 0 0 1
0 1 1 1_1 0 = 7.5 Q4.2 format
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 63
Equivalent Q formats
In many cases Q format of a number is to be
changed
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 64
Example
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 65
Overflow introduces an error equal to the dynamic range of
the number
Q(x)
overflow
overflow 3 011
2 010
1 001
000 x
-4 -3 -2 -1 1 2 3 4 5 6 7
111 111
-1
110 110
-2
101 101
-3
100 100
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 66
Saturation clamps the value to a maximum positive or
minimum negative level
Q(x)
011
3
2 010
saturation
1 001
000 x
-4 -3 -2 -1 1 2 3
111
-1
110
-2
101
-3
100
saturation
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 67
Summary
Capturing Requirements & Specifications is the first step in system design
A digital system usually consists of hybrid technologies and uses GPP,
DSPs, FPGAs and ASICs
The algorithms are developed in double precision floating point format
Floating point HW and DSPs are expensive, for DSP applications Fixed
point arithmetic is preferred
The floating point code is converted to fixed point format and all variables
are defined as Qn.m format numbers
The fixed point arithmetic results in bit growth, the results from arithmetic
computations are rounded and truncated
In IIR filter implementation 2nd order and Parallel forms minimizes
coefficient quantization
Implementation of LP FIR should use symmetry and anti-symmetry of
coefficients
Block floating point can improve the precision for block processing of data
Digital Design of Signal Processing Systems, John Wiley & Sons by Dr. Shoab A. Khan 88