2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA)
DESIGN AND IMPLEMENTATION OF FAST
FLOATING POINT MULTIPLIER UNIT
Sunesh N.V Sathishkumar P ,Assistant Professor
Department of Electronics and Communication Engineering Department of Electronics and Communication Engineering
Amrita Vishwa Vidhyapeetham, Bengaluru Campus Amrita Vishwa Vidhyapeetham, Bengaluru Campus
[email protected] [email protected] Abstract- Floating point numbers are the quantities that area and throughput constraints for the floating point
cannot be represented by integers, either because they representations.
contain fractional values or because they lie outside the Floating-point numbers have been used in
range re presentable within the system's bit width. arithmetic unit for several years however, for the last
Multiplication of two floating point numbers is very decade the commonly used representation is IEEE
important for processors. Architecture for a fast floating Standard for Floating-Point Arithmetic [IEEE 754-
point multiplier yielding with the single precision IEEE 754-
2008].[5]
2008 standard has been used in this project. The floating
point representation can preserve the resolution and
ii. IEEE 754 FLOATING POINT STANDARD
accuracy compared to fixed point. Pipeline is a technique
where multiple instructions are overlapped in execution.
Multiple operations performed at the same time by pipeline The representation of using floating point
will increase the instruction throughput. In several high numbers has been bring out by IEEE is known as IEEE
performance computing systems such as digital signal 754[5] and used in all CPU implementation. For
processors, FIR filters, microprocessors, etc multipliers are representing floating-point numbers which has negative
key components. The most important aim of the design is to numbers and de-normal numbers together with a set of
make the multiplier quicker by decreasing delay. Decrease of
floating-point operations that operate on IEEE 754
delay can be caused by propagation of carry in the adders
standard. It has 4 modes of rounding which are round to
having smallest amount power delay constant.
nearest, round to 00, round to O,round to even and five
Keywords- Floating point number, radix 4 Booth
exceptions counting when the exceptions occur. The
Encoder, KoggeStone adder, Wallace tree structure,
usability of a processor will be limited when dealing with
pipeline, Synopsys Design Compiler, FPGA.
fixed point arithmetic [3]. If operations on numbers with
fractions, very tiny numbers (e.g. 0.000004), or very huge
1. INTRODUCTION numbers (e.g. 62.445x105) are required, then a different
form of representation is used in the floating-point
The change in the level of integration brought arithmetic. In order to get some notation away of the way,
about by up to date VLSI trends has rendered possible the let us talk about a few floating-point -6.24x103• The
mixing of many complex components in a single device. negative symbol represents the sign part of the number,
This single device has ready the systems work faster and while the '624' shows the significant digits part of the
useful for applications such as portable device like mobile number, and to conclude the three shows the scale factor
phone, multimedia and methodical computation. As the part of the number. The string of major digits is officially
technology is advancing need for high speed is on a rise. termed the mantissa part of the given number, while the
Multiplier unit uses maximum time and power compared factor is fittingly called the exponent part of the number.
to other arithmetic unit. To reduce the computation time The broad representation of floating point is
efficient multiplier units are used. Thus, the speed, power (-1) S* M * 2E (1)
and size of a multiplier has been a key issue and therefore Where S stands for sign bit, M stands for - mantissa bit
the many focus on their research projects. Many and E stands for - exponent bit.
researchers working on innovative algorithms and circuit
analysis to decrease delay, power, etc. However, with Single Precision Floating Point Numbers
ever-increasing need of portable device, the study try to
look into design of adder that work quicker. The single-precision number representations
have 32 bits. There are three main fields in single
In arithmetic computing, representation of real precision number which are S,M and E. The 7-digit
numbers in floating point will use extensive range of decimal number represented in 24-bit mantissa, in that
values. Floating point unit is widely used in various while an 8-bit represent to a base 2 which present a
applications. This makes the developer to work on faster scaling factor with an fitting range. Thus, a sum of 32 bit
floating point multiplier units. Floating-point is desired for single-precision number representation. In
representation can keep its resolution and accuracy when arrange to get the stored exponent a bias of 2n-1 ,-I is
evaluate to fixed-point representations. The floating point added to real exponent. This bias 127 for an 8-bit
operators require excessive area (or time) for normal exponent of the single-precision format. The totaling of
implementations. The system designers are considering bias allows using an exponent in the range from -127 to
978-1-4799-7926-4115/$31.00©2015 IEEE
2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA)
+128, comparing to a range of 0-255 for single precision
number. A range of values from 2-127 to 2+127, which is
equal to 10-38 to 10+38 has been offered by the single
precision format. If Sign bit is 0 then positive number
Add the Biased Exponents offfi'o numbers, subtracting the
otherwise negative number. Exponent: 8-bit and signed bias from the sum to gel the new Biased Exponenl
EFh-tEy-BlAS
exponent in excess-l27 representation.Mantissa:23-bit
and fractional component.
Multiply the Significands .Set the result t() positive if the.
same sign and negati..-e otherwise
I I I I I I I I I
operands have
Mz= �[x·My: Sz= Sx'" Sy
No:nnalise the Product. Round the signifkand to
appropriate number ofbits and renolmalize. if ronnding
produces a cany
Figl.Single Precision Floating Point
The relative size of two floating point numbers
can be compared by excess-l27 representation. We will
store unsigned integer representation (E' = E +127) as a
replacement for storing the exponent (E) as a signed
number. This exponent varies in the range E' of 0 <= E'
<= 255. While the 0 and 255 range values are used to
represent particular numbers (exact 0, infinity and
denormal numbers), the operating range of EO becomes 1
<= E'<=" 254, thus limits the range of E to -126<= E
<= 127.Table I shows the variety of values in floating
Fig 2.Alogrithm for Multiplication
point numbers.
Table l.Ranges of Floating-Point Numbers
2) Multiplier Flow
Binary Decimal Multiplication of floating point number can be
carried out in 3 parts [5]
Single In the I st part, the sign product will be perform a XOR
3 17 38
± (2_2-2 ) x 22 �± 10 operation. In the 2nd part, the exponent bits operands are
passed to an adder stage and a bias 127 is subtracted from
the output. 8-bit kogge-stone adder is used for implement
3 1023 308 the addition and 2s complement addition for subtraction
Double ± (2_2- 2) x 2 �± 10
operations.
III.FLOATING POINT MULTIPLIER UNIT
KoggeStone adder
Kogge stone adder [7] is a related prefix form of
1) Algorithm
carry look ahead adder. It create the carry in a logarithmic
order. since logarithmic order it fastest adder when
Figure 2 shows a algorithm flow chart for
compared to other and also taking extra area but has lesser
multiplier. The mantissa of two numbers are multiplied,
fan out at each stage which make better performance of
and the exponent are added. For floating-point
adder. Order of kogge stone adder is 0 (logn) [8].Figure 4
multiplication, a easy algorithm is proposed
shows the structure of KoggeStone adder.
1. Add the exponents portion and subtract bias portion.
2. Multiply the mantissas portion and calculate the sign
bit.
~
3. The output will be normalized to the prefer number of Po -------1
p
bits. a.
rl ____� G
G , ------..... p.G
Fig 3. Diflerent Blocks Of Parallel Prefix Adders
2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA)
carry save adders [4]. The 4:2 compressor structures
compresses five partial products bits into three. Figure 5
shows the block diagram for the data distribution among a
tree architecture that utilizes 4:2 compressors.
A7 . . •. . . " .. " A.
B, ............ &
00 ••••••••
1 U Stage 0••••••••
••••••••
•••••••• 0
00 ••••••••
0••••••••
••••••••
•••••••• 0
••••••••••••••••
Final Adder
••••••••••••••••
Fig 4.koggestone Adder
•••••••••••••••••
Accumulator
Z" . . . . . . . . . .. . . . . . Z, Zo
In the 3rd stage, find the product of the mantissa
Figure 5.Data distribution among tree architecture
portion and the multiplication of mantissa portion is carry
out in the following steps.
Each packet consists of the bits that fed into a 4:2
A. Partial product generator:- For a given multiplier [6]
compressor group. The partial products result can be
there are many ways to generate partial products. The
reduce by ratio of 2:1 for two stages of 4:2 compressors.
radix-4 booth programming was found to be quicker in
Figure 5 show the cutback tree of 8 partial products to
which we had found out, so it will be put into operation in
form two operands, added in to form a final product by
the final multiplier architecture. Twelve partial products
using a speedy carry propagate adder.
are the output of this stage.
C. Final stage adder: - The products of mantissas are
specified by the 48-bit sum and carry outputs obtained
Radix 4 booth encoder
from the partial product are added in the final stage adder.
This stage adders should have a small amount delay and
To recode the terms, divide it into block of three
high speed. After research, comparing and implementing
and in that every one block overlaps the prior block by
the power and delay uniqueness of various adders, we
one bit. The bits are grouped from the LSB, and I st block
found out that the KoggeStone adder is the fastest among
only takes 2 bits of the multiplier for grouping (no prior
of all the adder.
block to overlap): Two bits the multiplier have been in
D. Normalization and rounding: - The product of
use by the least significant block, and consider a 0 for the
mantissas is normalized and round off. The excess one is
third bit.
detected and the exponent is adjusted for normalization.
We are reducing the implied bit which is foremost one
[3]. The left over bits are reduced to a 26-bit value. For
precision a few extra bits is added to the reduced value.
The reduced value is finally rounded off using the
Table 2 Radix4 Booth Recoding Strategic Block Values rounding to nearest value to give the 23 bit mantissa of
the product. A zero detect block is used in the multiplier
Block Partial product architecture to avoid unnecessary calculations of zero in
000 0 the input.
001 1 * Multiplicand
010 1 * Multiplicand IV. PIPELINING
011 2 * Multiplicand
Pipelining is a method where multiple
100 -2 * Multiplicand
instructions are overlapped. It separated into segments.
101 -I * Multiplicand
Each segment will carry out its operation all together with
110 -I * Multiplicand
other segments. After finishing of one step of operation
III 0
outcome of that step will be passes to next step in
channel, carries the next operation into next segment. The
B. Partial result accumulator: - The partial result obtained
final outcome of each instruction will be produced at
from the proposed multiplier will be used in the Wallace
end.The pipeline method techniques are widely used for
tree structure which is 4:2compressors. The propagation
making better performance for digital circuits. The overall
carry time is less in 4:2 compressor technique, used in
performance can be enlarged by increasing the pipelining
carry save adders.
stages which decreases the path delay in each and every
stage.
Wallace tree structure
a) 4-Stage Pipelining
Partial results is added in Wallace tree which is
The multiplier structure is prearranged as a four-stage
a new technique to propagate the carry that is ;sed in the
pipeline. This arrangement allows to generate one result
2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA)
in each clock cycle , after make the first three values,
outcome have been entered into the unit.
To increase the performance of multiplier, four stage
pipelining is used. In order to increase the performance of
multiplier operating frequency of multiplier is increased,
pipeline stages are inserted in the critical path. Pipelining
stages reduce the latency in the output by four clocks [8].
The pipelining stages are followed in following steps:
1. Pre-processing the input data.
ii. Addition of partial products Fig 7.Pipelined KoggeStone Adder
111. Subtracting the Bias and Compressing the partial
result in Wallace tree. V. EXPERIMENTAL RESULTS
IV. Normalization and final carry is adder.
A Proposed Multiplier unit has implemented on
Spartan XC3S500-4FG320 device and synthesized using
90 nm technology in Synopsys Design Compiler. Table 3
shows the outcome of various stages of proposed
multiplier unit using DC Compiler. Table 4 shows the
outcome of various stages of proposed multiplier unit
using DC Compiler.
Table 3: Synthesis Report of sub modules in multiplier unit using DC
compiler
L
Power(uW) Area(um ) Delay(ns)
Modified
2
Pipeline2-Sign calculation, EAtEB Booth Processor Booth 21.72uW 355.35um 4.84 ns
Encoder
Wallace Tree Compressor
KoggeStone
2
Pipeline3-E_temp-bias Compressing partial products Adder 27.05 uW 360 um 3.72 ns
Pipelined
2
KoggeStone 37.25 uW 420 um 3.92 ns
Adder
Pipeline4-Normalization,Final Adder
Wallace
2
Fig 6 Proposed Multiplier unit
Tree 44.25 uW 524.2 um 7.92 ns
Pipeline registers is implement with least number of
Table 4: Synthesis Report of sub modules in multiplier unit using
register in pipeline, positions of the pipeline register is Spartan XC3S500-4FG320
inserted in KoggeStone Adder in order to get the high
speed. Kogge Stone adder have three pipeline stages as No: of LUTS No: of Slices
shown in figure 7. Implementing the pipelining register is Modified
made at three stages. First stage is where two inputs of the Booth 125/9312 72/4656
adder is set, next stage is after the estimate of partial Encoder
product stage 1 in KoggeStone adder, and final stage is
after the stage 2 of adder. KoggeStone
Adder 355/9312 196/4656
Pipelined
KoggeStone 358/9312 208/4656
Adder
Wallace
Tree 895/9312 515/4656
2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA)
Table 5 shows the synthesis result of proposed multiplier Programmable Technology (FPT '03), Tokyo, Japan, Dec. 15-17,
pp.352-358, 2003.
unit using DC Compiler. Table 6 shows umber of look up
[6].Hamacher, Carl, Vranesic, Zvonko, Zaky, SafWat, "Computer
tables used in of proposed multiplier unit in xilinx Organization" Fifth Edition, pp. 367-390
synthesis tool. [7].Neil H Weste,David Harris,Ayan Bamujee, "CMOS VLSI
DESIGN",Pearson Education,3'd Edition,2009
Table 5: Synthesis Report of Final Multiplier using DC Compiler [8]. Gong Renxi, Zhang Hainan, Meng Xiaobi,Gong
Wenying,"Hardware Implementation of a High Speed Floating Point
Multiplier Based on FPGA,"2009 4th International Conference on
Power(uW) Area(umL) Dealy(ns)
Computer Science & Education.
Proposed [9].Vojin G. Oklobdzija, Bart R. Zeydel, Hoang Q. Dao, Sanu
2
Multiplier 173.72(uW) 13325.45um 29.72 ns Mathew,and Ram Krishnamurthy. "Comparison of High-Performance
Unit VLSI Adders in the Energy-Delay Space", IEEE Transactions On Very
Large Scale Integration (VISI) Systems, Vol. 13, No. 6, June
2005,pp.754-758
Table 6: Synthesis Report of Final Multiplier using Spartan XC3S500-
[I 0] IEEE standards board, IEEE standard for floating-point arithmetic,
4FG320
2008
No: of LUTS No: of Slices
Proposed
Multiplier 2270/9312 1208/4656
Unit
The figure 8 shows the functional verification of
the multiplier. Two 32-bit floating point numbers gives
their resulting of 32-bits product.
Fig 8. Simulation result of proposed 32 bit Multiplier unit
VI. CONCLUSION
Proposed architecture is based fast floating point
multiplier on the IEEE-754 standard single precision
format is studied. The modules is written in Verilog HDL
to optimize the performance architecture .The proposed
design used in this project can be effectively interfaced
with any DSP processor. The pipelining is the most
widely used technique to get better the performance of
digital circuits. Proposed the high-speed Floating point
multiplier is implemented with pipeline method.
VII.REFERENCES
[l].Anna Jain, Baisakhy Dash, Ajit Kumar Panda,"FPGA Design of a
Fast 32-bit Floating Point Multiplier Unit", International Conference on
Devices Circuits and Systems(lCDCS)-2012.
[2].BJeevan,S.Narender, Dr.C.V KrishnaReddy,K.Sivani,"A high speed
binary floating point multiplier using Dadda algorithm" Automation
Computing , Communication, Control and Compressed sensing -
International Multi Conference -2013
[3].Mohammed AI Ashrafy, Asharf Salem, Wagdy Anis, "An efficient
implementation of floating point multiplier" Electronics,
Communications and Photonics Conference (SIECPC)-2012
[4].Jung-Yup Kang, Jean-Luc Gaudiot, "A Simple High Speed
Multiplier" IEEE Transactions On Computers, Vol. 55, No. 10, October
2006
[5].S.Paschalakis, P.Lee, "Double Precision Floating-Point Arithmetic
on FPGAs", In Proc. 2003, 2nd IEEE International Conference on Field