0% found this document useful (0 votes)

15 views11 pages

1 s2.0 S0045790624001459 Main

This paper introduces an efficient method for implementing signed binary multipliers on FPGAs using radix-4 Booth's encoding, which simplifies the logic operations and reduces resource utilization. The proposed design achieves a maximum frequency of ~217 MHz with minimal energy consumption and a significant reduction in LUTs and carry-chains compared to existing architectures. Results demonstrate that the new approach outperforms previous designs in terms of energy-delay product and resource efficiency across various bit-widths.

Uploaded by

ecolillxf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views11 pages

1 s2.0 S0045790624001459 Main

Uploaded by

ecolillxf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Computers and Electrical Engineering 116 (2024) 109217

Contents lists available at ScienceDirect

Computers and Electrical Engineering

journal homepage: www.elsevier.com/locate/compeleceng

Efficient implementation of signed multipliers on FPGAs

Fanny Spagnolo a, Pasquale Corsonello a, Fabio Frustaci a, Stefania Perri b, *
a
Department of Informatics, Modelling, Electronics and Systems Engineering - University of Calabria, 87036 Arcavacata di Rende, Italy
b
Department of Mechanical, Energy and Management Engineering - University of Calabria, 87036 Arcavacata di Rende, Italy

A R T I C L E I N F O A B S T R A C T

Keywords: This paper presents a simple but effective strategy to implement signed binary multipliers on
Hardware accelerators Field Programmable Gate Arrays (FPGAs). It is based on the radix-4 Booth’s encoding logic but
Multipliers adopts an unconventional sequence of logic operations that allows hardware resources to be more
FPGAs
efficiently exploited. The main idea consists in preliminarily generating incorrect partial prod
Energy reduction
High performance computing
ucts, which allows the encoding logic to be simplified, and then correcting them in the subsequent
computational steps, without requiring additional logic resources. The adopted approach uses
both the look up tables (LUTs) and the fast carry-chains (CCs) available within FPGA devices.
When implemented on a Xilinx xc7v585ttfg1157–3 device, a 32 × 32 multiplier designed as
proposed here achieves a maximum running frequency of ~217 MHz consuming only ~216pJ per
operation and using 792 LUTs and 147 CCs. When adopted in the realization of a 3 × 3 Multiply-
Accumulate unit, in comparison with the lowest energy consuming competitor, the proposed
approach leads to an energy-delay product 5.2 % lower and allows reducing the number of uti
lized LUTs and carry-chains by 25.2 % and 41.4 %, respectively.

1. Introduction

Modern Field Programmable Gate Arrays (FPGAs) are widely recognized as very appropriate hardware implementation platforms
for those applications that demand high computational speed, significant energy efficiency and flexibility, like Computer Vision (CV),
object detection and classification, Internet of Things (IoT), Deep Learning (DL), and many others, [1–3]. In all the above-mentioned
fields, complex data-paths execute their elaboration by exploiting elementary adders and multipliers [4,5] designed to perform both
exact [6] and inexact computations [7,8]. For this reason, the design of such arithmetic circuits receives a great deal of attention for
efficient implementations on FPGAs [9–19].
Typically, when providing high computational capabilities and achieving the highest speed performances are mandatory, huge
levels of parallelism are exploited and Digital Signal Processing slices (DSPs) are greedily utilized. However, the use of softcore adders
and multipliers realized on programmable logic cells is of particular interest to comply with the limited amount of DSPs available on-
chip [20,21] and/or to avoid their underutilization when operating on small bit-width operands (i.e. lower than 8-bit [13]), that also
leads to poor delay and energy behaviors. Furthermore, as demonstrated in [14], due to their higher routing delays, DSP-based
implementations of relatively complex architectures do not always overcome the speed performance of analogous modules realized
by exclusively exploiting Look Up Tables (LUTs) and the fast Carry-Chains (CCs).
For these reasons, several multipliers optimized to efficiently use dual-output LUTs and CCs have been recently proposed [9–14].

* Corresponding author.
E-mail address: [email protected] (S. Perri).

https://doi.org/10.1016/j.compeleceng.2024.109217
Received 7 November 2023; Received in revised form 19 March 2024; Accepted 24 March 2024
Available online 2 April 2024
0045-7906/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Table 1
The Booth’s Algorithm as implemented in [9,12,13,22].
[22] [9,12,13]

b2x+1 b2x b2x-1 Dx PPx PPx=PP’x+ cx

zx cx sx PP’x
0 0 0 0 0 1 0 0 0
0 0 1 +1 A 0 0 0 A
0 1 0 +1 A 0 0 0 A
0 1 1 +2 2×A 0 0 1 2×A
1 0 0 − 2 − 2×A 0 1 1 2 × (not A)
1 0 1 − 1 − A 0 1 0 not A
1 1 0 − 1 − A 0 1 0 not A
1 1 1 0 0 1 0 0 0

The designs presented in [9,10,12,13] achieve area-efficient multipliers with a reduced LUTs utilization by exploiting the radix-4
Booth’s encoding logic [22]. Conversely, those presented in [11] and [14] are particularly suitable to reach high-speed perfor
mances. The former allows restructuring a conventional multiplier to reduce the levels of adders required to provide the resulting
product, whereas the latter adopts the modular approach that allows processing wide operands by using smaller bit-width multipliers.
This work presents an alternative method that exploits the radix-4 Booth’s encoding, whilst reducing hardware resources utili
zation, energy consumption and computational delay in comparison to state-of-the-art softcore multipliers. In order to achieve this
result, the multiplication between two n-bit operands A and B is performed within three basic steps: the encoding of the multiplier B;
the partial products (PPs) generation and the PPs accumulation. The overall operation is simplified by preliminarily computing
incorrect but simpler PPs that are then accumulated. The proposed strategy allows the final correction of the result to be performed
without requiring any additional logic resource.
For purposes of comparison with efficient multipliers known in literature, the novel method has been applied to design signed n × n
multipliers with n ranging between 4- and 32-bit. Results, obtained using a Xilinx VIRTEX-7 device, demonstrate that a 32 × 32
multiplier designed as proposed here achieves the lowest computational delay and energy-delay product, which are, respectively, 12.2
% and 17.4 % lower than the fastest competing architecture. Furthermore, a 4 × 4 novel multiplier shows an energy-delay product 29.3
% lower than [13], that represents the best competitor for such a word length. Finally, in comparison to the cheapest design [12], the
proposed 32 × 32 multiplier, while utilizing 45.6 % more LUTs and 2.1 % more CCs, reduces the energy-delay product by 3.025 times.
The rest of the paper is organized as follows. Section 2 provides a brief background and overviews the state-of-the art. The proposed
approach is introduced in Section 3. Hardware implementations and comparison results are presented in Section 4. Finally, conclusions
are drawn in Section 5.

2. Related works

In order to briefly discuss known approaches relevant to this work, let us consider the multiplication between 2′s complement n-bit
numbers: the multiplicand A = an − 1…a0 and the multiplier B = bn − 1…b0 represented as given in (1a) and (1b), respectively. When
the basic shift-and-add algorithm [23] is adopted, the 2n-bit product P = p2n − 1…p0 is calculated as reported in (2): n PPs are computed
as the bitwise ANDs between A and the bits of B; then, the j-th PP, associated to the bit bj, is left shifted by j bit positions, sign extended,
and accumulated to the other PPs.
∑
n− 2
A = − an− 1 ⋅ 2n− 1 + ai ⋅ 2i (1a)
i=0

∑
n− 2
B = − bn− 1 ⋅ 2n− 1 + bj ⋅ 2j (1b)
j=0

∑
n− 2
( )
P = − bn− 1 ⋅ 2n− 1 ⋅ A + A ⋅ bj ⋅ 2j (2)
j=0

To handle the operands signs more efficiently, the Baugh-Wooley’s algorithm [24] computes P as reported in (3): instead of
subtracting the negative PPs, their negations are added.
( )
n− 2 ∑
∑ n− 2
( ) ∑
n− 2
( ) ∑n− 2
( )
P = an− 1 ⋅ bn− 1 ⋅ 22n− 2 + ai ⋅ bj ⋅ 2i+j + 2n− 1 an− 1 ⋅ bj ⋅ 2j + bn− 1 ⋅ ai ⋅ 2i + 2n + 22n− 1 (3)
i=0 j=0 j=0 i=0

The high-performance design recently proposed in [14] for FPGA-based accelerators, takes benefit of this approach by imple
menting a modular n × n multiplier utilizing four 2n × 2n smaller multipliers and performing ternary and binary additions to compute the
final product P.
In order to reduce the levels of additions required in the two previous methods, instead of computing one PP for each bit of the

2
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 1. The generation of PP’x and cx as performed in [9,12,13].

Table 2
The proposed encoding.
3-bit group New
b b b ID
2x+1 2x 2x-1 x IPPx3

0 0 0 0 0
0 0 1 1 A
0 1 0 1 A
0 1 1 2 2×A
1 0 0 2 2×A
1 0 1 1 A
1 1 0 1 A
1 1 1 0 0

operand B, the radix-4 Booth’s algorithm [22] splits B into n/2 3-bit groups (having one overlap bit with each other) that are encoded
into the digits Dx, with x = 0,…,2n − 1. Then, one PP is calculated for each digit, as ruled in Table 1. The x-th PP is left shifted by 2x bit
positions, sign extended, and accumulated to the other PPs, as shown in (4).
n
2− 1 (
∑ )
P= A ⋅ Dx ⋅ 22x (4)
x=0

FPGA-based multipliers proposed in [9–13] rely on the radix-4 Booth’s algorithm and efficiently map the multiplication operation
using LUTs and CCs on-chip available. In particular, as summarized in Table 1, the designs presented in [9,12] and [13] encode the
digit Dx by means of the auxiliary signals zx, cx and sx, which are properly asserted when PPx must be zeroed, 2′s complemented or left
shifted. The multiplexing circuitry illustrated in Fig. 1 is adopted to compute the generic PP’x that, taking into account also the signal
cx, will be accumulated through either sequential [9,12] or parallel [13] adder architectures, while an optimized sign extension logic is
used to limit the extension bits of a negative PP to two. Conversely, the implementation strategy presented in [10,11] incorporates the
generation of PPs and their accumulation within the same resources. In such a case, an array of n × n/2 generate-add units is used to
perform the product A × B, which allows reducing the number of LUTs over the conventional radix-4 Booth’s algorithm, at the cost of a
detrimental effect on the critical path delay.
Due to the different mapping strategies above described, the designs presented in [9–14] exhibit significantly different compu
tational speeds, resources utilization and energy consumption. Anyway, recently published results clearly demonstrate that the ap
proaches presented in [12–14] are more efficient than [9–11]. For these reasons, in this paper [12–14], have been selected as the direct
competitors to compare with. It is worth noting that, in such designs, to introduce low-level optimizations, LUTs are instantiated as
primitives and configured through their INIT parameters. In our view, this could limit the design portability making the adaption to
different FPGA families/vendors more difficult. Therefore, we avoid the usage of such technique.
As discussed in the following, the novel encoding strategy adopted here is efficiently mapped into LUTs and CCs and complies with
both area-efficiency and high-speed requirements, without employing low-level optimizations.

3. The proposed methodology

The method here proposed computes the product P between the two n-bit signed operands A and B as shown in (5). It exploits the
concept of the Booth’s encoding to preliminarily compute the digits IDx that, as summarized in Table 2, lead to potentially incorrect
PPs, IPPx = A ⋅ IDx. Then, each IPPx is XOR-ed with the corresponding b2x+1 bit, so that when a correction is needed, firstly, the negated
value of IPPx is accumulated. Finally, the additive term CO introduced in (5) allows completing the 2′s complementation of all PPs that
need the correction.

3
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 2. The proposed n × n multiplier.

Fig. 3. The generation of IPPx.

n n
2− 1 {
∑ } 2− 1
∑
P= [b2x+1 ⊕ (A ⋅ IDx )] ⋅ 22x + CO, CO = (b2x+1 ⋅ 2x ) (5)
x=0 x=0

The top-level architecture of an n × n multiplier designed as proposed here is depicted in Fig. 2. The generic module GenIPP re
ceives the 3-bit group b2x + 1b2xb2x − 1 as input and implements the adopted encoding logic, as schematized in Fig. 3. The subsequent
( )
accumulation step is performed through an adder tree, consisting of log2 2n − 1 levels of two-operands (n + 2)-bit Ripple Carry Adders
(RCAs), which operates taking into account that each IPP is left shifted by two bit positions relative to the previous one. Obviously, the
RCA topology was chosen since, as widely known, the dedicated hardware available onto FPGAs makes it the best solution to exploit
efficiently the CCs. The first level of the adder tree is provided with XOR gates that invert the IPPs when needed, and Sign Extension
(SE) modules, responsible for introducing two sign extension bits on the appropriate IPPs. It is important to underline that each RCA
belonging to the i th level of the adder tree receives an operand left shifted by 2i bit positions relative to the other, which is sign
extended by 2i bit positions. Due to this, only (n + 2)-bit RCAs are necessary across the whole adder tree that finally furnishes two
partial results: the 2n-bit OA and the (3 × 2n)-bit OB.
To better explain how the PPs accumulation is performed, Fig. 4 schematizes the internal structure of the three-level adder tree used
in the proposed 32 × 32 multiplier. From Fig. 4a, it can be seen that the j-th RCA, with j = 0,…,7, within the first level receives the
operand XP12j+1 left shifted by 2 bit positions relative to SE12j , which is extended by 2 bit positions. Similarly, as shown in Fig. 4b, within
the second level, the operands XP22j+1 and SE22j , with j = 0,…,3, are left shifted and sign extended by 4 bit positions, respectively.
Finally, as illustrated in Fig. 4c, by adding SE30 with XP31 and SE32 with XP33 , which are sign extended and left shifted by 8 bit positions,
the third addition level furnishes the 64-bit OA and the 48-bit OB partial results.
The final addition level visible in Fig. 2 is implemented by a 3-input adder that, apart the two results furnished by the adder tree,
also receives the correction operand CO as input. The latter is a positive n-bit number having the bits CO2x, with x = 0,…, 2n− 1, equal to
b2x+1, and all the other bits zeroed. It is worth noting that all the two-operands RCAs used in the adder tree of the new multiplier are
uniformly sized at the (n + 2) bit-width. It can be easily verified that, as an alternative approach, the correction bits may be appended

4
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 4. A schematic view of the adder tree: a) the first level; b) the second level; c) the third level with the produced partial results.

Fig. 5. The 3-operands RCA used in 32 × 32 multiplier proposed here.

5
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 6. An example of computation with 8-bit operands.

∑log2 (2n− 1) ( n )
to the IPPs instead of moving them into the additive term CO. However, this would make necessary j=1 2j
additional FAs across
the adder tree and a further (n + 4)-bit addition level necessary to process the bit CO2(n− 1) , which could not be appended to any IPP.
2

Moreover, introducing the additive term CO, the final addition needed to furnish the product P can be implemented by exploiting a
ternary adder.
The final adder utilized in the n × n multiplier proposed here consists of three cascaded sub-adders: an 2n -bit RCA, responsible for
the computation of the LSB portion P2n− 1:0 of the product; an (2n + 2)-bit ternary adder that computes the sub-word Pn+1:2n ; and, finally, an
(n− 2)-bit RCA that furnishes the remaining MSB sub-word of the result. The architecture illustrated in Fig. 5 shows how the 3-operands
RCA used in the new 32 × 32 multiplier is being mapped on FPGAs efficiently using dual-output LUTs and CCs. It can be seen that each
LUT adds three input bits through a Full-Adder (FA). The carry-out and the sum bits outputted by the FAs are then aligned and added
through the cascaded multiplexers and the XOR gates available within the CCs.
To better explain the running of the proposed method, Fig. 6 reports an example of the 8 × 8 multiplication. In such a case, four IPPs
are computed and accumulated, after their inversion, through one level of RCAs. The two results OA and OB obtained in this way are
finally summed to CO, thus furnishing the expected product, i.e. (− 128) × (− 46)=5888.
The above-described approach has been applied to multipliers with operands word-length ranging from 4 to 32. All the designed
architectures have been described at the Register Transfer Level (RTL) of abstraction using the Very High Speed Integrated Circuit
Hardware Description Language (VHDL). Then they have been implemented within a Xilinx Virtex-7 device using the Vivado 2021.2
Design Suite. It is worth pointing out that the original strategy here proposed allows the logic enclosed in the highlighted box of Fig. 2
to be efficiently mapped within the LUTs used to control the single positions of the fast carry-chains required to implement the RCAs of
the adder tree. Therefore, neither the adopted encoding nor the subsequent correction logic implies the utilization of further logic
resources.

6
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Table 3
Post-implementation results and comparison with competitors. In bold, winners for each parameter. Best competitor for each configuration are
highlighted in gray. The percentage gain/loss shown by the proposed circuits with respect to the best competitor are reported in brackets. Gains are
underlined.
#LUTs #CCs #Slices D E EDP Throughput/LUT
[ns] [pJ] [pJ × ns] [MHz]

Proposed 4×4 13 2 8 1.4 1.5 2.1 54.95

8×8 46 10 18 2.4 12.5 30 9.06
16 × 16 197 39 57 3.47 49 170.03 1.46
32 × 32 792 147 232 4.6 216 993.6 0.27
[12] 4×4 12(+8.3 %) 4( − 50 %) 10 2.15 2.3 4.945 38.76(±41.8 %)
8×8 40(+15 %) 12( − 16.7 %) 21 4.25 13.8 58.65 5.88(±54.1 %)
16 × 16 144(+36.8 %) 40( − 2.5 %) 62( − 8 %) 7.64 48(+2 %) 366.72 0.91
32 × 32 544(+45.6 %) 144(+2.1 %) 180(+28.9 %) 15.18 198(+9.1 %) 3005.64 0.12
[13] 4×4 18 5 9( − 11.1 %) 1.65( − 15.1 %) 1.8( − 16.7 %) 2.97( − 29.3 %) 33.67
8×8 66 19 30 2.8 15 42 5.41
16 × 16 243 70 84 4.13 63 260.19 1.00
32 × 32 928 269 301 6.21 262.3 1628.88 0.17
[14] 4×4 14 5 9 2.59 2.7 6.99 27.58
8×8 54 16 19( − 5.3 %) 4.37 18 78.66 4.24
16 × 16 208 58 66 5.28 75.6 399.17 0.91
32 × 32 803 216 250 7.35 288 2116.8 0.17
AO IP 4×4 25 4 10 2.9 2.9 8.41 13.79
8×8 85 14 33 3.5 14.4 50.4 3.36
16 × 16 317 50 94 5.1 61.2 312.12 0.62
32 × 32 1040 156 309 6.8 252 1713.6 0.14
SO IP 4×4 18 6 9 1.8 2 3.6 30.86
8×8 72 21 27 2.6 13.5 35.1( − 14.5 %) 5.34
( − 7.7 %)
( − 7.4 %)
16 × 16 280 77 81 3.5 53.2 186.2 1.02(±43.1 %)
( − 0.85 %) ( − 8.7 %)
32 × 32 1089 288 294 5.24–12.2 %) 229.6 1203.104 0.18(±50 %)
( − 17.4 %)
Naïve Booth 16 × 16 487 71 146 4.6 145.6 669.8 0.45
32 × 32 2292 275 641 6.1 434 2690.8 0.07

4. Implementation results and comparison with state-of-the-art

For purposes of comparison with their direct counterparts recently presented in literature [12–14], the proposed multipliers have
been characterized using the Virtex-7 xc7v585tffg1157–3 device. Moreover, in order to analyze the behavior of the compared designs
at a parity of operating conditions, the competitors have been re-implemented in this work by using the original VHDL source files
made available at https://esim-project.eu/pd-downloads by Authors [12–14].
Table 3 summarizes results obtained in terms of: LUTs, CCs and Slices utilization; computational delay (D); energy consumption
(E); energy-delay product (EDP); and area efficiency, evaluated as the throughput/LUTs ratio. It is important to highlight that all the
compared designs, including the naïve (i.e. without adopting any kind of optimization) implementation of the 32 × 32 and the 16 × 16
Booth’s multipliers, the Area Optimized (AO) and Speed Optimized (SO) Xilinx IP cores, were synthesized at their minimum delay
constraint achieved inserting registers as the driving and the loading logic. Moreover, their energy dissipation was analyzed using the
Switching Activity Interchange Format (SAIF) files extracted for 100,000 random inputs.
In Table 3, for each input bit-width, the best value achieved for each metric is highlighted. Moreover, for each analyzed metric and
multiplier size, results in brackets report the percentage variation exhibited by the proposed circuits with respect to the best
competitor. As an example, the proposed 4 × 4 multiplier achieves a delay 15.1 % lower than [13], which is the fastest among the
competitors for that size.
From Table 3, it can be seen that, as expected, the naïve implementation of Booth multipliers does not exploit the hardware re
sources available onto FPGAs as efficiently as others, thus showing relatively poor characteristics. Among the remaining competitors,
the SO IP exhibits the best delay characteristics at multiplier sizes higher than 4 × 4, whereas [12] shows the lowest area occupancy.
Such an approach minimizes the LUTs utilization thanks to its low-level optimization strategy exploiting three efficient configurations
of LUTs; moreover, it shows lower energy dissipation when n is higher than 8. However, such a strategy leads to a significant lower
speed performance. Therefore, the area efficiency and the EDP exhibited by [12] are generally worse than those achieved by the
proposed scheme. Indeed, the proposed 32 × 32 multiplier achieves an EDP 3.025 times lower than [12]. Overall, the proposed designs
always achieve the minimum computational delay and reach the best EDP, with an advantage that appears as more evident as the
operands word-length increases. In the case of 32 × 32 implementations, the delay and EDP exhibited by the proposed multiplier are,
respectively, 12.2 % and 17.4 % lower than the fastest competing architecture SO IP.
The examined designs have been characterized and compared also in terms of the normalized energy-delay-resources utilization
trade-off (NEnDeRes). The latter is obtained by normalizing the cost function EnDeRes, here introduced and defined in (6), versus the

7
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 7. NEnDeRes at: (a) n = 4; (b) n = 8; (c) n = 16; (d) n = 32.

Table 4
Post-Implementation results obtained with the virtex ultraScale+ technology.
#LUTs #CCs #Slices D E EDP Throughput/LUT
[ns] [pJ] [pJ×ns] [MHz]

Proposed 4×4 12 1 4 1.15 1.15 1.323 72.46

8×8 47 6 8 1.6 9.6 15.36 13.29
16×16 196 22 31 2.5 40 100 2.04
32×32 793 79 118 3.5 168 588 0.36

Fig. 8. The implemented MAC unit.

proposed multipliers that lead to the minimum value among those obtained by the compared designs.
EnDeRes = E ⋅ D ⋅ #Slices (6)
Comparison results obtained in terms of NEnDeRes are summarized in Fig. 7. The normalized cost function values of 10.09 and 7.36,
related to the 16 × 16 and 32 × 32 naïve Booth’s multipliers are significantly higher than other designs and therefore they do not
appear in Fig. 7. The latter confirms that the multipliers designed as proposed here offer the best energy-delay-resources utilization

8
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Table 5
Post-Implementation characteristics of the MAC units. In bold, winners for each parameter. Best competitor for each configuration are highlighted in
gray. The percentage gain/loss shown by the proposed circuits with respect to the best competitor are reported in brackets. Gains are underlined.
#LUT #CC #Slices D [ns] E [pJ]

Proposed 694 140 243 3.5 165.7

[12] 640 (+8.4 %) 158 ( − 11.4 %) 270 4.25 177.4
[13] 874 221 351 3.5 (0 %) 188.2
[14] 766 194 252 ( − 3.6 %) 4.37 215.2
AO IP 1045 176 324 3.5 (0 %) 190.8
SO IP 928 239 378 3.5 (0 %) 174.7 ( − 5.2 %)

Fig. 9. Percentage variations in terms of NEnDeRes with respect to the MAC unit based on the proposed approach.

trade-off, for a wide-range of operands word-lengths, with an advantage, with respect to their counterparts, that spans between 34.8 %
(achieved over the SO IP when n = 32) and 80 % (obtained with respect to the AO IP with n = 4).
In order to provide the reader with a big picture on the state-of-the-art and to point out the advantages offered by different
implementation strategies, DSP-based IP cores were also evaluated. The 32 × 32 (16 × 16) IP core uses four (one) DSPs, achieves a
computational delay of 5.7 ns (2.5 ns) and dissipates 57pJ (17.5pJ) per operation. However, such IP cores use completely different
hardware resources and they cannot be directly compared to the LUT-based solutions referenced in Table 3.
Finally, to show how the proposed multiplication method scales with more advanced FPGA devices, the new multipliers were
characterized also using the Virtex Ultrascale+ xcvu3pffc1517–3-e device. Obtained results are summarized in Table 4. It is worth
noting that the 8-bit CCs available in the used device lead to quite lower computational delays and reduced amounts of utilized CCs
with respect to the implementations evaluated in Table 3.

4.1. An example of application

As a case study, the above compared multipliers have been used in the design of a MAC unit structured as illustrated in Fig. 8. It
consists of nine 8 × 8 multipliers and an adder tree that accumulates the nine computed products two-by-two and furnishes a 20-bit
output. The selected circuit exploits a two-stage pipeline architecture, with registers located on the input and output ports, and at the
interface between the multipliers and the adder tree. Post-implementation characterizations reported in Table 5 show that the MAC
unit based on the proposed multiplier reaches the lowest energy consumption, achieving the highest speed performance.
The designs based on the multiplier presented in [13], the AO IP and the SO IP, are as fast as that using the approach proposed here.
This happens because their worst computational paths occur in the adder tree. On the contrary, the critical path of MACs using the
multipliers presented in [12] and [14] takes place over the multiplication operations. Obviously, each MAC can be optimized by
properly introducing further pipeline stages, but at the expense of increased energy consumption and resources utilization.
Table 5 shows that the proposed multiplier leads to an appreciable reduction in terms of energy consumption over all the alter
native designs. In comparison with the most energy efficient competitor (SO IP), the proposed circuit uses 25.2 % (41.4 %) less LUTs
(CCs) and dissipates 5.2 % less energy. Moreover, at the parity of speed performance, it allows ~12 % energy saving with respect to
[13]. Finally, as visible in Fig. 9, in the referred case study, the designs presented in [12,13] and [14], respectively, lead to NEnDeRes
values ~44 %, 64 % and 68 % higher than the new multiplier. Moreover, the NEnDeRes achieved by using SO and AO IPs is 53 % and 64
% higher.

5. Conclusions

This paper presented a novel approach to efficiently exploit the Booth’s encoding principle to design area-efficient and high-speed
signed multipliers on FPGAs. The proposed 32 × 32 multiplier achieves a computational delay and an energy-delay product 12.2 % and

9
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

17.4 % lower than the fastest competing architecture. When exploited in the design of a MAC unit, at a parity of computational delay,
the approach presented here dissipates 5.2 % less than the lowest energy consuming competitor (i.e. that based on SO IPs), while
reducing the amount of CCs by 11.4 % with respect to the design based on the area-optimized multiplier [12].

Availability of data and materials

Not applicable.

CRediT authorship contribution statement

Fanny Spagnolo: Conceptualization, Methodology, Investigation, Writing – review & editing. Pasquale Corsonello: Resources,
Conceptualization, Writing – review & editing. Fabio Frustaci: Methodology, Investigation, Writing – review & editing. Stefania
Perri: Supervision, Conceptualization, Methodology, Investigation, Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper

Data availability

No data was used for the research described in the article.

Author agreement statement

We the undersigned declare that this manuscript is original, has not been published before and is not currently being considered for
publication elsewhere.
We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied
the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by
all of us.
We understand that the Corresponding Author is the sole contact for the Editorial process. He/she is responsible for communicating
with the other authors about progress, submissions of revisions and final approval of proofs.
Signed by all authors as follows: Fanny Spagnolo, Pasquale Corsonello, Fabio Frustaci, Stefania Perri.

Funding

This work was supported by PON Ricerca & Innovazione – Ministero dell’Università e della Ricerca under Grant
1062_R24_INNOVAZIONE and by ICSC National Research Centre for High Performance Computing, Big Data and Quantum Computing
within the Next Generation EU program.

References

[1] Zou Z, Chen K, Shi Z, Guo Y, Ye J. Object Detection in 20 years: a survey. In: Proceedings of the IEEE. 111; March 2023. p. 257–76. https://doi.org/10.1109/
JPROC.2023.3238524.
[2] Porkodi R, Bhuvaneswari V. The internet of things (IoT) applications and communication enabling technology standards: an overview. In: Proc. IEEE Int. Conf.
on Intelligent Computing Applications; 2014. p. 324–9.
[3] Sze V, Chen Y-H, Yang T-J, Emer JS. Efficient processing of deep neural networks: a tutorial and survey. In: Proceedings of the IEEE. 105; Dec. 2017.
p. 2295–329. https://doi.org/10.1109/JPROC.2017.2761740.
[4] Frustaci F, Perri S, Corsonello P, Alioto M. Energy-quality scalable adders based on nonzeroing bit truncation. IEEE Trans. on Very Large Scale Integration (VLSI)
Systems April 2019;27(4):964–8. https://doi.org/10.1109/TVLSI.2018.2881326.
[5] Perri S, Frustaci F, Spagnolo F, Corsonello P. Design of real-time FPGA-based embedded system for stereo vision. In: Proc. IEEE Inter. Symp. on Circuits and
Systems (ISCAS); 2018. p. 1–5.
[6] Mounica Y, Kumar KN, Veeramachaneni S, Mahammad S N. Energy efficient signed and unsigned radix 16 booth multiplier design. Comput Electr Eng 2021;90
(106892):1–8.
[7] Uppugunduru AK, Bharadwaj SV, Ahmed SE. Compressor based hybrid approximate multiplier architectures with efficient error correction logic. Comput Electr
Eng 2022;104(108407):1–10.
[8] Guturu S, Kumar UA, Bharadwaj SV, Ahmed SE. Design methodology for highly accurate approximate multipliers for error resilient applications. Comput Electr
Eng 2023;110(108798):1–11.
[9] Kumm M, et al. An efficient softcore multiplier architecture for Xilinx FPGAs. In: Proc. IEEE 22ndSymp. Comp. Arith; 2015. p. 18–25.
[10] Walters EG. Partial-product generation and addition for multiplication in FPGAs with 6-input LUTs. In: Proc. IEEE 48th Asilomar Conf. on Sig., Sys. and
Computers; 2014. p. 1247–51.
[11] Walters EG. Array multipliers for high throughput in xilinx FPGAs with 6-Input LUTs. Computers Sept. 2016;5(4):1–25. https://doi.org/10.3390/
computers5040020.
[12] Ullah S, Schmidl H, Sahoo SS, Rehman S, Kumar A. Area-optimized accurate and approximate softcore signed multiplier architectures. IEEE Trans Computers
March 2021;vol.70(3):384–92. https://doi.org/10.1109/TC.2020.2988404.

10
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

[13] Ullah S, Nguyen TDA, Kumar A. Energy-efficient low-latency signed multiplier for FPGA-based hardware accelerators. IEEE Embed Syst Lett June 2021;13(2):
41–4. https://doi.org/10.1109/LES.2020.2995053.
[14] Ullah S, Rehman S, Shafique M, Kumar A. High-performance accurate and approximate multipliers for FPGA-based hardware accelerators. IEEE Trans. Comput-
Aided Des Integr Circuits Sys Feb. 2022;41(2):211–24. https://doi.org/10.1109/TCAD.2021.3056337.
[15] Langhammer M, Baeckler G. High density and performance multiplication for FPGA. In: Symposium on Computer Arithmetic (ARITH); June 2018. p. 5–12.
[16] Kumm M, Hardieck M, Willkomm J, Zipf P, Meyer-Baese U. Multiple constant multiplication with ternary adders. In: International conference on field
programmable logic and applications (FPL); October 2013. p. 1–8.
[17] Ullah S, Kumar A. Approximate arithmetic circuit architectures for FPGA-based systems. Cham: Springer International Publishing; 2023. p. 1–26. https://doi.
org/10.1007/978-3-031-21294-9_1.
[18] Awais M, Zahir A, Shah SAA, Reviriego P, Ullah A, Ullah N, Khan A, Ali H. Toward optimal softcore carry-aware approximate multipliers on xilinx FPGAs. ACM
Trans. Embed Comput Syst 2023;22(4):1–19.
[19] Pedada RR, Gade VSR. A low latency power effective signed multiplier for hardware boosters using FPGA. AIP Conf Proc Oct. 2023;2794(1).
[20] 7 Series DSP48E1 Slice, Xilinx, San Jose, CA, USA, 2018. [Online]. Available: https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_
DSP48E1.pdf.
[21] Integer Arithmetic IP Cores User Guide. 2020. [Online]. Available: https://www.altera.com/en_US/pdfs/literature/ug/ug_lpm_alt_mfug.pdf.
[22] Booth AD. A signed binary multiplication technique. Quart J Mech Appl Math 1951;4(2):236–40.
[23] Parhami B. Computer arithmetic algorithms and hardware designs. New York, NY, USA: Oxford Univ. Press; 2000.
[24] Baugh CR, Wooley BA. A two’s complement parallel array multiplication algorithm. IEEE Trans. Comput. Dec. 1973;C-22(12):1045–7. https://doi.org/10.1109/
T-C.1973.223648.

Screening, Size Reduction, Flotation, Agitation
67% (3)
Screening, Size Reduction, Flotation, Agitation
496 pages
Statistics of Inheritance POGIL
50% (2)
Statistics of Inheritance POGIL
3 pages
Chapter 6. Arithmetic: Computer Organization
No ratings yet
Chapter 6. Arithmetic: Computer Organization
74 pages
Stock Market Prediction Using MLP and Random Forest
No ratings yet
Stock Market Prediction Using MLP and Random Forest
18 pages
Pushpull Final
100% (1)
Pushpull Final
59 pages
Booth Encoder
No ratings yet
Booth Encoder
8 pages
Electronics 12 00605 v2
No ratings yet
Electronics 12 00605 v2
19 pages
DICD Fall 2024 Lecture 09 Arithmetic Circuits
No ratings yet
DICD Fall 2024 Lecture 09 Arithmetic Circuits
52 pages
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
No ratings yet
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
4 pages
Design, Comparison and Implementation of Multipliers On FPGA
No ratings yet
Design, Comparison and Implementation of Multipliers On FPGA
8 pages
FPGA-Based Multiplier With A New Approximate Full Adder For Error-Resilient Applications
No ratings yet
FPGA-Based Multiplier With A New Approximate Full Adder For Error-Resilient Applications
5 pages
Design and Implementation of FIR Filter Based On Dual Quality Compressor Based Multipliers With MFA
No ratings yet
Design and Implementation of FIR Filter Based On Dual Quality Compressor Based Multipliers With MFA
24 pages
Applsci 13 10407
No ratings yet
Applsci 13 10407
12 pages
Design of Roba Multiplier Using Mac Unit
No ratings yet
Design of Roba Multiplier Using Mac Unit
15 pages
RISC-V Lecture 00
No ratings yet
RISC-V Lecture 00
62 pages
Goals of This Chapter: Designing For Performance, Area, or Power
No ratings yet
Goals of This Chapter: Designing For Performance, Area, or Power
74 pages
Module3 DDCO
No ratings yet
Module3 DDCO
37 pages
VLSI
No ratings yet
VLSI
29 pages
Example of Multiplier
No ratings yet
Example of Multiplier
4 pages
FALLSEM2024-25 BECE406E ETH VL2024250104214 2024-08-16 Reference-Material-I
No ratings yet
FALLSEM2024-25 BECE406E ETH VL2024250104214 2024-08-16 Reference-Material-I
23 pages
Algorithm and Design
No ratings yet
Algorithm and Design
6 pages
Ijecet: International Journal of Electronics and Communication Engineering & Technology (Ijecet)
No ratings yet
Ijecet: International Journal of Electronics and Communication Engineering & Technology (Ijecet)
11 pages
Convolution FPGA
No ratings yet
Convolution FPGA
6 pages
IT3030E CA Chap4 Arithmetics
No ratings yet
IT3030E CA Chap4 Arithmetics
64 pages
Approximate Recursive Multipliers Using Low Power
No ratings yet
Approximate Recursive Multipliers Using Low Power
16 pages
Low Power Booth Multiplier Using Radix-4 Algorithm On Fpga: Prof. V. R. Raut, P. R. Loya
No ratings yet
Low Power Booth Multiplier Using Radix-4 Algorithm On Fpga: Prof. V. R. Raut, P. R. Loya
5 pages
Types of Multiplier
No ratings yet
Types of Multiplier
25 pages
Lab6 Verilog
No ratings yet
Lab6 Verilog
6 pages
Datapath Subsystems
No ratings yet
Datapath Subsystems
29 pages
Vlsi Mtech Document
No ratings yet
Vlsi Mtech Document
72 pages
Existing Methodology: I I I-1 I I-1 I I
No ratings yet
Existing Methodology: I I I-1 I I-1 I I
9 pages
9 .Efficient Design For Fixed Width Adder
No ratings yet
9 .Efficient Design For Fixed Width Adder
45 pages
31 Design JJ New
No ratings yet
31 Design JJ New
8 pages
ROBA
67% (3)
ROBA
11 pages
Ijspr 5901 30318
No ratings yet
Ijspr 5901 30318
5 pages
Implementation of Carry Select Adder Using Verilog On FPGA: Sapan Desai (17BEC023) & Devansh Chawla (17BEC024)
No ratings yet
Implementation of Carry Select Adder Using Verilog On FPGA: Sapan Desai (17BEC023) & Devansh Chawla (17BEC024)
9 pages
Design of Roba Multiplier For High-Speed Yet Energy-Efficient Digital Signal Processing Using Verilog HDL
No ratings yet
Design of Roba Multiplier For High-Speed Yet Energy-Efficient Digital Signal Processing Using Verilog HDL
16 pages
Today: Arithmetic: Multiplication
No ratings yet
Today: Arithmetic: Multiplication
22 pages
Implementation Methods
No ratings yet
Implementation Methods
30 pages
Erle Mult Carrysave
No ratings yet
Erle Mult Carrysave
11 pages
4 Bit Signed Multiplier Implemented On FPGA
No ratings yet
4 Bit Signed Multiplier Implemented On FPGA
6 pages
Paper M
No ratings yet
Paper M
10 pages
Analysisofapproxadders
No ratings yet
Analysisofapproxadders
6 pages
Chapter 4 Combinational Logic
No ratings yet
Chapter 4 Combinational Logic
62 pages
Paper M
No ratings yet
Paper M
10 pages
Alajmi Rashed Thesis 2019
No ratings yet
Alajmi Rashed Thesis 2019
102 pages
Tomar2017 18 PDF
No ratings yet
Tomar2017 18 PDF
13 pages
02 SystemVerilogLecture1
No ratings yet
02 SystemVerilogLecture1
31 pages
Design and Simulation of Radix-8 Booth Multiplier For Signed and Unsigned Numbers Using VHDL
No ratings yet
Design and Simulation of Radix-8 Booth Multiplier For Signed and Unsigned Numbers Using VHDL
51 pages
Systemverilog - Lecture 1
No ratings yet
Systemverilog - Lecture 1
62 pages
Arithmetic Circuit Implementation: Haibo Wang ECE Department Southern Illinois University Carbondale, IL 62901
No ratings yet
Arithmetic Circuit Implementation: Haibo Wang ECE Department Southern Illinois University Carbondale, IL 62901
24 pages
22.multioperand Redundant Adders On FPGAs
No ratings yet
22.multioperand Redundant Adders On FPGAs
13 pages
ARITHMETIC and LOGIC UNIT - in This Lecture, We Will Examine How
No ratings yet
ARITHMETIC and LOGIC UNIT - in This Lecture, We Will Examine How
12 pages
Low Power Carry Look Adder Design Using FLUT's FPGA Arithmetic (Fpga - Arithmetic) - Docs
No ratings yet
Low Power Carry Look Adder Design Using FLUT's FPGA Arithmetic (Fpga - Arithmetic) - Docs
45 pages
(IJCST-V9I2P10) :DR - Shine N Das
No ratings yet
(IJCST-V9I2P10) :DR - Shine N Das
6 pages
Radix-4 and Radix-8 Multiplier Using Verilog HDL
No ratings yet
Radix-4 and Radix-8 Multiplier Using Verilog HDL
6 pages
Radix-4 and Radix-8 Multiplier Using Verilog HDL: (Ijartet) Vol. 1, Issue 1, September 2014
No ratings yet
Radix-4 and Radix-8 Multiplier Using Verilog HDL: (Ijartet) Vol. 1, Issue 1, September 2014
6 pages
Booth Multiplier On 23 06 10
No ratings yet
Booth Multiplier On 23 06 10
25 pages
Fast Multiplication Algorithms
No ratings yet
Fast Multiplication Algorithms
171 pages
VHDL Ass1 Binary Multiplier 2005
No ratings yet
VHDL Ass1 Binary Multiplier 2005
2 pages
Existing Methodology
No ratings yet
Existing Methodology
7 pages
Adaptive Area-Efficient Multiplier With Accuracy-Configurable Lookahead Multiplication
No ratings yet
Adaptive Area-Efficient Multiplier With Accuracy-Configurable Lookahead Multiplication
23 pages
BANFLEX
No ratings yet
BANFLEX
1 page
Sartorius Extend Flyer
No ratings yet
Sartorius Extend Flyer
8 pages
Large
No ratings yet
Large
15 pages
Mobile Antenna System Handbook
No ratings yet
Mobile Antenna System Handbook
15 pages
ICAO Frequency Management Manual
No ratings yet
ICAO Frequency Management Manual
19 pages
TEXA Axone Nemo Specs
No ratings yet
TEXA Axone Nemo Specs
36 pages
ETAP 21.0.1 - Unbalanced Load Flow Analysis
No ratings yet
ETAP 21.0.1 - Unbalanced Load Flow Analysis
80 pages
Furmark Log
No ratings yet
Furmark Log
2 pages
Soil Sorption of Caesium Modelled by The Langmuir and Freundlich Isotherm Equations
No ratings yet
Soil Sorption of Caesium Modelled by The Langmuir and Freundlich Isotherm Equations
9 pages
6.state in React
No ratings yet
6.state in React
31 pages
(Computational Neuroscience) Daniel Gardner - Neurobiology of Neural Networks-The MIT Press (1993) (Z-Lib - Io)
No ratings yet
(Computational Neuroscience) Daniel Gardner - Neurobiology of Neural Networks-The MIT Press (1993) (Z-Lib - Io)
235 pages
Interface Knowledge
No ratings yet
Interface Knowledge
4 pages
(Template) As WEEK 5&6
No ratings yet
(Template) As WEEK 5&6
3 pages
SSC CGL Physics in English d241009b
No ratings yet
SSC CGL Physics in English d241009b
13 pages
BOP Control System BC0114001A
No ratings yet
BOP Control System BC0114001A
2 pages
Revision - Length Time
No ratings yet
Revision - Length Time
12 pages
Infire HTC Speed Operating Instruction
No ratings yet
Infire HTC Speed Operating Instruction
56 pages
Introduction To Number System
100% (1)
Introduction To Number System
15 pages
Biostatistics Classes PDF
No ratings yet
Biostatistics Classes PDF
156 pages
EFFECTS OF ADDITION OF POLES AND ZEROS IN ROOT LOCUS SP
No ratings yet
EFFECTS OF ADDITION OF POLES AND ZEROS IN ROOT LOCUS SP
6 pages
Liebert Apm 30 600 KW Brochure English
No ratings yet
Liebert Apm 30 600 KW Brochure English
8 pages
Unit 1
No ratings yet
Unit 1
67 pages
January 1995 PW
100% (1)
January 1995 PW
78 pages
C-Full Programs 001
No ratings yet
C-Full Programs 001
25 pages
Pollens
No ratings yet
Pollens
13 pages
Edb Postgres Architecture Deep Dive
No ratings yet
Edb Postgres Architecture Deep Dive
5 pages

1 s2.0 S0045790624001459 Main

Uploaded by

1 s2.0 S0045790624001459 Main

Uploaded by

Computers and Electrical Engineering 116 (2024) 109217

Contents lists available at ScienceDirect

Computers and Electrical Engineering

Efficient implementation of signed multipliers on FPGAs

b2x+1 b2x b2x-1 Dx PPx PPx=PP’x+ cx

Fig. 1. The generation of PP’x and cx as performed in [9,12,13].

3. The proposed methodology

Fig. 2. The proposed n × n multiplier.

Fig. 3. The generation of IPPx.

Fig. 5. The 3-operands RCA used in 32 × 32 multiplier proposed here.

Fig. 6. An example of computation with 8-bit operands.

Proposed 4×4 13 2 8 1.4 1.5 2.1 54.95

4. Implementation results and comparison with state-of-the-art

Fig. 7. NEnDeRes at: (a) n = 4; (b) n = 8; (c) n = 16; (d) n = 32.

Proposed 4×4 12 1 4 1.15 1.15 1.323 72.46

Fig. 8. The implemented MAC unit.

Proposed 694 140 243 3.5 165.7

4.1. An example of application

Availability of data and materials

CRediT authorship contribution statement

Declaration of competing interest

No data was used for the research described in the article.

Author agreement statement

You might also like