0% found this document useful (0 votes)
15 views11 pages

1 s2.0 S0045790624001459 Main

This paper introduces an efficient method for implementing signed binary multipliers on FPGAs using radix-4 Booth's encoding, which simplifies the logic operations and reduces resource utilization. The proposed design achieves a maximum frequency of ~217 MHz with minimal energy consumption and a significant reduction in LUTs and carry-chains compared to existing architectures. Results demonstrate that the new approach outperforms previous designs in terms of energy-delay product and resource efficiency across various bit-widths.

Uploaded by

ecolillxf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views11 pages

1 s2.0 S0045790624001459 Main

This paper introduces an efficient method for implementing signed binary multipliers on FPGAs using radix-4 Booth's encoding, which simplifies the logic operations and reduces resource utilization. The proposed design achieves a maximum frequency of ~217 MHz with minimal energy consumption and a significant reduction in LUTs and carry-chains compared to existing architectures. Results demonstrate that the new approach outperforms previous designs in terms of energy-delay product and resource efficiency across various bit-widths.

Uploaded by

ecolillxf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Computers and Electrical Engineering 116 (2024) 109217

Contents lists available at ScienceDirect

Computers and Electrical Engineering


journal homepage: www.elsevier.com/locate/compeleceng

Efficient implementation of signed multipliers on FPGAs


Fanny Spagnolo a, Pasquale Corsonello a, Fabio Frustaci a, Stefania Perri b, *
a
Department of Informatics, Modelling, Electronics and Systems Engineering - University of Calabria, 87036 Arcavacata di Rende, Italy
b
Department of Mechanical, Energy and Management Engineering - University of Calabria, 87036 Arcavacata di Rende, Italy

A R T I C L E I N F O A B S T R A C T

Keywords: This paper presents a simple but effective strategy to implement signed binary multipliers on
Hardware accelerators Field Programmable Gate Arrays (FPGAs). It is based on the radix-4 Booth’s encoding logic but
Multipliers adopts an unconventional sequence of logic operations that allows hardware resources to be more
FPGAs
efficiently exploited. The main idea consists in preliminarily generating incorrect partial prod­
Energy reduction
High performance computing
ucts, which allows the encoding logic to be simplified, and then correcting them in the subsequent
computational steps, without requiring additional logic resources. The adopted approach uses
both the look up tables (LUTs) and the fast carry-chains (CCs) available within FPGA devices.
When implemented on a Xilinx xc7v585ttfg1157–3 device, a 32 × 32 multiplier designed as
proposed here achieves a maximum running frequency of ~217 MHz consuming only ~216pJ per
operation and using 792 LUTs and 147 CCs. When adopted in the realization of a 3 × 3 Multiply-
Accumulate unit, in comparison with the lowest energy consuming competitor, the proposed
approach leads to an energy-delay product 5.2 % lower and allows reducing the number of uti­
lized LUTs and carry-chains by 25.2 % and 41.4 %, respectively.

1. Introduction

Modern Field Programmable Gate Arrays (FPGAs) are widely recognized as very appropriate hardware implementation platforms
for those applications that demand high computational speed, significant energy efficiency and flexibility, like Computer Vision (CV),
object detection and classification, Internet of Things (IoT), Deep Learning (DL), and many others, [1–3]. In all the above-mentioned
fields, complex data-paths execute their elaboration by exploiting elementary adders and multipliers [4,5] designed to perform both
exact [6] and inexact computations [7,8]. For this reason, the design of such arithmetic circuits receives a great deal of attention for
efficient implementations on FPGAs [9–19].
Typically, when providing high computational capabilities and achieving the highest speed performances are mandatory, huge
levels of parallelism are exploited and Digital Signal Processing slices (DSPs) are greedily utilized. However, the use of softcore adders
and multipliers realized on programmable logic cells is of particular interest to comply with the limited amount of DSPs available on-
chip [20,21] and/or to avoid their underutilization when operating on small bit-width operands (i.e. lower than 8-bit [13]), that also
leads to poor delay and energy behaviors. Furthermore, as demonstrated in [14], due to their higher routing delays, DSP-based
implementations of relatively complex architectures do not always overcome the speed performance of analogous modules realized
by exclusively exploiting Look Up Tables (LUTs) and the fast Carry-Chains (CCs).
For these reasons, several multipliers optimized to efficiently use dual-output LUTs and CCs have been recently proposed [9–14].

* Corresponding author.
E-mail address: [email protected] (S. Perri).

https://doi.org/10.1016/j.compeleceng.2024.109217
Received 7 November 2023; Received in revised form 19 March 2024; Accepted 24 March 2024
Available online 2 April 2024
0045-7906/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Table 1
The Booth’s Algorithm as implemented in [9,12,13,22].
[22] [9,12,13]

b2x+1 b2x b2x-1 Dx PPx PPx=PP’x+ cx

zx cx sx PP’x
0 0 0 0 0 1 0 0 0
0 0 1 +1 A 0 0 0 A
0 1 0 +1 A 0 0 0 A
0 1 1 +2 2×A 0 0 1 2×A
1 0 0 − 2 − 2×A 0 1 1 2 × (not A)
1 0 1 − 1 − A 0 1 0 not A
1 1 0 − 1 − A 0 1 0 not A
1 1 1 0 0 1 0 0 0

The designs presented in [9,10,12,13] achieve area-efficient multipliers with a reduced LUTs utilization by exploiting the radix-4
Booth’s encoding logic [22]. Conversely, those presented in [11] and [14] are particularly suitable to reach high-speed perfor­
mances. The former allows restructuring a conventional multiplier to reduce the levels of adders required to provide the resulting
product, whereas the latter adopts the modular approach that allows processing wide operands by using smaller bit-width multipliers.
This work presents an alternative method that exploits the radix-4 Booth’s encoding, whilst reducing hardware resources utili­
zation, energy consumption and computational delay in comparison to state-of-the-art softcore multipliers. In order to achieve this
result, the multiplication between two n-bit operands A and B is performed within three basic steps: the encoding of the multiplier B;
the partial products (PPs) generation and the PPs accumulation. The overall operation is simplified by preliminarily computing
incorrect but simpler PPs that are then accumulated. The proposed strategy allows the final correction of the result to be performed
without requiring any additional logic resource.
For purposes of comparison with efficient multipliers known in literature, the novel method has been applied to design signed n × n
multipliers with n ranging between 4- and 32-bit. Results, obtained using a Xilinx VIRTEX-7 device, demonstrate that a 32 × 32
multiplier designed as proposed here achieves the lowest computational delay and energy-delay product, which are, respectively, 12.2
% and 17.4 % lower than the fastest competing architecture. Furthermore, a 4 × 4 novel multiplier shows an energy-delay product 29.3
% lower than [13], that represents the best competitor for such a word length. Finally, in comparison to the cheapest design [12], the
proposed 32 × 32 multiplier, while utilizing 45.6 % more LUTs and 2.1 % more CCs, reduces the energy-delay product by 3.025 times.
The rest of the paper is organized as follows. Section 2 provides a brief background and overviews the state-of-the art. The proposed
approach is introduced in Section 3. Hardware implementations and comparison results are presented in Section 4. Finally, conclusions
are drawn in Section 5.

2. Related works

In order to briefly discuss known approaches relevant to this work, let us consider the multiplication between 2′s complement n-bit
numbers: the multiplicand A = an − 1…a0 and the multiplier B = bn − 1…b0 represented as given in (1a) and (1b), respectively. When
the basic shift-and-add algorithm [23] is adopted, the 2n-bit product P = p2n − 1…p0 is calculated as reported in (2): n PPs are computed
as the bitwise ANDs between A and the bits of B; then, the j-th PP, associated to the bit bj, is left shifted by j bit positions, sign extended,
and accumulated to the other PPs.

n− 2
A = − an− 1 ⋅ 2n− 1 + ai ⋅ 2i (1a)
i=0


n− 2
B = − bn− 1 ⋅ 2n− 1 + bj ⋅ 2j (1b)
j=0


n− 2
( )
P = − bn− 1 ⋅ 2n− 1 ⋅ A + A ⋅ bj ⋅ 2j (2)
j=0

To handle the operands signs more efficiently, the Baugh-Wooley’s algorithm [24] computes P as reported in (3): instead of
subtracting the negative PPs, their negations are added.
( )
n− 2 ∑
∑ n− 2
( ) ∑
n− 2
( ) ∑n− 2
( )
P = an− 1 ⋅ bn− 1 ⋅ 22n− 2 + ai ⋅ bj ⋅ 2i+j + 2n− 1 an− 1 ⋅ bj ⋅ 2j + bn− 1 ⋅ ai ⋅ 2i + 2n + 22n− 1 (3)
i=0 j=0 j=0 i=0

The high-performance design recently proposed in [14] for FPGA-based accelerators, takes benefit of this approach by imple­
menting a modular n × n multiplier utilizing four 2n × 2n smaller multipliers and performing ternary and binary additions to compute the
final product P.
In order to reduce the levels of additions required in the two previous methods, instead of computing one PP for each bit of the

2
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 1. The generation of PP’x and cx as performed in [9,12,13].

Table 2
The proposed encoding.
3-bit group New
b b b ID
2x+1 2x 2x-1 x IPPx3

0 0 0 0 0
0 0 1 1 A
0 1 0 1 A
0 1 1 2 2×A
1 0 0 2 2×A
1 0 1 1 A
1 1 0 1 A
1 1 1 0 0

operand B, the radix-4 Booth’s algorithm [22] splits B into n/2 3-bit groups (having one overlap bit with each other) that are encoded
into the digits Dx, with x = 0,…,2n − 1. Then, one PP is calculated for each digit, as ruled in Table 1. The x-th PP is left shifted by 2x bit
positions, sign extended, and accumulated to the other PPs, as shown in (4).
n
2− 1 (
∑ )
P= A ⋅ Dx ⋅ 22x (4)
x=0

FPGA-based multipliers proposed in [9–13] rely on the radix-4 Booth’s algorithm and efficiently map the multiplication operation
using LUTs and CCs on-chip available. In particular, as summarized in Table 1, the designs presented in [9,12] and [13] encode the
digit Dx by means of the auxiliary signals zx, cx and sx, which are properly asserted when PPx must be zeroed, 2′s complemented or left
shifted. The multiplexing circuitry illustrated in Fig. 1 is adopted to compute the generic PP’x that, taking into account also the signal
cx, will be accumulated through either sequential [9,12] or parallel [13] adder architectures, while an optimized sign extension logic is
used to limit the extension bits of a negative PP to two. Conversely, the implementation strategy presented in [10,11] incorporates the
generation of PPs and their accumulation within the same resources. In such a case, an array of n × n/2 generate-add units is used to
perform the product A × B, which allows reducing the number of LUTs over the conventional radix-4 Booth’s algorithm, at the cost of a
detrimental effect on the critical path delay.
Due to the different mapping strategies above described, the designs presented in [9–14] exhibit significantly different compu­
tational speeds, resources utilization and energy consumption. Anyway, recently published results clearly demonstrate that the ap­
proaches presented in [12–14] are more efficient than [9–11]. For these reasons, in this paper [12–14], have been selected as the direct
competitors to compare with. It is worth noting that, in such designs, to introduce low-level optimizations, LUTs are instantiated as
primitives and configured through their INIT parameters. In our view, this could limit the design portability making the adaption to
different FPGA families/vendors more difficult. Therefore, we avoid the usage of such technique.
As discussed in the following, the novel encoding strategy adopted here is efficiently mapped into LUTs and CCs and complies with
both area-efficiency and high-speed requirements, without employing low-level optimizations.

3. The proposed methodology

The method here proposed computes the product P between the two n-bit signed operands A and B as shown in (5). It exploits the
concept of the Booth’s encoding to preliminarily compute the digits IDx that, as summarized in Table 2, lead to potentially incorrect
PPs, IPPx = A ⋅ IDx. Then, each IPPx is XOR-ed with the corresponding b2x+1 bit, so that when a correction is needed, firstly, the negated
value of IPPx is accumulated. Finally, the additive term CO introduced in (5) allows completing the 2′s complementation of all PPs that
need the correction.

3
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 2. The proposed n × n multiplier.

Fig. 3. The generation of IPPx.

n n
2− 1 {
∑ } 2− 1

P= [b2x+1 ⊕ (A ⋅ IDx )] ⋅ 22x + CO, CO = (b2x+1 ⋅ 2x ) (5)
x=0 x=0

The top-level architecture of an n × n multiplier designed as proposed here is depicted in Fig. 2. The generic module GenIPP re­
ceives the 3-bit group b2x + 1b2xb2x − 1 as input and implements the adopted encoding logic, as schematized in Fig. 3. The subsequent
( )
accumulation step is performed through an adder tree, consisting of log2 2n − 1 levels of two-operands (n + 2)-bit Ripple Carry Adders
(RCAs), which operates taking into account that each IPP is left shifted by two bit positions relative to the previous one. Obviously, the
RCA topology was chosen since, as widely known, the dedicated hardware available onto FPGAs makes it the best solution to exploit
efficiently the CCs. The first level of the adder tree is provided with XOR gates that invert the IPPs when needed, and Sign Extension
(SE) modules, responsible for introducing two sign extension bits on the appropriate IPPs. It is important to underline that each RCA
belonging to the i th level of the adder tree receives an operand left shifted by 2i bit positions relative to the other, which is sign
extended by 2i bit positions. Due to this, only (n + 2)-bit RCAs are necessary across the whole adder tree that finally furnishes two
partial results: the 2n-bit OA and the (3 × 2n)-bit OB.
To better explain how the PPs accumulation is performed, Fig. 4 schematizes the internal structure of the three-level adder tree used
in the proposed 32 × 32 multiplier. From Fig. 4a, it can be seen that the j-th RCA, with j = 0,…,7, within the first level receives the
operand XP12j+1 left shifted by 2 bit positions relative to SE12j , which is extended by 2 bit positions. Similarly, as shown in Fig. 4b, within
the second level, the operands XP22j+1 and SE22j , with j = 0,…,3, are left shifted and sign extended by 4 bit positions, respectively.
Finally, as illustrated in Fig. 4c, by adding SE30 with XP31 and SE32 with XP33 , which are sign extended and left shifted by 8 bit positions,
the third addition level furnishes the 64-bit OA and the 48-bit OB partial results.
The final addition level visible in Fig. 2 is implemented by a 3-input adder that, apart the two results furnished by the adder tree,
also receives the correction operand CO as input. The latter is a positive n-bit number having the bits CO2x, with x = 0,…, 2n− 1, equal to
b2x+1, and all the other bits zeroed. It is worth noting that all the two-operands RCAs used in the adder tree of the new multiplier are
uniformly sized at the (n + 2) bit-width. It can be easily verified that, as an alternative approach, the correction bits may be appended

4
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 4. A schematic view of the adder tree: a) the first level; b) the second level; c) the third level with the produced partial results.

Fig. 5. The 3-operands RCA used in 32 × 32 multiplier proposed here.

5
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 6. An example of computation with 8-bit operands.

∑log2 (2n− 1) ( n )
to the IPPs instead of moving them into the additive term CO. However, this would make necessary j=1 2j
additional FAs across
the adder tree and a further (n + 4)-bit addition level necessary to process the bit CO2(n− 1) , which could not be appended to any IPP.
2

Moreover, introducing the additive term CO, the final addition needed to furnish the product P can be implemented by exploiting a
ternary adder.
The final adder utilized in the n × n multiplier proposed here consists of three cascaded sub-adders: an 2n -bit RCA, responsible for
the computation of the LSB portion P2n− 1:0 of the product; an (2n + 2)-bit ternary adder that computes the sub-word Pn+1:2n ; and, finally, an
(n− 2)-bit RCA that furnishes the remaining MSB sub-word of the result. The architecture illustrated in Fig. 5 shows how the 3-operands
RCA used in the new 32 × 32 multiplier is being mapped on FPGAs efficiently using dual-output LUTs and CCs. It can be seen that each
LUT adds three input bits through a Full-Adder (FA). The carry-out and the sum bits outputted by the FAs are then aligned and added
through the cascaded multiplexers and the XOR gates available within the CCs.
To better explain the running of the proposed method, Fig. 6 reports an example of the 8 × 8 multiplication. In such a case, four IPPs
are computed and accumulated, after their inversion, through one level of RCAs. The two results OA and OB obtained in this way are
finally summed to CO, thus furnishing the expected product, i.e. (− 128) × (− 46)=5888.
The above-described approach has been applied to multipliers with operands word-length ranging from 4 to 32. All the designed
architectures have been described at the Register Transfer Level (RTL) of abstraction using the Very High Speed Integrated Circuit
Hardware Description Language (VHDL). Then they have been implemented within a Xilinx Virtex-7 device using the Vivado 2021.2
Design Suite. It is worth pointing out that the original strategy here proposed allows the logic enclosed in the highlighted box of Fig. 2
to be efficiently mapped within the LUTs used to control the single positions of the fast carry-chains required to implement the RCAs of
the adder tree. Therefore, neither the adopted encoding nor the subsequent correction logic implies the utilization of further logic
resources.

6
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Table 3
Post-implementation results and comparison with competitors. In bold, winners for each parameter. Best competitor for each configuration are
highlighted in gray. The percentage gain/loss shown by the proposed circuits with respect to the best competitor are reported in brackets. Gains are
underlined.
#LUTs #CCs #Slices D E EDP Throughput/LUT
[ns] [pJ] [pJ × ns] [MHz]

Proposed 4×4 13 2 8 1.4 1.5 2.1 54.95


8×8 46 10 18 2.4 12.5 30 9.06
16 × 16 197 39 57 3.47 49 170.03 1.46
32 × 32 792 147 232 4.6 216 993.6 0.27
[12] 4×4 12(+8.3 %) 4( − 50 %) 10 2.15 2.3 4.945 38.76(±41.8 %)
8×8 40(+15 %) 12( − 16.7 %) 21 4.25 13.8 58.65 5.88(±54.1 %)
16 × 16 144(+36.8 %) 40( − 2.5 %) 62( − 8 %) 7.64 48(+2 %) 366.72 0.91
32 × 32 544(+45.6 %) 144(+2.1 %) 180(+28.9 %) 15.18 198(+9.1 %) 3005.64 0.12
[13] 4×4 18 5 9( − 11.1 %) 1.65( − 15.1 %) 1.8( − 16.7 %) 2.97( − 29.3 %) 33.67
8×8 66 19 30 2.8 15 42 5.41
16 × 16 243 70 84 4.13 63 260.19 1.00
32 × 32 928 269 301 6.21 262.3 1628.88 0.17
[14] 4×4 14 5 9 2.59 2.7 6.99 27.58
8×8 54 16 19( − 5.3 %) 4.37 18 78.66 4.24
16 × 16 208 58 66 5.28 75.6 399.17 0.91
32 × 32 803 216 250 7.35 288 2116.8 0.17
AO IP 4×4 25 4 10 2.9 2.9 8.41 13.79
8×8 85 14 33 3.5 14.4 50.4 3.36
16 × 16 317 50 94 5.1 61.2 312.12 0.62
32 × 32 1040 156 309 6.8 252 1713.6 0.14
SO IP 4×4 18 6 9 1.8 2 3.6 30.86
8×8 72 21 27 2.6 13.5 35.1( − 14.5 %) 5.34
( − 7.7 %)
( − 7.4 %)
16 × 16 280 77 81 3.5 53.2 186.2 1.02(±43.1 %)
( − 0.85 %) ( − 8.7 %)
32 × 32 1089 288 294 5.24–12.2 %) 229.6 1203.104 0.18(±50 %)
( − 17.4 %)
Naïve Booth 16 × 16 487 71 146 4.6 145.6 669.8 0.45
32 × 32 2292 275 641 6.1 434 2690.8 0.07

4. Implementation results and comparison with state-of-the-art

For purposes of comparison with their direct counterparts recently presented in literature [12–14], the proposed multipliers have
been characterized using the Virtex-7 xc7v585tffg1157–3 device. Moreover, in order to analyze the behavior of the compared designs
at a parity of operating conditions, the competitors have been re-implemented in this work by using the original VHDL source files
made available at https://esim-project.eu/pd-downloads by Authors [12–14].
Table 3 summarizes results obtained in terms of: LUTs, CCs and Slices utilization; computational delay (D); energy consumption
(E); energy-delay product (EDP); and area efficiency, evaluated as the throughput/LUTs ratio. It is important to highlight that all the
compared designs, including the naïve (i.e. without adopting any kind of optimization) implementation of the 32 × 32 and the 16 × 16
Booth’s multipliers, the Area Optimized (AO) and Speed Optimized (SO) Xilinx IP cores, were synthesized at their minimum delay
constraint achieved inserting registers as the driving and the loading logic. Moreover, their energy dissipation was analyzed using the
Switching Activity Interchange Format (SAIF) files extracted for 100,000 random inputs.
In Table 3, for each input bit-width, the best value achieved for each metric is highlighted. Moreover, for each analyzed metric and
multiplier size, results in brackets report the percentage variation exhibited by the proposed circuits with respect to the best
competitor. As an example, the proposed 4 × 4 multiplier achieves a delay 15.1 % lower than [13], which is the fastest among the
competitors for that size.
From Table 3, it can be seen that, as expected, the naïve implementation of Booth multipliers does not exploit the hardware re­
sources available onto FPGAs as efficiently as others, thus showing relatively poor characteristics. Among the remaining competitors,
the SO IP exhibits the best delay characteristics at multiplier sizes higher than 4 × 4, whereas [12] shows the lowest area occupancy.
Such an approach minimizes the LUTs utilization thanks to its low-level optimization strategy exploiting three efficient configurations
of LUTs; moreover, it shows lower energy dissipation when n is higher than 8. However, such a strategy leads to a significant lower
speed performance. Therefore, the area efficiency and the EDP exhibited by [12] are generally worse than those achieved by the
proposed scheme. Indeed, the proposed 32 × 32 multiplier achieves an EDP 3.025 times lower than [12]. Overall, the proposed designs
always achieve the minimum computational delay and reach the best EDP, with an advantage that appears as more evident as the
operands word-length increases. In the case of 32 × 32 implementations, the delay and EDP exhibited by the proposed multiplier are,
respectively, 12.2 % and 17.4 % lower than the fastest competing architecture SO IP.
The examined designs have been characterized and compared also in terms of the normalized energy-delay-resources utilization
trade-off (NEnDeRes). The latter is obtained by normalizing the cost function EnDeRes, here introduced and defined in (6), versus the

7
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Fig. 7. NEnDeRes at: (a) n = 4; (b) n = 8; (c) n = 16; (d) n = 32.

Table 4
Post-Implementation results obtained with the virtex ultraScale+ technology.
#LUTs #CCs #Slices D E EDP Throughput/LUT
[ns] [pJ] [pJ×ns] [MHz]

Proposed 4×4 12 1 4 1.15 1.15 1.323 72.46


8×8 47 6 8 1.6 9.6 15.36 13.29
16×16 196 22 31 2.5 40 100 2.04
32×32 793 79 118 3.5 168 588 0.36

Fig. 8. The implemented MAC unit.

proposed multipliers that lead to the minimum value among those obtained by the compared designs.
EnDeRes = E ⋅ D ⋅ #Slices (6)
Comparison results obtained in terms of NEnDeRes are summarized in Fig. 7. The normalized cost function values of 10.09 and 7.36,
related to the 16 × 16 and 32 × 32 naïve Booth’s multipliers are significantly higher than other designs and therefore they do not
appear in Fig. 7. The latter confirms that the multipliers designed as proposed here offer the best energy-delay-resources utilization

8
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

Table 5
Post-Implementation characteristics of the MAC units. In bold, winners for each parameter. Best competitor for each configuration are highlighted in
gray. The percentage gain/loss shown by the proposed circuits with respect to the best competitor are reported in brackets. Gains are underlined.
#LUT #CC #Slices D [ns] E [pJ]

Proposed 694 140 243 3.5 165.7


[12] 640 (+8.4 %) 158 ( − 11.4 %) 270 4.25 177.4
[13] 874 221 351 3.5 (0 %) 188.2
[14] 766 194 252 ( − 3.6 %) 4.37 215.2
AO IP 1045 176 324 3.5 (0 %) 190.8
SO IP 928 239 378 3.5 (0 %) 174.7 ( − 5.2 %)

Fig. 9. Percentage variations in terms of NEnDeRes with respect to the MAC unit based on the proposed approach.

trade-off, for a wide-range of operands word-lengths, with an advantage, with respect to their counterparts, that spans between 34.8 %
(achieved over the SO IP when n = 32) and 80 % (obtained with respect to the AO IP with n = 4).
In order to provide the reader with a big picture on the state-of-the-art and to point out the advantages offered by different
implementation strategies, DSP-based IP cores were also evaluated. The 32 × 32 (16 × 16) IP core uses four (one) DSPs, achieves a
computational delay of 5.7 ns (2.5 ns) and dissipates 57pJ (17.5pJ) per operation. However, such IP cores use completely different
hardware resources and they cannot be directly compared to the LUT-based solutions referenced in Table 3.
Finally, to show how the proposed multiplication method scales with more advanced FPGA devices, the new multipliers were
characterized also using the Virtex Ultrascale+ xcvu3pffc1517–3-e device. Obtained results are summarized in Table 4. It is worth
noting that the 8-bit CCs available in the used device lead to quite lower computational delays and reduced amounts of utilized CCs
with respect to the implementations evaluated in Table 3.

4.1. An example of application

As a case study, the above compared multipliers have been used in the design of a MAC unit structured as illustrated in Fig. 8. It
consists of nine 8 × 8 multipliers and an adder tree that accumulates the nine computed products two-by-two and furnishes a 20-bit
output. The selected circuit exploits a two-stage pipeline architecture, with registers located on the input and output ports, and at the
interface between the multipliers and the adder tree. Post-implementation characterizations reported in Table 5 show that the MAC
unit based on the proposed multiplier reaches the lowest energy consumption, achieving the highest speed performance.
The designs based on the multiplier presented in [13], the AO IP and the SO IP, are as fast as that using the approach proposed here.
This happens because their worst computational paths occur in the adder tree. On the contrary, the critical path of MACs using the
multipliers presented in [12] and [14] takes place over the multiplication operations. Obviously, each MAC can be optimized by
properly introducing further pipeline stages, but at the expense of increased energy consumption and resources utilization.
Table 5 shows that the proposed multiplier leads to an appreciable reduction in terms of energy consumption over all the alter­
native designs. In comparison with the most energy efficient competitor (SO IP), the proposed circuit uses 25.2 % (41.4 %) less LUTs
(CCs) and dissipates 5.2 % less energy. Moreover, at the parity of speed performance, it allows ~12 % energy saving with respect to
[13]. Finally, as visible in Fig. 9, in the referred case study, the designs presented in [12,13] and [14], respectively, lead to NEnDeRes
values ~44 %, 64 % and 68 % higher than the new multiplier. Moreover, the NEnDeRes achieved by using SO and AO IPs is 53 % and 64
% higher.

5. Conclusions

This paper presented a novel approach to efficiently exploit the Booth’s encoding principle to design area-efficient and high-speed
signed multipliers on FPGAs. The proposed 32 × 32 multiplier achieves a computational delay and an energy-delay product 12.2 % and

9
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

17.4 % lower than the fastest competing architecture. When exploited in the design of a MAC unit, at a parity of computational delay,
the approach presented here dissipates 5.2 % less than the lowest energy consuming competitor (i.e. that based on SO IPs), while
reducing the amount of CCs by 11.4 % with respect to the design based on the area-optimized multiplier [12].

Availability of data and materials

Not applicable.

CRediT authorship contribution statement

Fanny Spagnolo: Conceptualization, Methodology, Investigation, Writing – review & editing. Pasquale Corsonello: Resources,
Conceptualization, Writing – review & editing. Fabio Frustaci: Methodology, Investigation, Writing – review & editing. Stefania
Perri: Supervision, Conceptualization, Methodology, Investigation, Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper

Data availability

No data was used for the research described in the article.

Author agreement statement

We the undersigned declare that this manuscript is original, has not been published before and is not currently being considered for
publication elsewhere.
We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied
the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by
all of us.
We understand that the Corresponding Author is the sole contact for the Editorial process. He/she is responsible for communicating
with the other authors about progress, submissions of revisions and final approval of proofs.
Signed by all authors as follows: Fanny Spagnolo, Pasquale Corsonello, Fabio Frustaci, Stefania Perri.

Funding

This work was supported by PON Ricerca & Innovazione – Ministero dell’Università e della Ricerca under Grant
1062_R24_INNOVAZIONE and by ICSC National Research Centre for High Performance Computing, Big Data and Quantum Computing
within the Next Generation EU program.

References

[1] Zou Z, Chen K, Shi Z, Guo Y, Ye J. Object Detection in 20 years: a survey. In: Proceedings of the IEEE. 111; March 2023. p. 257–76. https://doi.org/10.1109/
JPROC.2023.3238524.
[2] Porkodi R, Bhuvaneswari V. The internet of things (IoT) applications and communication enabling technology standards: an overview. In: Proc. IEEE Int. Conf.
on Intelligent Computing Applications; 2014. p. 324–9.
[3] Sze V, Chen Y-H, Yang T-J, Emer JS. Efficient processing of deep neural networks: a tutorial and survey. In: Proceedings of the IEEE. 105; Dec. 2017.
p. 2295–329. https://doi.org/10.1109/JPROC.2017.2761740.
[4] Frustaci F, Perri S, Corsonello P, Alioto M. Energy-quality scalable adders based on nonzeroing bit truncation. IEEE Trans. on Very Large Scale Integration (VLSI)
Systems April 2019;27(4):964–8. https://doi.org/10.1109/TVLSI.2018.2881326.
[5] Perri S, Frustaci F, Spagnolo F, Corsonello P. Design of real-time FPGA-based embedded system for stereo vision. In: Proc. IEEE Inter. Symp. on Circuits and
Systems (ISCAS); 2018. p. 1–5.
[6] Mounica Y, Kumar KN, Veeramachaneni S, Mahammad S N. Energy efficient signed and unsigned radix 16 booth multiplier design. Comput Electr Eng 2021;90
(106892):1–8.
[7] Uppugunduru AK, Bharadwaj SV, Ahmed SE. Compressor based hybrid approximate multiplier architectures with efficient error correction logic. Comput Electr
Eng 2022;104(108407):1–10.
[8] Guturu S, Kumar UA, Bharadwaj SV, Ahmed SE. Design methodology for highly accurate approximate multipliers for error resilient applications. Comput Electr
Eng 2023;110(108798):1–11.
[9] Kumm M, et al. An efficient softcore multiplier architecture for Xilinx FPGAs. In: Proc. IEEE 22ndSymp. Comp. Arith; 2015. p. 18–25.
[10] Walters EG. Partial-product generation and addition for multiplication in FPGAs with 6-input LUTs. In: Proc. IEEE 48th Asilomar Conf. on Sig., Sys. and
Computers; 2014. p. 1247–51.
[11] Walters EG. Array multipliers for high throughput in xilinx FPGAs with 6-Input LUTs. Computers Sept. 2016;5(4):1–25. https://doi.org/10.3390/
computers5040020.
[12] Ullah S, Schmidl H, Sahoo SS, Rehman S, Kumar A. Area-optimized accurate and approximate softcore signed multiplier architectures. IEEE Trans Computers
March 2021;vol.70(3):384–92. https://doi.org/10.1109/TC.2020.2988404.

10
F. Spagnolo et al. Computers and Electrical Engineering 116 (2024) 109217

[13] Ullah S, Nguyen TDA, Kumar A. Energy-efficient low-latency signed multiplier for FPGA-based hardware accelerators. IEEE Embed Syst Lett June 2021;13(2):
41–4. https://doi.org/10.1109/LES.2020.2995053.
[14] Ullah S, Rehman S, Shafique M, Kumar A. High-performance accurate and approximate multipliers for FPGA-based hardware accelerators. IEEE Trans. Comput-
Aided Des Integr Circuits Sys Feb. 2022;41(2):211–24. https://doi.org/10.1109/TCAD.2021.3056337.
[15] Langhammer M, Baeckler G. High density and performance multiplication for FPGA. In: Symposium on Computer Arithmetic (ARITH); June 2018. p. 5–12.
[16] Kumm M, Hardieck M, Willkomm J, Zipf P, Meyer-Baese U. Multiple constant multiplication with ternary adders. In: International conference on field
programmable logic and applications (FPL); October 2013. p. 1–8.
[17] Ullah S, Kumar A. Approximate arithmetic circuit architectures for FPGA-based systems. Cham: Springer International Publishing; 2023. p. 1–26. https://doi.
org/10.1007/978-3-031-21294-9_1.
[18] Awais M, Zahir A, Shah SAA, Reviriego P, Ullah A, Ullah N, Khan A, Ali H. Toward optimal softcore carry-aware approximate multipliers on xilinx FPGAs. ACM
Trans. Embed Comput Syst 2023;22(4):1–19.
[19] Pedada RR, Gade VSR. A low latency power effective signed multiplier for hardware boosters using FPGA. AIP Conf Proc Oct. 2023;2794(1).
[20] 7 Series DSP48E1 Slice, Xilinx, San Jose, CA, USA, 2018. [Online]. Available: https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_
DSP48E1.pdf.
[21] Integer Arithmetic IP Cores User Guide. 2020. [Online]. Available: https://www.altera.com/en_US/pdfs/literature/ug/ug_lpm_alt_mfug.pdf.
[22] Booth AD. A signed binary multiplication technique. Quart J Mech Appl Math 1951;4(2):236–40.
[23] Parhami B. Computer arithmetic algorithms and hardware designs. New York, NY, USA: Oxford Univ. Press; 2000.
[24] Baugh CR, Wooley BA. A two’s complement parallel array multiplication algorithm. IEEE Trans. Comput. Dec. 1973;C-22(12):1045–7. https://doi.org/10.1109/
T-C.1973.223648.

11

You might also like