0% found this document useful (0 votes)
20 views

FPGA-Based Multi-Level Approximate Multipliers For High-Performance Error-Resilient Applications

This paper presents approximate multipliers deployed on FPGAs using approximate logic compressors at different accuracy levels. The proposed multipliers offer higher power-delay-area product gains than state-of-the-art works at comparable accuracies. In particular, 8-, 16-, and 32-bit multipliers can deliver gains up to 7.1x, 8.3x, and 5.0x respectively. Experiments show the multipliers enable power savings for image processing with minimal quality degradation.

Uploaded by

pskumarvlsipd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

FPGA-Based Multi-Level Approximate Multipliers For High-Performance Error-Resilient Applications

This paper presents approximate multipliers deployed on FPGAs using approximate logic compressors at different accuracy levels. The proposed multipliers offer higher power-delay-area product gains than state-of-the-art works at comparable accuracies. In particular, 8-, 16-, and 32-bit multipliers can deliver gains up to 7.1x, 8.3x, and 5.0x respectively. Experiments show the multipliers enable power savings for image processing with minimal quality degradation.

Uploaded by

pskumarvlsipd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Received December 30, 2019, accepted January 27, 2020, date of publication February 3, 2020, date of current version

February 10, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.2970968

FPGA-Based Multi-Level Approximate Multipliers


for High-Performance Error-Resilient
Applications
NGUYEN VAN TOAN , (Member, IEEE), AND JEONG-GUN LEE , (Member, IEEE)
Smart Computing Laboratory, Department of Computer Engineering, Hallym University, Chuncheon 24252, South Korea
Corresponding author: Jeong-Gun Lee ([email protected])
This work was supported by the National Research Foundation through the Basic Science Research Program under Grant
2018R1D1A1B07043399.

ABSTRACT This paper presents approximate multipliers which are efficiently deployed on Field Pro-
grammable Gate Arrays (FPGAs) by using newly proposed approximate logic compressors at different levels
of accuracy. Our approximate multiplier designs offer higher gains of power-delay-area products (PDAP)
than those of the state-of-the-art works at comparable accuracies. Furthermore, in terms of delay, occupied
area, and dynamic power dissipation, our designs are much better than Lookup Table based multiplier
Intellectual Properties that are available on an FPGA. Particularly, our proposed 8-, 16-, and 32-bit multipliers
can deliver PDAP gains up to 7.1 ×, 8.3 ×, and 5.0 ×, respectively. The effectiveness and applicability
of our designs are also demonstrated by image processing applications such as image multiplication and
sharpening. The experiments show that for the image sharpening, our 8 × 8 multipliers can deliver a good
peak signal-to-noise ratio (PSNR) of 46.81 dB, a structural similarity index metric (SSIM) of 0.9989, and
a dynamic power saving of up to 36.7% with regard to the exact multiplier. For the image multiplication,
approximate 16 × 16 multipliers can offer a high PSNR of 80.25 dB, an SSIM of 1.0, and a dynamic power
saving of up to 58.15%. From these demonstrations, the proposed multipliers are expected to be appropriate
with high-performance and low-power error-resilient applications.

INDEX TERMS Approximate computing, approximate compressor, error-resilient application, multiplier,


power-delay-area product.

I. INTRODUCTION lead to large and unpredictable errors due to the failures


In many of error-resilient applications such as multimedia, of the most significant bits. Approximate computing is one
data mining, image processing, machine learning, etc., pre- of the efficient paradigms to lower the power consumption
cise computations are not always necessary [1]–[4]. Com- and enhance the performance of an embedded system. Some
putation results with some degradation of accuracy can be errors are allowed at the outputs of a complex circuit in
acceptable and meaningful enough for these applications [5]. order to simplify logic expressions, which turns out reducing
By taking advantage of this property, therefore, we can take logic counts. Eventually, savings in areas and dynamic power
into consideration of trade-offs between the accuracy and dissipation, and shortening of circuit delays can be achieved.
electrical performances of a circuit. That is, we can sacrifice In the past few years, intensive researches have focused on
some loss of accuracy for beneficial gains of power dissipa- the approximation techniques for arithmetic units such as
tion, occupied area, and delay. adders and multipliers [2], [5], [9]–[12]. Approximations
Voltage over-scaling is a solution to reduce the power dissi- can be implemented at transistor level as well as logic gate
pation of a circuit [6]–[8]. However, when a circuit operates level [4], [13]. Due to the high complexity of multipliers,
under the normal voltage level, timing-induced failures can applying approximate computing to these circuits can offer
benefits of power and performance. In principle, approximate
The associate editor coordinating the review of this manuscript and techniques can be applied to any stage of a multiplier. How-
approving it for publication was Jenny Mahoney. ever, most of the hardware resources and computation time

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 25481
N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

are occupied by the partial product accumulation. Therefore, 4-2 compressors still consume high hardware resources
the optimization of partial product accumulation can achieve and power with moderate error rates when synthesized on
more gains. FPGAs.
The removal of partial products from the partial product As all the above discussions, employing presented approxi-
matrix (PPM) leads to the reductions of circuit area and mation techniques for designing different types of multipliers
delay. Truncation is such a method which ignores the least can achieve performance gains, energy savings, and area
significant partial products from the partial product accu- reductions. Generally, FPGA-based and application-specific
mulations [14]. In [15], authors proposed a broken-array integrated circuit (ASIC)-based design architectures are
multiplier whose omitted carry-save adder cells were spec- totally different from each other. For ASIC-based designs,
ified by the horizontal break and vertical break parameters. we can fully optimize and customize our circuits to
These methods cause biased errors. That means the approx- reduce logic counts. In contrast, for FPGA-based designs,
imate multipliers can produce zero values with respect to the logic functions are implemented by fabricated look-up
non-zero valued inputs. Zervakis et al. [16] proposed to skip tables (LUTs) which have a fixed number of inputs and fixed
the generation of arbitrary partial product rows in the PPM, drive strength. It is not easy to fully control their utilization.
which was called a partial product perforation. Nevertheless, Thus, the approximation techniques that have been efficiently
the more removal of partial products, the more error are deployed to ASIC-based designs might offer limited benefits
produced. when porting to FPGA-based designs.
OR-gate based logic compressions were applied to reduce Xilinx and Intel FPGAs provide fast DSP-based multi-
the partial product tree. In [17], authors used 2-input OR gate pliers for digital signal processing applications demanding
to combine two partial products in the same columns into a low power. However, using these multiplier Intellectual Prop-
new one, and this procedure was applied to all columns in erties (IP) can incur the large routing delay since they are
the PPM. In [18], authors proposed to combine two or more available at some locations on an FPGA. Meanwhile, LUTs
partial products in the same columns at the first reduction are scattered all over the device, which can make the routing
stage by using two or more input OR gates under the con- delay much shorter. Note that in modern devices, the routing
sideration of bit position significance. Then, exact full adders delay becomes dominant in the data-path delay. Besides, if an
were employed for the next reduction stages. However, these application exhaustively occupies all dedicated IPs, there are
methods presented high errors at the final results. To recover no IPs left for other applications that require low-power and
the accuracy, error compensation vectors were intentionally high-speed computations. Eventually, LUT-based multiplier
generated and added to the final stage or the first reduc- IPs are still necessary [22]. Ullah et al. [23], [24] have
tion stage [17]–[19]. The generation of error vectors incurs recently proposed to optimize their multipliers by truncating
area and power dissipation overheads. Moreover, they also off a least significant partial product of a 4 × 2 multiplier
increase the height of the partial product tree, which could to save an LUT. Then, larger operand size multipliers (e.g.,
cost some extra logic compressions. Another limitation of the 8 × 8, 16 × 16) were built by using this building block. Their
work [18] is that it used full adders to compress the partial approach only provided a limited benefit in area and power
product tree at the reduction stages following the first stage. savings.
This limits the benefit of area and power reductions since the To tackle the aforementioned problems, our work has
logic compression ratio is only 1.5 (= 3/2). designed and analyzed a compact approximate 3-2 com-
In order to get more benefit of area saving and power pressor and approximate 4-2 compressors. Then, different
reduction, a higher ratio of logic compression was used to approximate multipliers are proposed and realized by using
reduce partial product tree, and approximate 4-2 compres- those logic compressors at different levels of accuracy. In this
sors are typically utilized with some losses of accuracy [5], work, our goal is to optimize the approximate multipli-
[14], [19], [20]. The higher ratio of logic compression can ers so that they can be efficiently implemented on FPGAs
be used such as approximate 5-2 (5 inputs and 2 outputs), with high electrical performances (i.e., low power consump-
6-2 compressors [20]. However, they showed higher errors tion, small are, low latency) and low accuracy losses. Thus,
in the final products. Authors in [5] proposed approximate these multipliers can be suitable for digital signal process-
4-2 compressors with simple logic equations to reduce logic ing applications such as image processing, multimedia, etc.
counts. However, their approximate compressors still occu- The efficiencies of our multiplier designs will be analyzed
pied high hardware resources when synthesized on an FPGA. and assessed by comparing to the state-of-the-art works in
Besides, their proposed multipliers utilizing that compressor terms of delay, power, area, power-delay product (PDP), and
produced non-zero values with zero-valued inputs, which PDAPs. Furthermore, their accuracy losses are also analyzed
costs extra logics to detect this case. Similarly, authors and compared to those of the prior works. Our work is sum-
in [21] used both their proposed approximate compressors marized as follows:
and exact compressors to form the so-called configurable 1) Approximate multipliers are proposed to be efficiently
dual-quality multipliers. Nevertheless, their method cannot implemented on FPGAs at low losses of accuracy by
be deployed on FPGAs since the existing commercial FPGAs using the compact approximate 3-2 compressor and
do not feature the power gating. Besides, their proposed approximate 4-2 compressors.

25482 VOLUME 8, 2020


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

2) Design-time accuracy configurations can be achieved


by using different proposed multiplier designs.
3) Quality metrics (QM) such as mean error distance
(MED), mean relative error distance (MRED), and nor-
malized mean error distance (NMED) are used to rigor-
ously analyze the quality of the proposed approximate
multipliers.
4) Exhaustive simulations are carried out to obtain exact
QMs for approximate 8 × 8, and 16 × 16 multipliers.
5) Formulas are derived for estimating the important met- FIGURE 1. 4-2 compressors: (a) a conventional exact 4-2 compressor,
rics (MED and NMED) for large operand size multipli- (b) an approximate 4-2 compressor.
ers being scaled up from smaller operand size ones.
6) Our proposed approximate multipliers are prototyped
on an FPGA to measure electrical performances such
as delays, areas, and dynamic power dissipation.
7) The trade-offs between the accuracies and PDAPs are
also presented.
8) Image processing applications such as image multi-
plication and image sharpening are implemented and
evaluated on an FPGA board to demonstrate the appli-
cability and effectiveness of our proposed multipliers.
Compared to our previous study [25], this work contains
several significant extensions as follows: (i) propose a newly
designed approximate 4-2 compressor (CP2) to improve FIGURE 2. The partial product reduction tree of an 8 × 8 multipliers by
using conventional 4-2 compressor (solid), half-, and full-adders
power consumption and area, (ii) formulate the error analysis (dashed).
of a large operand size multipliers being scaled up from
smaller ones, (iii) develop large size multipliers (16 × 16,
32 × 32), (iv) extend the proposed approach to approximate conventional exact 4-2 compressor, the approximate one does
signed multipliers, (v) develop image sharpening application not have the carry input (receive from the preceding compres-
to evaluate the effectiveness of approximate multipliers. sor) and carry output (generate to the next compressor). This
This paper is structured as follows. The overview of exact shortens the delay and reduces the occupied area and power
and approximate compressors and error metrics are presented consumption.
in Section II. The error analysis for large operand size multi-
pliers is formulated in Section III. The approximate compres- B. ERROR METRICS
sors are designed and analyzed in Section IV. Our proposed Let us consider an N × N multiplier. Error metrics such as
approximate multipliers are designed and implemented in MED, MRED, and NMED are utilized as a means to evaluate
Section V. Error analysis and performance evaluation are the quality of the multiplier. Let EDi (error distance) be the
shown in Section VI. Effectiveness and applicability of our arithmetic distance between the i-th accurate and approx-
approximate multipliers based on image applications are imate products, and Si be the i-th accurate product. The
demonstrated in Section VII. Finally, Section VIII concludes MED, MRED, and NMED are defined as (1), (2), and (3),
our work. respectively [21], [26].
II. PRELIMINARIES 2N
2
A. EXACT AND APPROXIMATE 4-2 COMPRESSORS 1 X
MED = |EDi | (1)
A conventional exact 4-2 compressor consists of two full 22N
i=1
adders as shown in Fig. 1(a). It has 5 inputs and 3 outputs. 2N
2
The output sum has the same weight of 1 as the inputs while 1 X |EDi |
the outputs cout and carry have a weight of 2. The reduction MRED = 2N (2)
2 Si
i=1
tree of an 8 × 8 multiplier is illustrated in Fig. 2 in which P22N
the dashed shapes represent half and full adders and the solid 1 |EDi |
NMED = N × i=12N (3)
shapes are the exact 4-2 compressors. (2 − 1) 2 2
An approximate 4-2 compressor has two outputs which
can approximately count all 1’s at its four inputs. The block
diagram of an approximate 4-2 compressor is illustrated III. A FORMULATION OF ERROR ANALYSIS
in Fig. 1(b). Two outputs of the approximate 4-2 compressor Let us consider an N × N multiplier built from m × m
can have the same or different weights. Compared to the multipliers (N = 2 × m) as illustrated in Fig. 3. The mean

VOLUME 8, 2020 25483


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

FIGURE 3. An N × N multiplier is constructed from m × m ones


(N = 2 × m).

error distance of N × N multiplier (MEDN ) is given as (4).


2 2N FIGURE 4. An illustration of swapping values of AH, BH, AL, and BL.
1 X Without loss of generality, let’s consider 2-bit operands A and B.
MEDN = |EDi | (4) Operands A and B create 16 values (≡ 22×2 ) in which sub-operand pairs
22N
i=1 (AH, BH), (AH, BL), (AL, BH), and (AL, BL) periodically repeat their values
after each 4 values (≡ 22 ). It is noted that swapping values of AH, BH, AL,
The error distance of N × N multiplier is obtained as (5), and BL does not affect the final summation of error distances EDi .
where e and a denote exact and approximate terms, respec-
2m
tively. (AH , BH ) and (AL, BL) are high and low halves of the 2
1 X 2m
operands A and B. MEDN = 2m (2 |EDHHi | + 2m |EDHLi |
2
i=1
EDi = (Ai × Bi )e − (Ai × Bi )a +2m |EDLHi | + |EDLLi |) (9)
= 22m [(AHi × BHi )e − (AHi × BHi )a ]
Finally, the MED of an N × N multiplier is a summa-
+2m [(AHi × BLi )e − (AHi × BLi )a ] tion of weighted MED of m × m multipliers as expressed
+2m [(ALi × BHi )e − (ALi × BHi )a ] in (10).
+[(ALi × BLi )e − (ALi × BLi )a ] (5)
MEDN = 22m MEDHH + 2m MEDHL
From (5), we have the following inequality, where EDHHi , +2m MEDLH + MEDLL (10)
EDHLi , EDLHi , and EDLLi are the i-th error distances caused
by four approximate m × m multipliers. If m × m multipliers have the same MEDm , the MEDN of
an N × N multiplier can be given as:
|EDi | ≤ 22m |EDHHi | + 2m |EDHLi | +2m |EDLHi |+|EDLLi |
MEDN = (2m + 1)2 × MEDm (11)
(6)
The normalized mean error distance of N × N multiplier
A. CASE 1 (NMEDN ) can be calculated from (10) and (3). Moreover,
The equality in (6) occurs. In this case, error distances of all if m × m multipliers have the same MEDm , the NMEDN can
approximate m × m multipliers are greater than or equal to be approximated as (12) (m  N ):
zero, or all of them are less than or equal to zero. (2m + 1)2 MEDm
Substitute EDi in (6) into (4), we have: NMEDN = 2
× MEDm ≈ (12)
N
(2 − 1) 22m
2 2N
1 X 2m B. CASE 2
MEDN = 2N (2 |EDHHi | + 2m |EDHLi |
2 The inequality in (6) holds. Error distances of some approx-
i=1
+2m |EDLHi | + |EDLLi )| (7) imate m × m multipliers are greater than or equal to zero,
and some of them are less than zero. In this case, the expres-
It is important to note that we can swap values of AH , AL, sion (6) can be rewritten as (13).
BH , and BL so that their occurrences are uniform over the
range of [0, 22m − 1] as illustrated in Fig. 4. It is worth noting |EDi |< 22m |EDHHi |+2m |EDHLi |+2m |EDLHi |+|EDLLi | (13)
that swapping values of AH , BH , AL, and BL does not affect Therefore, the worst-case (upper bound) of the mean error
the final summation of error distances EDi . Additionally, distance (MEDN ) can be obtained as (14).
P(k+1).22m
EDXYi (k is an integer, k = 0 → N /m + 1; X and
i=k.22m +1 MEDN < 22m MEDHH + 2m MEDHL +2m MEDLH +MEDLL
Y can be H or L) is a constant over each interval [0, 22m − 1].
Therefore, we have: (14)

2 2m
1 X IV. DESIGN AND ANALYSIS OF APPROXIMATE
MEDN = (22m |EDHHi | + 2m |EDHLi | COMPRESSORS
22N
i=1 In this section, we present two kinds of compressor that
+2m |EDLHi | + |EDLLi |) × 22N −2m (8) are efficiently implemented on an FPGA: an approximate

25484 VOLUME 8, 2020


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

FIGURE 5. Karnaugh map of the approximate 3-2 compressor: the outputs


y 2 and y 1 have the same weight; the error cases are highlighted [25].

FIGURE 6. Karnaugh map of the approximate 4-2 compressor CP1:


an error occurs when (x4,x3,x2,x1) = (1,1,1,1).
3-2 compressor and approximate 4-2 compressors. We also
analyze the error probabilities that occur in these compres-
sors.

A. APPROXIMATE 3-2 COMPRESSOR


An exact 3-2 compressor converts three inputs into a value
(the number of 1’s at inputs) represented in two outputs.
To cover all the possible combinations of inputs, one output
has a weight of 1 as inputs while the weight of other output
is 2. However, by accepting some errors in the final counting
results, its two outputs can get the same weights as the inputs,
which forms an approximate 3-2 compressor. The Karnaugh
map is shown in Fig. 5. The error cases are marked by
circles and highlighted in red color. The outputs y1 and y2 FIGURE 7. Approximate 4-2 compressor CP1: (a) a schematic, (b) an
are simplified as (15) and (16). LUT-based circuit for a logic compression at the reduction stage n
(n > 1) [25].
y1 = x1 + x2 (15)
y2 = x3 (16) 1) APPROXIMATE 4-2 COMPRESSOR 1 (CP1) – CLASS 1
As observed in [27], two outputs of this compressor have
If the compressor is applied to the reduction stage 1 (see different weights, which can represent the number of 1’s for
Fig. 2), it occupies two 6-input LUTs (including the par- most possible combinations of inputs. An error occurs when
tial product generation circuit). The probability of a partial its inputs get all 1’s. The Karnaugh map of this compressor is
product bit (ai .bj ) getting 0 is 3/4 (=
P
p(ai bj = 0) = illustrated in Fig. 6. The outputs C and S of this compressor
(1/2) × (1/2) + (1/2) × (1/2) P + (1/2) × (1/2)) while its are expressed as (19) and (20).
probability being 1 is 1/4 (= p(ai bj = 1) = (1/2)×(1/2)).
From Fig. 5, the error probability P1 (Err) is given as (17): C = (x4 + x3).(x2 + x1) + (x4.x3) + (x2.x1) (19)
P1 (Err) = p1 (x3 = 0, x2 = 1, x1 = 1) S = (x4 ⊕ x3 ⊕ x2 ⊕ x1) + x4.x3.x2.x1 (20)
+p1 (x3 = 1, x2 = 1, x3 = 1) (17) The error probability P1 (Err) is 1/256 (= (1/4)4 ) when
this compressor is applied to the reduction stage 1. On the
Therefore, P1 (Err) = (3/4) × (1/4) × (1/4) + (1/4) ×
other hand, if this compressor is used for the reduction
(1/4) × (1/4) = 1/4. If the compressor is applied to the
stage n (n>1), the error probability Pn (Err) is pn (1, 1, 1, 1).
reduction stage n (n > 1), it just uses 3 pins of a 6-input
In that case, each variable in (19) and (20) can be a single
LUT for y1 and y2 (compression ratio = 1.5). As the occur-
variable. Thus, it can be fitted into one 6-input LUT as
rence probabilities of the inputs are redistributed through
illustrated in Fig. 7 (b). The delay of this compressor is
the reduction stages, the error probability Pn (Err) is given
the delay of one 6-input LUT. However, if this compressor
as (18), where pn (x3, x2, x1) is the occurrence probability of
is used to reduce the PPM at the first stage, it costs three
a specific pattern (x3, x2, x1) at the reduction stage n (n > 1).
6-input LUTs (including the partial product bit generation
Pn (Err) = pn (0, 1, 1) + pn (1, 1, 1) (18) circuitry).

B. APPROXIMATE 4-2 COMPRESSORS 2) APPROXIMATE 4-2 COMPRESSOR 2 (CP2) – CLASS 1


In this section, we design and analyze two classes of approxi- A new approximate 4-2 compressor is proposed by modifying
mate 4-2 compressors. In the first class, the approximate com- the Karnaugh map in Fig. 6 (i.e., introduce some more errors
pressors CP1 and CP2 have outputs with different weights. to its outputs). The Karnaugh map of this compressor is illus-
That means one output has a weight of 1 whereas the other is trated in Fig. 8. The main target of this compressor is to reduce
2. In the second class, the outputs of approximate compres- the switching activities as well as glitches when its inputs
sors CP3 and CP4 have the same weight as their inputs. change. Therefore, the reduction in power consumption can

VOLUME 8, 2020 25485


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

FIGURE 8. Karnaugh map of the approximate 4-2 compressor CP2: the FIGURE 10. Karnaugh map of the approximate 4-2 compressor CP3.
number of switching and glitches are reduced as the inputs change in the
shaded region.

FIGURE 9. Approximate 4-2 compressor CP2: (a) a schematic, (b) an


LUT-based circuit for a logic compression at the reduction stage n (n > 1).

FIGURE 11. (a) partial product generation circuit, (b) approximate


be achieved. The logic expressions of the outputs C and S are 4-2 compressor, and (c) LUT-based circuits for implementing a logic
described in (21) and (22). compression at the first reduction stage (I_0 pins can be used for other
purposes) [25].
C = (x4 + x3)(x2 + x1) + x4.x3 (21)
S = (x2 + x1) + (x4 ⊕ x3) (22) To optimize the area and power dissipation, the FPGA
sythesis tool normally combines the partial product compres-
If this compressor is used to implement the logic compres-
sor and the partial product generation circuitry together. Thus,
sion at the reduction stage n (n>1), it consumes one 6-input
when this compressor is used to compress the PPM height
LUT. The error probability is given by (23), where E is a set
at the first reduction stage, it only consumes two 6-input
of errors shown in Fig. 8, and pn (x4, x3, x2, x1) is occurrence
LUTs as shown in Fig. 11 (including the partial product
probability of a specific pattern (x4, x3, x2, x2) in set E.
X generation circuit). One remaining input (I_0) of each 6-input
Pn (Err) = pn (x4, x3, x2, x1) (23) LUT can be used for other purposes. The circuit delay of this
E compressor is equivalent to the delay of one 6-input LUT.
Note that the maximum error distance of this compressor
3) APPROXIMATE 4-2 COMPRESSOR 3 (CP3) – CLASS 2 is 2. Therefore, the compressors belong to the class 2 cause
Two outputs of this compressor have the same weight, which large errors if two or more compressors are cascaded in
is used to represent the number of 1’s at the inputs. The series.
Karnaugh maps are described in Fig. 10. The outputs y1 and
y2 are expressed as (24) and (25), respectively, which are 4) APPROXIMATE 4-2 COMPRESSOR 4 (CP4) – CLASS 2
OR-gate based compressions as introduced in [18]. This compressor is proposed to improve the accuracy with
y1 = x2 + x1 (24) moderate timing and very small area overheads. The cor-
responding Karnaugh map is shown in Fig. 12. The logic
y2 = x4 + x3 (25)
expressions of its outputs y1 and y2 are represented in (27)
If it is used to compress the PPM height at the first reduc- and (28).
tion stage, the error probability is given as (26), where E is a
set of error patterns as in Fig. 10 and (x4, x3, x2, x1) ∈ E. y2 = x4 + x3 + x2.x1 (27)
X y1 = x2 + x1 (28)
P1 (Err) = p1 (x4, x3, x2, x1)
E
3 9 1 If this compressor is applied to the first stage reduction,
= 4× +2× + ≈ 12.11% (26) the total error probability P1 (Err) is given by (29) and
256 256 256
25486 VOLUME 8, 2020
N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 1. Performance comparison between approximate


4-2 compressors (At (1.2 V, 100 MHz), reduction stage 1).

FIGURE 12. Karnaugh map of the approximate 4-2 compressor CP4.

TABLE 2. Performance comparison between approximate


4-2 compressors (At (1.2 V, 100 MHz), reduction stage 2).

4-2 compressors and the ones in previous works. As we can


see, the compressors CP3 and CP4 show higher maximum
error distances (i.e., 2) than those of other compressors;
FIGURE 13. (a) partial product generation circuit, (b) approximate however, they can save up to 50% power dissipation and
4-2 compressor CP4, and (c) LUT-based circuits for a logic compression at approximate 33.3% occupied area. Therefore, the compres-
the first reduction stage (pack the PPM generation and the compressor
into LUTs) [25]. sors CP3 and CP4 are suitable for the reduction stage 1.
Table 2 illustrates the performance comparison between
the presented approximate 4-2 compressors and the ones in
(x4, x3, x2, x2) ∈ E. previous works when they are utilized to compress the PPM
X height at the reduction stage n (n>1). That is, their inputs
P1 (Err) = p1 (x4, x3, x2, x1) are not partial products (pij ). Hence, they only occupy one
E LUT on an FPGA. The compressors CP2, CP3, and CP4 are
3 9 1 much better than the others in terms of the power dissipation.
= 4× + + ≈ 8.59% (29)
256 256 256 The maximum error distances of the compressors CP3 and
When the compressor CP4 is used to suppress the PPM CP4 are higher than those of the compressors CP1 and CP2.
height at the first reduction stage, it still consumes two 6- As pointed out in Section IV-B.3, we should not cascade the
input LUTs as illustrated in Fig. 13 (including the partial compressors CP3 and CP4 in consecutive reduction stages
product generation circuitry). Compared to the compressor since they produce high error rates and high error results.
CP3, this compressor fully consumes hardware resources of So, the compressor CP2 should be selected to implement
two 6-input LUTs. Its circuit delay is about the delay of two the logic compression at the reduction stage n. Moreover,
6-input LUTs. the compressor CP1 can be also used for the higher accuracy
multipliers.
C. COMPARISON OF APPROXIMATE 4-2 COMPRESSORS
This section aims at comparing approximate 4-2 compressors V. PROPOSED APPROXIMATE MULTIPLIERS
in terms of power dissipation, occupied area, and error rate. A. APPROXIMATE 8 × 8 MULTIPLIERS
Based on the comparison, we find out which compressor is Our proposed 8 × 8 approximate multiplier implementa-
suitable for which stage of the reduction tree. tions consist of several steps: (i) partial product generation,
As aforementioned, to optimize the power dissipation and (ii) PPM reduction at the first stage by using approximate
occupied LUTs, the 4-2 compressors are normally merged 4-2 compressors in class 2, (iii) the PPM reduction at the
with the partial product generation circuits when they are following stages n (n> 1) by using approximate 4-2 compres-
used to compress the partial product tree at the first reduction sors class 1, (iv) generation of the final result by using a ripple
stage. So, they occupy two 6-input LUTs. Table 1 shows the carry adder (RCA). Note that the partial product generation
performance comparison between the presented approximate and the PPM reduction at the first stage are combined together

VOLUME 8, 2020 25487


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 3. Approximate 8 × 8 multiplier configurations.

TABLE 4. Approximate 16 × 16 multiplier configurations.

FIGURE 14. Dot diagram of the proposed approximate 8 × 8 multiplier.


Approximate 4-2 compressors CP3 (CP4) are used to reduce the PPM at
the first reduction stage while the compressors CP1 (CP2) are utilized for
the reduction stage 2 [25]. CP3 at the first reduction stage and the compressor CP1 at
the second reduction stage. k is the number of leftmost partial
(using approximate 4-2 compressors in class 2) to save the product columns (the most significant PP is not counted)
hardware resource, thereby, the power consumption. This is compressed by accurate compressors, full-, and half-adders.
a considerable difference between our FPGA-based imple- In this work, the approximate 4-2 compressor CP2 is not
mentations and ASIC-based ones. Actually, we analyzed the used to construct approximate 8 × 8 multipliers. Since only
benefit of this implementation in Section IV-B. the small amount of approximate 4-2 compressors is used
A dot diagram of the proposed 8 × 8 approximate mul- for the reduction stage 2 as in Fig. 14, the reduction in
tiplier is illustrated in Fig. 14. We apply different kinds of dynamic power dissipation is small while the loss of accuracy
approximate compressors to compress the PPM at differ- could be considerable. The effectiveness of the approximate
ent reduction stages. In terms of hardware resources and 4-2 compressor CP2 on the dynamic power reduction is eval-
accuracy, each kind of approximate compressors should be uated on larger operand size multipliers (i.e., 16 × 16, 32 ×
used properly for the PPM reduction at different stages. For 32 multipliers).
example, the approximate 4-2 compressors in class 2 is effi-
ciently applied to compress the PPM at the first stage since B. APPROXIMATE 16 × 16 MULTIPLIERS
it costs small hardware resources and less power dissipation. In FPGAs, carry chains (each CARRY4 consists of 4 carry
The approximate 4-2 compressor class 1 is appropriate with cells as described in [28]) are used to implement the fast RCA
reducing the PPM height at stages n (n >1) because they show to generate the final result in approximate 8 × 8 multipliers.
higher accuracies and consume less dynamic power. This is There are some extra delay costs due to the inputs and outputs
also deeply analyzed in the compressor implementations in of carry chains (at the boundary of carry chains). These delays
Sections IV-B and IV-C. are mainly incurred by LUTs used to generate control signals
Moreover, the proposed approximate multiplier is clus- for the carry chains. This introduces some delay overheads
tered into two parts with different accuracies. The most sig- to 16-bit multipliers if they are built by using 8 × 8 ones
nificant partial products are accumulated by using accurate as shown in Fig. 3. Importantly, in order to further opti-
compressors and adders while the least significant PPs are mize the performances of 16-bit multipliers, all the proposed
accumulated by utilizing presented approximate compres- 4-2 compressors are employed. Thus, 16 × 16 multipliers
sors. In this work, we implement different configurations of should be built in the same method that 8 × 8 ones have been
approximate multipliers with various accuracies. For exam- done. Table 4 shows all the 16 × 16 multiplier configurations
ple, in Fig. 14, the approximate multiplier is implemented presented in this work; k is the number of leftmost partial
with 5 leftmost columns of the partial products compressed product columns (the most significant PP is not counted)
by using accurate compressors and adders. An RCA is used compressed by accurate compressors, full-, and half-adders.
to add two last rows of the final stage to generate the final
result since the dedicated RCA is known as one of the fastest C. APPROXIMATE 2M × 2M MULTIPLIERS
adders on FPGAs. It is beneficial to design 32 × 32 multipliers are designed
The proposed multi-level approximate architecture by using 16 × 16 ones since the boundary delays of the
improves the accuracy, shortens the delay, and reduces the dedicated carry chains are small compared to the delay
power dissipation and the occupied area of the approximate of 32 × 32 multipliers. Another reason is that it seems impos-
multiplier. Finally, all the proposed approximate 8 × 8 multi- sible to do functional simulation for 32 × 32 multipliers to
pliers with different accuracy configurations are summarized cover all the possible combinations of input values because
in Table 3. The M8_CP13_k group includes approximate the simulation time is extremely long. Thus, we cannot guar-
8 × 8 multipliers using the approximate 4-2 compressor antee their qualities. In contrast, if we build large operand

25488 VOLUME 8, 2020


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 5. Approximate 32 × 32 multiplier configurations.

size multipliers from the smaller ones, we can assess their


qualities through error metrics of those building blocks as FIGURE 15. The NMED errors between the estimated and simulation
shown in Section III. Fig. 3 shows how to build the larger results.
operand size multipliers from the smaller ones. In this way,
we can build many approximate 32 × 32 multiplier configu-
multipliers with different accuracies. The proposed multipli-
rations. However, we just show some typical configurations
ers using compressors CP1 and CP4 (M16_CP14_k, k is the
as described in Table 5. Note that the M32_CP13_h_m is built
index) have the best qualities while the qualities of the mul-
by the M16_CP13_h for the higher part while the middle and
tipliers M16_CP23_k are lowest. The multiplier M16 [14]
lower parts are built by the M16_CP13_m.
whose twelve leftmost partial product columns are com-
pressed by exact compressors and eight less significant partial
VI. EXPERIMENTAL RESULTS product columns are truncated. As a result, our multiplier
In this section, we present the accuracy analysis of the pro- M16_CP14_12 and the M16 [14] show comparable qualities.
posed approximate 8-, 16-, 32-bit multipliers. Then, we also Our proposed multipliers outperform other existing works in
describe the experimental conditions for evaluating and mea- term of the quality (except for the M16_CPxy_4; x = 1, 2; y
suring the circuit delay, occupied area, and dynamic power = 3, 4). In general, the multipliers M16_CP24_k have better
dissipation of the proposed approximate multipliers. accuracies than those of the multipliers M16_CP13_k (except
for M16_CPxy_4; x = 1, 2; y = 3, 4). As shown in Table 6,
A. ACCURACY ANALYSIS our multipliers are classified into 4 groups. In each group,
1) UNSIGNED MULTIPLIERS the highest quality multiplier is nearly 100 × more precise
For approximate 8 × 8 and 16 × 16 multipliers, error met- than the lowest one (for MED and NMED). The accuracy
rics such as MED, NMED, MRED are evaluated based on is increased with the increasing number of most significant
exhaustive simulations with all the possible combinations partial product columns (k) compressed by accurate com-
of inputs. For 32 × 32 multipliers, since they are built pressors. Among the existing multipliers, the M16-Ca [23]
of 16 × 16 ones, so the most important error metrics MED demonstrates the best quality (except for the M16 [14]).
and NMED are obtained from (12) and (14). Nevertheless, its accuracy is just better than those of our
The accuracy metrics of the proposed approximate multipliers M16_CPxy_4 (x = 1, 2; y = 3, 4), and less than
8 × 8, 16 × 16, 32 × 32 multipliers, and the prior works those of all our remaining multipliers.
are presented in Table 6. The approximate multiplier M8 [14] Since the accuracies of approximate 32 × 32 multipliers
has the lowest quality while the multiplier M8-Ca [23] are estimated by using the formulas in Section III, we first val-
shows the highest accuracy. Among our proposed multipliers, idate this estimation method. Fig. 15 plots the NMED errors
the M8_CP14_6 has the highest quality, and its accuracy between the simulation and estimated results of approxi-
is comparable with that of the M8_Ca [23]. The proposed mate 16 × 16 multipliers which are built of 8 × 8 multi-
multipliers using the compressors CP3 have lower qualities pliers. The actual NMEDs are bounded by estimated ones.
than the multipliers with the compressors CP4 since the The errors between the estimated and simulation results are
compressor CP4 is more accurate than the compressor CP3. extremely small, which is at most 0.6% for the multiplier
The accuracy is increased with respect to the increase of the M16 in [30]. For our multipliers, there is no error in NMED
number of most significant partial product columns (k) being since the approximate products are always less than or equal
compressed by using the accurate compressors and adders. to the exact results, which means that the equality in (6) is
In terms of MED and NMED, the highest quality multipliers satisfied.
(M8_CP14_6 and M8_CP13_6) are over 7 × more accu- As the proposed 32 × 32 multipliers are built by using
rate than the lowest quality multipliers (M8_CP14_2 and approximate 16 × 16 ones, their accuracies have the same
M8_CP13_2). trend as those of 16-bit multipliers. Besides, hybrid 32-bit
For 16 × 16 multipliers, all approximate 4-2 compressors multipliers are also built by using different 16-bit multipliers
(CP1 to CP4) are employed to develop different approximate with different accuracies as shown at the bottom of Table 6.

VOLUME 8, 2020 25489


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 6. Accuracy comparisons of approximate 8 × 8, 16 × 16, and 32 × 32 multipliers.

approximated. Our proposed approximate signed 8 × 8 mul-


tipliers are named by M8s_CPxy_k (x = 1; y = 3, 4; k =
2 to 6). For approximate signed 16 × 16 multipliers, their
names are M16s_CPxy_k (x = 1, 2; y = 3, 4; k = 2 to 12).
As we can see in Table 7, the MREDs of the proposed
approximate signed 8 × 8 multipliers M8s_CP13_k and
FIGURE 16. Partial product matrix of a signed 8 × 8 multiplier using M8s_CP14_k (k = 2 to 6) are lower than those of the multi-
Baugh-Wooley algorithm [31]. pliers R4ABM2 (except for M8s_CP13_6), and higher than
those of the R4ABM1. The NMEDs of M8s_CP_14 (k = 2 to
The accuracies of the multipliers M32_CPxy_12_4 (x = 1, 2; 6) are comparable to those of the R4ABM2, and the NMEDs
y = 3, 4; see Table 5) are strongly dependent on those of the of M8s_CP13_k (k = 2 to 6) are higher than those of the
16-bit multipliers used to build the higher part (as in Fig. 3). multipliers R4ABM2. However, our multipliers show better
power savings, which is demonstrated in Section VII-B.2
2) SIGNED MULTIPLIERS (case study). The multipliers M8s_CP14_k show significant
Booth encoding is one of algorithms to implement the signed advantages in accuracy compared to M8s_CP13_k (k = 2 to
multiplier. Booth encoding reduces the height of partial 6). For 16-bit operands, the multipliers M16s_CP14_k and
product matrix at the expense of the complexity of par- M16s_CP24_k are more accurate than M16s_CP13_k and
tial product generation (encoding). Section V implemented M16s_CP23_k, respectively (k = 2, 4, 6, 8, 10, 12). In terms
unsigned approximate multipliers. Our proposed approach of MRED, our signed 16-bit multipliers have better accu-
can be extended to implement signed multipliers by using racies compared to the signed multipliers R4ARBM1 and
Baugh-Wooley algorithm [31] as shown in Fig. 16. In the R4ARBM2 proposed by Liu et al. [33] as shown in Table 8.
diagram, there is a small change in the partial product matrix Previous studies [21], [32], [33] assessed the quality
in comparison with that of the unsigned multiplier. of 32-bit multipliers by simulations with randomized input
The approximate signed 8 × 8 and 16 × 16 multipliers values since it is impossible to cover all possible combi-
are implemented by using Baugh-Wooley algorithm. Their nations. This approach does not guarantee the qualities of
accuracy metrics are exhaustively analyzed and compared the developed approximate multipliers. Therefore, we still
with prior works as shown in Tables 7 and 8. Note that p build approximate signed 32-bit multipliers by using small
is the number of least significant partial product columns ones. However, it is complicated to develop such multipliers

25490 VOLUME 8, 2020


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 7. Accuracy comparisons of approximate signed 8 × 8 multipliers.

FIGURE 17. An implementation of approximate signed N-bit multiplier by


using the smaller ones. Operands A and B are encoded in unsigned values
(e.g, a signed 17-bit value -6553510 is encoded by 1_FFFF16 ; the MSB is
signed bit). The final product can be kept in the format of operands
A and B or 2’s complement, which is controlled by the signal ‘mode’.

sign bit. For an N-bit number, the maximum negative value


TABLE 8. Accuracy comparisons of approximate signed 16 ×
−2(N −1) is rounded off to −2(N −1) + 1 since we can only
16 multipliers. represent the values within [−2(N −1) + 1, +2(N −1) − 1].
The input value encoding is performed at the application
level. Thus, the signed multiplication is transformed to the
multiplication of two unsigned values. The sign of the final
product is determined by an XOR operation of the input
operands’ signs. The final result (Pf ) is obtained by inverting
the unsigned product (P) if the sign bit (S) is 1’s, which is
given as (30).
(
P̄ + 1 ≈ P̄ if S = 1
Pf = (30)
P if S = 0
Compared to the approximate unsigned multipliers,
the approximate signed ones produce one more error.
Besides, an error also occurs in case of maximum negative
value. However, these errors are very small. According to
this implementation, the accuracy (e.g., MED, MRED, and
NMED) of approximate signed multipliers are nearly equal
to those of the unsigned counterparts. That is, the accuracy of
signed 33-bit multiplier is nearly equal to that of the unsigned
32-bit multiplier. We take an example for signed 17-bit mul-
tiplication to validate our approach as shown in Table 9. The
differences between qualities of signed 17-bit and unsigned
16-bit multiplications are very small, which is an O(10−7 ).
The area overhead of approximate signed multipliers are
caused by a sign bit detector (an XOR gate) and 63 mul-
tiplexers (2-to-1), which approximately occupies 33 LUTs.
Compared to the area of 32-bit multipliers, this overhead can
be negligible.

B. DELAY, POWER AND AREA RESULTS


with 2’s complement input operands. Instead, we still use To evaluate the efficiency of our designs, the proposed
small-size approximate unsigned multipliers to implement multiplier are implemented with different accuracies by
large-size signed multipliers (with some modifications for varying the number of leftmost partial product columns
Fig. 3) as shown in Fig. 17. The most significant bits (MSB) in compressed by exact compressors, full and half adders.
the operands are the sign bits. The remaining bits are encoded These proposed multipliers are compared with the exact
in the format of unsigned values. That means we don’t use LUT-based, DSP-based multipliers, and the prior works in
2’s complement values. For example, a signed 17-bit value terms of delay, area, and power consumption, PDP, and
-6553510 is encoded by 1_FFFF16 in which the MSB is the PDAP. Our multipliers and the prior works are implemented

VOLUME 8, 2020 25491


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 9. Accuracy validation (NMED) of approximate 17-bit signed TABLE 10. Performance comparison of approximate 8 × 8 multipliers.
multipliers.

TABLE 11. Performance comparison of approximate 16 × 16 multipliers.


FIGURE 18. The setup model for power measurement.

FIGURE 19. The trade-offs between the accuracy and PDAP of 8 ×


8 multipliers.

by using Verilog-HDL and evaluated on a Xilinx Spartan-


6 FPGA (XC6SLX9-2TQG144).
For the power consumption measurements, these circuits
work on the relaxed operating frequency of 100 MHz (for 8
× 8 and 16 × 16 multipliers), and 50 MHz for 32 × 32 multi-
pliers. Fig. 18 shows the setup model where a high-precision
multimeter is used to measure the current dissipation of the
multipliers, which is more precise and faster than the power
estimation using the PowerAnalyzer in ISE software. The
dynamic power dissipation of 8 × 8 and 16 × 16 multipliers
are evaluated with all the possible combinations of input data.
It is impossible to cover all the cases of 32-bit input data for electrical performances (i.e., delay, power, and area), PDP,
32 × 32 multipliers, so their dynamic power dissipation is and PDAP. The electrical performances of proposed multi-
measured by feeding inputs with the random input patterns pliers using approximate 4-2 compressors CP1 and CP3 are
(generated by pseudo-random number generator) [34]. better than those of the multipliers built by employing
Table 10 summarizes the comparison between our the approximate 4-2 compressors CP1 and CP4. Our pro-
8 × 8 multipliers and the prior works in terms of posed 8 × 8 multipliers outperform the prior works,

25492 VOLUME 8, 2020


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

the exact LUT-based, and DSP-based multipliers. Com-


pared to the DSP-based multiplier, the proposed multiplier
M8_CP13_2 and M8_CP14_2 can achieve delay improve-
ments 40.8% and 37.1%, respectively. Compared to the
exact LUT-based multiplier, the multipliers M8_CP13_2 and
M8_CP14_2 have shorter delays, smaller occupied areas
(LUTs and CARRY4s), and consume less dynamic powers.
They can save (28.2%, 23.7%) in delays, (54.3%, 47.8%)
in areas, and (67.8%, 64.4%) in dynamic power dissipation,
respectively. They can offer PDP and PDAP gains of up
to (4.3 × and 3.7 ×) and (9.5 × and 7.1 ×), respectively,
with regard to the exact multiplier. The M8-Ca and M8-Cc
FIGURE 20. The trade-offs between the accuracy and PDAP of 16 ×
in [23] show worse performances than those of the exact 16 multipliers.
multiplier. The reason is that their multipliers were built from
their customized components which cannot be optimized by
the synthesis tools. Fig. 19 plots the trade-offs between the prior works and the exact LUT-based multiplier, all of our
accuracies and PDAPs of presented approximate multipliers. proposed multipliers have shorter delays. They consume less
Our proposed multipliers offer a good range of the PDAP dynamic powers except for the multipliers M32_CPxy_12
benefits at different levels of accuracy, which is bounded by (x = 1, 2; y = 3, 4). They cost smaller hardware resources
the two ends (M8_CP14_6 and M8_CP13_2). This enables a than the prior works and the exact LUT-based multiplier
search space exploration to meet user-defined constraints of except for the M32_CP14_12 (it is a little bigger than
both the power dissipation and accuracy. M32 in [30]). In terms of PDAP, most of our multipliers
The comparison of electrical performances, PDPs, and are better than the existing works and the exact LUT-based
PDAPs between our approximate 16 × 16 multipliers and multiplier (except for M32_CP14_12 and M32_CP24_12).
previous works are summarized in Table 11. All four The multipliers M32_CP23_4 and M32_CP23_12_4 offer
approximate 4-2 compressors (CP1 to CP4) are deployed good benefits of PDPs and PDAPs. They can improve
to develop four classes of approximate multipliers. Among PDPs and PDAPs up to (2.9 ×, 2.0 ×) and (5.0×, 3.2×),
them, the multipliers M16_CP23_k deliver better perfor- respectively, compared to the exact LUT-based multipliers.
mances than those of our other multipliers. The multiplier Our multipliers M32_CP13_4, M32_CP24_4, M32_CP23_4,
M16_CP23_4 is more dominant than the M16_CP14_4 in M32_CP23_12_4 and M32_CP24_12_4 demonstrate bet-
terms of the electrical performances. Generally, the multi- ter performances than the DSP-based multiplier. Especially,
pliers M16_CP24_k deliver better electrical performances the delays of all the proposed multipliers are almost half
than the multipliers M16_CP13_k. Compared to the exact of the delay of the DSP-based multiplier. Note that the 32-
LUT-based multipliers, the M16_CP23_4 can save 27.9% bit DSP-based multiplier is built by using 18-bit DSP-based
in delay, 70.7% in dynamic power dissipation, and 43.2% multiplier since the used FPGA only supports the 18-bit
in the area. Especially, it can improve the PDP and PDAP multiplier. The procedure to build the large size multiplier
of 4.7 × and 8.3 × with respect to the exact LUT-based from the smaller ones are described in Fig. 3.
multiplier. The prior works show worse PDAPs than those Liu et al. [33] stated that large size approximate multipliers
of the exact LUT-based multipliers except for the 16-bit can be used in applications demanding high dynamic range
multipliers in [17] and [14]. Even though the multipliers computation. As an example, the 32-bit approximate mul-
M16 in [17] and [14] are better than the exact LUT-based tipliers was applied to a high dynamic range (HDR) image
multiplier, their electrical performances are not better than processing [33]. Besides, it can be also applied to machine
those of our highest accuracy multiplier M16_CP14_12. learning applications as reported in [35].
Compared to the DSP-based multiplier, our multipliers show
better electrical performances (delay, power consumption VII. IMAGE PROCESSING APPLICATIONS: CASE STUDIES
and PDP) except for the multiplier M16_CP14_k (k = 4 to In this section, two image processing applications such as
12), M16_CP13_12 and M16_CP24_12. In term of delay image sharpening and image multiplication are developed by
(latency), the proposed multiplier M16_CPxy_k (x = 1, 2; using the proposed approximate multipliers. The dynamic
y = 3, 4; k = 4 to 10) are better than the DSP-based multi- power dissipation of the approximate multipliers and the
plier. The multiplier M16_CP23_4 can improve the latency quality of output images are measured and assessed.
up to 16%. Similar to 8-bit multipliers, Fig. 20 depicts the
trade-offs between the accuracies and PDAPs of our proposed A. IMAGE SHARPENING
16-bit multipliers as well as the prior works. We apply the proposed approximate multipliers to implement
As explained above, our 32 × 32 multipliers are built an image sharpening, which has been widely used to evaluate
by using approximate 16 × 16 multipliers. Their electrical the effectiveness and quality of approximate multipliers. The
performances are summarized in Table 12. Compared to the sharpened image is obtained by (31). In this design exam-

VOLUME 8, 2020 25493


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 12. Performance comparison of approximate 32 × 32 multipliers. TABLE 13. Quality and power savings comparison for 8 × 8 multipliers
(at 100MHz, 1.2V).

the prior works show limited benefits in power dissipation


compared to the exact LUT-based multiplier. Some prior
works consume more power than the exact LUT-based mul-
tiplier. These results confirm that the approximation meth-
ple, an 8-bit image is convolved with the filter kernel M as ods which are successfully applied to ASIC-based designs
described in (32). present a limited (4.2%) or even no benefit (−48.3%)
when directly applied to FPGA-based designs. Our pro-
Y (i, j) = 2X (i + m, j + n) posed design M8_CP13_6 shows a good PSNR (45.87 dB)
3 3
1 X X and SSIM (0.9987). It can offer a power saving of 18.3%
− X (i + m, j + n) with respect to the exact LUT-based multiplier. The design
1115
m=−3 n=−3 M8_CP13_2 shows a high quality with a PSNR of 39.1 dB,
×M (m + 3, n + 3) (31) an SSIM of 0.9978, and a high percentage of power-saving
The mask M (7 × 7) [36] is given as (32): 36.7%. Compared to the proposed multiplier using 4-2 com-
  pressor CP3, the multipliers employing 4-2 compressor
1 4 7 10 7 4 1 CP4 offer smaller power-savings, but higher qualities. Our
4 12 26 33 26 12 4
  multipliers M8_CP13_k (k = 2 to 5) offer better power
7 26 55 71 55 26 7
  savings than the exact DSP-based multiplier. Though other
M =  10 33 71 91 71 33 10 (32) proposed multipliers are not better than the DSP-based mul-
7 26 55 71 55 26 7
  tiplier in term of power consumption, they deliver good laten-
4 12 26 33 26 12 4 cies. Fig. 21 describes some results of the output images,
1 4 7 10 7 4 1 their qualities (PSNR, SSIM), and the percentage of power
Peak signal-to-noise ratio (PSNR) is one of the metrics, savings.
which is used for investigating the quality of approximate
images. Additionally, the structural similarity index met- B. IMAGE MULTIPLICATION
ric (SSIM) that is consistent with the human perception of 1) UNSIGNED MULTIPLIERS
an image is also used to assess the quality of the processed In this design example, an image multiplication is developed
image. This metric works based on measuring the structural by using the proposed approximate 16 × 16 multipliers. Two
similarity of the exact and approximate images. 16-bit images are multiplied each other on the basis of pixel-
All the designs are implemented by using Verilog-HDL by-pixel, which blends two images into a single output image.
and on the Xilinx Spartan-6 FPGA (XC6SLX9-2TQG144). The block diagram of an image multiplication is illustrated
The FPGA is powered with the supply voltage of 1.2 V in Fig. 22. Note that the 16-bit Lena image is converted from
and run at the clock frequency of 100 MHz to measure the the original 8-bit image by using Matlab. Input images are
dynamic power consumption of multipliers. read into input registers and then processed by the approxi-
Table 13 shows the comparison between the proposed mate multipliers. The output values are then stored into output
approximate multipliers and the prior works in terms of qual- registers. The metrics PSNR and SSIM are used to evaluate
ity (PSNR and SSIM) and power savings. As we can see, the quality of the output images.

25494 VOLUME 8, 2020


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 14. Quality and power savings comparison for 16 × 16 multipliers


(at 100MHz, 1.2V).

FIGURE 21. Image sharpening: (a) by using exact multiplier; PSNR (dB),
SSIM and power savings (%) by using multipliers (b) M8_CP13_2,
(c) M8_CP14_2, (d) [19], (e) [14], (f) [23]-Ca.

M16 [30]) or even no power benefit (-38.46% for M16-


FIGURE 22. Image processing application: image multiplication. Ca [23]). Our multipliers provide the best quality with the
PSNR up to 80.25 dB, and the SSIM up to 1.0. The qual-
ities of the proposed multipliers M16_CP23_k (k = 4 to
12) are less than those of the multipliers M16_CP14_k, but
they can save more dynamic power dissipation. Their PSNRs
are greater than 40 dB, which implies that the quality of
output images are good enough for human perception as
demonstrated in Fig. 23. Compared to the exact LUT-based
multiplier, the M16_CP23_4 offers an impressive power sav-
ing of up to 58.15%. Compared to the exact DSP-based
multiplier, our multipliers M16_CP24_4, M16_CP23_4, and
M16_CP23_6 offer comparable power saving and better
latencies. Other proposed multipliers can be used when the
DSP-based multipliers are exhaustively consumed by some
application as reported in [22], or when the timing constraint
of the design is critical.

2) SIGNED MULTIPLIERS
Similar to Section VII-B.1, approximate signed 8 × 8 mul-
FIGURE 23. Image multiplication: (a) and (b) input images, (c) output tipliers are applied to image multiplication in order to assess
image by using the exact multiplier; output images, PSNR (dB), SSIM and
power savings (%) by using multipliers (d) M16_CP13_4, (e) M16_CP14_4, their qualities and applicability. In this demonstration, since
(f) M16_CP23_4, (g) M16_CP24_4, (h) [19], (i) [30]. signed multipliers are used, we need to translate the pixel
values of the image (8-bit cameraman.png 256 × 256 pixels)
Table 14 shows the comparison between the proposed ranging from [0, 255] to [−128, 127]. Two identical images
approximate multipliers and the prior works in terms of are multiplied each other on the basis of pixel-by-pixel [32].
quality (PSNR and SSIM) and power savings. The previous The PSNRs, SSIMs, power consumption, and power sav-
works offer a limited benefit in power saving (9.85% for ings are summarized in Table 15. Note that the PSNRs of

VOLUME 8, 2020 25495


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

TABLE 15. Quality and power savings comparison for signed 8 × DSP-based multipliers in terms of delays (latencies). One
8 multipliers (at 100MHz, 1.2V).
more disadvantage of the DSP-based multiplier is that it still
uses 4 DSP primitives for arbitrary operand sizes from 19 bits
to 32 bits (causes higher routing delays as shown in 32-bit
multiplier, Table 12) while our proposed method can optimize
these multipliers easier.
Our proposed approximate multipliers were evaluated on
both circuit and application levels to demonstrate their effec-
tiveness and applicability. Two image processing applica-
tions such as image sharpening and image multiplication
were implemented to measure the dynamic power dissipa-
tion savings. Besides, the quality metrics (PSNR, SSIM) of
output images are also evaluated. To the best of our knowl-
edge, this is the first work to implement the approximate
multipliers and measure their dynamic power consumption
on an FPGA board. Finally, our proposed multipliers are
suitable for high-performance and low-power error-resilient
applications.

REFERENCES
[1] Y. Wang, J. Lin, and Z. Wang, ‘‘An energy-efficient architecture for binary
weight convolutional neural networks,’’ IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 26, no. 2, pp. 280–293, Feb. 2018.
FIGURE 24. Image multiplication: (a) output image by using the exact
signed multiplier; output images, PSNR (dB), SSIM and power savings (%) [2] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, ‘‘Design of
by using multipliers (b) M8s_CP13_2, (c) M8s_CP14_2, (d) M8s_CP13_5. low-power high-speed truncation-error-tolerant adder and its application
in digital signal processing,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 18, no. 8, pp. 1225–1229, Aug. 2010.
R4ABM1 and R4ABM2 are higher than those reported by [3] S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S. Kim,
Liu et al. [32] since the image source that was used for ‘‘Energy-efficient approximate multiplication for digital signal processing
and classification applications,’’ IEEE Trans. Very Large Scale Integr.
experiments can be different. The PSNRs of output images (VLSI) Syst., vol. 23, no. 6, pp. 1180–1184, Jun. 2015.
that are processed by our approximate signed multipliers [4] C. K. Jha, S. N. Ved, I. Anand, and J. Mekie, ‘‘Energy and error analysis
M8s_CP14_2 and M8s_CP13_2 are lower than those of the framework for approximate computing in mobile applications,’’ IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 67, no. 2, pp. 385–389, Feb. 2020,
R4ABM1 and R4ABM2 (p = 12). However, the SSIMs of our
doi: 10.1109/tcsii.2019.2910137.
output images are higher than those of the images processed [5] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, ‘‘Design and analysis
by the multipliers R4ABM1 and R4ABM2 (p = 12). Impor- of approximate compressors for multiplication,’’ IEEE Trans. Comput.,
tantly, our approximate signed multipliers offers much high vol. 64, no. 4, pp. 984–994, Apr. 2015.
power savings. The approximate multipliers R4ABM1 and [6] Y. Liu, T. Zhang, and K. K. Parhi, ‘‘Computation error analysis in dig-
ital signal processing systems with overscaled supply voltage,’’ IEEE
R4ABM2 in [32] consume high power because they still Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 4, pp. 517–526,
used exact compressors to accumulate partial products. Fur- Apr. 2010.
thermore, our proposed multipliers show higher benefits in [7] D. Jeon, M. Seok, Z. Zhang, D. Blaauw, and D. Sylvester, ‘‘Design method-
ology for voltage-overscaled ultra-low-power systems,’’ IEEE Trans. Cir-
dynamic power dissipation compared to the DSP multiplier cuits Syst. II, Exp. Briefs, vol. 59, no. 12, pp. 952–956, Dec. 2012.
(except for M8s_CP1y_6 (y = 3, 4)). Fig. 24 shows the output [8] R. Liu and K. K. Parhi, ‘‘Power reduction in frequency-selective FIR filters
images processed by our multipliers. under voltage overscaling,’’ IEEE J. Emerg. Sel. Topics Circuits Syst.,
vol. 1, no. 3, pp. 343–356, Sep. 2011.
[9] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, ‘‘EvoApproxSb:
VIII. CONCLUSION Library of approximate adders and multipliers for circuit design and bench-
In this work, we proposed approximate 8 × 8, 16 × 16, marking of approximation methods,’’ in Proc. Design, Automat. Test Eur.
and 32 × 32 multipliers with different accuracies, which Conf. Exhib. (DATE), Lausanne, Switzerland, Mar. 2017, pp. 258–261.
[10] A. Dalloo, A. Najafi, and A. Garcia-Ortiz, ‘‘Systematic design of an
were efficiently implemented on FPGAs by using proposed approximate adder: The optimized lower part constant-OR adder,’’ IEEE
approximate compressors. Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 8, pp. 1595–1599,
Our proposed approximate multipliers outperform the Aug. 2018.
prior works and exact LUT-based multiplier in terms of elec- [11] S. Dutt, S. Dash, S. Nandi, and G. Trivedi, ‘‘Analysis, modeling and
optimization of equal segment based approximate adders,’’ IEEE Trans.
trical performances (delay, power, area), PDPs, and PDAPs. Comput., vol. 68, no. 3, pp. 314–330, Mar. 2019.
We also presented the trade-offs between the accuracies and [12] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, ‘‘RAP-CLA:
PDAPs, which guides designers on selecting appropriate mul- A reconfigurable approximate carry look-ahead adder,’’ IEEE Trans. Cir-
cuits Syst. II, Exp. Briefs, vol. 65, no. 8, pp. 1089–1093, Aug. 2018.
tiplier to meet the design requirements. The expressions of
[13] S. Xu and B. C. Schafer, ‘‘Exposing approximate computing optimizations
accuracy analysis for large operand size multipliers are also at different levels: From behavioral to gate-level,’’ IEEE Trans. Very Large
formulated. Besides, our proposed multipliers outperform the Scale Integr. (VLSI) Syst., vol. 25, no. 11, pp. 3077–3088, Nov. 2017.

25496 VOLUME 8, 2020


N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications

[14] M. Ha and S. Lee, ‘‘Multipliers with approximate 4-2 compressors and [30] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han, ‘‘Low-power approxi-
error recovery modules,’’ IEEE Embedded Syst. Lett., vol. 10, no. 1, mate multipliers using encoded partial products and approximate compres-
pp. 6–9, Mar. 2018. sors,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 404–416,
[15] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, ‘‘Bio-inspired Sep. 2018.
imprecise computational blocks for efficient VLSI implementation of soft- [31] C. R. Baugh and B. A. Wooley, ‘‘A two’s complement parallel array
computing applications,’’ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, multiplication algorithm,’’ IEEE Trans. Comput., vol. C-22, no. 12,
no. 4, pp. 850–862, Apr. 2010. pp. 1045–1047, Dec. 1973.
[16] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi, [32] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and F. Lombardi, ‘‘Design of
‘‘Design-efficient approximate multiplication circuits through partial prod- approximate radix-4 booth multipliers for error-tolerant computing,’’ IEEE
uct perforation,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, Trans. Comput., vol. 66, no. 8, pp. 1435–1441, Aug. 2017.
no. 10, pp. 3105–3117, Oct. 2016. [33] W. Liu, T. Cao, P. Yin, Y. Zhu, C. Wang, E. E. Swartzlander, and
[17] T. Yang, T. Ukezono, and T. Sato, ‘‘Low-power and high-speed approxi- F. Lombardi, ‘‘Design and analysis of approximate redundant binary mul-
mate multiplier design with a tree compressor,’’ in Proc. 35th Int. Conf. tipliers,’’ IEEE Trans. Comput., vol. 68, no. 6, pp. 804–819, Jun. 2019.
Comput. Design, Boston, MA, USA, Nov. 2017, pp. 89–96. [34] R. Nelson. (Mar. 2019). Pseudo Random Number Generator With
[18] I. Qiqieh, R. Shafik, G. Tarawneh, D. Sokolov, S. Das, and A. Yakovlev, Linear Feedback Shift Registers (Verilog). [Online]. Available: https://
‘‘Significance-driven logic compression for energy-efficient multiplier www.digikey.com/eewiki/pages/viewpage.action?pageId=16351401
design,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, [35] D. Kim, J. Kung, and S. Mukhopadhyay, ‘‘A power-aware digital mul-
pp. 417–430, Sep. 2018. tilayer perceptron accelerator with on-chip training based on approxi-
[19] S. Venkatachalam and S.-B. Ko, ‘‘Design of power and area efficient mate computing,’’ IEEE Trans. Emerg. Topics Comput., vol. 5, no. 2,
approximate multipliers,’’ IEEE Trans. Very Large Scale Integr. (VLSI) pp. 164–178, Apr. 2017.
Syst., vol. 25, no. 5, pp. 1782–1786, May 2017. [36] R. Jain, R. Casturi, and B. G. Schunck, ‘‘Image filtering,’’ in Machine
[20] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra, Vision, 1st ed. New York, NY, USA: McGraw-Hill, 1995, ch. 4,
‘‘Approximate multipliers based on new approximate compressors,’’ IEEE pp. 112–139.
Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4169–4182,
Dec. 2018.
[21] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, ‘‘Dual-quality
4:2 compressors for utilizing in dynamic accuracy configurable multipli- NGUYEN VAN TOAN (Member, IEEE) received
ers,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 4, the B.S. degree in electronics and telecommuni-
pp. 1352–1361, Apr. 2017. cations engineering from the Da Nang Univer-
[22] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, ‘‘Design flow of sity of Technology, Viet Nam, in 2007, and the
accelerating hybrid extremely low bit-width neural network in embedded
M.S. degree from the HCM University of Sci-
FPGA,’’ in Proc. 28th Int. Conf. Field-Program. Logic Appl., Dublin,
Ireland, Aug. 2018, pp. 163–169.
ence, Ho Chi Minh City, Viet Nam, in 2014, and
[23] S. Ullah, ‘‘Area-optimized low-latency approximate multipliers for FPGA- the Ph.D. degree in computer engineering from
based hardware accelerators,’’ in Proc. 55th Design Automat. Conf. Hallym University, South Korea, in 2018. He is
(DAC), San Francisco, CA, USA, Jun. 2018, pp. 1–6, doi: 10.1109/ currently working as a Postdoctoral Researcher
DAC.2018.8465781. with Hallym University. His research interests
[24] S. Ullah, S. S. Murthy, and A. Kumar, ‘‘SMApproxLib: Library of FPGA- include FPGA-based system designs, low EMI circuit designs, and approxi-
based approximate multipliers,’’ in Proc. 55th Design Automat. Conf. mate computing.
(DAC), San Francisco, CA, USA, Jun. 2018, pp. 1–6, doi: 10.1109/
DAC.2018.8465845.
[25] N. V. Toan and J. G. Lee, ‘‘Energy-area-efficient approximate multipliers JEONG-GUN LEE (Member, IEEE) received the
for error-tolerant applications on FPGAs,’’ in Proc. 32nd IEEE Int. Syst.- B.S. degree in computer engineering from Hal-
Chip Conf., Singapore, Sep. 2019, pp. 336–341. lym University, in 1996, and the M.S. and Ph.D.
[26] J. Liang, J. Han, and F. Lombardi, ‘‘New metrics for the reliability of degrees from the Gwangju Institute of Science
approximate and probabilistic adders,’’ IEEE Trans. Comput., vol. 62, and Technology (GIST), South Korea, in 1998 and
no. 9, pp. 1760–1771, Sep. 2013. 2005, respectively. He is currently a Full Professor
[27] Z. Yang, J. Han, and F. Lombardi, ‘‘Approximate compressors for error-
with the Department of Computer Engineering,
resilient multiplier design,’’ in Proc. IEEE Int. Symp. Defect Fault
Hallym University. Prior to joining the faculty
Tolerance VLSI Nanotechnol. Syst., Amherst, MA, USA, Oct. 2015,
pp. 183–186. of Hallym University, in 2008, he was a visiting
[28] Xilinx Inc. (Dec. 2009). Spartan-6 Libraries Guide for HDL Designs Postdoctoral Researcher with the Computer Lab-
(Ver.11.4). [Online]. Available: https://www.xilinx.com/support/ oratory, University of Cambridge, and a Research Professor with GIST.
documentation/sw_manuals/xilinx11/spartan6_hdl.pdf In 2014, he was a Visiting Scholar with the Computer Laboratory, University
[29] C. Liu, J. Han, and F. Lombardi, ‘‘A low-power, high-performance approxi- of Cambridge. His research interests focus on low EMI asynchronous circuit
mate multiplier with configurable partial error recovery,’’ in Proc. Design, designs, FPGA-based reconfigurable system designs, energy efficient het-
Automat. Test Eur. Conf. Exhib. (DATE), Dresden, Germany, Mar. 2014, erogeneous computing, and GPU-based parallel computing.
pp. 1–4, doi: 10.7873/DATE.2014.108.

VOLUME 8, 2020 25497

You might also like