FPGA-Based Multi-Level Approximate Multipliers For High-Performance Error-Resilient Applications
FPGA-Based Multi-Level Approximate Multipliers For High-Performance Error-Resilient Applications
ABSTRACT This paper presents approximate multipliers which are efficiently deployed on Field Pro-
grammable Gate Arrays (FPGAs) by using newly proposed approximate logic compressors at different levels
of accuracy. Our approximate multiplier designs offer higher gains of power-delay-area products (PDAP)
than those of the state-of-the-art works at comparable accuracies. Furthermore, in terms of delay, occupied
area, and dynamic power dissipation, our designs are much better than Lookup Table based multiplier
Intellectual Properties that are available on an FPGA. Particularly, our proposed 8-, 16-, and 32-bit multipliers
can deliver PDAP gains up to 7.1 ×, 8.3 ×, and 5.0 ×, respectively. The effectiveness and applicability
of our designs are also demonstrated by image processing applications such as image multiplication and
sharpening. The experiments show that for the image sharpening, our 8 × 8 multipliers can deliver a good
peak signal-to-noise ratio (PSNR) of 46.81 dB, a structural similarity index metric (SSIM) of 0.9989, and
a dynamic power saving of up to 36.7% with regard to the exact multiplier. For the image multiplication,
approximate 16 × 16 multipliers can offer a high PSNR of 80.25 dB, an SSIM of 1.0, and a dynamic power
saving of up to 58.15%. From these demonstrations, the proposed multipliers are expected to be appropriate
with high-performance and low-power error-resilient applications.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 25481
N. V. Toan, J.-G. Lee: FPGA-Based Multi-Level Approximate Multipliers for High-Performance Error-Resilient Applications
are occupied by the partial product accumulation. Therefore, 4-2 compressors still consume high hardware resources
the optimization of partial product accumulation can achieve and power with moderate error rates when synthesized on
more gains. FPGAs.
The removal of partial products from the partial product As all the above discussions, employing presented approxi-
matrix (PPM) leads to the reductions of circuit area and mation techniques for designing different types of multipliers
delay. Truncation is such a method which ignores the least can achieve performance gains, energy savings, and area
significant partial products from the partial product accu- reductions. Generally, FPGA-based and application-specific
mulations [14]. In [15], authors proposed a broken-array integrated circuit (ASIC)-based design architectures are
multiplier whose omitted carry-save adder cells were spec- totally different from each other. For ASIC-based designs,
ified by the horizontal break and vertical break parameters. we can fully optimize and customize our circuits to
These methods cause biased errors. That means the approx- reduce logic counts. In contrast, for FPGA-based designs,
imate multipliers can produce zero values with respect to the logic functions are implemented by fabricated look-up
non-zero valued inputs. Zervakis et al. [16] proposed to skip tables (LUTs) which have a fixed number of inputs and fixed
the generation of arbitrary partial product rows in the PPM, drive strength. It is not easy to fully control their utilization.
which was called a partial product perforation. Nevertheless, Thus, the approximation techniques that have been efficiently
the more removal of partial products, the more error are deployed to ASIC-based designs might offer limited benefits
produced. when porting to FPGA-based designs.
OR-gate based logic compressions were applied to reduce Xilinx and Intel FPGAs provide fast DSP-based multi-
the partial product tree. In [17], authors used 2-input OR gate pliers for digital signal processing applications demanding
to combine two partial products in the same columns into a low power. However, using these multiplier Intellectual Prop-
new one, and this procedure was applied to all columns in erties (IP) can incur the large routing delay since they are
the PPM. In [18], authors proposed to combine two or more available at some locations on an FPGA. Meanwhile, LUTs
partial products in the same columns at the first reduction are scattered all over the device, which can make the routing
stage by using two or more input OR gates under the con- delay much shorter. Note that in modern devices, the routing
sideration of bit position significance. Then, exact full adders delay becomes dominant in the data-path delay. Besides, if an
were employed for the next reduction stages. However, these application exhaustively occupies all dedicated IPs, there are
methods presented high errors at the final results. To recover no IPs left for other applications that require low-power and
the accuracy, error compensation vectors were intentionally high-speed computations. Eventually, LUT-based multiplier
generated and added to the final stage or the first reduc- IPs are still necessary [22]. Ullah et al. [23], [24] have
tion stage [17]–[19]. The generation of error vectors incurs recently proposed to optimize their multipliers by truncating
area and power dissipation overheads. Moreover, they also off a least significant partial product of a 4 × 2 multiplier
increase the height of the partial product tree, which could to save an LUT. Then, larger operand size multipliers (e.g.,
cost some extra logic compressions. Another limitation of the 8 × 8, 16 × 16) were built by using this building block. Their
work [18] is that it used full adders to compress the partial approach only provided a limited benefit in area and power
product tree at the reduction stages following the first stage. savings.
This limits the benefit of area and power reductions since the To tackle the aforementioned problems, our work has
logic compression ratio is only 1.5 (= 3/2). designed and analyzed a compact approximate 3-2 com-
In order to get more benefit of area saving and power pressor and approximate 4-2 compressors. Then, different
reduction, a higher ratio of logic compression was used to approximate multipliers are proposed and realized by using
reduce partial product tree, and approximate 4-2 compres- those logic compressors at different levels of accuracy. In this
sors are typically utilized with some losses of accuracy [5], work, our goal is to optimize the approximate multipli-
[14], [19], [20]. The higher ratio of logic compression can ers so that they can be efficiently implemented on FPGAs
be used such as approximate 5-2 (5 inputs and 2 outputs), with high electrical performances (i.e., low power consump-
6-2 compressors [20]. However, they showed higher errors tion, small are, low latency) and low accuracy losses. Thus,
in the final products. Authors in [5] proposed approximate these multipliers can be suitable for digital signal process-
4-2 compressors with simple logic equations to reduce logic ing applications such as image processing, multimedia, etc.
counts. However, their approximate compressors still occu- The efficiencies of our multiplier designs will be analyzed
pied high hardware resources when synthesized on an FPGA. and assessed by comparing to the state-of-the-art works in
Besides, their proposed multipliers utilizing that compressor terms of delay, power, area, power-delay product (PDP), and
produced non-zero values with zero-valued inputs, which PDAPs. Furthermore, their accuracy losses are also analyzed
costs extra logics to detect this case. Similarly, authors and compared to those of the prior works. Our work is sum-
in [21] used both their proposed approximate compressors marized as follows:
and exact compressors to form the so-called configurable 1) Approximate multipliers are proposed to be efficiently
dual-quality multipliers. Nevertheless, their method cannot implemented on FPGAs at low losses of accuracy by
be deployed on FPGAs since the existing commercial FPGAs using the compact approximate 3-2 compressor and
do not feature the power gating. Besides, their proposed approximate 4-2 compressors.
2 2m
1 X IV. DESIGN AND ANALYSIS OF APPROXIMATE
MEDN = (22m |EDHHi | + 2m |EDHLi | COMPRESSORS
22N
i=1 In this section, we present two kinds of compressor that
+2m |EDLHi | + |EDLLi |) × 22N −2m (8) are efficiently implemented on an FPGA: an approximate
FIGURE 8. Karnaugh map of the approximate 4-2 compressor CP2: the FIGURE 10. Karnaugh map of the approximate 4-2 compressor CP3.
number of switching and glitches are reduced as the inputs change in the
shaded region.
TABLE 9. Accuracy validation (NMED) of approximate 17-bit signed TABLE 10. Performance comparison of approximate 8 × 8 multipliers.
multipliers.
TABLE 12. Performance comparison of approximate 32 × 32 multipliers. TABLE 13. Quality and power savings comparison for 8 × 8 multipliers
(at 100MHz, 1.2V).
FIGURE 21. Image sharpening: (a) by using exact multiplier; PSNR (dB),
SSIM and power savings (%) by using multipliers (b) M8_CP13_2,
(c) M8_CP14_2, (d) [19], (e) [14], (f) [23]-Ca.
2) SIGNED MULTIPLIERS
Similar to Section VII-B.1, approximate signed 8 × 8 mul-
FIGURE 23. Image multiplication: (a) and (b) input images, (c) output tipliers are applied to image multiplication in order to assess
image by using the exact multiplier; output images, PSNR (dB), SSIM and
power savings (%) by using multipliers (d) M16_CP13_4, (e) M16_CP14_4, their qualities and applicability. In this demonstration, since
(f) M16_CP23_4, (g) M16_CP24_4, (h) [19], (i) [30]. signed multipliers are used, we need to translate the pixel
values of the image (8-bit cameraman.png 256 × 256 pixels)
Table 14 shows the comparison between the proposed ranging from [0, 255] to [−128, 127]. Two identical images
approximate multipliers and the prior works in terms of are multiplied each other on the basis of pixel-by-pixel [32].
quality (PSNR and SSIM) and power savings. The previous The PSNRs, SSIMs, power consumption, and power sav-
works offer a limited benefit in power saving (9.85% for ings are summarized in Table 15. Note that the PSNRs of
TABLE 15. Quality and power savings comparison for signed 8 × DSP-based multipliers in terms of delays (latencies). One
8 multipliers (at 100MHz, 1.2V).
more disadvantage of the DSP-based multiplier is that it still
uses 4 DSP primitives for arbitrary operand sizes from 19 bits
to 32 bits (causes higher routing delays as shown in 32-bit
multiplier, Table 12) while our proposed method can optimize
these multipliers easier.
Our proposed approximate multipliers were evaluated on
both circuit and application levels to demonstrate their effec-
tiveness and applicability. Two image processing applica-
tions such as image sharpening and image multiplication
were implemented to measure the dynamic power dissipa-
tion savings. Besides, the quality metrics (PSNR, SSIM) of
output images are also evaluated. To the best of our knowl-
edge, this is the first work to implement the approximate
multipliers and measure their dynamic power consumption
on an FPGA board. Finally, our proposed multipliers are
suitable for high-performance and low-power error-resilient
applications.
REFERENCES
[1] Y. Wang, J. Lin, and Z. Wang, ‘‘An energy-efficient architecture for binary
weight convolutional neural networks,’’ IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 26, no. 2, pp. 280–293, Feb. 2018.
FIGURE 24. Image multiplication: (a) output image by using the exact
signed multiplier; output images, PSNR (dB), SSIM and power savings (%) [2] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, ‘‘Design of
by using multipliers (b) M8s_CP13_2, (c) M8s_CP14_2, (d) M8s_CP13_5. low-power high-speed truncation-error-tolerant adder and its application
in digital signal processing,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 18, no. 8, pp. 1225–1229, Aug. 2010.
R4ABM1 and R4ABM2 are higher than those reported by [3] S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S. Kim,
Liu et al. [32] since the image source that was used for ‘‘Energy-efficient approximate multiplication for digital signal processing
and classification applications,’’ IEEE Trans. Very Large Scale Integr.
experiments can be different. The PSNRs of output images (VLSI) Syst., vol. 23, no. 6, pp. 1180–1184, Jun. 2015.
that are processed by our approximate signed multipliers [4] C. K. Jha, S. N. Ved, I. Anand, and J. Mekie, ‘‘Energy and error analysis
M8s_CP14_2 and M8s_CP13_2 are lower than those of the framework for approximate computing in mobile applications,’’ IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 67, no. 2, pp. 385–389, Feb. 2020,
R4ABM1 and R4ABM2 (p = 12). However, the SSIMs of our
doi: 10.1109/tcsii.2019.2910137.
output images are higher than those of the images processed [5] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, ‘‘Design and analysis
by the multipliers R4ABM1 and R4ABM2 (p = 12). Impor- of approximate compressors for multiplication,’’ IEEE Trans. Comput.,
tantly, our approximate signed multipliers offers much high vol. 64, no. 4, pp. 984–994, Apr. 2015.
power savings. The approximate multipliers R4ABM1 and [6] Y. Liu, T. Zhang, and K. K. Parhi, ‘‘Computation error analysis in dig-
ital signal processing systems with overscaled supply voltage,’’ IEEE
R4ABM2 in [32] consume high power because they still Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 4, pp. 517–526,
used exact compressors to accumulate partial products. Fur- Apr. 2010.
thermore, our proposed multipliers show higher benefits in [7] D. Jeon, M. Seok, Z. Zhang, D. Blaauw, and D. Sylvester, ‘‘Design method-
ology for voltage-overscaled ultra-low-power systems,’’ IEEE Trans. Cir-
dynamic power dissipation compared to the DSP multiplier cuits Syst. II, Exp. Briefs, vol. 59, no. 12, pp. 952–956, Dec. 2012.
(except for M8s_CP1y_6 (y = 3, 4)). Fig. 24 shows the output [8] R. Liu and K. K. Parhi, ‘‘Power reduction in frequency-selective FIR filters
images processed by our multipliers. under voltage overscaling,’’ IEEE J. Emerg. Sel. Topics Circuits Syst.,
vol. 1, no. 3, pp. 343–356, Sep. 2011.
[9] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, ‘‘EvoApproxSb:
VIII. CONCLUSION Library of approximate adders and multipliers for circuit design and bench-
In this work, we proposed approximate 8 × 8, 16 × 16, marking of approximation methods,’’ in Proc. Design, Automat. Test Eur.
and 32 × 32 multipliers with different accuracies, which Conf. Exhib. (DATE), Lausanne, Switzerland, Mar. 2017, pp. 258–261.
[10] A. Dalloo, A. Najafi, and A. Garcia-Ortiz, ‘‘Systematic design of an
were efficiently implemented on FPGAs by using proposed approximate adder: The optimized lower part constant-OR adder,’’ IEEE
approximate compressors. Trans. Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 8, pp. 1595–1599,
Our proposed approximate multipliers outperform the Aug. 2018.
prior works and exact LUT-based multiplier in terms of elec- [11] S. Dutt, S. Dash, S. Nandi, and G. Trivedi, ‘‘Analysis, modeling and
optimization of equal segment based approximate adders,’’ IEEE Trans.
trical performances (delay, power, area), PDPs, and PDAPs. Comput., vol. 68, no. 3, pp. 314–330, Mar. 2019.
We also presented the trade-offs between the accuracies and [12] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, ‘‘RAP-CLA:
PDAPs, which guides designers on selecting appropriate mul- A reconfigurable approximate carry look-ahead adder,’’ IEEE Trans. Cir-
cuits Syst. II, Exp. Briefs, vol. 65, no. 8, pp. 1089–1093, Aug. 2018.
tiplier to meet the design requirements. The expressions of
[13] S. Xu and B. C. Schafer, ‘‘Exposing approximate computing optimizations
accuracy analysis for large operand size multipliers are also at different levels: From behavioral to gate-level,’’ IEEE Trans. Very Large
formulated. Besides, our proposed multipliers outperform the Scale Integr. (VLSI) Syst., vol. 25, no. 11, pp. 3077–3088, Nov. 2017.
[14] M. Ha and S. Lee, ‘‘Multipliers with approximate 4-2 compressors and [30] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han, ‘‘Low-power approxi-
error recovery modules,’’ IEEE Embedded Syst. Lett., vol. 10, no. 1, mate multipliers using encoded partial products and approximate compres-
pp. 6–9, Mar. 2018. sors,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, pp. 404–416,
[15] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, ‘‘Bio-inspired Sep. 2018.
imprecise computational blocks for efficient VLSI implementation of soft- [31] C. R. Baugh and B. A. Wooley, ‘‘A two’s complement parallel array
computing applications,’’ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, multiplication algorithm,’’ IEEE Trans. Comput., vol. C-22, no. 12,
no. 4, pp. 850–862, Apr. 2010. pp. 1045–1047, Dec. 1973.
[16] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi, [32] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and F. Lombardi, ‘‘Design of
‘‘Design-efficient approximate multiplication circuits through partial prod- approximate radix-4 booth multipliers for error-tolerant computing,’’ IEEE
uct perforation,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, Trans. Comput., vol. 66, no. 8, pp. 1435–1441, Aug. 2017.
no. 10, pp. 3105–3117, Oct. 2016. [33] W. Liu, T. Cao, P. Yin, Y. Zhu, C. Wang, E. E. Swartzlander, and
[17] T. Yang, T. Ukezono, and T. Sato, ‘‘Low-power and high-speed approxi- F. Lombardi, ‘‘Design and analysis of approximate redundant binary mul-
mate multiplier design with a tree compressor,’’ in Proc. 35th Int. Conf. tipliers,’’ IEEE Trans. Comput., vol. 68, no. 6, pp. 804–819, Jun. 2019.
Comput. Design, Boston, MA, USA, Nov. 2017, pp. 89–96. [34] R. Nelson. (Mar. 2019). Pseudo Random Number Generator With
[18] I. Qiqieh, R. Shafik, G. Tarawneh, D. Sokolov, S. Das, and A. Yakovlev, Linear Feedback Shift Registers (Verilog). [Online]. Available: https://
‘‘Significance-driven logic compression for energy-efficient multiplier www.digikey.com/eewiki/pages/viewpage.action?pageId=16351401
design,’’ IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 3, [35] D. Kim, J. Kung, and S. Mukhopadhyay, ‘‘A power-aware digital mul-
pp. 417–430, Sep. 2018. tilayer perceptron accelerator with on-chip training based on approxi-
[19] S. Venkatachalam and S.-B. Ko, ‘‘Design of power and area efficient mate computing,’’ IEEE Trans. Emerg. Topics Comput., vol. 5, no. 2,
approximate multipliers,’’ IEEE Trans. Very Large Scale Integr. (VLSI) pp. 164–178, Apr. 2017.
Syst., vol. 25, no. 5, pp. 1782–1786, May 2017. [36] R. Jain, R. Casturi, and B. G. Schunck, ‘‘Image filtering,’’ in Machine
[20] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra, Vision, 1st ed. New York, NY, USA: McGraw-Hill, 1995, ch. 4,
‘‘Approximate multipliers based on new approximate compressors,’’ IEEE pp. 112–139.
Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4169–4182,
Dec. 2018.
[21] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, ‘‘Dual-quality
4:2 compressors for utilizing in dynamic accuracy configurable multipli- NGUYEN VAN TOAN (Member, IEEE) received
ers,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 4, the B.S. degree in electronics and telecommuni-
pp. 1352–1361, Apr. 2017. cations engineering from the Da Nang Univer-
[22] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, ‘‘Design flow of sity of Technology, Viet Nam, in 2007, and the
accelerating hybrid extremely low bit-width neural network in embedded
M.S. degree from the HCM University of Sci-
FPGA,’’ in Proc. 28th Int. Conf. Field-Program. Logic Appl., Dublin,
Ireland, Aug. 2018, pp. 163–169.
ence, Ho Chi Minh City, Viet Nam, in 2014, and
[23] S. Ullah, ‘‘Area-optimized low-latency approximate multipliers for FPGA- the Ph.D. degree in computer engineering from
based hardware accelerators,’’ in Proc. 55th Design Automat. Conf. Hallym University, South Korea, in 2018. He is
(DAC), San Francisco, CA, USA, Jun. 2018, pp. 1–6, doi: 10.1109/ currently working as a Postdoctoral Researcher
DAC.2018.8465781. with Hallym University. His research interests
[24] S. Ullah, S. S. Murthy, and A. Kumar, ‘‘SMApproxLib: Library of FPGA- include FPGA-based system designs, low EMI circuit designs, and approxi-
based approximate multipliers,’’ in Proc. 55th Design Automat. Conf. mate computing.
(DAC), San Francisco, CA, USA, Jun. 2018, pp. 1–6, doi: 10.1109/
DAC.2018.8465845.
[25] N. V. Toan and J. G. Lee, ‘‘Energy-area-efficient approximate multipliers JEONG-GUN LEE (Member, IEEE) received the
for error-tolerant applications on FPGAs,’’ in Proc. 32nd IEEE Int. Syst.- B.S. degree in computer engineering from Hal-
Chip Conf., Singapore, Sep. 2019, pp. 336–341. lym University, in 1996, and the M.S. and Ph.D.
[26] J. Liang, J. Han, and F. Lombardi, ‘‘New metrics for the reliability of degrees from the Gwangju Institute of Science
approximate and probabilistic adders,’’ IEEE Trans. Comput., vol. 62, and Technology (GIST), South Korea, in 1998 and
no. 9, pp. 1760–1771, Sep. 2013. 2005, respectively. He is currently a Full Professor
[27] Z. Yang, J. Han, and F. Lombardi, ‘‘Approximate compressors for error-
with the Department of Computer Engineering,
resilient multiplier design,’’ in Proc. IEEE Int. Symp. Defect Fault
Hallym University. Prior to joining the faculty
Tolerance VLSI Nanotechnol. Syst., Amherst, MA, USA, Oct. 2015,
pp. 183–186. of Hallym University, in 2008, he was a visiting
[28] Xilinx Inc. (Dec. 2009). Spartan-6 Libraries Guide for HDL Designs Postdoctoral Researcher with the Computer Lab-
(Ver.11.4). [Online]. Available: https://www.xilinx.com/support/ oratory, University of Cambridge, and a Research Professor with GIST.
documentation/sw_manuals/xilinx11/spartan6_hdl.pdf In 2014, he was a Visiting Scholar with the Computer Laboratory, University
[29] C. Liu, J. Han, and F. Lombardi, ‘‘A low-power, high-performance approxi- of Cambridge. His research interests focus on low EMI asynchronous circuit
mate multiplier with configurable partial error recovery,’’ in Proc. Design, designs, FPGA-based reconfigurable system designs, energy efficient het-
Automat. Test Eur. Conf. Exhib. (DATE), Dresden, Germany, Mar. 2014, erogeneous computing, and GPU-based parallel computing.
pp. 1–4, doi: 10.7873/DATE.2014.108.