An Energy-Efficient GeMM-Based Convolution Accelerator With On-the-Fly Im2col
An Energy-Efficient GeMM-Based Convolution Accelerator With On-the-Fly Im2col
Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 11, NOVEMBER 2023 1875
manager extracts the partially accumulated results from the array and,
if needed, inserts preload data used to initialize the partial sums.
Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.
1876 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 11, NOVEMBER 2023
TABLE I
C ONVOLUTION ACCELERATOR S PECIFICATIONS (P OSTLAYOUT )
is defined as the word offset, and it indicates to each feed lane the
location of its first interest element.
For kernels larger than 1 × 1, each feed lane generally needs
to take several elements from the data bus on a single cycle. These
elements need not be contiguous to the first interest element, since the
accelerator supports dilated convolutions. To identify which positions
must be selected after the first one (which is pointed to by the
word offset), we define the Kernel Pattern, a configurable bit vector
that defines the location of interest elements in contiguous memory
positions, based on the convolution kernel. With this signal, we fully Fig. 5. Accelerator floorplan (left). Area and power breakdown of the
describe the horizontal kernel size and dilation rate. Fig. 3 shows an accelerator (right). Power is estimated with dense random inputs.
example for a 3 × 3 kernel with a dilation rate of 2.
By right-shifting the kernel pattern by the word offset, we align
integer operators. To implement the FP16 MAC operator, we used
it with the correct location of the interest elements on the data
a simplified version of the FPnew floating-point unit from the
bus, as depicted in the lower part of Fig. 3. Once aligned, each
open-source parallel ultra low power (PULP) platform [16].
bit of the shifted pattern points to the data elements that must be
The implemented data feeders use three intermediate registers per
taken. If several contiguous reads are needed to cover all elements
lane (M = 3) to optimize the performance of 3 × 3 kernels, which
of interest, the bus width (in data elements) is subtracted from the
are very common. The kernel pattern width is limited to 64 bits,
shift amount after every read. A negative shift amount in this case
so the data feeders can support any kernel shape with a horizontal
denotes a left shift. With this, the shifted kernel pattern is aligned
dimension that fits in this vector. For convolutions without dilation
with the interest elements in subsequent reads. When the readout of
(d = 1), this means that the maximum kernel shape is (64, K y ),
the current row of the interest region is complete, the shift amount is
with the vertical dimension K y limited only by the total capacity of
reset and the sequence is repeated. After all positions of the region
the SRAMs. For dilated 3 × 3 convolutions, the maximum dilation
of interest have been covered, we move to a new context.
coefficient is d = 31.
A simple multiplexer-based selection circuit takes the data elements
The architecture was synthesized in 22 nm using Cadence Genus,
pointed by the shifted kernel pattern and stores them in a set of
and the physical design was performed with Cadence Innovus. Table I
intermediate registers. When all registers are full, their contents
summarizes the most important specifications of the implementation.
are pushed to a first-input, first-output (FIFO) memory that holds
The throughput and power metrics are obtained from postlayout
the values until they are consumed by the array. The number of
simulations of the design using this setup. We benchmark the
intermediate registers (M) determines how many values in parallel
performance of our design with four CNNs of different sizes: the
can be taken on each feed lane. If there are more interest elements
atrous region proposal network (ARPN) used in [17], ResNet-50 [12],
in the bus than free registers, a stall signal is raised by the feed lane,
Visual Geometry Group (VGG)-16 [18], and YOLOv3 [19] using
indicating that the index counting and data readout pipeline must stop
input resolutions of 64 × 64, 256 × 256, 224 × 224, and 512 ×
until all lanes have been able to gather the required data. Finally, the
512, respectively.
FIFO output passes through the feed registers, a stage of concatenated
Fig. 5 shows the floorplan of the physical design of the accelerator,
registers that enforce the staggered latency between array rows, for
as well as a breakdown of the area and power consumption. The
example, if the first array row starts getting data at cycle 0, the second
total area of the system is 0.897 mm2 , which we fit in a 1.09-mm2
one will start at cycle 1, and so on.
floorplan. The systolic array PEs make up for about 83% of the logic
area excluding the SRAM macros, and the area overhead of the data
III. I MPLEMENTATION AND R ESULTS feeder is less than 7% of the logic area (4% of the total).
We base our evaluation on an ASIC implementation of the accel- The reserve registers we include in the PEs to extract the partial
erator described in Section II, using a 16 × 16 systolic array. Three sums have a total area and power overhead of 0.76% and 0.90%,
double-buffered 32-kB single-port SRAMs are used to store the respectively. The zero gating strategy we implement in the array PEs
ifmaps, weights, and partial sums, each with a 256-bit data bus. allows us to save power when the input values are zero, decreasing
We implement the PE arithmetic units in a 16-bit floating point the total power by about 6% when both weights and feature maps
to compute state-of-the-art CNNs with high accuracy and without present 10% of zeros. Taking the worst-case scenario as a baseline (all
retraining. The accumulation of partial sums is also performed using values are random and the tensors are completely dense), the system
FP16 precision. Note that the accelerator architecture is agnostic consumes 258 mW during computation, with the PEs accounting for
to arithmetic representation and could also be implemented using 75% of the power and the data feeder for about 7% (see Fig. 5).
Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 11, NOVEMBER 2023 1877
IV. C ONCLUSION
We have presented an energy-efficient convolution accelerator for
CNNs built upon a GeMM-based systolic array, coupled with our
novel data feeder architecture, which enables ON-chip, on-the-fly
convolution lowering and supports dilated (or atrous) convolutions.
Our accelerator reaches a peak throughput of 284 GFLOP/s and an
energy efficiency of 1.10 TFLOP/sW using FP16 arithmetic and a
By converting the tensors to data streams on-the-fly, the data feeder 16 × 16 array. By formulating lowering as a selection task and
decreases the overall memory traffic and bandwidth requirements leveraging the regularity of the 2-D convolution, our data feeder
of the system with respect to the software-based explicit im2col, design achieves superior area and energy efficiency than any other
which would require fetching the lowered matrices from the OFF-chip proposed im2col solution, with area and power overheads of 4% and
memory. We find that the number of bytes per operation decreases 7%, respectively. We have demonstrated how an ASIC implemen-
by more than 50% for most of our benchmarks (see Fig. 6). This tation of our design can achieve reductions of more than a factor
results in a significant reduction of the total data exchanged with the of two in memory transactions when accelerating state-of-the-art
DRAM memory, as well as the average bandwidth required to feed CNNs, compared to software-based im2col, significantly improving
the accelerator without stalls (see Table II). the energy efficiency of the overall system. Additionally, the reduction
Fewer memory transfers also enable energy savings on the DRAM, in memory transactions relaxes the OFF-chip memory bandwidth
which typically accounts for a large portion of the system-level power requirements of the system, boosting the overall performance and
consumption. Using the DRAMPower tool [20], we estimate that energy efficiency of the accelerator.
OFF-chip memory transactions using an LPDDR3 memory have an
energy cost of about 120 pJ/byte. Taking YOLOv3 as an example (see R EFERENCES
Table II), using our data feeder decreases the total energy consumed
[1] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and
by the DRAM during inference by 236 mJ. In comparison, the M. Shafique, “Hardware and software optimizations for accelerating
energy overhead of the data feeder is 6.9 mJ, 34× smaller than the deep neural networks: Survey of current trends, challenges, and the road
system-level energy reduction it enables. ahead,” IEEE Access, vol. 8, pp. 225134–225180, 2020.
Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.
1878 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 11, NOVEMBER 2023
[2] N. P. Jouppi, “In-datacenter performance analysis of a tensor pro- [11] W. Liu, J. Lin, and Z. Wang, “USCA: A unified systolic convolution
cessing unit,” in Proc. Int. Symp. Comput. Archit. Piscataway, NJ, array architecture for accelerating sparse neural network,” in Proc. IEEE
USA: Institute of Electrical and Electronics Engineers, Jun. 2017, Int. Symp. Circuits Syst. (ISCAS), May 2019, pp. 1–5.
pp. 1–12. [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[3] Z. Du, “ShiDianNao: Shifting vision processing closer to the sensor,” image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
in Proc. Int. Symp. Comput. Archit., vols. 13–17. Piscataway, NJ, (CVPR), Jun. 2016, pp. 770–778.
USA: Institute of Electrical and Electronics Engineers, Jun. 2015, [13] A. Vaswani, “Attention is all you need,” 2017, arXiv:1706.03762.
pp. 92–104. [14] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity
[4] R. Xu, S. Ma, Y. Wang, and Y. Guo, “HeSA: Heterogeneous sys- in deep learning: Pruning and growth for efficient inference and training
tolic array architecture for compact CNNs hardware accelerators,” in neural networks,” 2021, arXiv:2102.00554.
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Feb. 2021, [15] L. Ye, J. Ye, M. Yanagisawa, and Y. Shi, “Power-efficient deep convo-
pp. 657–662. lutional neural network design through zero-gating PEs and partial-sum
[5] Y. Chen, “Eyeriss: A spatial architecture for energy-efficient dataflow for reuse centric dataflow,” IEEE Access, vol. 9, pp. 17411–17420, 2021.
[16] S. Mach, F. Schuiki, F. Zaruba, and L. Benini, “FPnew: An open-source
convolutional neural networks,” in Proc. 43rd Int. Symp. Comput. Archit.
multiformat floating-point unit architecture for energy-proportional
(ISCA). Piscataway, NJ, USA: Institute of Electrical and Electronics
transprecision computing,” IEEE Trans. Very Large Scale Integr. (VLSI)
Engineers, Aug. 2016, pp. 367–379.
Syst., vol. 29, no. 4, pp. 774–787, Apr. 2021.
[6] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible [17] T. Guan and H. Zhu, “Atrous faster R-CNN for small scale object
accelerator for emerging deep neural networks on mobile devices,” detection,” in Proc. 2nd Int. Conf. Multimedia Image Process. (ICMIP),
IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Mar. 2017, pp. 16–21.
Jun. 2019. [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[7] S. Venkataramani et al., “RaPiD: AI accelerator for ultra-low precision large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent.
training and inference,” in Proc. ACM/IEEE 48th Annu. Int. Symp. (ICLR), Sep. 2015, pp. 1–14.
Comput. Archit. (ISCA), Jun. 2021, pp. 153–166. [19] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
[8] S. Chetlur, “cuDNN: Efficient primitives for deep learning,” 2014, 2018, arXiv:1804.02767.
arXiv:1410.0759. [20] K. Chandrasekar. DRAMPower: Open-Source DRAM Power & Energy
[9] H. Genc et al., “Gemmini: Enabling systematic deep-learning archi- Estimation Tool. Accessed: Apr. 25, 2023. [Online]. Available: http://
tecture evaluation via full-stack integration,” in Proc. 58th ACM/IEEE www.drampower.info
Design Automat. Conf. (DAC), Dec. 2021, pp. 769–774. [21] A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction
[10] M. Soltaniyeh et al., “SPOTS: An accelerator for sparse CNNs leverag- of CMOS device performance from 180 nm to 7 nm,” Integration,
ing general matrix-matrix multiplication,” 2021, arXiv:2107.13386. vol. 58, pp. 74–81, Jun. 2017.
Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.