0% found this document useful (0 votes)
29 views5 pages

An Energy-Efficient GeMM-Based Convolution Accelerator With On-the-Fly Im2col

paper about: An_Energy-Efficient_GeMM-Based_Convolution_Accelerator_With_On-the-Fly_im2col (1)

Uploaded by

darmen.ilyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

An Energy-Efficient GeMM-Based Convolution Accelerator With On-the-Fly Im2col

paper about: An_Energy-Efficient_GeMM-Based_Convolution_Accelerator_With_On-the-Fly_im2col (1)

Uploaded by

darmen.ilyas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1874 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO.

11, NOVEMBER 2023

An Energy-Efficient GeMM-Based Convolution Accelerator


With On-the-Fly im2col
Jordi Fornt , Pau Fontova-Musté , Martí Caro , Jaume Abella , Francesc Moll ,
Josep Altet , and Christoph Studer
Abstract— Systolic array architectures have recently emerged as transforming the input 3-D tensor into an equivalent matrix, using
successful accelerators for deep convolutional neural network (CNN) a technique known as convolution lowering [8] (a.k.a. the im2col
inference. Such architectures can be used to efficiently execute general
algorithm).
matrix–matrix multiplications (GeMMs), but computing convolutions
with this primitive involves transforming the 3-D input tensor into an Lowering the convolution helps make computations efficient, but
equivalent matrix, which can lead to an inflation of the input data, the resulting matrix can be much larger than its tensor equivalent,
increasing the OFF-chip memory traffic which is critical for energy greatly increasing the amount of data to be transferred to the array.
efficiency. In this work, we propose a GeMM-based systolic array For this reason, many accelerators choose to manage the lowering step
accelerator that uses a novel data feeder architecture to perform ON-
chip, on-the-fly convolution lowering (also known as im2col), supporting internally by including some im2col units, for example, the systolic
arbitrary tensor and kernel sizes as well as strided and dilated (or atrous) array generator Gemmini [9] contains an optional im2col module to
convolutions. By using our data feeder, we reduce memory transactions accelerate the lowering task. This relieves the CPU from the burden
and required bandwidth on state-of-the-art CNNs by a factor of two, while of managing the im2col, but it still suffers from the overhead in
only adding an area and power overhead of 4% and 7%, respectively.
Application specific integrated circuit (ASIC) implementation of our
memory accesses, as the inflated matrices are fetched from the L2
accelerator in 22-nm technology fits in less than 1.1 mm2 and reaches an cache or OFF-chip memory. SPOTS [10] implements a GeMM-based
energy efficiency of 1.10 TFLOP/sW with 16-bit floating-point arithmetic. systolic accelerator with an im2col unit for on-the-fly convolution
lowering and sparsity support. It avoids the data overhead introduced
Index Terms— Convolution lowering, convolutional neural net- in the lowering step by first moving the tensors into its internal
work (CNN) accelerators, energy efficiency, im2col, systolic static random access memory (SRAM) memories and then managing
arrays. the im2col task internally. However, it requires including very large
(13% of its total area) intermediate buffers that compromise the area
I. I NTRODUCTION efficiency of the system, and its im2col unit cannot support dilated
Deep convolutional neural network (CNN) models achieve high convolutions.
inference accuracy at the cost of computational complexity, so the To the best of our knowledge, unified systolic convolution array
use of hardware accelerators is key for their deployment in real- (USCA) [11] is the only accelerator that currently provides hard-
world applications. Powerful computing engines, such as graphics ware support for on-the-fly convolution lowering as well as dilated
processing units (GPUs), have been extensively used for CNN convolutions, but it focuses on accelerating convolutions on sparse
training and inference [1], but for many applications, these prove tensors. While some classic networks like ResNet [12] can be
to be too power-hungry. sparsified to large degrees without much accuracy degradation, there
Systolic array-based accelerators, specialized toward deep learning, is a trend in modern architectures like transformer networks [13] to
have been shown to provide superior energy efficiency than GPUs, struggle in achieving high sparsity [14]. Hence, basing the energy
and many different systolic architectures have been proposed in efficiency strategy of an accelerator in sparsity puts it at risk of
[2], [3], [4], [5], [6], and [7]. These accelerators often focus on obsolescence unless this trend is reverted. Furthermore, USCA only
efficiently executing general matrix–matrix multiplications (GeMMs) reports synthesis results, lacking a full physical design, which is
[2], [3], [4]. Computing convolution operations in this setting requires critical for accurate power estimation.
In this work, we develop a CNN accelerator that efficiently
Manuscript received 28 February 2023; revised 22 May 2023;
computes convolutions with dense tensors, based on a systolic
accepted 9 June 2023. Date of publication 27 June 2023; date of cur-
rent version 24 October 2023. This work was supported in part by array architecture and a novel data feeder that performs on-the-fly
the MCIN/AEI/10.13039/501100011033 under Project PCI2020-134984-2, im2col transformation and supports dilated convolutions. Further-
in part by the European Union NextGenerationEU/PRTR, in part by more, we also perform the full synthesis and physical design process
the European Union’s Horizon Europe Program under Project Key Digi- for an application specific integrated circuit (ASIC) implementation
tal Technologies (KDT) Joint Undertaking (JU) under Grant 101097224,
and in part by the Spanish Ministry of Science and Innovation through of the accelerator, providing accurate area and power results.
MCIN/AEI/10.13039/501100011033 under Grant PID2019-107255GB-C21.
(Corresponding author: Jordi Fornt.) II. A RCHITECTURE
Jordi Fornt was with the Eidgenössische Technische Hochschule (ETH)
The architecture of the proposed accelerator, depicted in Fig. 1,
Zürich, 8092 Zürich, Switzerland. He is now with the Barcelona Super-
computing Center (BSC), Universitat Politècnica de Catalunya (UPC), 08034 is built around an output stationary (OS) systolic array. Three
Barcelona, Spain (e-mail: [email protected]). independent SRAM memories with double buffering hold the tensors
Pau Fontova-Musté and Jaume Abella are with the BSC, 08034 Barcelona, involved in the convolution (input feature maps, weights, and out-
Spain (e-mail: [email protected]; [email protected]). puts/partial sums). Our novel Data Feeder module is used to lower
Martí Caro and Francesc Moll are with the BSC, UPC, 08034 Barcelona,
Spain (e-mail: [email protected]; [email protected]). the input tensors stored in the ifmap SRAM and stream the resulting
Josep Altet is with the High-Performance Integrated Circuits and Systems matrices to the array rows. Since we fetch the data in tensor shape
(HIPICS) Group, UPC, 08034 Barcelona, Spain (e-mail: [email protected]). from the dynamic random access memory (DRAM), we avoid the
Christoph Studer is with the ETH Zürich, 8092 Zürich, Switzerland (e-mail: inflation of memory transactions caused by the im2col step. During
[email protected]).
Color versions of one or more figures in this article are available at
the convolution lowering process, the data feeder generates the data
https://doi.org/10.1109/TVLSI.2023.3286122. streams according to the dilation rate of the convolution. In parallel,
Digital Object Identifier 10.1109/TVLSI.2023.3286122 the weight fetcher block streams the weight data, and the partial sums
1063-8210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 11, NOVEMBER 2023 1875

Fig. 2. PE design. Zero gating highlighted in purple, and partial sum


management in green.

Fig. 1. Proposed convolution accelerator architecture using a 3 × 3 systolic


array and our data feeder.

manager extracts the partially accumulated results from the array and,
if needed, inserts preload data used to initialize the partial sums.

A. Systolic Array GeMM Engine


We implement an OS array as our GeMM engine, in which each
processing element (PE) is assigned to a pixel of the output tensor.
The horizontal spatial dimension of the tensor is mapped to the array
rows, and the output channels dimension to the columns. The systolic
array is composed of a mesh of PEs, the design of which is depicted
in Fig. 2. We kept the PEs as simple as possible to maximize area
efficiency. A zero detection and gating circuit, inspired by works
like [7], [15], is included before the multiplier to avoid unnecessary Fig. 3. Data feeder design and an example of feature map selection from
switching when any input value is zero. Even though our work an eight-element memory bus.
focuses on dense CNNs rather than sparse networks, this technique
is inexpensive and can save power even at low sparsity levels (see
Section III). sequentially once, and each feed lane selects the elements it needs
The multiply-accumulate (MAC) operation results are accumulated from every data word coming out of the SRAM. Once all positions of
in an internal register (the Accumulator). A secondary register, called the current region of interest have been read, the region is relocated
reserve register, is used to hold partial sum data during the extraction to a new context. The readout sequence is controlled by the index
of results or insertion of preload data. When the execution of a com- counters module, which comprises five independent counters with
putation context finishes, the accumulator and reserve register values configurable step size and limit: three are used to traverse the interest
are swapped, and the first MAC of the next context is performed. This region (through the x, y, and channel axes), and the other two move
solution has two benefits: first, it eliminates the need to access all the whole region to a new context (only for the x and y axes, since
partial sums from outside of the array concurrently (which does not the input channel is fully reduced on each context). The sum of all
scale well) by connecting the reserve registers together and shifting the counter outputs generates a pointer that is used to go through the
in and out the values. Second, it allows concatenating computation ifmap tensor.
contexts without stalling the PEs, improving the overall performance The index pointer initially points to the first interest element in the
and utilization, while having a small power and area overhead (see current row of the region, as illustrated in Fig. 4. Subsequent reads
Section III). on the same tensor row are performed by incrementing the pointer
by the bus width (in data elements). When the index reaches the end
of the current interest region row, it moves to the first position of the
B. Data Feeder for On-the-Fly im2col next row that must be read. The pointer value is split to generate the
Our novel data feeder (see Fig. 3) is composed of the index SRAM read address (higher bits) and the global word offset (lower
counters, a set of configuration registers, and an array of feed lanes, bits), which informs about the location inside the SRAM data word
replicated for all array rows as shown in Fig. 1. The main idea of of the first element of interest in the region.
our feeder architecture is to read all SRAM addresses of a context The information provided by the global word offset fully describes
in a single pass and distribute the data values to the PEs that need how the first feed lane (first row) can access its first element of
them. The ifmap tensor stored in the corresponding SRAM has its interest inside the SRAM data bus. For all remaining lanes, we obtain
dimensions flattened in the order (C,Y,X), with the x-axis contiguous this location by adding a local offset to the global word offset,
in memory. representing the difference in kernel locations (i.e., output pixel
To perform the readout, we define the interest region of a context locations) between rows of the array. Contiguous lanes represent
as the union of the convolution kernels centered on the output contiguous output values, so a convolution is configured by setting
pixels currently being computed, as depicted in Fig. 4 (left part). the local offsets to [s, 2s, . . . , (Y − 1)s], with s being the strides
All memory locations corresponding to the interest region are read and Y being the number of array rows. The result of this addition

Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.
1876 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 11, NOVEMBER 2023

TABLE I
C ONVOLUTION ACCELERATOR S PECIFICATIONS (P OSTLAYOUT )

Fig. 4. Example and illustration of the index counters sequence, assuming


an eight-element memory bus.

is defined as the word offset, and it indicates to each feed lane the
location of its first interest element.
For kernels larger than 1 × 1, each feed lane generally needs
to take several elements from the data bus on a single cycle. These
elements need not be contiguous to the first interest element, since the
accelerator supports dilated convolutions. To identify which positions
must be selected after the first one (which is pointed to by the
word offset), we define the Kernel Pattern, a configurable bit vector
that defines the location of interest elements in contiguous memory
positions, based on the convolution kernel. With this signal, we fully Fig. 5. Accelerator floorplan (left). Area and power breakdown of the
describe the horizontal kernel size and dilation rate. Fig. 3 shows an accelerator (right). Power is estimated with dense random inputs.
example for a 3 × 3 kernel with a dilation rate of 2.
By right-shifting the kernel pattern by the word offset, we align
integer operators. To implement the FP16 MAC operator, we used
it with the correct location of the interest elements on the data
a simplified version of the FPnew floating-point unit from the
bus, as depicted in the lower part of Fig. 3. Once aligned, each
open-source parallel ultra low power (PULP) platform [16].
bit of the shifted pattern points to the data elements that must be
The implemented data feeders use three intermediate registers per
taken. If several contiguous reads are needed to cover all elements
lane (M = 3) to optimize the performance of 3 × 3 kernels, which
of interest, the bus width (in data elements) is subtracted from the
are very common. The kernel pattern width is limited to 64 bits,
shift amount after every read. A negative shift amount in this case
so the data feeders can support any kernel shape with a horizontal
denotes a left shift. With this, the shifted kernel pattern is aligned
dimension that fits in this vector. For convolutions without dilation
with the interest elements in subsequent reads. When the readout of
(d = 1), this means that the maximum kernel shape is (64, K y ),
the current row of the interest region is complete, the shift amount is
with the vertical dimension K y limited only by the total capacity of
reset and the sequence is repeated. After all positions of the region
the SRAMs. For dilated 3 × 3 convolutions, the maximum dilation
of interest have been covered, we move to a new context.
coefficient is d = 31.
A simple multiplexer-based selection circuit takes the data elements
The architecture was synthesized in 22 nm using Cadence Genus,
pointed by the shifted kernel pattern and stores them in a set of
and the physical design was performed with Cadence Innovus. Table I
intermediate registers. When all registers are full, their contents
summarizes the most important specifications of the implementation.
are pushed to a first-input, first-output (FIFO) memory that holds
The throughput and power metrics are obtained from postlayout
the values until they are consumed by the array. The number of
simulations of the design using this setup. We benchmark the
intermediate registers (M) determines how many values in parallel
performance of our design with four CNNs of different sizes: the
can be taken on each feed lane. If there are more interest elements
atrous region proposal network (ARPN) used in [17], ResNet-50 [12],
in the bus than free registers, a stall signal is raised by the feed lane,
Visual Geometry Group (VGG)-16 [18], and YOLOv3 [19] using
indicating that the index counting and data readout pipeline must stop
input resolutions of 64 × 64, 256 × 256, 224 × 224, and 512 ×
until all lanes have been able to gather the required data. Finally, the
512, respectively.
FIFO output passes through the feed registers, a stage of concatenated
Fig. 5 shows the floorplan of the physical design of the accelerator,
registers that enforce the staggered latency between array rows, for
as well as a breakdown of the area and power consumption. The
example, if the first array row starts getting data at cycle 0, the second
total area of the system is 0.897 mm2 , which we fit in a 1.09-mm2
one will start at cycle 1, and so on.
floorplan. The systolic array PEs make up for about 83% of the logic
area excluding the SRAM macros, and the area overhead of the data
III. I MPLEMENTATION AND R ESULTS feeder is less than 7% of the logic area (4% of the total).
We base our evaluation on an ASIC implementation of the accel- The reserve registers we include in the PEs to extract the partial
erator described in Section II, using a 16 × 16 systolic array. Three sums have a total area and power overhead of 0.76% and 0.90%,
double-buffered 32-kB single-port SRAMs are used to store the respectively. The zero gating strategy we implement in the array PEs
ifmaps, weights, and partial sums, each with a 256-bit data bus. allows us to save power when the input values are zero, decreasing
We implement the PE arithmetic units in a 16-bit floating point the total power by about 6% when both weights and feature maps
to compute state-of-the-art CNNs with high accuracy and without present 10% of zeros. Taking the worst-case scenario as a baseline (all
retraining. The accumulation of partial sums is also performed using values are random and the tensors are completely dense), the system
FP16 precision. Note that the accelerator architecture is agnostic consumes 258 mW during computation, with the PEs accounting for
to arithmetic representation and could also be implemented using 75% of the power and the data feeder for about 7% (see Fig. 5).

Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 11, NOVEMBER 2023 1877

TABLE II To assess the accelerator performance for our benchmark CNNs,


M EMORY T RANSACTIONS AND BANDWIDTH S UMMARY, U SING E XPLICIT we assume that a commercial 32-bit-wide LPDDR3 DRAM memory
L OWERING ( EXPL .) AND O UR DATA F EEDER (feed) operating at 800 MHz is used, with a maximum bandwidth of
6.4 GB/s. As summarized in Table III, the use of our data feeder also
improves the throughput of the accelerator under these conditions.
The performance difference comes mainly from the twofold decrease
in the amount of data transactions, which helps avoid accelerator
stalls. In YOLOv3, computation time is reduced almost by 20% when
using our data feeder, even if we neglect the overhead of the CPU
lowering the convolution in the im2col case.
Table IV compares our design with SPOTS [10] and USCA [11],
the GeMM-based systolic accelerators in the literature most similar to
this work, as well as Eyeriss [6], a successful systolic accelerator used
commonly as a benchmark. Since these designs are implemented in
different technology nodes, we also report the efficiency values scaled
to 22 nm for a better comparison, using the equations defined in [21].
In the case of SPOTS, a comparison in terms of energy efficiency
cannot be established since its power consumption is unreported.
Similarly, USCA seems to present a higher energy efficiency, but the
Fig. 6. Required bytes per operation during the execution of the benchmarks. arithmetic used in its PEs is unreported, so a fair comparison cannot
be established since this choice greatly impacts the overall energy
efficiency, as well as the accelerator accuracy. It should also be noted
TABLE III
that the authors of USCA do not perform the full physical design of
C OMPUTE T IME AND P ROPORTION OF DRAM WAITING T IME A SSUMING
6.4 GB/ S OF DRAM BANDWIDTH
the accelerator, so its reported power may be an optimistic estimate.
Finally, compared to Eyeriss v2, our accelerator has a slightly lower
energy efficiency in absolute terms, when considering dense tensor
convolutions. However, it is important to note that Eyeriss uses 8-
bit fixed-point arithmetic, which is expected to consume much less
power than the 16-bit floating-point arithmetic, we support to execute
any state-of-the-art CNN with high accuracy. From our experiments
with MAC-based PEs in 22-nm technology, we have seen that the
difference between FP16 and int8 is more than enough to cover
TABLE IV the energy efficiency gap between our accelerator and Eyeriss v2.
C OMPARISON W ITH S IMILAR S TATE - OF - THE -A RT D ESIGNS
Hence, when accounting for this difference, our proposed accelerator
surpasses Eyeriss v2 in terms of energy efficiency when dealing with
dense tensors, while also supporting dilated convolutions.

IV. C ONCLUSION
We have presented an energy-efficient convolution accelerator for
CNNs built upon a GeMM-based systolic array, coupled with our
novel data feeder architecture, which enables ON-chip, on-the-fly
convolution lowering and supports dilated (or atrous) convolutions.
Our accelerator reaches a peak throughput of 284 GFLOP/s and an
energy efficiency of 1.10 TFLOP/sW using FP16 arithmetic and a
By converting the tensors to data streams on-the-fly, the data feeder 16 × 16 array. By formulating lowering as a selection task and
decreases the overall memory traffic and bandwidth requirements leveraging the regularity of the 2-D convolution, our data feeder
of the system with respect to the software-based explicit im2col, design achieves superior area and energy efficiency than any other
which would require fetching the lowered matrices from the OFF-chip proposed im2col solution, with area and power overheads of 4% and
memory. We find that the number of bytes per operation decreases 7%, respectively. We have demonstrated how an ASIC implemen-
by more than 50% for most of our benchmarks (see Fig. 6). This tation of our design can achieve reductions of more than a factor
results in a significant reduction of the total data exchanged with the of two in memory transactions when accelerating state-of-the-art
DRAM memory, as well as the average bandwidth required to feed CNNs, compared to software-based im2col, significantly improving
the accelerator without stalls (see Table II). the energy efficiency of the overall system. Additionally, the reduction
Fewer memory transfers also enable energy savings on the DRAM, in memory transactions relaxes the OFF-chip memory bandwidth
which typically accounts for a large portion of the system-level power requirements of the system, boosting the overall performance and
consumption. Using the DRAMPower tool [20], we estimate that energy efficiency of the accelerator.
OFF-chip memory transactions using an LPDDR3 memory have an
energy cost of about 120 pJ/byte. Taking YOLOv3 as an example (see R EFERENCES
Table II), using our data feeder decreases the total energy consumed
[1] M. Capra, B. Bussolino, A. Marchisio, G. Masera, M. Martina, and
by the DRAM during inference by 236 mJ. In comparison, the M. Shafique, “Hardware and software optimizations for accelerating
energy overhead of the data feeder is 6.9 mJ, 34× smaller than the deep neural networks: Survey of current trends, challenges, and the road
system-level energy reduction it enables. ahead,” IEEE Access, vol. 8, pp. 225134–225180, 2020.

Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.
1878 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 31, NO. 11, NOVEMBER 2023

[2] N. P. Jouppi, “In-datacenter performance analysis of a tensor pro- [11] W. Liu, J. Lin, and Z. Wang, “USCA: A unified systolic convolution
cessing unit,” in Proc. Int. Symp. Comput. Archit. Piscataway, NJ, array architecture for accelerating sparse neural network,” in Proc. IEEE
USA: Institute of Electrical and Electronics Engineers, Jun. 2017, Int. Symp. Circuits Syst. (ISCAS), May 2019, pp. 1–5.
pp. 1–12. [12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[3] Z. Du, “ShiDianNao: Shifting vision processing closer to the sensor,” image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
in Proc. Int. Symp. Comput. Archit., vols. 13–17. Piscataway, NJ, (CVPR), Jun. 2016, pp. 770–778.
USA: Institute of Electrical and Electronics Engineers, Jun. 2015, [13] A. Vaswani, “Attention is all you need,” 2017, arXiv:1706.03762.
pp. 92–104. [14] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity
[4] R. Xu, S. Ma, Y. Wang, and Y. Guo, “HeSA: Heterogeneous sys- in deep learning: Pruning and growth for efficient inference and training
tolic array architecture for compact CNNs hardware accelerators,” in neural networks,” 2021, arXiv:2102.00554.
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Feb. 2021, [15] L. Ye, J. Ye, M. Yanagisawa, and Y. Shi, “Power-efficient deep convo-
pp. 657–662. lutional neural network design through zero-gating PEs and partial-sum
[5] Y. Chen, “Eyeriss: A spatial architecture for energy-efficient dataflow for reuse centric dataflow,” IEEE Access, vol. 9, pp. 17411–17420, 2021.
[16] S. Mach, F. Schuiki, F. Zaruba, and L. Benini, “FPnew: An open-source
convolutional neural networks,” in Proc. 43rd Int. Symp. Comput. Archit.
multiformat floating-point unit architecture for energy-proportional
(ISCA). Piscataway, NJ, USA: Institute of Electrical and Electronics
transprecision computing,” IEEE Trans. Very Large Scale Integr. (VLSI)
Engineers, Aug. 2016, pp. 367–379.
Syst., vol. 29, no. 4, pp. 774–787, Apr. 2021.
[6] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible [17] T. Guan and H. Zhu, “Atrous faster R-CNN for small scale object
accelerator for emerging deep neural networks on mobile devices,” detection,” in Proc. 2nd Int. Conf. Multimedia Image Process. (ICMIP),
IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292–308, Mar. 2017, pp. 16–21.
Jun. 2019. [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
[7] S. Venkataramani et al., “RaPiD: AI accelerator for ultra-low precision large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent.
training and inference,” in Proc. ACM/IEEE 48th Annu. Int. Symp. (ICLR), Sep. 2015, pp. 1–14.
Comput. Archit. (ISCA), Jun. 2021, pp. 153–166. [19] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
[8] S. Chetlur, “cuDNN: Efficient primitives for deep learning,” 2014, 2018, arXiv:1804.02767.
arXiv:1410.0759. [20] K. Chandrasekar. DRAMPower: Open-Source DRAM Power & Energy
[9] H. Genc et al., “Gemmini: Enabling systematic deep-learning archi- Estimation Tool. Accessed: Apr. 25, 2023. [Online]. Available: http://
tecture evaluation via full-stack integration,” in Proc. 58th ACM/IEEE www.drampower.info
Design Automat. Conf. (DAC), Dec. 2021, pp. 769–774. [21] A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction
[10] M. Soltaniyeh et al., “SPOTS: An accelerator for sparse CNNs leverag- of CMOS device performance from 180 nm to 7 nm,” Integration,
ing general matrix-matrix multiplication,” 2021, arXiv:2107.13386. vol. 58, pp. 74–81, Jun. 2017.

Authorized licensed use limited to: SLUB Dresden. Downloaded on July 21,2025 at 20:03:10 UTC from IEEE Xplore. Restrictions apply.

You might also like