Enhanced Accelerator Design for Efficient CNN Processing with Improved Row-Stationary Dataflow

F Lesniak, A Gutermann, T Harbaum… - Proceedings of the Great …, 2024 - dl.acm.org
F Lesniak, A Gutermann, T Harbaum, J Becker
Proceedings of the Great Lakes Symposium on VLSI 2024, 2024dl.acm.org
Efficient on-device inference of convolutional neural networks (CNNs) is becoming one of
the key challenges for embedded systems, leading to the integration of specialized
hardware accelerators in System-on-Chips (SoCs). Due to the memory-bound nature of
convolution workloads, it is essential to optimize CNN accelerators for maximum data re-use
to reduce memory bandwidth requirements. The row-stationary (RS) dataflow enhances
data re-use in CNN processing by storing a subset of input activations, weights and partial …
Efficient on-device inference of convolutional neural networks (CNNs) is becoming one of the key challenges for embedded systems, leading to the integration of specialized hardware accelerators in System-on-Chips (SoCs). Due to the memory-bound nature of convolution workloads, it is essential to optimize CNN accelerators for maximum data re-use to reduce memory bandwidth requirements. The row-stationary (RS) dataflow enhances data re-use in CNN processing by storing a subset of input activations, weights and partial sums locally within the Processing Elements (PEs). However, designs of RS accelerators are not publicly available, and many implementation details remain undisclosed. This paper introduces an open-source implementation of a CNN accelerator with RS dataflow. The complete VHDL source code is provided as well as a simulation environment that enables in-depth analysis of different workloads. We contribute an exploration of various design parameters and evaluate their impact on performance. Furthermore, we present an enhanced dataflow that is optimized for parallel processing of convolutions with a high number of channels. Our optimizations yield a performance improvement of up to 2.3x for convolutional layers of common neural networks. An FPGA prototype of the accelerator design, featuring 70 PEs on the Xilinx UltraScale+ ZCU104 platform, achieves 4.012 GOPS at 100 MHz.
ACM Digital Library
Showing the best result for this search. See all results