Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection
Abstract—The aim of this work is to implement a However, this growth is constrained by the need for
Convolutional Neural Network (CNN) using a Python improved hardware acceleration to scale present day models
framework on Xilinx® Zynq® based Field Programmable beyond the limits of contemporary data and model sizes. So
Gate Array (FPGA). And the prototype is used to tackle the GPUs and FPGAs are employed to provide the desired
challenging problem of real-time object detection in computer speedup, but they depend on High Level Synthesis (HLS)
vision. Pre-trained CNN models are implemented using design tools. Hence it is cumbersome to deploy
Tensorflow Application Programming Interface (API). The sophisticated algorithms like object detectors on hardware
models are then implemented on Xilinx® Zynq® based FPGA development platforms like FPGAs. This issue may be
using Python productivity for Zynq (PYNQ). The versatility of resolved by amalgamation of software and hardware design
this approach is tested on four state-of-the-art object detectors
based on the classifiers i.e. mobilenet v1 and inception v2 and
i.e. using python for FPGA design.
the meta-architectures i.e. SSD and Faster R-CNN which have Major challenge in such embedded system designs is the
been trained on MS-COCO dataset. System functionality is variable dimension real-time output which is high when the
compared on the basis of system latency, detection accuracy number of objects in image is more, whereas it is low for
and ease of implementation on ARM® embedded mobile scarce observable entities in the image. This aspect makes it
platforms. It is observed from the results and real-time frame difficult to trade-off between architectural complexity and
rate, SSD with inception v2 model is suitable for the intended its size. Existing highly complex architectures can’t be
application. This hardware software co-design approach forms employed in embedded applications due to scarce resources
the basis of FPGA based hardware accelerators. and size constraint, which affects the overall accuracy of the
Keywords—PYNQ, Tensorflow, CNN, Zynq, Object detection.
system. Hence application specific CNN architectures for
embedded systems must be designed and there is no ‘one-
I. INTRODUCTION size-fits-all’ architecture. This issue can be resolved by
quick implementation of CNN on FPGA which can be
The software design in AI has undergone a paradigm custom-modified based on system specifications using HLS
shift with an exponential growth in the available data and design tools. Most of the prior work is focussed on
computational power. Contrary to manual programming the accelerator architecture design, but in this work a system
focus now is on ‘learning to learn’. This ability to deduce level analysis is carried out demonstrating the loop flow for
information from massive amounts of data is enabled by a CNN implementation without HLS design flow.
class of algorithms that constitute deep learning, which are
successfully applied to computer vision, natural language Literature has revealed other projects with similar
processing, speech recognition etc. The ability of deep exploratory goals like S. I. Venieris et. al [14] presented
learning algorithms to train complex feature extraction ‘fpgaConvNet' a framework to map CNN architectures on
systems and learn from huge datasets has been augmented by FPGA. Samjabrahams et. al [10] deployed Tensorflow on
DL frameworks such as Tensorflow, Caffe, PyTorch etc. A Raspberry-Pi® which demonstrated successful integration
democratised application of these open source frameworks of python and micro-controllers. Schmidt et. al [11]
has led to their rapid, transparent and tractable development implemented edge detection algorithms such as Canny edge
which has further boosted the growth of this field. These detector using python on Xilinx® PYNQ® board and
open source frameworks support python programming similar hardware platforms. Their performance comparison
language which was #4 in 2014 based on IEEE spectrum’s revealed that approximately 30x speedup can be achieved
survey of top programming languages and is now #1 as per
2018 survey. A quantitative comparison of few open source using PYNQ® based hardware acceleration. E. Wang et. al
python based frameworks is carried out using ‘Github [12] presented an open source framework for fast
metrics’ [1] and summarised in Table I. prototyping of an FPGA, which enables rapid deployment of
CNN on FPGA. To evaluate its performance three CNN
I. COMPARISON OF DL FRAMEWORKS THAT SUPPORT prototypes were tested. First, based on LeNet-5 for
PYTHON BASED ON GITHUB METRICS. recognition of hand-written digits using MNIST dataset,
second for image classification using CIFAR-10 dataset and
Framework License Language APIs Score third, an implementation of Network in Network (NIN) for
image classification using CIFAR-10 dataset.
Tensorflow Apache 2 C++, python C++, python, Go 100
CNNs were applied to image processing even before
Keras MIT Python Python, R 46.1 2012 but Alexnet [13] received attention of the world in
2012 with a top-5 error rate of 15.3%. As observed by [4]
Caffe BSD C++ Python, MATLAB 38.1
that CNN is an efficient algorithm for computer vision
Theano BSD python python 19.3 applications such as: semantic segmentation, image
classification, object detection and instance segmentation.
PyTorch BSD C++, python python 14.3 The important characteristics of CNNs required for fast real-
II. PYNQ® BASED HLS APPROACH TO FPGA DESIGN The PYNQ® architecture is complimentary to Xilinx®
Heterogenous configurable platforms must be Zynq® All Programmable System on Chip (APSoCs) based
complimented with high level synthesis design tools, so that FPGAs. Figure 3 shows a highly simplified overview of
higher abstraction applications can be developed on the low Xilinx® Zynq® SoC architecture which is powered by
level fabric of an FPGA. An introductory overview of Quad-core ARM® Cortex® A53 processor (PS) and dual-
various HLS design tools is given in [2]. The most widely core ARM® Cortex® R5 processor (RPU), intertwined with
used HLS tools involve complex tool flows, and the ‘Programmable Logic’ (PL).
common steps involved are summarised in the flow graph
shown in Figure 1.
Zynq SoC
In 2017 Xilinx® launched PYNQ® [3] an open source Processing system (PS) Programmable logic (PL)
project that eases the process of embedded system design
using Xilinx® Zynq® SoCs. It enables heterogenous DMA DNN
programmable development platforms to be programmed Python DMA kernel
using python directly. Prior to it, different approaches have Applications Drivers
DMA BRAM
been adopted to enhance the programmability of FPGAs and
improve their participation in mainstream computing. First,
is the development of Hardware Description Language Peripheral DRAM Peripheral
(HDL)-like languages. In this direction System verilog was devices devices
a game changer which brought together the best of HDL
languages and the object oriented programming paradigm. 3. Highly simplified overview of ZYNQ® architecture.
Another such example is Bluespec® System Verilog (BSV).
Second approach is the introduction of C- based frameworks Jupyter notebook is an open source project which
which use a subset of C/C++ programming languages to provides an interactive environment to explore algorithms
program hardware devices through automatic RTL and dynamically prototype complex applications in python.
development. Various such tools launched by major EDA PYNQ® leverages the same approach in its FPGA based
companies are Xilinx® Vivado HLS, Synopsys® Synphony embedded system development by embedding Jupyter
C compiler, Intel® High Level Synthesis Compiler, Mentor (IPython) kernel and notebook web server onto Xilinx®
Graphics® Catapult C etc. Recently, LegUp® was launched Zynq® SoC’s ARM® processors.
which is the only HLS tool that can program Intel®,
Xilinx®, Lattice®, Achronix® and Microsemi® FPGAs III. METHODOLOGY
without any re-implementation of the designs. Third
approach is the OpenCL based framework such as Intel® The focus of this work is fast implementation of CNN
FPGA SDK for OpenCL which is designed for FPGA-based for real-time object detection. And analyse its application in
heterogenous systems (FPGA+processor). the design of FPGA based hardware accelerators. So, before
delving into details a primer of key concepts is given below.
A. An Overview of PYNQ Architecture
A. Real-time Object Detection
Xilinx® PYNQ® provides an end to end development
flow for acceleration of python applications on Xilinx® Object detection involves detection of an instance of an
Zynq® SoC based FPGAs. It offers a dynamic object in two-dimensional image based on discriminative
implementation of python applications on FPGA based features such as colour, intensity, edge, contour etc derived
processing system (PS) or programmable logic (PL) as per from pixel map of the image. The Real-time object detection
the design. In PYNQ® PL bitstreams are packaged as pre- is a challenging problem because the system response has to
1. Steps involved in the high level synthesis of hardware designs using HLS tools.
WK,&&&17
-XO\,,7.DQSXU
.DQSXU,QGLD
IEEE - 45670
6. This figure shows (a) Original images (b) Model SSD + MobileNet v1 results (c) Model SSD + Inception v2 results (d) Faster RCNN +
Inception v2 results (e) Model SSD + MobileNet v1 w/fpn results. Last row shows images captured for live video feed input to the system.
MobileNet v1 w/fpn with an inference time of 56 ms and 32 object to maintain the speed requirements. Thereby,
mAP. It is the slowest model and its accuracy is much less providing a decreased overall accuracy for the CNN model.
than the maximum obtained by Faster R-CNN based model. Further accurate object detectors show many bounding
boxes in the same image, thus leading to over prediction of
It can be seen that the input images have been directly
objects which in turn leads to a slow system response and is
captured with a USB camera in improper lighting to further not suitable for real-time object detection applications.
test system robustness. The images chosen vary in colour,
texture, contrast, print and object instance to test CNN In case of real-time object detection on embedded
inference effectively. The implemented CNN models are systems, the three performance parameters are accuracy
inherently better in feature extraction in different conditions level, inference speed and architecture size. But with
such as occlusion, illumination, scale etc because of their constrained resources on embedded platforms, inference
invariance to translation - which is induced by pooling speed gains precedence over accuracy. Hence a faster model
layers and invariance to different illuminations - which is is preferred over an accurate but slower model. Therefore, a
the result of pre-processing of inputs to the CNNs through constrained system specification due to scarce on-board
mean subtraction and normalisation. However, as it can be resources leads to a sub-par accuracy level. And since the
observed from the output images there is still a significant CNN models are programmed in python which has a lower
effect of illumination on the detected objects, it is because execution speed compared to C++, also leads to a decreased
of the simplicity of CNN architectures used. Hence, inference speed. Thus, an optimum trade-off among system
robustness of CNN architecture also depends upon how parameters is necessary to obtain the desired system
diverse the training data is and how complex contextual performance.
information is annotated into the training images. Therefore,
the CNN models have been trained on a large dataset like Besides qualitative analysis, a statistical analysis is also
MS-COCO which has objects in various orientations, performed to support the observations made from obtained
present at different distances and viewed under different results. A detailed quantitative analysis is provided in the
lighting conditions. next sub-section.
7. The Figure shows accuracy levels of various object instances for all the 4 CNN models used for object detection i.e. Model A - SSD +
MobileNet v1, Model B - SSD + Inception v2, Model C - Faster RCNN + Inception v2 and Model D - SSD + MobileNet v1 w/fpn.
100
75
Model A
50 Model B
Model C
Model D
25
0
C1 C2 C3 C4 C5 C6 C7 C8 C9 C 10
8. Comparative analysis of confidence scores (VI) for 4 CNN models.
VI. CONFIDENCE SCORES FOR ALL 10 OBJECT CLASSES.