0% found this document useful (0 votes)
251 views7 pages

Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection

hardware implementation deep learning

Uploaded by

pavithr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views7 pages

Implementation of CNN On Zynq Based FPGA For Real-Time Object Detection

hardware implementation deep learning

Uploaded by

pavithr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

IEEE - 45670

Implementation of CNN on Zynq based FPGA for


Real-time Object Detection
Aman Sharma Vijander Singh Asha Rani
I.C.E Division I.C.E Division I.C.E Division
Netaji Subhas Institute of Technology Netaji Subhas Institute of Technology Netaji Subhas Institute of Technology
University of Delhi University of Delhi University of Delhi
New Delhi, India New Delhi, India New Delhi, India
[email protected] [email protected] [email protected]

Abstract—The aim of this work is to implement a However, this growth is constrained by the need for
Convolutional Neural Network (CNN) using a Python improved hardware acceleration to scale present day models
framework on Xilinx® Zynq® based Field Programmable beyond the limits of contemporary data and model sizes. So
Gate Array (FPGA). And the prototype is used to tackle the GPUs and FPGAs are employed to provide the desired
challenging problem of real-time object detection in computer speedup, but they depend on High Level Synthesis (HLS)
vision. Pre-trained CNN models are implemented using design tools. Hence it is cumbersome to deploy
Tensorflow Application Programming Interface (API). The sophisticated algorithms like object detectors on hardware
models are then implemented on Xilinx® Zynq® based FPGA development platforms like FPGAs. This issue may be
using Python productivity for Zynq (PYNQ). The versatility of resolved by amalgamation of software and hardware design
this approach is tested on four state-of-the-art object detectors
based on the classifiers i.e. mobilenet v1 and inception v2 and
i.e. using python for FPGA design.
the meta-architectures i.e. SSD and Faster R-CNN which have Major challenge in such embedded system designs is the
been trained on MS-COCO dataset. System functionality is variable dimension real-time output which is high when the
compared on the basis of system latency, detection accuracy number of objects in image is more, whereas it is low for
and ease of implementation on ARM® embedded mobile scarce observable entities in the image. This aspect makes it
platforms. It is observed from the results and real-time frame difficult to trade-off between architectural complexity and
rate, SSD with inception v2 model is suitable for the intended its size. Existing highly complex architectures can’t be
application. This hardware software co-design approach forms employed in embedded applications due to scarce resources
the basis of FPGA based hardware accelerators. and size constraint, which affects the overall accuracy of the
Keywords—PYNQ, Tensorflow, CNN, Zynq, Object detection.
system. Hence application specific CNN architectures for
embedded systems must be designed and there is no ‘one-
I. INTRODUCTION size-fits-all’ architecture. This issue can be resolved by
quick implementation of CNN on FPGA which can be
The software design in AI has undergone a paradigm custom-modified based on system specifications using HLS
shift with an exponential growth in the available data and design tools. Most of the prior work is focussed on
computational power. Contrary to manual programming the accelerator architecture design, but in this work a system
focus now is on ‘learning to learn’. This ability to deduce level analysis is carried out demonstrating the loop flow for
information from massive amounts of data is enabled by a CNN implementation without HLS design flow.
class of algorithms that constitute deep learning, which are
successfully applied to computer vision, natural language Literature has revealed other projects with similar
processing, speech recognition etc. The ability of deep exploratory goals like S. I. Venieris et. al [14] presented
learning algorithms to train complex feature extraction ‘fpgaConvNet' a framework to map CNN architectures on
systems and learn from huge datasets has been augmented by FPGA. Samjabrahams et. al [10] deployed Tensorflow on
DL frameworks such as Tensorflow, Caffe, PyTorch etc. A Raspberry-Pi® which demonstrated successful integration
democratised application of these open source frameworks of python and micro-controllers. Schmidt et. al [11]
has led to their rapid, transparent and tractable development implemented edge detection algorithms such as Canny edge
which has further boosted the growth of this field. These detector using python on Xilinx® PYNQ® board and
open source frameworks support python programming similar hardware platforms. Their performance comparison
language which was #4 in 2014 based on IEEE spectrum’s revealed that approximately 30x speedup can be achieved
survey of top programming languages and is now #1 as per
2018 survey. A quantitative comparison of few open source using PYNQ® based hardware acceleration. E. Wang et. al
python based frameworks is carried out using ‘Github [12] presented an open source framework for fast
metrics’ [1] and summarised in Table I. prototyping of an FPGA, which enables rapid deployment of
CNN on FPGA. To evaluate its performance three CNN
I. COMPARISON OF DL FRAMEWORKS THAT SUPPORT prototypes were tested. First, based on LeNet-5 for
PYTHON BASED ON GITHUB METRICS. recognition of hand-written digits using MNIST dataset,
second for image classification using CIFAR-10 dataset and
Framework License Language APIs Score third, an implementation of Network in Network (NIN) for
image classification using CIFAR-10 dataset.
Tensorflow Apache 2 C++, python C++, python, Go 100
CNNs were applied to image processing even before
Keras MIT Python Python, R 46.1 2012 but Alexnet [13] received attention of the world in
2012 with a top-5 error rate of 15.3%. As observed by [4]
Caffe BSD C++ Python, MATLAB 38.1
that CNN is an efficient algorithm for computer vision
Theano BSD python python 19.3 applications such as: semantic segmentation, image
classification, object detection and instance segmentation.
PyTorch BSD C++, python python 14.3 The important characteristics of CNNs required for fast real-

10th ICCCNT 2019


July 6-8, 2019, IIT - Kanpur,
Kanpur, India
IEEE - 45670
time object detection in embedded applications are high defined hardware libraries called ‘Overlays’ analogous to
accuracy level, small inference time and minimum software libraries which can be directly plugged in the
architecture size. Real-time object detection enables development environment. Application of hardware libraries
machine vision, which has varied applications such as: or ‘Overlays’ facilitate a virtual programmable space on top
UAVs, autonomous driving, surveillance, machine of the low-level FPGA fabric without any need of RTL
inspection, pose detection, personal security etc. Although development. Figure 2 shows a simplified overview of
there are several highly complex and efficient neural PYNQ® development architecture.
networks with hundreds of convolutional layers and millions
of parameters but such huge CNNs are difficult to be
implemented on embedded platforms. The aim of this work Apps Jupyter/ Ipython PYNQ notebooks
is to demonstrate implementation of CNN architectures on
resource scarce embedded platforms like FPGA using PYNQ libs DMA
PYNQ for fast real-time object detection. APIs python
Overlay
This paper is organised as follows: Section II explains
PYNQ architecture used for the implementation of CNN on Drivers Linux kernel
FPGA. In Section III methodology of this work is explained. axi_intc
Section IV covers the experimental environment for User designs
Bitstreams FPGA
hardware implementation of deep learning algorithms. The PYNQ IPs
PYNQ overlays
performance analysis of system based on obtained results is
given in section V. Section VI provides a conclusion of the
work with an insight into possible future work. 2. A simplified overview of PYNQ® architecture.

II. PYNQ® BASED HLS APPROACH TO FPGA DESIGN The PYNQ® architecture is complimentary to Xilinx®
Heterogenous configurable platforms must be Zynq® All Programmable System on Chip (APSoCs) based
complimented with high level synthesis design tools, so that FPGAs. Figure 3 shows a highly simplified overview of
higher abstraction applications can be developed on the low Xilinx® Zynq® SoC architecture which is powered by
level fabric of an FPGA. An introductory overview of Quad-core ARM® Cortex® A53 processor (PS) and dual-
various HLS design tools is given in [2]. The most widely core ARM® Cortex® R5 processor (RPU), intertwined with
used HLS tools involve complex tool flows, and the ‘Programmable Logic’ (PL).
common steps involved are summarised in the flow graph
shown in Figure 1.
Zynq SoC
In 2017 Xilinx® launched PYNQ® [3] an open source Processing system (PS) Programmable logic (PL)
project that eases the process of embedded system design
using Xilinx® Zynq® SoCs. It enables heterogenous DMA DNN
programmable development platforms to be programmed Python DMA kernel
using python directly. Prior to it, different approaches have Applications Drivers
DMA BRAM
been adopted to enhance the programmability of FPGAs and
improve their participation in mainstream computing. First,
is the development of Hardware Description Language Peripheral DRAM Peripheral
(HDL)-like languages. In this direction System verilog was devices devices
a game changer which brought together the best of HDL
languages and the object oriented programming paradigm. 3. Highly simplified overview of ZYNQ® architecture.
Another such example is Bluespec® System Verilog (BSV).
Second approach is the introduction of C- based frameworks Jupyter notebook is an open source project which
which use a subset of C/C++ programming languages to provides an interactive environment to explore algorithms
program hardware devices through automatic RTL and dynamically prototype complex applications in python.
development. Various such tools launched by major EDA PYNQ® leverages the same approach in its FPGA based
companies are Xilinx® Vivado HLS, Synopsys® Synphony embedded system development by embedding Jupyter
C compiler, Intel® High Level Synthesis Compiler, Mentor (IPython) kernel and notebook web server onto Xilinx®
Graphics® Catapult C etc. Recently, LegUp® was launched Zynq® SoC’s ARM® processors.
which is the only HLS tool that can program Intel®,
Xilinx®, Lattice®, Achronix® and Microsemi® FPGAs III. METHODOLOGY
without any re-implementation of the designs. Third
approach is the OpenCL based framework such as Intel® The focus of this work is fast implementation of CNN
FPGA SDK for OpenCL which is designed for FPGA-based for real-time object detection. And analyse its application in
heterogenous systems (FPGA+processor). the design of FPGA based hardware accelerators. So, before
delving into details a primer of key concepts is given below.
A. An Overview of PYNQ Architecture
A. Real-time Object Detection
Xilinx® PYNQ® provides an end to end development
flow for acceleration of python applications on Xilinx® Object detection involves detection of an instance of an
Zynq® SoC based FPGAs. It offers a dynamic object in two-dimensional image based on discriminative
implementation of python applications on FPGA based features such as colour, intensity, edge, contour etc derived
processing system (PS) or programmable logic (PL) as per from pixel map of the image. The Real-time object detection
the design. In PYNQ® PL bitstreams are packaged as pre- is a challenging problem because the system response has to

Top-level Profile Bitstream Middleware


IP core
Design Analysis generation generation FPGA design
generation
(Behavioural (Throughput (Hardware for (Software for Implemented
using HLS
model) & Latency) Software) Hardware)

1. Steps involved in the high level synthesis of hardware designs using HLS tools.

10th ICCCNT 2019


July 6-8, 2019, IIT - Kanpur,
Kanpur, India
IEEE - 45670
be obtained within a time frame, else the system fails in the II. KEY SPECIFICATIONS OF THE CLASSIFIERS USED IN THE
task. Therefore, performance metrics of a real time system PROJECT. (MEASURED ON IMAGENET).
are system latency, system throughput and system
bandwidth, emphasising that parallel operations are critical Model Top-1 Accuracy Million parameters
to CNN inference in a real time system. Since parallelism is
MobileNet v1 71.1% mAP 3.19
inherent in FPGAs therefore, object detection is a suitable
application for FPGA based acceleration. Inception v2 73.9% mAP 10.17
B. Convolutional Neural Network Models
CNN is a type of feed-forward artificial neural network. Meta architectures such as SSD300 [7], Faster R-CNN
Typically CNN is a sequence of different types of layers. [8] are used for localisation of the classified object. Faster
The input to the CNN is a pixel array of an input image. The R-CNN utilises two networks one called the region proposal
intermediate results obtained are a set of feature maps. The network (RPN) and the other used for detecting objects in
outputs obtained are in the form of conditional probability the image from proposals. However SSD300 classifies and
for a given set of inputs. Highest probability for a choice localises the object in an image using just one network.
depicts the confidence of the network in that output. Meta-architectures used in this work are shown in Table III.
Consider a CNN having S images as an input fed to L III. KEY SPECIFICATIONS OF META-ARCHITECTURES USED IN
number of network layers. 𝒳 are the input feature maps THE WORK (BASED ON PASCAL VOC2007).
with dimensions [SxCxBxA], 𝒴 are the output feature maps
with dimensions [SxRxQxP], 𝒲 are the learned weights Meta-architecture Frame Rate Top-1 Accuracy # Boxes
associated with each layer, their dimensions are [RxCxVxU]
and the biases ℬ associated with each neuron have the Faster R-CNN 7 FPS 73.2% mAP ~6000
dimension [R]. Where, S is the batch size for input feature SSD 59 FPS 74.3% mAP 8732
maps. A is the width, B is the height, C is the depth for input
feature maps. P is the width, Q is the height, R is the depth
of output feature maps. U is the horizontal and V is the These complex CNN architectures can be optimised for
vertical kernel size for learned biases. embedded applications, e.g. MobileNet_v1 optimised using
depth wise separable convolutions [4]. In order to ease and
• Convolutional layer is used to extract features from the expedite the implementation of CNNs on FPGA, pre-trained
input pixel array. In this a convolutional kernel (filter) CNN models can be obtained from open source APIs of
slides over the entire pixel array to perform convolution. various python frameworks and modified as per design
Mathematically, it is shown in equation (1) below. specifications. In this project Google Tensorflow API [9] is
C V U used. The work in [8] reveals that Faster R-CNN is the most
𝒴CON V[s, r, q, p ] = (W . X  +  B ) accurate and the fastest among all R-CNN architectures. The
∑∑∑ test-time and speed comparison is shown in Figure 4.
c=1 v=1 u=1
Where, W =  𝒲 CON V [r, c, v, u ] , B = ℬCON V [r ] and R-CNN test-time Speed
X =  𝒳CON V[s, c, q + v, p + u ].
R-CNN 49
• Activation layer is used to implement a non-linear
function on the output feature maps obtained from the SPP-Net 4.3
convolutional layer. It is essential to introduce non- Fast R-CNN 2.3
linearity in CNN architecture. Mathematically, it is shown
in equation (2) below. Faster R-CNN 0.2
0 12.5 25 37.5 50
𝒴ACT V[s, r, b, a ] = A( 𝒳ACT V[s, r, b, a ] )
4. Test-time speed comparison of RCNN architectures [15].
• Pooling layer is inserted between other layers
intermittently. It progressively reduces the spatial While the analysis in Tables II and III shows that SSD is
dimensions of its input feature maps. It extracts local much faster though less accurate than Faster R-CNN. Hence
information, reduces spatial dimensions of feature maps ensuring an optimum trade-off between inference speed and
and decreases the overall number of parameters in the accuracy level among the CNN models. Four CNN models
network. Mathematically, it is as shown in equation (3) are chosen for further analysis as summarised in Table IV.
given below.
IV. KEY SPECIFICATIONS OF META-ARCHITECTURES USED IN
𝒴POOL[s, r, q, p ] =   max ( 𝒳POOL[s, r, q + m , p + n ] ) THE WORK (BASED ON PASCAL VOC2007).
m,nϵ[1:K ]

If object detection is modelled as a classification CNN model Summary


problem then the success of a CNN model depends on the mAP Inference time
(Meta-architecture + Classifier)
accuracy of classification. Examples of such architectures
are Region-based Convolutional Neural Network (R-CNN), SSD + MobileNet v1 21 30 ms
Fast R-CNN, Faster R-CNN etc. While if object detection is SSD + Inception v2 24 42 ms
considered as a regression problem then two of the most
famous architectures are You Only Look Once (YOLO) and Faster RCNN + Inception v2 28 58 ms
Single Shot Detector (SSD). Object detection involves two
steps i.e. classification and localisation of an object in the SSD + MobileNet v1 w/fpn 32 56 ms
image. In this work MobileNet_v1 [5], Inception_v2 [6]
models are used for feature extraction from positive and C. Google Tensorflow API
negative classes of object instances. The extracted features Due to the sophistication, community support and
are further used to train the model as well as for industry acceptance of Tensorflow as reflected by Github
classification of object in the image. Classifiers used in this metrics summarised in Table I, it is prudent to leverage it for
work are summarised in Table II. CNN implementation on FPGA. Tensorflow requires a

10th ICCCNT 2019


July 6-8, 2019, IIT - Kanpur,
Kanpur, India
,(((
powerful 64 bit OS for its deployment and is thus V. KEY SPECIFICATIONS OF SEE-3CAM USB CAMERA.
implemented on Xilinx® Zynq® ZCU104 FPGA which can
support both 32 bit and 64 bit operating modes. See3-CAM parameter description Parameter Value
IV. EXPERIMENTAL SETUP Operating Voltage and Current 5V +/- 5%, 433mA
In this project a CNN is modelled in python using OS supported Linux and Windows
Tensorflow framework, tested on an Intel® CPU platform,
deployed on a Xilinx® Zynq® FPGA (in this case ZCU104) Supported Video Output formats UYVY, MJPEG
and validated with a jupyter notebook deployed on the
FPGA. Leveraging Tensorflow’s object detection API four Streaming max power (1920x1080) at 60 FPS 2.165 W
pre-trained CNN models are obtained and optimised for Streaming min power (1920x1080) at 15 FPS 1.455W
real-time embedded vision applications. Because of
Tensorflow’s cross platform support, necessary support files Responsivity 2 V/Lux-sec
are installed and CNN models are implemented on Intel® Operating temperature range 0 to 45 degrees C
CPU to verify their performance and ascertain an optimum
trade-off between inference time and accuracy level
required for real-time object detection. Next, Xilinx® Zynq® ZCU104 is to be booted with
PYNQ® image. The necessary board files and other
A. Data-sets required dependencies are installed to support the execution
MS-COCO [16] dataset (training+validation) is used for of CNN on Xilinx® ZCU104. First, install the
training and testing the classifiers and meta-architectures of downloadable image of PYNQ® for Xilinx® ZCU104 from
CNN models. It is the largest dataset of common object the link ZCU104 v2.4 PYNQ image. This package contains
interactions observed daily in the environment. It is a highly board specific overlays to support PYNQ®. Or instead the
complex dataset, with over 330k images (of which >220k image file can be downloaded from the link www.pynq.io
are labelled), it has 1.5 million instances of different objects, and then written on a micro SD card, which can be used to
and 80 different object classes, each having 5 captions. boot the FPGA board. Next, set the Xilinx® ZCU104 board
jumpers to enable SD card boot mode. Insert SD card and
Before the training images are input to the system they connect power cable to the board. Next connect Xilinx®
are pre-processed. And it involves two steps one is pre- ZCU104 with router, this will be used by PC to connect and
processing of both the training and testing data and the other program the board. Next, open a web browser in the
is pre-processing of only the training data incorporating data computer and goto the link https://pynq:9090/. Then in
augmentation. Input pre-processing of images involves jupyter notebook terminal open the Jupyter notebook to be
various different techniques such as mean subtraction, executed. In this case, CNN is dynamically inferred and
normalisation etc. results obtained are discussed in the next section.
B. CNN Implementation on FPGA V. RESULTS AND DISCUSSION
After the optimisation of CNN models, hardware is set
up for their deployment on FPGA. The Xilinx® Zynq® In this work four state-of-the-art object detectors are
ZCU104 FPGA is an embedded vision low cost (EVLC) investigated using pre-trained CNN models implemented
SoC based development platform that facilitates machine using Tensorflow, trained on MS-COCO dataset and
vision, autonomous driving, AR/VR etc. ZCU104 is based deployed on the Xilinx® Zynq® ZCU104 FPGA using
on Xilinx® Zynq® MPSoC XCZU7EV-2FFVC1156. The Xilinx® PYNQ® to perform real-time object detection.
core ZU7EV has an integrated quad core ARM® Cortex- Both images and video input are fed to the system. Different
A53 (PS) @ 1.5 GHz and a dual-core ARM® Cortex-R5 instances of object classes are chosen viz. Sports (Frisbee),
(RPU) @ 600 MHz. Such computational power on an FPGA Animal (dog, giraffe, zebra), Vehicle (bicycle), Person,
provides huge boost to heterogeneous multiprocessing in Electronics (laptop, mouse), Accessory (backpack) based on
complex computer vision applications. The Xilinx ZCU104 label hierarchy of MS-COCO dataset. So in all, for each of
along with the experimental setup is shown in Figure 5. the four object detectors three different types of images and
a video feed have been sampled 10 times each. Giving a
master sample set of 160 images in total.
A. Qualitative Analysis
The performance of CNN models on Xilinx® Zynq®
ZCU104 exhibit an inference time comparable to an i7
Intel® core CPU. Figure 6 shows the detection results of
various object instances representing the best accuracy level
for a given set of object classes tested on three different
types of images and a video feed for all the four object
detectors. Column (a) shows reference images used in the
study. Column (b) shows the results observed for SSD +
MobileNet v1 model which has an inference time of 30 ms
and 21 mAP. It is the fastest model as compared to other
three but its accuracy is less than the maximum observed.
Column (c) shows the images captured for SSD + inception
5. Xilinx® Zynq® ZCU104 FPGA with the experimental setup. v2 model which has an inference time of 42 ms and 24
mAP. It is a fast model with an accuracy at par with the
The video input to the system is obtained using e-con maximum of these four models. Column (d) shows the
Systems® See3-CAM USB camera. It is a 3.4 MP lens, image results for Faster R-CNN + inception v2 model with
super speed colour camera, supported by USB3.0 with an inference time of 58 ms and 28 mAp. It is the most
reversible plug and play (UVC Compliant) Type C accurate model among the four models under consideration,
connector provided by e-con Systems®. Key specifications but due to the highest inference time it is also the slowest
of the camera are summarised in Table V. model. Column (e) shows the image results for SSD +

WK,&&&17
-XO\,,7.DQSXU
.DQSXU,QGLD
IEEE - 45670

(a) (b) (c) (d) (e)

6. This figure shows (a) Original images (b) Model SSD + MobileNet v1 results (c) Model SSD + Inception v2 results (d) Faster RCNN +
Inception v2 results (e) Model SSD + MobileNet v1 w/fpn results. Last row shows images captured for live video feed input to the system.

MobileNet v1 w/fpn with an inference time of 56 ms and 32 object to maintain the speed requirements. Thereby,
mAP. It is the slowest model and its accuracy is much less providing a decreased overall accuracy for the CNN model.
than the maximum obtained by Faster R-CNN based model. Further accurate object detectors show many bounding
boxes in the same image, thus leading to over prediction of
It can be seen that the input images have been directly
objects which in turn leads to a slow system response and is
captured with a USB camera in improper lighting to further not suitable for real-time object detection applications.
test system robustness. The images chosen vary in colour,
texture, contrast, print and object instance to test CNN In case of real-time object detection on embedded
inference effectively. The implemented CNN models are systems, the three performance parameters are accuracy
inherently better in feature extraction in different conditions level, inference speed and architecture size. But with
such as occlusion, illumination, scale etc because of their constrained resources on embedded platforms, inference
invariance to translation - which is induced by pooling speed gains precedence over accuracy. Hence a faster model
layers and invariance to different illuminations - which is is preferred over an accurate but slower model. Therefore, a
the result of pre-processing of inputs to the CNNs through constrained system specification due to scarce on-board
mean subtraction and normalisation. However, as it can be resources leads to a sub-par accuracy level. And since the
observed from the output images there is still a significant CNN models are programmed in python which has a lower
effect of illumination on the detected objects, it is because execution speed compared to C++, also leads to a decreased
of the simplicity of CNN architectures used. Hence, inference speed. Thus, an optimum trade-off among system
robustness of CNN architecture also depends upon how parameters is necessary to obtain the desired system
diverse the training data is and how complex contextual performance.
information is annotated into the training images. Therefore,
the CNN models have been trained on a large dataset like Besides qualitative analysis, a statistical analysis is also
MS-COCO which has objects in various orientations, performed to support the observations made from obtained
present at different distances and viewed under different results. A detailed quantitative analysis is provided in the
lighting conditions. next sub-section.

It is revealed from the results that despite having a B. Quantitative Analysis


higher mAP, SSD+MobileNet v1 w/fpn shows a lower The maximum accuracy obtained for various class
accuracy level. The object detector algorithms are compared instances in all the test images for the four CNN object
based on the frame-level analysis, either in terms of mean detectors is plotted in Figure 7. A comparison of confidence
average precision (mAP) or frame per second (FPS). These scores of all the object classes for the four CNN models is
metrics measure the efficiency of CNN models applied to shown in Table VI and plotted in Figure 8. A visual and
static image detection but provide partial analysis for real- statistical analysis of the obtained results, reveals that Faster
time object detection applications. In other words, the RCNN + Inception v2 is the most accurate model and SSD
complexity of real-time object detection is not totally + Inception v2 model has the second best accuracy along
captured by mAP and FPS. with a near real-time frame rate. Considering an optimised
performance for real-time embedded applications, based on
It is also observed that the most accurate algorithms are a trade-off between inference speed and accuracy level of
slower whereas faster algorithms are less accurate. This is the CNN model, SSD + Inception v2 model is most suitable
due to the fact that in case of real-time object detection the for the intended application.
algorithm may skip frames or detect a wrong instance of an

10th ICCCNT 2019


July 6-8, 2019, IIT - Kanpur,
Kanpur, India
IEEE - 45670

Person Dog Frisbee Zebra Giraffe 1 Giraffe 2


(a) Accuracy levels for various object instances in image 1 (b) Accuracy levels for various object instances in image 2

Laptop Mouse Backpack


Person Bicycle
(c) Accuracy levels for various object instances in image 3 (d) Accuracy levels for various object instances in Video input

7. The Figure shows accuracy levels of various object instances for all the 4 CNN models used for object detection i.e. Model A - SSD +
MobileNet v1, Model B - SSD + Inception v2, Model C - Faster RCNN + Inception v2 and Model D - SSD + MobileNet v1 w/fpn.

100

75
Model A
50 Model B
Model C
Model D
25

0
C1 C2 C3 C4 C5 C6 C7 C8 C9 C 10
8. Comparative analysis of confidence scores (VI) for 4 CNN models.
VI. CONFIDENCE SCORES FOR ALL 10 OBJECT CLASSES.

CNN Models VI. CONCLUSION


Object Instances This work successfully demonstrates an end-to-end
- Class (Source) Model Model Model Model deployment of CNN on ARM® embedded Xilinx® Zynq®
A B C D
based FPGA using Xilinx® PYNQ® framework. A detailed
C 1 - Person (Img 1) 71 66 98 75
analysis of software hardware co-design is carried out,
highlighting the significance of inference speed and
C 2 - Dog (Img 1) 81 76 99 63 accuracy level tradeoff for real-time applications
implemented on embedded platforms. The evaluation and
C 3 - Frisbee (Img 1) 0 62 95 64 validation of four object detectors implemented on FPGA is
successful, generating expected results for different images
C 4 - Zebra (Img 2) 79 98 99 87 and video feed inputs. Object detector algorithms are
analysed using mean average precision (mAP) and frame
C 5 - Giraffe (Img 2) 71 84 99 77 per second (FPS). However, these metrics provide a
C 6 - Person (Img 3) 82 98 98 53
reasonable prediction for CNN models applied to static
images but are inadequate for real-time object detection
C 7 - Bicycle (Img 3) 71 82 91 74 applications. It is concluded from the results that SSD +
Inception v2 model is optimally accurate with near real-time
C 8 - Laptop (Video) 87 94 99 67 frame rate, suitable for real-time applications on embedded
platforms. The suggested setup has exhibited reasonably
C 9 - Mouse (Video) 63 60 97 67 efficient performance which is essential for rapid embedded
design development.
C10 - Backpack (Video) 62 61 53 0

10th ICCCNT 2019


July 6-8, 2019, IIT - Kanpur,
Kanpur, India
REFERENCES IEEE - 45670
1. J. Zacharias, M. Barz, D. Sonntag. A survey on deep learning toolkits
and libraries for intelligent user interfaces. arXiv preprint arXiv:
1803.04818. 2018 Mar 13.
2. D. F. Bacon, R. M. Rabbah, S. Shukla. FPGA Programming for the
Masses. Commun. ACM. 2013 Feb 1;56(4):56-63.
3. PYNQ Homepage, Available at, http://www.pynq.io/.
4. H. Mao, S. Yao, T. Tang, B. Li, J. Yao and Y. Wang, "Towards Real-
Time Object Detection on Embedded Systems," in IEEE Transactions
on Emerging Topics in Computing, vol. 6, no. 3, pp. 417-431, 1 July-
Sept. 2018. doi: 10.1109/TETC.2016.2593643.
5. A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T.
Weyand, M. Andreetto, H. Adam. Mobilenets: Efficient convolutional
neural networks for mobile vision applications. arXiv preprint arXiv:
1704.04861. 2017 Apr 17.
6. C. Szegedy et al., "Going deeper with convolutions," 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
Boston, MA, 2015, pp. 1-9. doi: 10.1109/CVPR.2015.7298594.
7. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed,C.Y. Fu, A.C.
Berg. Ssd: Single shot multibox detector. InEuropean conference on
computer vision 2016 Oct 8 (pp. 21-37). Springer, Cham.
8. S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-
Time Object Detection with Region Proposal Networks," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39,
no. 6, pp. 1137-1149, 1 June 2017. doi: 10.1109/TPAMI.
2016.2577031.
9. Google TensorFlow API Homepage, Available at, https://
www.tensorflow.org/api_docs.
10. Tensorflow-on-raspberry-pi Homepage, Available at, https://
github.com/samjabrahams/tensorflow-on-raspberry-pi.
11. A. G. Schmidt, G. Weisz and M. French, "Evaluating Rapid
Application Development with Python for Heterogeneous Processor-
Based FPGAs," 2017 IEEE 25th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM), Napa,
CA, 2017, pp. 121-124. doi: 10.1109/FCCM.2017.45.
12. E. Wang, J. J. Davis and P. Y. K. Cheung, "A PYNQ-Based
Framework for Rapid CNN Prototyping," 2018 IEEE 26th Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM), Boulder, CO, 2018, pp. 223-223. doi: 10.1109/
FCCM.2018.00057.
13. A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification
with deep convolutional neural networks. InAdvances in neural
information processing systems 2012 (pp. 1097-1105).
14. S. I. Venieris and C. Bouganis, "fpgaConvNet: A Framework for
Mapping Convolutional Neural Networks on FPGAs," 2016 IEEE
24th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), Washington, DC, 2016, pp.
40-47. doi: 10.1109/FCCM.2016.22.
15. F.F. Li, J. Johnson, S. Yeung. “Lecture 11 : Detection and
Segmentation,” 2017 Available at, http://cs231n.stanford.edu/slides/
2017/cs231n_2017_lecture11.pdf.
16. MS-COCO dataset Homepage, Available at, http://cocodataset.org/
#home.
17. Cs231n: Convolutional neural networks for visual recognition.
Lectures from Stanford university Homepage, Available at, http://
cs231n.stanford.edu/.
18. Deep learning for Visual Computing. Lectures from NPTEL
Homepage, Available at, https://nptel.ac.in/courses/108105103/26.

10th ICCCNT 2019


July 6-8, 2019, IIT - Kanpur,
Kanpur, India

You might also like