0% found this document useful (0 votes)
90 views31 pages

Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms

This document is a project report submitted by Sarthak Goyal to fulfill the requirements for a Bachelor of Technology degree in Electronics and Electrical Communication Engineering at IIT Kharagpur. The report discusses implementing convolutional neural network layers on low-cost reconfigurable edge computing platforms like FPGAs. Specifically, it aims to implement fully connected and convolutional layers of deep neural networks on FPGAs and analyze their resource utilization and timing performance. The results show that FPGA implementations can achieve accuracy close to software while providing throughput an order of magnitude higher than other edge devices at lower energy costs, making FPGAs suitable for powering CNNs in IoT applications.

Uploaded by

Sarthak Goyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views31 pages

Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms

This document is a project report submitted by Sarthak Goyal to fulfill the requirements for a Bachelor of Technology degree in Electronics and Electrical Communication Engineering at IIT Kharagpur. The report discusses implementing convolutional neural network layers on low-cost reconfigurable edge computing platforms like FPGAs. Specifically, it aims to implement fully connected and convolutional layers of deep neural networks on FPGAs and analyze their resource utilization and timing performance. The results show that FPGA implementations can achieve accuracy close to software while providing throughput an order of magnitude higher than other edge devices at lower energy costs, making FPGAs suitable for powering CNNs in IoT applications.

Uploaded by

Sarthak Goyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Convolutional Neural Network layers implementation on

Low-cost Reconfigurable Edge Computing Platforms

Project Report to be submitted in partial fulfillment of


the requirements for the degree

of

Bachelor of Technology in Electronics and Electrical


Communication Engineering

by

Sarthak Goyal
19EC37004

Under the guidance of

Professor Indrajit Chakrabarti

ELECTRONICS AND ELECTRICAL COMMUNICATION


ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

1
Department of Electronics and
Electrical Communication Engineering
Indian Institute of Technology,
Kharagpur
India - 721302

CERTIFICATE

This is to certify that we have examined the thesis entitled Convolutional Neural
Network layers implementation on Low-cost Reconfigurable Edge Computing
Platforms, submitted by Sarthak Goyal (Roll Number: 19EC37004) a
undergraduate student of Department of Electronics and Electrical
Communication Engineering in partial fulfillment for the award of degree of
Bachelor of Technology in Electronics and Electrical Communication Engineering.
We hereby accord our approval of it as a study carried out and presented in a
manner required for its acceptance in partial fulfillment for the Degree for which
it has been submitted. The thesis has fulfilled all the requirements as per the
regulations of the Institute and has reached the standard needed for submission.

Supervisor
Department of Electronics and
Electrical Communication
Engineering
Indian Institute of Technology,
Kharagpur

Place: Kharagpur
Date:

2
ACKNOWLEDGEMENTS

I would sincerely like to thank my thesis supervisor, Professor Indrajit Chakrabarti.


He presented me with an opportunity to work on a problem designed on my own
and even went out of the way to assist me in tackling issues which do not directly
overlap with his ongoing research.

I would like to thank my parents without whose consistent support I would


perhaps not have been a part of this institution. They have provided continuous
encouragement throughout my years of study at IIT Kharagpur. I also thank my
friends who have provided pointers and helped me refine my work.

Sarthak Goyal
IIT Kharagpur
Date:

3
ABSTRACT

Deep learning, a branch of machine learning, has become increasingly popular.


Specifically, Computer Vision and Image Recognition workloads are well suited for
Deep Learning. With each layer of the network, the Convolutional Neural Networks
used in Deep Learning Neural Networks teach a set of weights and biases to recognise
important features in a picture. Custom hardware accelerators offer a way to address
this issue as CNN performance demands continue to rise. Field Programmable Gate
Arrays (FPGAs), a type of specialised hardware accelerator, are a potentially excellent
choice for powering CNNs.

The widespread use of internet of things (IoT) enabled applications offers low-cost
FPGA devices a new possibility to function as edge computing neural network nodes.
Although neural network development environments are offered by FPGA vendors,
they frequently focus on high-end devices. In contrast to their software counterparts,
these development platforms are less user-friendly.

We intend to implement the FPGA design of Deep Neural Network and then compare
the resources used by them in the board for different data width and activation
functions followed by timing analysis. Implementation results show that the DNNs
generated by the platform achieve accuracy very close to software implementations
at the same time gives throughput by an order of magnitude compared to other edge
computing devices at lower energy footprint. Further we plan to implement the
convolution layer of CNNs and analyse its resource utilization. These two operation
layers once implemented can be used for much bigger models together in future.

Keywords: Deep Neural Networks, Convolutional Neural Networks, Image


Classification, Field Programmable Gate Arrays (FPGA), Parallelism and Pipelining,
Hardware-Software Co-Design

4
CONTENT

1. Introduction to CNN and Hardware Acceleration . . . . . . . . . . . . . . . . . . . 6


1.1 Machine Learning and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Convolution Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Hardware Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2. Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Rectified Linear Units (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3. Hardware Implementation of Fully Connected Layer . .. . . . . . . . . . . . . . 14
3.1 Number Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Neuron Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Layer Architecture . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4. Hardware Implementation of Convolution Operation . . . . . . . . . . . . . . 18
4.1 Line Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Multiply and Accumulate Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Top Module and test Bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.1 Dataset used and its Software implementation Accuracy . . . . . 23
5.1.2 Resource Utilization and Timing analysis for a single neuron . . 24
5.1.3 Resource Utilization and Timing analysis for the whole DNN . . 26
5.2 Convolution Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6. Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5
Chapter 1

Introduction to Convolutional
Neural Network and Hardware
Acceleration

1.1 Machine Learning and Deep Learning

An application of Artificial Intelligence (AI) called machine learning gives a


system the capacity to learn from experience and get better over time without
being fully programmed. Data are used by machine learning to train and produce
precise outcomes. The goal of machine learning is to create a computer software
that can access data and utilise it to educate itself.

A neural network is a network or circuit of biological neurons, or, in a


modern meaning, an artificial neural network, made of artificial neurons or
nodes. A neural network can therefore be either a biological neural network
consisting of biological neurons or an artificial neural network intended to
address artificial intelligence (AI) issues. Artificial neural networks model
biological neuron connections as weights between nodes.

Deep Learning[9] is a subclass of Machine Learning that combines an artificial


neural network and a recurrent neural network. Though there are many more
levels of algorithms, they are produced exactly like machine learning. The term
"artificial neural network" refers to the entire collection of these algorithmic
network structures. In much basic terms, it mimics how the human brain
functions because all of the neural networks there are connected, which is exactly
the idea behind deep learning. With the aid of its procedure and algorithms, it
resolves all complicated issues.

6
Figure 1.1: Illustration of Deep Learning Neural Network Layers

1.2 Image Classification

Contextual image classification, a topic of pattern recognition in computer


vision, is an approach of classification based on contextual information in images.
“Contextual” means this approach is focusing on the relationship of the nearby
pixels, which is also called neighborhood. The goal of this approach is to classify
the images by using the contextual information. (Or a probability of the image
being part of a ‘class’.) A class is essentially a label, for instance, ‘car’, ‘animal’,
‘building’ and so on.[3]

You might enter a picture of a sheep, for instance. The method of having a
computer analyse an image and inform you that it is a sheep is known as image
classification. (Or the likelihood that a sheep is present.) Classifying photos is
nothing new to humans. But when it comes to machines, it's the ideal illustration
of Moravec's dilemma. (That is, for AI, what we find simple is complex.)

Raw pixel data was the foundation of early image classification. This implied
that computers would dissect images into their component pixels. The issue is
that the same subject can appear substantially differently in two distinct
photographs. They may have various backdrops, perspectives, poses, etc. This
made it very difficult for computers to accurately "see" and classify images. [2]

7
Figure 1.2: Illustration of Image Classification

1.3 Convolution Neural Networks

1.3.1 Layers of Convolution Neural Networks

Convolutional Neural Networks are composed of multiple layers. The first


part consists of Convolutional and max-pooling layers which act as the feature
extractor. The second part consists of the fully connected layer which performs
non-linear transformations of the extracted features and acts as the classifier. A
brief overview of the layers is presented. [3]

Convolution Layer: A convolution layer takes certain number of input


feature maps, slides a kernel of size K × K and produces output feature maps. The
kernel depth is the same as the input feature maps, and the number of kernels
(also known as filters) decide the number of output feature maps. The kernel may
have a shifting window, referred to as stride S. There are multiple convolution
layers for any DCNN.

Figure 1.3: Illustration of Convolution Operation

8
Pooling Layer: The purpose of the Pooling is to create spatial invariance by
subsampling adjacent pixels. Fig. 1.4 points out pooling schemes used commonly,
i.e average pooling and max pooling.

Figure 1.4: Illustration of Max Pooling

Fully Connected Layer: The fully connected layer (FC) operates on a


flattened input where each input is connected to all neurons. If present, FC layers
are usually found towards the end of CNN architectures and can be used to
optimize objectives such as class scores.

Figure 1.5: Illustration of Fully Connected Layer

1.4 Hardware Accelerator

In order to increase efficiency when any application is implemented higher


up the hierarchy of digital computing systems, hardware acceleration combines
the flexibility of general-purpose processors, such as CPUs, with the effectiveness
of fully tailored hardware, such as GPUs and ASICs. For instance, visualisation
chores could be delegated to a graphics card to speed up and improve the playing
of videos and games while freeing up the CPU for other work. The most common
hardware used for acceleration include:

9
Graphics Processing Units (GPUs): originally designed for handling the
motion of image, GPUs are now used for calculations involving massive amounts
of data, accelerating portions of an application while the rest continues to run on
the CPU. The massive parallelism of modern GPUs allows users to process billions
of records instantly.

Application-Specific Integrated Circuits (ASICs): an integrated circuit


customized specifically for a particular purpose or application, improving overall
speed as it focuses solely on performing its one function. Maximum complexity in
modern ASICs has grown to over 100 million logic gates.

Field Programmable Gate Arrays (FPGAs): a hardware description language


(HDL)-specified semiconductor integrated circuit designed to allow the user to
configure a large majority of the electrical functionality. FPGAs can be used to
accelerate parts of an algorithm, sharing part of the computation between the
FPGA and a general-purpose processor.

Figure 1.5: FPGA layout-main blocks of modern FPGAs

1.5 Problem Statement

There are various operations while modelling Convolutional Neural


Networks where Convolution operation and fully connected layer plays a key role.
The idea is to first implement fully connected layer using different activation
functions and compare them, then we have also implemented the convolution
layer. Both of these layers together can be used for much bigger models.

10
Chapter 2

Deep Neural Network


The fully connected layers in a convolutional network are practically a
multilayer perceptron (generally a two- or three-layer MLP) that aims to map the
m1(l−1)×m2(l−1)×m3(l−1) activation volume from the combination of previous
different layers into a class probability distribution. Thus, the output layer of the
multilayer perceptron will have m1(l−i) outputs, i.e., output neurons where i
denotes the number of layers in the multilayer perceptron.

The key difference from a standard multilayer perceptron is the input layer
where instead of a vector, an activation volume is taken as the input. As a result
the fully connected layer is defined as:

If l−1 is a fully connected layer;

Otherwise,

The goal of the complete fully connected structure is to tune the weight
parameters wi,j(l) or wi,j,r,s(l) to create a stochastic likelihood representation of each
class based on the activation maps generated by the concatenation of
convolutional, non-linearity, rectification and pooling layers. Individual fully
connected layers operate identically to the layers of the multilayer perceptron
with the only exception being the input layer.
11
It is noteworthy that the function f once again represents the non-linearity,
however, in a fully connected structure the non-linearity is built within the
neurons and is not a separate layer.

2.1 Rectified Linear Units (ReLU)

The rectified linear units (ReLUs) are a special implementation that


combines non-linearity and rectification layers in convolutional neural networks.
A rectified linear unit (i.e. thresholding at zero) is a piecewise linear function
defined as:
Yi(l)=max(0,Yi(l−1))
The rectified linear units come with three significant advantages in
convolutional neural networks compared to the traditional logistic or hyperbolic
tangent activation functions:

• Rectified linear units propagate the gradient efficiently and therefore


reduce the likelihood of a vanishing gradient problem that is common in
deep neural architectures.
• Rectified linear units threshold negative values to zero, and therefore solve
the cancellation problem as well as result in a much more sparse activation
volume at its output. The sparsity is useful for multiple reasons but mainly
provides robustness to small changes in input such as noise.
• Rectified linear units consist of only simple operations in terms of
computation (mainly comparisons) and therefore much more efficient to
implement in convolutional neural networks.

As a result of its advantages and performance, most of the recent


architectures of convolutional neural networks utilize only rectified linear unit
layers (or its derivatives such as noisy or leaky ReLUs) as their non-linearity layers
instead of traditional non-linearity and rectification layers.

Figure 2.1: Rectified Linear Unit

12
2.2 Sigmoid Function

The sigmoid function is a fundamental component of artificial neural


network and is crucial in many machine-learning applications. The sigmoid
function is defined mathematically as 1/(1+e^(-x)), where x is the input value and
e is the mathematical constant of 2.718. The function maps any input value to a
value between 0 and 1, making it useful for binary classification and logistic
regression problems. The range of the function is (0,1), and the domain is
(-infinity, +infinity).
The sigmoid function is commonly used as an activation function in artificial
neural networks. In feedforward neural networks, the sigmoid function is applied
to each neuron’s output, allowing the network to introduce non-linearity into the
model. This nonlinearity is important because it allows the neural network to
learn more complex decision boundaries, which can improve its performance on
specific tasks.
Advantage of sigmoid function is that it produces output values between 0
and 1, which can be helpful for binary classification and logistic regression
problems. Differentiable means that its derivative can be calculated, and it is easy
to optimize the network by adjusting the weights and biases of the neurons.
Disadvantage is it can produce output values close to 0 or 1, which can cause
problems with the optimization algorithm. The gradient of the sigmoid function
becomes very small near the output values of 0 or 1, which makes it difficult for
the optimization algorithm to adjust the weights and biases of the neurons.

Figure 2.2: Sigmoid Function

13
Chapter 3

Hardware Implementation of Fully


Connected Layer

3.1 Number Representation

Floating point representation helps in representing large numbers and they


may provide better precision, but their implementation and manipulation is
difficult. Again, they consume so much resources, we won’t be able to implement
more than a few tens of neuron on a platform like Zedboard.
So, we will be going with fixed point representation. Since the input values
used in NNs are generally normalized (between 0 to 1 or -1 to 1), there won't be
an issue of not able to represent large numbers. There may be slight degradation
in accuracy but if no overflow occurs, a 32-bit fixed point representation will give
better performance than 32-bit floating point representation. This representation
is highly flexible and can be parameterized depending upon the target application.
In this representation you will have to specify the total number of bits, the
number of bits representing the integer part and the number of bits representing
the fractional part. So, by fixing the number of bits to represent, we do a tradeoff
between accuracy and resources utilization.

3.2 Activation Functions

Many neural networks use non-linear functions such as sigmoid (1/1+e-x) or


hyperbolic tangent (ex – e-x / ex + e-x) as activation functions. Again, building digital
circuits generating these functions are very challenging and they will be very
resource intensive. Hence, we pre-calculate their values (since we will be aware of
14
range of x, the input) and store in a ROM. These ROMs will be also called Look Up
Tables (LUTs) These LUT ROMs are built either using block RAMs or distributed
RAMs (FFs and LUTs).
We won’t instantiate Xilinx IP cores (RAMs, DSP slices etc.) when building
neural networks. This will bring more flexibility to the design. If we are using IP
cores for example to design a ROM for activation function and if we decide to
change the size of the ROM, we won’t be able to do this just changing a parameter.
We will have to do it through Xilinx IP core generator. But these cores are highly
efficient in implementation. For example, BlockRAMs are much efficient than
distributed RAMs when making ROMs. Hence, we will use specific coding style
which will infer appropriate Ips instead of directly instantiating them.

3.3 System Architecture

We will be using ZedBoard Zynq Evaluation and Development Kit


(xc7z020clg484-1) to run our implementation. Fig. 3.1 depicts the complete
system architecture generated by DNN when targeting the Zynq platform. It
packages the DNN as an IP core (called ZyNet itself) and automatically integrates
with other peripherals. The AXI4-Lite interface of ZyNet is connected to the GP0
(General Purpose 0) interface of Zynq PS (processing system). This interface is
used for configuration in case of untrained networks and for reading out the final
network output in case of both pre-trained and on-board trained network.

Figure 3.1: System Architecture

The AXI4 stream interface from ZyNet is connected to a DMA controller,


which in turn is connected to the external memory through the Zynq HP0 (high
performance 0) interface. This enables training and test data to be directly
streamed from external memory to ZyNet. The DMA controller is also interfaced
with the Zynq GP0 port for configuration. Interrupt signals from both ZyNet and
DMA controller are connected to the PS interrupt interface.
15
3.4 Neuron Architecture
The architecture of a single artificial neuron used by ZyNet is as shown in
Fig. 3.2. Each neuron has independent interfaces for configuration (weight, bias
etc.) and data. Irrespective of number of predecessors, each neuron has a single
interface for accepting data. This enables scalability of the network and improves
clock performance by compromising on latency.
An internal memory, whose size is decided by the number of inputs to the
neuron, is used to store the weight values corresponding to each input.
Depending on whether the network is configured as pre-trained or not, either a
RAM with read and write interface or a ROM initialized with weight values are
instantiated.
As inputs are streamed into the neuron, a control logic reads the
corresponding weight value from the memory. Inputs and corresponding weight
values are multiplied and accumulated (MAC) and finally added with the bias
value. Like weights, bias values are stored in registers at implementation time if
the network is pre-trained or configured at runtime from software.
The output from the MAC unit is finally applied to the activation unit. Based
on the type of activation function configured (Sigmoid, ReLU, hardMax etc.),
either a look-up-table based (for Sigmoid) or a circuit-based function is
implemented by the tool. The type of function chosen has a direct impact on the
accuracy of the network and the total resource utilization and clock performance.
The depth of the LUT for Sigmoid function can be optionally specified by the user
or the tool can automatically determine it.

Figure 3.2: Neuron Architecture

16
3.5 Layer Architecture

Each layer instantiates user specified number of neurons and manages data
movement between the layers. Since each neuron has a single data interface and
a fully connected layer requires connection to every neuron from the previous
layer, data from each layer is initially stored in a shift register. It is then shifted to
the next layer one per clock cycle as shown in Fig. 3.3. Connection between layers
and integration with input and output AXI interfaces are automatically
implemented by the tool.

Figure 3.3: Layer Architecture

17
Chapter 4

Hardware Implementation of
Convolution Operation
Convolution Neural Network implementation on a Field Programmable Gate
Array is the ultimate goal. We need a functioning convolution operation
implementation in order for that to function. Since hardware is handled
differently than in a normal software implementation, the convolution operation
is designed differently. This design employs a module-wise behavioural
implementation to accomplish the desired operation and was created using the
Verilog High Level Description Language and the Vivado Suite of Xilinx. [1][10]

Figure 4.1: Flow Diagram of Convolution Operation

18
To perform the convolution process, the structure has four modules that are
implemented in Vivado:

1. Line Buffer
2. Multiply and Accumulate Unit
3. Control Unit
4. Top Module

4.1 Line Buffer

The Line Buffer is basically a region of memory that is being used to store a data
temporarily between operations. In this case, it holds the the entire row of an
image data at once before it is sent to the Multiply Accumulate Unit.

The output size of this line buffer is chosen based on the size of the Kernel
that is being implemented. The most important thing to keep in mind is that while
the output is a continuous stream of bits, the internal structure is made up of
registers that store colour values that are ordered to store the necessary number
of pixels in a given row.

For this particular implementation, the parameters have been decided and are
listed here:

Image Width: 512 Image Color Encoding: 8 Bits Kernel Size: 3 × 3


Based on this the output size is 24 bits. Output = 3 × ColorEncoding

The other inputs are the following signals:


• Input Clock Signal: To synchronize line buffer operation
• Reset Signal: Reset buffer state to 0 at the start of the operation
• Input Data Valid Signal: Flag Signal to allow the line buffer to start reading
from Image Pixel Data
• Read Data Signal: Control the output of the line buffer

To validate the correctness of the implemented design, a simulation was


performed of the concerned module using Vivado with the target board set as a
ZYNQ FPGA board. The simulation results are shown in the Fig. 4.2 below.

19
Figure 4.2: Implementation of Line Buffer in Vivado

4.2 Multiply and Accumulate Unit

The multiply-accumulate (MAC) action in computer architecture is an operation


that computes the product of two numbers and adds that product to an
accumulator, particularly in the area of digital signal processing. A multiplier-
accumulator is the name of the hardware component that executes the process
(MAC unit). A MAC unit is employed in a CNN's convolution procedure.

The advantage of performing a MAC operation using a Hardware


Description Language is that the all the multiplication and addition operations
occur in parallel. The inbuilt addition IPs and multiplier IPs are used for
performing signed multiplication and subsequent addition. The speed of the
convolution operation hence depends on the speed of the MAC unit. This
provides the user a unique opportunity to try a variety of algorithms.

Once again, the width of the multiplier and adder depends on the kernel size. For
this particular implementation since we are using a 3 × 3 kernel we require 9
parallel data paths.[8]

The input/outputs for this MAC module are as follows –


• Input Clock Signal: To synchronize MAC operation
• Input Pixel Data: This is the 72-bit continuous pixel data which is
corresponding to 9 image pixel which are sent to different data paths as
required
• Input Data Valid Signal: Flag Signal to allow the MAC unit to start its
20
operation
• Output Convolved Data: This is a single 8-bit pixel value calculated at the
end of the MAC operation
• Output Convolved Data Valid Signal: Flag Signal to let the Output Buffer
know to latch onto the output

The output of the MAC unit can undergo further processing in case of the Sobel
Kernel which is used as the test case in this report. The FPGA is capable of
performing certain mathematical operations in optimized methods using DPS
cores present, such as the Square Root operation.

Figure 3.5: Structural Description of MAC Unit

4.3 Control Unit

This module is responsible for controlling the operation of the line buffers, and
maintain the synchronous operation of the circuit.

The input/outputs for the control module are as follows:


• Input Clock Signal: To synchronize MAC operation
• Input Reset Signal: External Signal which is asserted before starting the
operation
• Input Pixel Data: Data flow from the image pixel to the line buffer
• Output Pixel Data: Data flowing from the line buffers to the MAC unit
• Output Pixel Data Valid: Flag Signal to let the MAC know to operate onto
the output
• Output Interrupt: Decides on order of line buffer write order

21
The control module is responsible in maintaining the order in which the line
buffers are filled. The first step is that all the three-line buffers get filled before
we can start operation using the MAC unit. Simultaneously the fourth line buffer
gets filled, to exploit the philosophy parallelism. After that as we complete the
convolution operation for a single row, the control module sends the next group
of three-line buffers while the first one gets filled. The control module does this
by generating relevant select signals and output signals to let the MAC unit to
start operation. There are various stages in the code where pipelining is used, that
is there are multiple always blocks executing in parallel.

4.4 Top Module and Test Bench

The top module is defined for simulation purposes. This is also used while
packaging the IP for deployment in the FPGA. The top module has instances called
for the Control Module, MAC Unit as well as Output Buffers. The Line Buffers are
called are instantiated in the Control Module itself. The test bench is written
where the Top Module is the device under test.

The test bench is written which reads a greyscale image file of 512 × 512 in
BMP format, excludes the header information and then subsequently provides
the binary data to the Device Under Test (DUT). The output of the DUT is then
fetched from the buffer and then it is written back into a BMP file with the same
header information.

22
Chapter 5

Results and Discussion

5.1 Fully Connected Layer

5.1.1 Dataset used and its Software implementation Accuracy

In this section we discuss the implementation results and performance of


ZyNet based DNNs. Multiple DNNs targeting the popular MNIST dataset is
implemented and evaluated through both simulation and hardware validation.
MNIST dataset for handwritten digit recognition uses 30000 images for training
and 10000 images for testing. The weights and biases for the network were
initially determined from software implementation and used for pre-training
hardware implementation.
The software implementation after 30 epoch training provides 96.52% detection
accuracy for the testing set. All implementations follow a 5-layer architecture
with 784 neurons in input layer, 2 hidden layers with 30 neurons each and 1
hidden layer with 10 neurons and an output layer with 10 neurons. The output
layer is connected to a hardmax module to detect the neuron with maximum
output value. All designs are simulated and implemented with Xilinx Vivado
2018.3 version and hardware validated on ZedBoard having an xc7z020clg484-1
SoC and 512MB external DDR3 memory.

23
5.1.2 Resource Utilization and Timing analysis for a single neuron

The datawidth of input values is 16 for the following results. We compare the
Resource Utilisation, Maximum frequency, and power consumption of the Zedboard
for a single neuron with the following graphs.

68 67 74
72
66 72

64 70
LUTs used

FFs Used
62 61 61 61 68
60 60 60 60 66 66 66 66 66 66 66
60 66

58 64

56 62
5 6 7 8 9 10 11 12 5 6 7 8 9 10 11 12
Log2(Sigmoid Depth) Log2(Sigmoid Depth)

3 2.5
2.5 2 2 2 2 2 2 2 2
2.5 2
DSP Slices Used
BRAMs Used

2
1.5 1.5
1.5
1 1 1 1 1 1
1
0.5 0.5
0.5
0 0
5 6 7 8 9 10 11 12 5 6 7 8 9 10 11 12
Log2(Sigmoid Depth) Log2(Sigmoid Depth)

Figure 5.1: Resource Utilisation for Sigmoid Function with varying Sigmoid depths

24
220
Maximum Frequency (MHz)
218.38
216.35
215 214.4
213.35
212.44
211.46
210
208.02
206.52
205

200
5 6 7 8 9 10 11 12
Log2(Sigmoid Depth)

Figure 5.2: Maximum Frequency for Sigmoid Function with varying Sigmoid depths

0.17 0.168
Power Consumed (W)

0.16 0.162
0.157 0.159 0.159 0.159
0.154
0.15

0.14 212.44

0.13

0.12
5 6 7 8 9 10 11 12
Log2(Sigmoid Depth)

Figure 5.3: Power Consumed for Sigmoid Function with varying Sigmoid depths

The total number of resources in the board are as followed: LUT: 53200, FF:
106400, BRAM: 140, DSP: 220. For the ReLu implementation of the activation
function it consumes 68 LUTs, 81 FFs, 0.5 BRAM and 2 DSP slices.

25
5.1.3 Resource Utilization and Timing analysis for the whole DNN

All implementations use Vivado default settings and do not apply any
optimization (timing, area or power). The DNN implementation was evaluated for
both Sigmoid and ReLU activation functions for varying datawidth. Fig.5.4 shows
the relation between the detection accuracy and datawidth.

Figure 5.4: Comparison of accuracy for Sigmoid and ReLU activation functions

It could be seen that for very small datawidth (such as 4 and 8 bits), Sigmoid-
based function implementation outperforms ReLU-based implementation. As the
width increases, ReLU has slight advantage over Sigmoid implementation and
accuracy of both implementations becomes constant beyond 12-bits. Sigmoid
implementation gives a maximum of 94.86% detection accuracy and ReLU gives
a maximum of 95.87%.
The degradation in the result compared to software implementation can be
attributed to the error introduced due to fixed-point representation of weights,
biases and input data. Still the approximation causes less than 1% error but gives
considerable advantage in terms of resource utilization and clock performance.

Figure 5.4: (a) Resource utilization for sigmoid Activation function (b) Resource utilization for
ReLU Activation function

26
Figures 5.4(a) and 5.4(b) compares the resource utilization of the DNNs for
different data widths in terms of LUTs, flipflops, Block RAMs (BRAMs) and DSP
slices while using the two different activation functions. Since the RTL code
generated by ZyNet does not explicitly instantiates any IP cores, for smaller
designs the implementation tool (Vivado) automatically maps the multipliers and
weight memory blocks into LUTs and flip-flops. For larger data size, the lookup
table used for implementing Sigmoid function are mapped to Block RAMs, which
considerably increases the BRAM utilization.
For example, for 32-bit implementation, Sigmoid based implementation
requires 50350 LUTs, 15544 flip-flops, 70 BRAMs and 220 DSP slices. At the same
time ReLU-based implementation requires 54559 LUTs, 18074 flip-flops, 30
BRAMs and 220 DSP slices. These numbers roughly maps to 94.6% LUTs, 17% flip-
flops, 21.4% BRAMs and 100% DSP slices of the chip. It should be noted that 16-
bit implementation also consumes 220 DSP slices which means for larger
networks the tool automatically maps the multipliers into LUTs and flip-flops.
Thus the largest network size is constrained by the number of LUTs available in
the device.

Figure 5.4: (a) Detection accuracy and varying sigmoid memory depth (b) Datawidth vs
Maximum Frequency and Power consumption

27
5.2 Convolution Layer

We were using parallelism by using the data of three-line buffers for MAC
operation and by filling the fourth line buffer with new data simultaneously in
total using four Line Buffers. But if we only use three-line buffers in total then
then we would first need to wait for the data to be filled in the line buffer and
then only we can perform the MAC operation, hence increasing the total time to
perform convolution two times. This is shown in the below waveforms Figure
5.5(a) and 5.5(b).

Figure 5.5(a): Waveform of the Testbench when three-line buffers are used

Figure 5.5(b): Waveform of the Testbench when four-line buffers are used

28
Hence, we can comment that by increasing the no. of line buffers or by increasing
the number of hardware components further, further parallelism can be achieved
and we can reduce the computation time.

Here is a display of the Sobel Kernel's output. The edge detection output is
currently produced via thresholding in the implementation.

Figure 5.6: Edge Detection Output Image

29
Chapter 4

Conclusion and Future work

4.1 Conclusion

We have discussed about the fundamental concepts behind a Convolutional


Neural Network as well as a Field Programmable Gate Arrays. We have further
discussed related works in the deployment of DNN on FPGAs and analysed the
results related to resource utilisation, Maximum Frequency and Power.
Further we move onto the hardware implementation of basic convolution
operation, and we implement a circuit level design using Verilog HDL. The circuit
level design forces us to think about the operation differently and the details of
the module design and simulations have been discussed extensively in the report.
The hardware allows us to pipeline and parallelise the operation to a large extent.
The Sobel Kernel has been implemented as a test case and the convolution
operation has been performed in approximately 2.6 milliseconds on a system
running Intel Core i5 8th Generation processor with 8 GB of RAM.

4.1 Future Work

Further we can extend the 2D convolution operation to 3D convolutions. After


which we can implement a whole CNN model like AlexNet, VGGNet, ResNet, etc
on FPGA and optimize it according to the use. The implementation of a whole
CNN would not be possible on the used FPGA board due to resource constraints.
For this we would need to either further optimize the layers or use a alternative
board.

30
Bibliography

1) Yuchen Yao, Qinghua Duan, Zhiqian Zhang, Jiabao Gao, Jian Wang, Meng
Yang, Xinxuan Tao, Jinmei Lai (2018) ‘A FPGA-based Hardware Accelerator
for Multiple Convolutional Neural Networks’

2) Vikas Gupta, Anastasia (2017) ‘Image Classification using Feedforward


Neural Network in Keras’

3) Abhinav Sagar (2019) ‘Deep Learning, Image Classification, Neural


Networks, Small Data’

4) Lacey, G., Taylor, G. W. & Areibi, S. (2016), ‘Deep learning on fpgas: Past,
present, and future’, arXiv preprint arXiv:1602.04283

5) Zilic, Zeljko. (2009). ‘Designing and Using FPGAs beyond Classical Binary
Logic: Opportunities in Nano-Scale Integration Age. Proceedings of The
International Symposium on Multiple-Valued Logic. 268-273.
10.1109/ISMVL.2009.51.’

6) Mittal, S. (2020), ‘A survey of fpga-based accelerators for convolutional


neural networks’, Neural computing and applications 32(4), 1109–1139.

7) Ghosh, Aniruddha & Sinha, Amitabha. (2018). ‘FPGA Implementation of


MAC Unit for Double Base Ternary Number System (DBTNS) and its
Performance Analysis. International Journal of Computer Applications.
181. 9-22. 10.5120/ijca2018917785.’

8) Murnane, K. (2016), ‘What is deep learning and how is it useful?’.

9) Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y. & Zhou, X. (2016), ‘Dlau: A scalable
deep learning accelerator unit on fpga’, IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems 36(3), 513–517.

10) Automating Deep Neural Network Implementation on Low-cost


Reconfigurable Edge Computing Platforms

31

You might also like