Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
Convolutional Neural Network Layers Implementation On Low-Cost Reconfigurable Edge Computing Platforms
of
by
Sarthak Goyal
19EC37004
1
Department of Electronics and
Electrical Communication Engineering
Indian Institute of Technology,
Kharagpur
India - 721302
CERTIFICATE
This is to certify that we have examined the thesis entitled Convolutional Neural
Network layers implementation on Low-cost Reconfigurable Edge Computing
Platforms, submitted by Sarthak Goyal (Roll Number: 19EC37004) a
undergraduate student of Department of Electronics and Electrical
Communication Engineering in partial fulfillment for the award of degree of
Bachelor of Technology in Electronics and Electrical Communication Engineering.
We hereby accord our approval of it as a study carried out and presented in a
manner required for its acceptance in partial fulfillment for the Degree for which
it has been submitted. The thesis has fulfilled all the requirements as per the
regulations of the Institute and has reached the standard needed for submission.
Supervisor
Department of Electronics and
Electrical Communication
Engineering
Indian Institute of Technology,
Kharagpur
Place: Kharagpur
Date:
2
ACKNOWLEDGEMENTS
Sarthak Goyal
IIT Kharagpur
Date:
3
ABSTRACT
The widespread use of internet of things (IoT) enabled applications offers low-cost
FPGA devices a new possibility to function as edge computing neural network nodes.
Although neural network development environments are offered by FPGA vendors,
they frequently focus on high-end devices. In contrast to their software counterparts,
these development platforms are less user-friendly.
We intend to implement the FPGA design of Deep Neural Network and then compare
the resources used by them in the board for different data width and activation
functions followed by timing analysis. Implementation results show that the DNNs
generated by the platform achieve accuracy very close to software implementations
at the same time gives throughput by an order of magnitude compared to other edge
computing devices at lower energy footprint. Further we plan to implement the
convolution layer of CNNs and analyse its resource utilization. These two operation
layers once implemented can be used for much bigger models together in future.
4
CONTENT
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5
Chapter 1
Introduction to Convolutional
Neural Network and Hardware
Acceleration
6
Figure 1.1: Illustration of Deep Learning Neural Network Layers
You might enter a picture of a sheep, for instance. The method of having a
computer analyse an image and inform you that it is a sheep is known as image
classification. (Or the likelihood that a sheep is present.) Classifying photos is
nothing new to humans. But when it comes to machines, it's the ideal illustration
of Moravec's dilemma. (That is, for AI, what we find simple is complex.)
Raw pixel data was the foundation of early image classification. This implied
that computers would dissect images into their component pixels. The issue is
that the same subject can appear substantially differently in two distinct
photographs. They may have various backdrops, perspectives, poses, etc. This
made it very difficult for computers to accurately "see" and classify images. [2]
7
Figure 1.2: Illustration of Image Classification
8
Pooling Layer: The purpose of the Pooling is to create spatial invariance by
subsampling adjacent pixels. Fig. 1.4 points out pooling schemes used commonly,
i.e average pooling and max pooling.
9
Graphics Processing Units (GPUs): originally designed for handling the
motion of image, GPUs are now used for calculations involving massive amounts
of data, accelerating portions of an application while the rest continues to run on
the CPU. The massive parallelism of modern GPUs allows users to process billions
of records instantly.
10
Chapter 2
The key difference from a standard multilayer perceptron is the input layer
where instead of a vector, an activation volume is taken as the input. As a result
the fully connected layer is defined as:
Otherwise,
The goal of the complete fully connected structure is to tune the weight
parameters wi,j(l) or wi,j,r,s(l) to create a stochastic likelihood representation of each
class based on the activation maps generated by the concatenation of
convolutional, non-linearity, rectification and pooling layers. Individual fully
connected layers operate identically to the layers of the multilayer perceptron
with the only exception being the input layer.
11
It is noteworthy that the function f once again represents the non-linearity,
however, in a fully connected structure the non-linearity is built within the
neurons and is not a separate layer.
12
2.2 Sigmoid Function
13
Chapter 3
16
3.5 Layer Architecture
Each layer instantiates user specified number of neurons and manages data
movement between the layers. Since each neuron has a single data interface and
a fully connected layer requires connection to every neuron from the previous
layer, data from each layer is initially stored in a shift register. It is then shifted to
the next layer one per clock cycle as shown in Fig. 3.3. Connection between layers
and integration with input and output AXI interfaces are automatically
implemented by the tool.
17
Chapter 4
Hardware Implementation of
Convolution Operation
Convolution Neural Network implementation on a Field Programmable Gate
Array is the ultimate goal. We need a functioning convolution operation
implementation in order for that to function. Since hardware is handled
differently than in a normal software implementation, the convolution operation
is designed differently. This design employs a module-wise behavioural
implementation to accomplish the desired operation and was created using the
Verilog High Level Description Language and the Vivado Suite of Xilinx. [1][10]
18
To perform the convolution process, the structure has four modules that are
implemented in Vivado:
1. Line Buffer
2. Multiply and Accumulate Unit
3. Control Unit
4. Top Module
The Line Buffer is basically a region of memory that is being used to store a data
temporarily between operations. In this case, it holds the the entire row of an
image data at once before it is sent to the Multiply Accumulate Unit.
The output size of this line buffer is chosen based on the size of the Kernel
that is being implemented. The most important thing to keep in mind is that while
the output is a continuous stream of bits, the internal structure is made up of
registers that store colour values that are ordered to store the necessary number
of pixels in a given row.
For this particular implementation, the parameters have been decided and are
listed here:
19
Figure 4.2: Implementation of Line Buffer in Vivado
Once again, the width of the multiplier and adder depends on the kernel size. For
this particular implementation since we are using a 3 × 3 kernel we require 9
parallel data paths.[8]
The output of the MAC unit can undergo further processing in case of the Sobel
Kernel which is used as the test case in this report. The FPGA is capable of
performing certain mathematical operations in optimized methods using DPS
cores present, such as the Square Root operation.
This module is responsible for controlling the operation of the line buffers, and
maintain the synchronous operation of the circuit.
21
The control module is responsible in maintaining the order in which the line
buffers are filled. The first step is that all the three-line buffers get filled before
we can start operation using the MAC unit. Simultaneously the fourth line buffer
gets filled, to exploit the philosophy parallelism. After that as we complete the
convolution operation for a single row, the control module sends the next group
of three-line buffers while the first one gets filled. The control module does this
by generating relevant select signals and output signals to let the MAC unit to
start operation. There are various stages in the code where pipelining is used, that
is there are multiple always blocks executing in parallel.
The top module is defined for simulation purposes. This is also used while
packaging the IP for deployment in the FPGA. The top module has instances called
for the Control Module, MAC Unit as well as Output Buffers. The Line Buffers are
called are instantiated in the Control Module itself. The test bench is written
where the Top Module is the device under test.
The test bench is written which reads a greyscale image file of 512 × 512 in
BMP format, excludes the header information and then subsequently provides
the binary data to the Device Under Test (DUT). The output of the DUT is then
fetched from the buffer and then it is written back into a BMP file with the same
header information.
22
Chapter 5
23
5.1.2 Resource Utilization and Timing analysis for a single neuron
The datawidth of input values is 16 for the following results. We compare the
Resource Utilisation, Maximum frequency, and power consumption of the Zedboard
for a single neuron with the following graphs.
68 67 74
72
66 72
64 70
LUTs used
FFs Used
62 61 61 61 68
60 60 60 60 66 66 66 66 66 66 66
60 66
58 64
56 62
5 6 7 8 9 10 11 12 5 6 7 8 9 10 11 12
Log2(Sigmoid Depth) Log2(Sigmoid Depth)
3 2.5
2.5 2 2 2 2 2 2 2 2
2.5 2
DSP Slices Used
BRAMs Used
2
1.5 1.5
1.5
1 1 1 1 1 1
1
0.5 0.5
0.5
0 0
5 6 7 8 9 10 11 12 5 6 7 8 9 10 11 12
Log2(Sigmoid Depth) Log2(Sigmoid Depth)
Figure 5.1: Resource Utilisation for Sigmoid Function with varying Sigmoid depths
24
220
Maximum Frequency (MHz)
218.38
216.35
215 214.4
213.35
212.44
211.46
210
208.02
206.52
205
200
5 6 7 8 9 10 11 12
Log2(Sigmoid Depth)
Figure 5.2: Maximum Frequency for Sigmoid Function with varying Sigmoid depths
0.17 0.168
Power Consumed (W)
0.16 0.162
0.157 0.159 0.159 0.159
0.154
0.15
0.14 212.44
0.13
0.12
5 6 7 8 9 10 11 12
Log2(Sigmoid Depth)
Figure 5.3: Power Consumed for Sigmoid Function with varying Sigmoid depths
The total number of resources in the board are as followed: LUT: 53200, FF:
106400, BRAM: 140, DSP: 220. For the ReLu implementation of the activation
function it consumes 68 LUTs, 81 FFs, 0.5 BRAM and 2 DSP slices.
25
5.1.3 Resource Utilization and Timing analysis for the whole DNN
All implementations use Vivado default settings and do not apply any
optimization (timing, area or power). The DNN implementation was evaluated for
both Sigmoid and ReLU activation functions for varying datawidth. Fig.5.4 shows
the relation between the detection accuracy and datawidth.
Figure 5.4: Comparison of accuracy for Sigmoid and ReLU activation functions
It could be seen that for very small datawidth (such as 4 and 8 bits), Sigmoid-
based function implementation outperforms ReLU-based implementation. As the
width increases, ReLU has slight advantage over Sigmoid implementation and
accuracy of both implementations becomes constant beyond 12-bits. Sigmoid
implementation gives a maximum of 94.86% detection accuracy and ReLU gives
a maximum of 95.87%.
The degradation in the result compared to software implementation can be
attributed to the error introduced due to fixed-point representation of weights,
biases and input data. Still the approximation causes less than 1% error but gives
considerable advantage in terms of resource utilization and clock performance.
Figure 5.4: (a) Resource utilization for sigmoid Activation function (b) Resource utilization for
ReLU Activation function
26
Figures 5.4(a) and 5.4(b) compares the resource utilization of the DNNs for
different data widths in terms of LUTs, flipflops, Block RAMs (BRAMs) and DSP
slices while using the two different activation functions. Since the RTL code
generated by ZyNet does not explicitly instantiates any IP cores, for smaller
designs the implementation tool (Vivado) automatically maps the multipliers and
weight memory blocks into LUTs and flip-flops. For larger data size, the lookup
table used for implementing Sigmoid function are mapped to Block RAMs, which
considerably increases the BRAM utilization.
For example, for 32-bit implementation, Sigmoid based implementation
requires 50350 LUTs, 15544 flip-flops, 70 BRAMs and 220 DSP slices. At the same
time ReLU-based implementation requires 54559 LUTs, 18074 flip-flops, 30
BRAMs and 220 DSP slices. These numbers roughly maps to 94.6% LUTs, 17% flip-
flops, 21.4% BRAMs and 100% DSP slices of the chip. It should be noted that 16-
bit implementation also consumes 220 DSP slices which means for larger
networks the tool automatically maps the multipliers into LUTs and flip-flops.
Thus the largest network size is constrained by the number of LUTs available in
the device.
Figure 5.4: (a) Detection accuracy and varying sigmoid memory depth (b) Datawidth vs
Maximum Frequency and Power consumption
27
5.2 Convolution Layer
We were using parallelism by using the data of three-line buffers for MAC
operation and by filling the fourth line buffer with new data simultaneously in
total using four Line Buffers. But if we only use three-line buffers in total then
then we would first need to wait for the data to be filled in the line buffer and
then only we can perform the MAC operation, hence increasing the total time to
perform convolution two times. This is shown in the below waveforms Figure
5.5(a) and 5.5(b).
Figure 5.5(a): Waveform of the Testbench when three-line buffers are used
Figure 5.5(b): Waveform of the Testbench when four-line buffers are used
28
Hence, we can comment that by increasing the no. of line buffers or by increasing
the number of hardware components further, further parallelism can be achieved
and we can reduce the computation time.
Here is a display of the Sobel Kernel's output. The edge detection output is
currently produced via thresholding in the implementation.
29
Chapter 4
4.1 Conclusion
30
Bibliography
1) Yuchen Yao, Qinghua Duan, Zhiqian Zhang, Jiabao Gao, Jian Wang, Meng
Yang, Xinxuan Tao, Jinmei Lai (2018) ‘A FPGA-based Hardware Accelerator
for Multiple Convolutional Neural Networks’
4) Lacey, G., Taylor, G. W. & Areibi, S. (2016), ‘Deep learning on fpgas: Past,
present, and future’, arXiv preprint arXiv:1602.04283
5) Zilic, Zeljko. (2009). ‘Designing and Using FPGAs beyond Classical Binary
Logic: Opportunities in Nano-Scale Integration Age. Proceedings of The
International Symposium on Multiple-Valued Logic. 268-273.
10.1109/ISMVL.2009.51.’
9) Wang, C., Gong, L., Yu, Q., Li, X., Xie, Y. & Zhou, X. (2016), ‘Dlau: A scalable
deep learning accelerator unit on fpga’, IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems 36(3), 513–517.
31