0% found this document useful (0 votes)
19 views

YOLO Based Detection and Classification of Objects in Video Records

YOLO_based_Detection_and_Classification_of_Objects_in_video_records

Uploaded by

Ashwani Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

YOLO Based Detection and Classification of Objects in Video Records

YOLO_based_Detection_and_Classification_of_Objects_in_video_records

Uploaded by

Ashwani Singh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT-2018), MAY 18th

& 19th 2018

YOLO based Detection and Classification of Objects


in video records
Arka Prava Jana Abhiraj Biswas Mohana
Department of Telecommunication Department of Telecommunication Department of Telecommunication
R. V. College of Engineering R. V. College of Engineering R. V. College of Engineering
Bengaluru, India Bengaluru, India Bengaluru, India
[email protected] [email protected] [email protected]

Abstract— The primitive machine learning algorithms that are incorporated time to time for obtaining better results. The
present break down each problem into small modules and solve object detection approaches have progressed from sliding
them individually. Nowadays requirement of detection algorithm window-based methods to single shot detection frameworks.
is to work end to end and take less time to compute. Real-time The Convolutional Neural Network (CNN), in particular, has
detection and classification of objects from video records provide numerous applications such as facial recognition, as it
the foundation for generating many kinds of analytical aspects
achieved a large decrease in error rate but at the expense of
such as the amount of traffic in a particular area over the years
or the total population in an area. In practice, the task usually speed and computation time. The Region based Convolutional
encounters slow processing of classification and detection or the Network (R-CNN) uses the process of Selective Search
occurrence of erroneous detection due to the incorporation of process in order to detect the objects. The descendent of R-
small and lightweight datasets. To overcome these issues, YOLO CNN, Fast R-CNN and, Faster R-CNN fixed the slow nature
(You Only Look Once) based detection and classification of CNN and R-CNN by making the training process end to end
approach (YOLOv2) for improving the computation and [2].
processing speed and at the same time efficiently identify the
objects in the video records. The classification algorithm creates a
bounding box for every class of objects for which it is trained,
and generates an annotation describing the particular class of
object. The YOLO based detection and classification (YOLOv2)
use of GPU (Graphics Processing Unit) to increase the
computation speed and processes at 40 frames per second.

Keywords—Object detection, Classification, YOLOv2, video


records, performance, object movement

I. INTRODUCTION Fig. 1. YOLOv2 model regression.

H umans look at an image and instantly process the objects


in it and determine their locations due to the interlinked
neurons of the brain. The human brain is very accurate in
YOLOv2 (You Only Look Once version 2) is an object
detection technique in which the detection process is
considered as a single backsliding problem which takes an
performing complex tasks such as identifying objects of input image and generates the confidence level of each object
similar attributes, in a very small amount of time. Just like the in the image[1]. It is the descendent of primitive YOLO
human interpretation, the world today requires fast and algorithm. Fig. 1 depicts a regression model wherein the
accurate algorithms to classify and detect various objects for image which is given as input is divided into grids, followed
various applications. These applications include pedestrian by formation of bounding boxes on all objects and finally
detection, vehicle counting, motion tracking, cancer cell detection of the objects as per requirement. The YOLOv2
detection and many more [6]. For a human visual system, the detection algorithm finds its genesis to the open source deep
perception of visual information is with apparent ease. In learning framework known as Darknet. The Darknet is based
artificial intelligence, we face a huge amount of visual on GoogLeNet architecture. YOLOv2 is extremely fast and
information and few useful techniques to process, understand makes fewer background errors than traditional R-CNN
and classify them. Object detection and tracking algorithms approaches [2]. YOLOv2 divides each image into a several
are described by extracting the features of image and video for grid boxes and each grid box predicts certain bounding boxes
security applications [10] [11]. Features are extracted using and associated confidence levels. The confidence levels
convolutional neural network and deep learning [12]. reflect the precision of localization of the objects, regardless
Classifiers are used for image classification and counting [8] of the class. Most of the grids boxes and bounding boxes are
[9]. The process of object classification and detection removed accounting to fewer threshold values, leaving behind
workflow aims to classify objects, based on their features and the particular class of objects, which it is trained to detect.
attributes. As days have gone by, many approaches have been

978-1-5386-2440-1/18/$31.00 ©2018 IEEE


2448
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.
quickly. Firstly, it follows a regression model wherein it takes
the input and derives the class probabilities. Secondly, it
calculates the class specific confidence scores. Lastly, it
compares the confidence score with the predefined threshold
values to detect and classify the objects. If the confidence
score is less than the threshold value the algorithm doesn’t
detect that particular object.
B. Intersection over Union (IoU)

Fig. 2. Speed v/s Accuracy of various detection algorithms.

Proposed work is mainly motivated by two main issues


that are present in the traditional CNN algorithms. These are
low accuracy rate and slow computation speed due to the
absence of GPU. Fig. 2 shows the graph of speed versus
accuracy of various detection algorithms. This paper, focusing
on the working and implementation of YOLOv2 detection
algorithm by the YOLO9000 detection system and run it on
Fig 4. Predicted and ground truth bounding boxes
the video records, which will predict the bounding boxes along Intersection over Union is a gauging metric that is used to
with the annotations on the objects. This algorithm is compute the precision of an object detector and classifier on a
implemented mainly using OpenCV library. particular dataset. It consists of two evaluation metrics [7].
The ground truth bounding box: The hand labeled bounding
II. DESIGN AND IMPLEMENTATION box of a particular object in an image.
This section depicts the overall design requirements and The predicted bounding box: The predicted bounding box
implementation of YOLO model on input images. The section from the detection and classification algorithm.
also describes how the model efficiently and accurately detects Fig. 4 depicts the hand labelled and the predicted bounding
and classifies objects by implementing Anchor Boxes and boxes. These help in determining the closest bounding box for
CUDA environment. a particular object.
IoU can also be defined as:
A. Flow diagram

C. Anchor Box
The YOLOv2 model segments the input image into N×N
grid cells. Each grid cell has the task of localizing the object if
the midpoint of that object falls in a grid cell. But the grid cell
approach can predict only a single object at a time. If the
midpoint of two objects coincides with each other, the
detection algorithm will simply pick any one of the objects. To
solve this issue, the concept of Anchor Boxes is used [7].

Fig 3. Flow diagram of YOLO Model


Fig 3 shows the flow diagram of YOLO model. YOLO model
follows a certain flow method to analyze and detect the objects Fig. 5. Generation of grid cells and bounding boxes

2449
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.
In Fig. 5, it is observed that the image consists of two objects bounding box parameters of the second object. The ground
(cars) and for ease of explanation chosen N=3 for the number truth bounding box associated with a particular object is
of grids. Here, the image is divided into 3 X 3 grid cells. If the compared with the Anchor Boxes, and the IoU is determined.
classification and localization algorithm is trained to classify The object whose IoU metric is maximum will be coded and
three sets of classes, namely car, person, motorcycle, then the detected. For instance, in Figure 5 if car is to be detected, then
output vector ‘Y’ of the neural net can be defined as a matrix the output variable Y will take the values as seen in (3)
of 8 possible elements.
0 Don’t care
P
D.C
bx D.C
by D.C
bh D.C
(1) D.C
bw D.C
c1 D.C
c2 Y= --- (3)
[ c3 ] 1
bx'
Equation (1) describes the attributes of an object in an input by'
image. ‘P’ defines the presence of object in the grid cell which bh'
can take values either 0 or 1, ‘bx’ and ‘by’ defines the bw'
0 Detection of car
coordinates of the midpoint of the object in a particular grid, 1
‘bh’ defines the percentage height of the bounding box of the [ 0 ]

total height of the grid cell, ‘bw’ defines the percentage width The YOLOv2 algorithm requires a specific set of
of the total width of the grid cell and c1, c2 and c3 defines the platforms as it extensively uses the GPU of the system. These
classes namely person, car and motorcycle. The target volume platforms readily increase the speed and performance of the
output will be of order 3 X 3 X 8, where 8 is the number of algorithm. For a Windows based system, the algorithm
dimensions defined for this particular classification example. requires the help of Microsoft Visual C++ platform.
Consider now a case where two objects share the same
D. CUDA (Compute Unified Device Architecture)
midpoint. In this situation, the approach of Anchor Boxes is
implemented. CUDA is a computing platform created by Nvidia in order
to perform general purpose computing on the Graphics
Processing Unit (GPU). It works extensively with
programming languages like C /C++. The slow speed of a
CPU is a serious hindrance to productivity for any image
processing computation. CUDA has built-in features that
enable it to associate a series of threads to each pixel so that
speed is uncompromised. The CUDA environment supports
heterogeneous programming that involves a host, that
primarily works on the CPU and a device, that consists of the
graphics card interfaced with the GPU [4]. The host and the
Fig. 6. Implementation of Anchor Boxes device work hand in hand to improve the workflow and
In Fig. 6, it is seen that two objects are sharing the same computation speed. The host is responsible for allocating share
midpoint. The system of objects now generates an output in memory for the program variables, and the device improves
variable ‘Y’ as a matrix of 16 elements. the speed of the computation.In YOLOv2, performance is the
P main criteria and to achieve a non-dispensable output, CUDA
bx
by environment plays a vital role. In real time scenario, reduction
bh of noise and redundancy from the objects that are being
bw detected and classified is important [4]. The CUDA
c1 environment incorporates a library cuDNN, which provides
c2
c3 GPU accelerated functionality in Deep Neural Networks. This
Y= --- (2) environment speeds up the process of smoothening (reduction
P of noise) and edge detection by manifold due to the associated
bx' thread approach and the device-host paradigm.
by'
bh' E. YOLO 9000
bw'
c1 YOLO 9000 is a significant improvement to the original object
c2 detection system (YOLO). The earlier version of YOLO gave
[ c3 ]
a speed of 35 FPS or 22ms/image. It also was behind in terms
Equation (2) depicts the possible attributes of the system of
of accuracy when compared to other methods like RCNN and
objects where the new parameters bx’, by’, bh’, bw’ are the

2450
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.
Fast RCNN. There is, therefore, the need to make the original
YOLO version better and faster. There is also a need to
improve the accuracy. General object detectors pre train on
ImageNet dataset on 224*224 and then resize the network to
448*448. Later, they fine-tune the model on detection. This
version however trains a bit longer upon resizing before fine-
tuning. This increases the mean average precision (mAP) by
Fig. 8. Confidence levels of the two objects
3.5%. The earlier YOLO version used the bounding box
technique wherein the coordinates of the X, Y, width and In Fig. 7, there are two objects and it detects them with
height were obtained. This version uses the anchor box comprehensive confidence levels, as shown in Fig. 8. The time
technique, which calculates the offsets for images. The offsets for computing the detection algorithm for this image was close
are calculated from candidate boxes or the reference points. to 0.49s.
YOLO 9000 has also brought the concept of multi scale
training into limelight. General object detection systems train
at a single aspect ratio (448*448). On the contrary, the new
version resizes the network randomly during the training
process on a bunch of different scales. The detector is trained
on image scales from 320*320 to 608*608. We, therefore, get
a network we can resize at test time to a bunch of different
sizes and without changing weights that we have trained. We
can run the detector at different scales, which gives us a trade-
off between performance (speed) and accuracy [5].
There is a dearth of training data in the data detection models.
Hence, there is a need for data augmentation. This is also
called the Joint training method. The Common Objects in Fig. 9. Detection and labelling of multiple objects
Context (COCO) dataset is used as the detection data set.
Although, the COCO dataset can detect multiple objects in an
image accurately, however it is confined to only 80 classes.
We use the ImageNet as our classification dataset for the
model that has 22000 classes. ImageNet, however, labels only
one object that is in focus and not on multiple objects in the
image. We, therefore, combine the detection and classification
datasets and then use backpropagation technique to determine
the exact location and class of the image [5]. This result in Fig. 10. Confidence levels of multiple objects
better performance, more accuracy, and low latency compared Further increase the number of objects in an image the speed
to the earlier version of YOLO. of execution doesn’t drop. The YOLOv2 model detects
majority of the objects with a proficient confidence level. This
III. SIMULATION RESULTS AND ANALYSIS is portrayed in Fig. 9 and Fig. 10, which has more number of
For YOLOv2 algorithm to execute and detect the objects, objects compared to Fig. 7. The time for execution for this
we have employed Microsoft Visual C++ 2017 to build the image was close to 0.5s.
.exe file. We have implemented the pre-trained yolo9000
weights and its configurations. Our system consists of a
NVIDIA GEFORCE 940 MX enabled GPU. The Figures in
this section depicts the performance of the YOLOv2 model on
both still images and on video records. Figure 6 and Figure 8
portrays the detection and labelling of the objects in a single
image.

Fig. 11. Detection and Labeling of multiple instances of a single object

Fig. 7. Detection and labeling of two objects

2451
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION
In this paper, introduced YOLOv2 model and
YOLO9000, real-time detection systems for detecting and
classifying objects in video records. YOLOv2 is agile and
efficient in detecting and classifying the objects. The speed
and accuracy were achieved with the aid of GPU
functionalities and Anchor Box technique respectively.
Furthermore, YOLOv2 can detect object movement in video
records with a proficient accuracy. YOLO9000 is a real-time
framework which is able to optimize detection and
classification and bridge the gap between them. The YOLOv2
Fig. 12. Detection and Labeling of multiple instances of multiple objects
model and YOLO 9000 detection system collectively are able
to detect and classify objects varying from multiple instances
of single objects to multiple instances of multiple objects.

REFERENCES

[1] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, "You Only Look Once:


Unified Real-Time Object Detection", 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 779-788, 2016.
[2] Ren S, He K, Girshick R, “Faster R-CNN: Towards Real-Time Object
Detection with Region Proposal Networks.”, 2017 IEEE Transactions on
Pattern Analysis & Machine Intelligence.
[3] Jing Tao, Hongbo Wang, Xinyu Zhang, Xiaoyu Li; “An object detection
system based on YOLO in traffic scene” 2017 6th International
Conference on Computer Science and Network Technology (ICCSNT).
Fig. 13. Real time object detection and labeling of various objects on a road [4] Shunji Funasaka, Koji Nakano, Yasuaki Ito, “Single Kernel Soft
Synchronization Technique for Task Arrays on CUDA-enabled GPUs,
with Applications”, 2017 Fifth International Symposium on Computing
As we move from images to video inputs, the scenario and Networking (CANDAR)
completely changes. The objects in a video will now [5] Joseph Redmon, Ali Farhadi, “YOLO9000 Better, Faster, Stronger”,
continuously change its co-ordinates. YOLOv2 algorithm here 2017 IEEE Conference on Computer Vision and Pattern Recognition
is continuously detect and label the objects with a proficient (CVPR).
confidence level. We have taken some still images from the [6] Ipek Baris, Yalin Bastanlar, “Classification and tracking of traffic scene
objects with hybrid camera systems”, 2017 IEEE 20th International
video record that we had given as input. Fig. 11, 12 and 13 Conference on Intelligent Transportation Systems (ITSC).
depicts the variation in the number of objects in the video. As [7] Arsalan Mousavian, Dragomir Anguelov, John Flynn, Jana Košecká,
the number of objects kept increasing, it didn’t affect the “3D Bounding Box Estimation Using Deep Learning and Geometry”,
detection of other neighboring objects. It gives good detection 2017 IEEE Conference on on Computer Vision and Pattern Recognition
and classification performance. (CVPR).
[8] Mohana and H. V. R. Aradhya, "Elegant and efficient algorithms for real
time object detection, counting and classification for video surveillance
IV. APPLICATIONS AND CHALLENGES applications from single fixed camera," 2016 International Conference
on Circuits, Controls, Communications and Computing (I4C),
This section describes some of the applications and challenges. Bangalore, 2016, pp. 1-7.
A. Applications [9] H. V. Ravish Aradhya, Mohana and Kiran Anil Chikodi, "Real time
The fast object detection model, YOLOv2 is used to gain objects detection and positioning in multiple regions using single fixed
accurate and efficient results. This model is used in pedestrian camera view for video surveillance applications," 2015 International
detection, vehicle detection [3], identifying anomalies in a Conference on Electrical, Electronics, Signals, Communication and
Optimization (EESCO), Visakhapatnam, 2015, pp. 1-6.
scene such as explosives, people counting and many more.
[10] Akshay Mangawati, Mohana, Mohammed Leesan, H. V. Ravish
B. Challenges Aradhya, “Object Tracking Algorithms for video survilllance
The system requirements for running YOLO model are quite applications” International conference on communication and signal
high and it consumes a lot of GPU functionalities to execute. processing (ICCSP), India, 2018, pp. 0676-0680.
The correct version of CUDA environment (v9.0) plays a key [11] Apoorva Raghunandan, Mohana, Pakala Raghav and H. V. Ravish
Aradhya,“Object Detection Algorithms for video survilllance
role in setting up the bridge between the algorithm and the applications” International conference on communication and signal
GPU. The most challenging part was building the executable processing (ICCSP), India, 2018, pp. 0570-0575.
file for the YOLO algorithm as it requires many libraries and [12] Manjunath Jogin, Mohana, “Feature extraction using Convolution
configuration files to be added. Neural Networks (CNN) and Deep Learning” 2018 IEEE International
Conference On Recent Trends In Electronics Information
Communication Technology,(RTEICT) 2018, India.

2452
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.

You might also like