YOLO Based Detection and Classification of Objects in Video Records
YOLO Based Detection and Classification of Objects in Video Records
Abstract— The primitive machine learning algorithms that are incorporated time to time for obtaining better results. The
present break down each problem into small modules and solve object detection approaches have progressed from sliding
them individually. Nowadays requirement of detection algorithm window-based methods to single shot detection frameworks.
is to work end to end and take less time to compute. Real-time The Convolutional Neural Network (CNN), in particular, has
detection and classification of objects from video records provide numerous applications such as facial recognition, as it
the foundation for generating many kinds of analytical aspects
achieved a large decrease in error rate but at the expense of
such as the amount of traffic in a particular area over the years
or the total population in an area. In practice, the task usually speed and computation time. The Region based Convolutional
encounters slow processing of classification and detection or the Network (R-CNN) uses the process of Selective Search
occurrence of erroneous detection due to the incorporation of process in order to detect the objects. The descendent of R-
small and lightweight datasets. To overcome these issues, YOLO CNN, Fast R-CNN and, Faster R-CNN fixed the slow nature
(You Only Look Once) based detection and classification of CNN and R-CNN by making the training process end to end
approach (YOLOv2) for improving the computation and [2].
processing speed and at the same time efficiently identify the
objects in the video records. The classification algorithm creates a
bounding box for every class of objects for which it is trained,
and generates an annotation describing the particular class of
object. The YOLO based detection and classification (YOLOv2)
use of GPU (Graphics Processing Unit) to increase the
computation speed and processes at 40 frames per second.
C. Anchor Box
The YOLOv2 model segments the input image into N×N
grid cells. Each grid cell has the task of localizing the object if
the midpoint of that object falls in a grid cell. But the grid cell
approach can predict only a single object at a time. If the
midpoint of two objects coincides with each other, the
detection algorithm will simply pick any one of the objects. To
solve this issue, the concept of Anchor Boxes is used [7].
2449
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.
In Fig. 5, it is observed that the image consists of two objects bounding box parameters of the second object. The ground
(cars) and for ease of explanation chosen N=3 for the number truth bounding box associated with a particular object is
of grids. Here, the image is divided into 3 X 3 grid cells. If the compared with the Anchor Boxes, and the IoU is determined.
classification and localization algorithm is trained to classify The object whose IoU metric is maximum will be coded and
three sets of classes, namely car, person, motorcycle, then the detected. For instance, in Figure 5 if car is to be detected, then
output vector ‘Y’ of the neural net can be defined as a matrix the output variable Y will take the values as seen in (3)
of 8 possible elements.
0 Don’t care
P
D.C
bx D.C
by D.C
bh D.C
(1) D.C
bw D.C
c1 D.C
c2 Y= --- (3)
[ c3 ] 1
bx'
Equation (1) describes the attributes of an object in an input by'
image. ‘P’ defines the presence of object in the grid cell which bh'
can take values either 0 or 1, ‘bx’ and ‘by’ defines the bw'
0 Detection of car
coordinates of the midpoint of the object in a particular grid, 1
‘bh’ defines the percentage height of the bounding box of the [ 0 ]
total height of the grid cell, ‘bw’ defines the percentage width The YOLOv2 algorithm requires a specific set of
of the total width of the grid cell and c1, c2 and c3 defines the platforms as it extensively uses the GPU of the system. These
classes namely person, car and motorcycle. The target volume platforms readily increase the speed and performance of the
output will be of order 3 X 3 X 8, where 8 is the number of algorithm. For a Windows based system, the algorithm
dimensions defined for this particular classification example. requires the help of Microsoft Visual C++ platform.
Consider now a case where two objects share the same
D. CUDA (Compute Unified Device Architecture)
midpoint. In this situation, the approach of Anchor Boxes is
implemented. CUDA is a computing platform created by Nvidia in order
to perform general purpose computing on the Graphics
Processing Unit (GPU). It works extensively with
programming languages like C /C++. The slow speed of a
CPU is a serious hindrance to productivity for any image
processing computation. CUDA has built-in features that
enable it to associate a series of threads to each pixel so that
speed is uncompromised. The CUDA environment supports
heterogeneous programming that involves a host, that
primarily works on the CPU and a device, that consists of the
graphics card interfaced with the GPU [4]. The host and the
Fig. 6. Implementation of Anchor Boxes device work hand in hand to improve the workflow and
In Fig. 6, it is seen that two objects are sharing the same computation speed. The host is responsible for allocating share
midpoint. The system of objects now generates an output in memory for the program variables, and the device improves
variable ‘Y’ as a matrix of 16 elements. the speed of the computation.In YOLOv2, performance is the
P main criteria and to achieve a non-dispensable output, CUDA
bx
by environment plays a vital role. In real time scenario, reduction
bh of noise and redundancy from the objects that are being
bw detected and classified is important [4]. The CUDA
c1 environment incorporates a library cuDNN, which provides
c2
c3 GPU accelerated functionality in Deep Neural Networks. This
Y= --- (2) environment speeds up the process of smoothening (reduction
P of noise) and edge detection by manifold due to the associated
bx' thread approach and the device-host paradigm.
by'
bh' E. YOLO 9000
bw'
c1 YOLO 9000 is a significant improvement to the original object
c2 detection system (YOLO). The earlier version of YOLO gave
[ c3 ]
a speed of 35 FPS or 22ms/image. It also was behind in terms
Equation (2) depicts the possible attributes of the system of
of accuracy when compared to other methods like RCNN and
objects where the new parameters bx’, by’, bh’, bw’ are the
2450
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.
Fast RCNN. There is, therefore, the need to make the original
YOLO version better and faster. There is also a need to
improve the accuracy. General object detectors pre train on
ImageNet dataset on 224*224 and then resize the network to
448*448. Later, they fine-tune the model on detection. This
version however trains a bit longer upon resizing before fine-
tuning. This increases the mean average precision (mAP) by
Fig. 8. Confidence levels of the two objects
3.5%. The earlier YOLO version used the bounding box
technique wherein the coordinates of the X, Y, width and In Fig. 7, there are two objects and it detects them with
height were obtained. This version uses the anchor box comprehensive confidence levels, as shown in Fig. 8. The time
technique, which calculates the offsets for images. The offsets for computing the detection algorithm for this image was close
are calculated from candidate boxes or the reference points. to 0.49s.
YOLO 9000 has also brought the concept of multi scale
training into limelight. General object detection systems train
at a single aspect ratio (448*448). On the contrary, the new
version resizes the network randomly during the training
process on a bunch of different scales. The detector is trained
on image scales from 320*320 to 608*608. We, therefore, get
a network we can resize at test time to a bunch of different
sizes and without changing weights that we have trained. We
can run the detector at different scales, which gives us a trade-
off between performance (speed) and accuracy [5].
There is a dearth of training data in the data detection models.
Hence, there is a need for data augmentation. This is also
called the Joint training method. The Common Objects in Fig. 9. Detection and labelling of multiple objects
Context (COCO) dataset is used as the detection data set.
Although, the COCO dataset can detect multiple objects in an
image accurately, however it is confined to only 80 classes.
We use the ImageNet as our classification dataset for the
model that has 22000 classes. ImageNet, however, labels only
one object that is in focus and not on multiple objects in the
image. We, therefore, combine the detection and classification
datasets and then use backpropagation technique to determine
the exact location and class of the image [5]. This result in Fig. 10. Confidence levels of multiple objects
better performance, more accuracy, and low latency compared Further increase the number of objects in an image the speed
to the earlier version of YOLO. of execution doesn’t drop. The YOLOv2 model detects
majority of the objects with a proficient confidence level. This
III. SIMULATION RESULTS AND ANALYSIS is portrayed in Fig. 9 and Fig. 10, which has more number of
For YOLOv2 algorithm to execute and detect the objects, objects compared to Fig. 7. The time for execution for this
we have employed Microsoft Visual C++ 2017 to build the image was close to 0.5s.
.exe file. We have implemented the pre-trained yolo9000
weights and its configurations. Our system consists of a
NVIDIA GEFORCE 940 MX enabled GPU. The Figures in
this section depicts the performance of the YOLOv2 model on
both still images and on video records. Figure 6 and Figure 8
portrays the detection and labelling of the objects in a single
image.
2451
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION
In this paper, introduced YOLOv2 model and
YOLO9000, real-time detection systems for detecting and
classifying objects in video records. YOLOv2 is agile and
efficient in detecting and classifying the objects. The speed
and accuracy were achieved with the aid of GPU
functionalities and Anchor Box technique respectively.
Furthermore, YOLOv2 can detect object movement in video
records with a proficient accuracy. YOLO9000 is a real-time
framework which is able to optimize detection and
classification and bridge the gap between them. The YOLOv2
Fig. 12. Detection and Labeling of multiple instances of multiple objects
model and YOLO 9000 detection system collectively are able
to detect and classify objects varying from multiple instances
of single objects to multiple instances of multiple objects.
REFERENCES
2452
Authorized licensed use limited to: J.R.D. Tata Memorial Library Indian Institute of Science Bengaluru. Downloaded on April 14,2024 at 13:11:45 UTC from IEEE Xplore. Restrictions apply.