The Basics of Object Detection YOLO SSD R-CNN
The Basics of Object Detection YOLO SSD R-CNN
com /the-basics-of-object-detection-yolo-ssd-r-cnn-6def60f51c0b
Unknown Title
Hari Devanathan ⋮ ⋮ 10/12/2022
Top highlight
Hari Devanathan
Photo by on
Note: This assumes you have a general idea of what convolutional neural networks are. If you need a
refresher, this IBM post is excellent.
You know how convolutional neural networks (or CNNs) detect and classify images. But you can expand
on that CNN to detect objects within that image.
You have an image of 4 cows. Image recognition detects the image as a whole. It says the image is
classified as a cow.
However, image recognition can’t tell where in the image the cow is. Furthermore, it won’t be able to tell
that there are 4 cows instead of 1.
On a side note, object detection is NOT meant for counting. There are other computer vision techniques
like density estimation and extracting patches that are meant for counting. Object detection just detects if
there are multiple objects in the same image.
Why do I need to learn this? How is this useful in the real world?
There are a lot of scammers. Some get by with using fake cashier’s checks or dollar bills. Image
detection can only say whether the fake replica looks like a check or a dollar bill. Object detection can
1/4
help search for individual components that help determine if the check or bill is fake.
If you’re fascinated by medicine, then object detection can identify shadows or abnormalities in X-rays.
Even if it can’t diagnose the disease, object detection can assist doctors in finding abnormalities in lung
tissue. Doctors can catch lung cancer in patients before it becomes malicious.
If you’re fascinated by drones and self driving cars, then object detection is very useful for detecting
obstacles and boundaries in a video stream. It keeps the car within the driving lane and avoids accidents.
Images already have so much information. CNNs aim to reduce image data and keep information that is
important. This can be done with different convolutional, pooling, and dense layers.
At the end of these layers, the image is reduced enough to make predictions on. For image recognition,
it’s an activation function like Softmax to classify the image. For object detection, it has an algorithm to
predict bounding boxes around detected objects before it classifies the object.
There are many object detection algorithms, but we’ll cover three main ones.
YOLO is the simplest object detection architecture. It predicts bounding boxes through a grid based
approach after the object goes through the CNN. It divides each image into an SxS grid, with each grid
predicting N boxes that contain any object. From those SxSxN boxes, it classifies each box for every
class and picks the highest class probability.
SSD is similar to YOLO, but uses the feature maps of each convolutional layer (output of each filter/layer)
to predict the bounding boxes. After consolidating all the feature maps, it runs a 3x3 convolutional kernel
on them to predict bounding boxes and classification probability. SSD is a family of algorithms, with the
popular choice being RetinaNet.
R-CNN takes a different approach by classifying the pixels that make up the object in the identified
bounding box/region. It uses one neural network to suggest potential locations for objects to be detected
(region based network). It uses a second neural network to classify and detect objects based on those
regions proposed. This second neural network adds a pixel mask that gives shape to the object that
needs to be classified.
Note, some researchers have different semantics of these algorithms. Some consider YOLO as part of
the SSD family because they both process images exactly once. Some keep YOLO separate from the
SSD family. Some say that region based networks like R-CNN are instance segmentation methods as
opposed to an object detection methods.
2/4
This is heavily dependent on your data, your goals, and your compute usage.
YOLO is blazing fast and uses little processing memory. While YOLOv1 was less accurate than SSD,
YOLOv3 and YOLOv5 have surpassed SSD in accuracy and speed. In addition, YOLO can predict only 1
class per grid. If there are multiple objects in a grid, YOLO fails. Finally, YOLO struggles to detect small
objects.
SSD can handle objects of various scales. It utilizing feature maps from all convolutional layers, and each
layer operates at different scales. It is also not computationally heavy. However, SSD also struggles to
detect small objects. Furthermore, SSD becomes slower if it contains more convolutional layers.
R-CNN is the most accurate. However, it is computationally expensive. It needs a lot of storage and
processing power for detection. It’s also slower than YOLO and SSD.
There are tradeoffs for each. If accuracy isn’t a huge concern, YOLO is the best bet. If your images are in
black-and-white or have easily identifiable objects on a clear background, YOLO would be very accurate
on those scenarios. If you have complex images and care about accuracy (such as cancer detection from
X-rays), then R-CNN would be the best fit.
No. These are open sourced and pre-trained. They’re all available on OpenCV. OpenCV documentation
and samples exist for YOLO, SSD, and Mask R-CNN.
Wait, if these models are pre-trained, then how do I use them on new data that is specific to
my use case?
custom training dataset that includes the images and the labels/annotations of the objects in each
image
custom test dataset for the model to predict on, and labels/annotations to verify the model’s
accuracy with their predictions
This sounds like a lot of work! How would I go about doing this?
You use a tool to add the annotations. After you’re done, you can export the dataset as any format for
your model.
My favorite tool is Label Studio. You can annotate image objects for computer vision, text for natural
language processing, audio for transcription, and more. I’ve used it only for annotating objects, and it’s
excellent.
You can output the datasets in various formats. CSV, JSON, XML, Pascal VOC XML, etc. There’s even a
format specifically for YOLO.
3/4
You can read more on how to get started with Label Studio. It’s super easy to set up. You download label-
studio via the command pip install -U label-studio , and you then launch it via label-
studio . The UI is very intuitive to figure things out on the go.
labelstud.io
If you need help figuring it out, some other contributors from Towards Data Scientists wrote their own
methods for annotating images. See below.
towardsdatascience.com
Now, you have the tools needed to train existing object detection models on your own custom datasets.
Thanks for reading! If you want to read more of my work, view my Table of Contents.
If you’re not a Medium paid member, but are interested in subscribing to Towards Data Science just to
read tutorials and articles like this, click here to enroll in a membership. Enrolling in this link means I get
paid for referring you to Medium.
References
Practical Machine Learning for Computer Vision — O’Reilly Media (specifically Chapter 4: Object
Detection and Image Segmentation)
4/4