Chapter 7. Object Recognition
Chapter 7. Object Recognition
Hà Nội, 2021 1
Chapter 7. Object Recognition
❖1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 2
1. Introduction
▪ Object recognition: localize and to classify objects.
▪ General concept:
➢ training datasets containing images with known and labelled objects;
➢ extracts different types of information (colours, edges, geometric forms) based on the
chosen algorithm
➢ for any new image the same information is gathered and compared to the training
dataset to find the most suitable classification.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 3
1. Introduction
▪ Applications:
➢ robots in industrial environments,
➢ face or handwriting recognition
➢ autonomous systems such as modern cars which use object recognition for pedestrian
detection, emergency brake assistant and so on.
➢…
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 4
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 5
1. Introduction
▪ General Object Recognition Strategies: Appearance-based method
➢ Face or handwriting recognition
➢ Reference training images
➢ This dataset is compressed to obtain a lower dimension subspace, also called eigenspace.
➢ Parts of the new input images are projected on the eigenspace and then correspondence is
examined.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 6
1. Introduction
▪ General Object Recognition Strategies: Feature-based Method
➢ Characteristic for each object
➢ Colours, contour lines, geometric forms or edges
➢ The basic concept of feature-based object recognition strategies is following:
• Every input image is searched for a specific type of feature,
• This feature is then compared to a database containing models of the objects in order to
verify if there are recognised objects.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 7
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 8
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 9
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 10
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 11
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 12
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 13
1. Introduction
A neural network containing one input layer, two hidden layer and one output layer.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 14
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 15
1. Introduction
A neural network containing one input layer, two hidden layer and
one output layer
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 16
1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 17
1. Introduction
▪ Performance Analysis
➢ Invariances and Robustness
➢ Complexity
➢ Reliability and Accuracy
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 18
1. Introduction
❖ Performance Analysis: Invariances and Robustness
▪ First, the algorithms are analysed and checked whether invariances occur and what level
of robustness they have.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 19
1. Introduction
❖ Performance Analysis: Complexity
▪ Secondly, the algorithms are compared regarding complexity, especially in terms of
computational load and memory usage.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 22
1. Introduction
❖ Performance Analysis: Reliability and Accuracy
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 26
Chapter 7. Object Recognition
❖1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 28
2. Pattern Matching
❖ Template matching is a technique for finding areas of an image that match (are similar) to
a template image (patch).
❖ How does it work?
▪ We need two primary components:
▪ Source image (I): The image in which we expect to find a match to the template image
▪ Template image (T): The patch image which will be compared to the template image our
goal is to detect the highest matching area:
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 29
2. Pattern Matching
❖ Template matching
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 30
2. Pattern Matching
❖ Template matching
▪ To identify the matching area, we have to compare the template image against the source
image by sliding it:
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 31
2. Pattern Matching
❖ Template matching
▪ By sliding, we mean moving the patch one pixel at a time (left to right, up to down). At
each location, a metric is calculated so it represents how "good" or "bad" the match at that
location is (or how similar the patch is to that particular area of the source image).
▪ For each location of T over I, you store the metric in the result matrix R. Each location
(x,y) in R contains the match metric:
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 32
2. Pattern Matching
❖ Template matching
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 33
2. Pattern Matching
❖ Template matching
▪ The image above is the result R of sliding the patch with a metric
TM_CCORR_NORMED. The brightest locations indicate the highest matches. As you can
see, the location marked by the red circle is probably the one with the highest value, so
that location (the rectangle formed by that point as a corner and width and height equal to
the patch image) is considered the match.
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 34
2. Pattern Matching
❖ Template matching
▪ Which are the matching methods available in OpenCV?
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 35
2. Pattern Matching
❖ Template matching
▪ Which are the matching methods available in OpenCV?
https://docs.opencv.org/4.3.0/de/da9/tutorial_template_matching.html 36
Chapter 7. Object Recognition
❖1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 37
3. Feature-based Methods
▪ Feature Detectors
▪ Feature Descriptors
▪ Feature Matching
38
3. Feature-based Methods
❖ Feature detectors
Image pairs with extracted patches below. Notice how some patches
can be localized or matched with higher accuracy than others.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 39
3. Feature-based Methods
❖ Feature detectors
▪ The simplest possible matching criterion for comparing two image patches:
where I0 and I1 are the two images being compared, u = (u, v) is the displacement vector, w(x) is a
spatially varying weighting (or window) function, and the summation i is over all the pixels in the patch.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 40
3. Feature-based Methods
❖ Feature detectors
Aperture problems for different image patches: (a) stable (“corner-like”) flow; (b) classic aperture problem
(barber-pole illusion); (c) textureless region. The two images I0 (yellow) and I1 (red) are overlaid. The red
vector u indicates the displacement between the patch centers and the w(xi) weighting function (patch
window) is shown as a dark circle.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 41
3. Feature-based Methods
❖ Feature detectors
▪ auto-correlation function or surface
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 42
3. Feature-based Methods
❖ Feature detectors
▪ auto-correlation function or surface
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 43
3. Feature-based Methods
❖ Feature detectors
▪ Forstner–Harris
Interest operator responses: (a) Sample image, (b) Harris response, and (c) DoG response. The circle sizes
and colors indicate the scale at which each interest point was detected. Notice how the two detectors
tend to respond at complementary locations.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 44
3. Feature-based Methods
❖ Feature detectors
▪ Adaptive non-maximal suppression (ANMS)
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 45
3. Feature-based Methods
❖ Feature detectors
▪ Scale invariance
Multi-scale oriented patches (MOPS) extracted at five pyramid levels (Brown, Szeliski, and
Winder 2005). The boxes show the feature orientation and the region from which the
descriptor vectors are sampled.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 46
3. Feature-based Methods
❖ Feature detectors
▪ Scale invariance
Scale-space feature detection using a sub-octave Difference of Gaussian pyramid (Lowe 2004): (a) Adjacent
levels of a sub-octave Gaussian pyramid are subtracted to produce Difference of Gaussian images; (b) extrema
(maxima and minima) in the resulting 3D volume are detected by comparing a pixel to its 26 neighbors.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 47
3. Feature-based Methods
❖ Feature detectors
▪ Rotational invariance and orientation estimation
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 48
3. Feature-based Methods
❖ Feature detectors
▪ Rotational invariance and orientation estimation
Affine region detectors used to match two images taken from dramatically different viewpoints
(Mikolajczyk and Schmid 2004)
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 49
3. Feature-based Methods
❖ Feature detectors
▪ Affine invariance
Affine normalization using the second moment matrices, as described by Mikolajczyk, Tuytelaars, Schmid et
al. (2005): After image coordinates are transformed using the matrices A0-1/2 and A1-1/2, they are related by a
pure rotation R, which can be estimated using a dominant orientation technique.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 50
3. Feature-based Methods
❖ Feature detectors
▪ Affine invariance
Maximally stable extremal regions (MSERs) extracted and matched from a number of images
(Matas, Chum, Urban et al. 2004)
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 51
3. Feature-based Methods
▪ Feature Detectors
▪ Feature Descriptors
▪ Feature Matching
52
3. Feature-based Methods
❖ Feature descriptors
Feature matching: how can we extract local descriptors that are invariant to inter-image variations and yet
still discriminative enough to establish correct correspondences?
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 53
3. Feature-based Methods
❖ Feature descriptors
▪ Bias and gain normalization (MOPS)
MOPS descriptors are formed using an 8×8 sampling of bias and gain normalized intensity values, with a
sample spacing of five pixels relative to the detection scale (Brown, Szeliski, and Winder 2005). This low
frequency sampling gives the features some robustness to interest point location error and is achieved by
sampling at a higher pyramid level than the detection scale.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 54
3. Feature-based Methods
❖ Feature descriptors
▪ Scale invariant feature transform (SIFT)
A schematic representation of Lowe’s
(2004) scale invariant feature transform
(SIFT): (a) Gradient orientations and
magnitudes are computed at each pixel
and weighted by a Gaussian fall-off
function (blue circle). (b) A weighted
gradient orientation histogram is then
computed in each subregion, using trilinear
interpolation. While this figure shows an 8
× 8 pixel patch and a 2 × 2 descriptor array,
Lowe’s actual implementation uses 16 × 16
patches and a 4 × 4 array of eight-bin
histograms.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 55
3. Feature-based Methods
❖ Feature descriptors
▪ Gradient location-orientation histogram (GLOH)
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 56
3. Feature-based Methods
❖ Feature descriptors
Spatial summation blocks for SIFT, GLOH, and some newly developed feature descriptors (Winder and Brown 2007): (a)
The parameters for the new features, e.g., their Gaussian weights, are learned from a training database of (b) matched
real-world image patches obtained from robust structure from motion applied to Internet photo collections (Hua, Brown,
and Winder 2007).
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 57
3. Feature-based Methods
▪ Feature Detectors
▪ Feature Descriptors
▪ Feature Matching
58
3. Feature-based Methods
❖ Feature matching
▪ Matching strategy and error rates
Recognizing objects in a cluttered scene (Lowe 2004). Two of the training images in the database are shown on the left.
These are matched to the cluttered scene in the middle using SIFT features, shown as small squares in the right image.
The affine warp of each recognized database image onto the scene is shown as a larger parallelogram in the right image.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 59
3. Feature-based Methods
❖ Feature matching
▪ Matching strategy and error rates
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 60
3. Feature-based Methods
❖ Feature matching
▪ Matching strategy and error rates
The number of matches correctly and incorrectly estimated by a feature matching algorithm, showing the number of
true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN). The columns sum up to the actual
number of positives (P) and negatives (N), while the rows sum up to the predicted number of positives (P’) and
negatives (N’). The formulas for the true positive rate (TPR), the false positive rate (FPR), the positive predictive value
(PPV), and the accuracy (ACC) are given in the text.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 61
3. Feature-based Methods
❖ Feature matching
▪ Matching strategy and error rates
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 62
3. Feature-based Methods
❖ Feature matching
ROC curve and its related rates: (a) The
▪ Matching strategy and error rates ROC curve plots the true positive rate
against the false positive rate for a
particular combination of feature
extraction and matching algorithms.
Ideally, the true positive rate should be
close to 1, while the false positive rate is
close to 0. The area under the ROC curve
(AUC) is often used as a single (scalar)
measure of algorithm performance.
Alternatively, the equal error rate is
sometimes used. (b) The distribution of
positives (matches) and negatives (non-
matches) as a function of inter-feature
distance d. As the threshold θ is increased,
the number of true positives (TP) and false
positives (FP) increases.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 63
3. Feature-based Methods
❖ Feature matching
▪ Matching strategy and error rates
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 64
3. Feature-based Methods
❖ Feature matching
▪ Matching strategy and error rates
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 65
3. Feature-based Methods
❖ Feature matching
▪ Nearest neighbor distance ratio
where d1 and d2 are the nearest and second nearest neighbor distances, DA is the target
descriptor, and DB and DC are its closest two neighbors
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 66
3. Feature-based Methods
❖ Feature matching
Performance of the feature descriptors evaluated by Mikolajczyk and Schmid (2005), shown for three matching
strategies: (a) fixed threshold; (b) nearest neighbor; (c) nearest neighbor distance ratio (NNDR). Note how the
ordering of the algorithms does not change that much, but the overall performance varies significantly between the
different matching strategies.
Richard Szeliski, Computer Vision Algorithms and Applications, Springer-Verlag London Limited 2011. 67
Chapter 7. Object Recognition
❖1. Introduction
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 68
4. Artificial Neural Networks
❖ CNN - Convolutional Neural Network
▪ (Deep) convolutional neural networks (CNN): The term deep means that there is at least
one hidden layer and convolutional implies the use of convolution layers. The basic
principles of CNNs are inspired by the biological visual cortex of humans.
▪ The architecture of an example CNN can be seen in Slide 70. Input images with 28x28
pixels are convoluted with a filter to obtain 3D feature maps. The succeeding sub-
sampling, or often called pooling, layer further reduces the amount of data. This
procedure is continued until a one-dimensional vector, which represents the different
classes, is obtained.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 69
4. Artificial Neural Networks
❖ CNN - Convolutional Neural Network
One example architecture of a convolutional neural network using subsampling and convolution hidden layers.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 70
4. Artificial Neural Networks
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 71
4. Artificial Neural Networks
❖ CNN - Convolutional Neural Network
Intermediate results from hidden layers. From left to right: low-level, mid-level and high-level features.
Simon Achatz, State of the art of object recognition techniques, Technische Universitat Muchen. 72
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 73
4. Artificial Neural Networks
A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers.
Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this
example, the red input layer holds the image, so its width and height would be the dimensions of the image,
and the depth would be 3 (Red, Green, Blue channels).
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 75
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 76
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 78
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 79
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 81
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 82
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 83
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 84
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 85
4. Artificial Neural Networks
➢ Convolutional Layer
▪ Stride
Value of stride is 1, with filter 3x3 on 7x7 image. Value of stride is 2, with filter 3x3 on 7x7 image.
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 86
4. Artificial Neural Networks
➢ Convolutional Layer
▪ Stride
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 87
4. Artificial Neural Networks
➢ Convolutional Layer
▪ Padding (Zero-padding)
➢ Convolutional Layer
▪ Padding (Zero-padding)
• same convolution: preserve the dimension of the image
• wide convolution: Adding zero-padding
• narrow convolution: not using zero-padding
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 89
4. Artificial Neural Networks
➢ Convolutional Layer
▪ Number of filter (depth of next layer)
• Example: 6x6x3 image with four 3x3 filter.
• After convolving, will get 4x4xn, n is depends on the number of filter you use,
in another words, means that depends on the number of feature detector you
use. In this case, n will be 4.
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 90
4. Artificial Neural Networks
➢ Convolutional Layer
▪ Size of the filter
• Size of the filter usually is odd number so that the filter has the “central
pixel”/”central vision” so to know the position of the filter.
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 91
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 92
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 93
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 94
4. Artificial Neural Networks
➢ Activation Function
▪ ReLU Activation Function
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 96
4. Artificial Neural Networks
Max-pooling in 2D image.
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 97
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 100
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 102
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 103
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 104
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 105
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 106
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 107
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 108
4. Artificial Neural Networks
Stanford University CS231n: Convolutional Neural Networks for Visual Recognition, 2020 109
4. Artificial Neural Networks
110
4. Artificial Neural Networks
111
4. Artificial Neural Networks
112