Object Detyection Using CNN
Object Detyection Using CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 1 April 26, 2022
Image Classification: A core task in Computer Vision
cat
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 2 April 26, 2022
Computer Vision Tasks
Semantic Object Instance
Classification
Segmentation Detection Segmentation
No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 3 April 26, 2022
Semantic Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 4 April 26, 2022
Semantic Segmentation: The Problem
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 5 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Full image
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 6 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Full image
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 7 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Full image
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 8 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN
Full image
Cow
Cow
Grass
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 9 April 26, 2022
Semantic Segmentation Idea: Sliding Window
Classify center
Extract patch pixel with CNN
Full image
Cow
Cow
Grass
Problem: Very inefficient! Not
reusing shared features between
overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 10 April 26, 2022
Semantic Segmentation Idea: Convolution
Full image
An intuitive idea: encode the entire image with conv net, and do semantic segmentation
on top.
Problem: classification architectures often reduce feature spatial sizes to go deeper, but
semantic segmentation requires the output size to be the same as input size.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 11 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design a network with only convolutional layers
without downsampling operators to make predictions
for pixels all at once!
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
DxHxW
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 12 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design a network with only convolutional layers
without downsampling operators to make predictions
for pixels all at once!
Input:
Scores: Predictions:
3xHxW
CxHxW HxW
Convolutions:
Problem: convolutions at DxHxW
original image resolution will
be very expensive ...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 13 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: C x H x W Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 14 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Downsampling: Design network as a bunch of convolutional layers, with Upsampling:
Pooling, strided downsampling and upsampling inside the network! ???
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: C x H x W Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 15 April 26, 2022
In-Network upsampling: “Unpooling”
1 2 1 1 2 2 1 2 0 0 0 0
3 4 3 3 4 4 3 4 3 0 4 0
3 3 4 4 0 0 0 0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 16 April 26, 2022
In-Network upsampling: “Max Unpooling”
Max Pooling
Max Unpooling
Remember which element was max!
Use positions from
1 2 6 3 pooling layer 0 0 2 0
1 2 0 1 0 0
3 5 2 1 5 6
… 3 4
1 2 2 1 7 8 0 0 0 0
Rest of the network
7 3 4 8 3 0 0 4
Corresponding pairs of
downsampling and
upsampling layers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 17 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 18 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 19 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Dot product
between filter
and input
Input: 4 x 4 Output: 4 x 4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 20 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 21 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Dot product
between filter
and input
Input: 4 x 4 Output: 2 x 2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 22 April 26, 2022
Learnable Upsampling
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 23 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 24 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1
Input gives
weight for
filter
Input: 2 x 2 Output: 4 x 4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 25 April 26, 2022
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 pad 1
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 26 April 26, 2022
Learnable Upsampling: Transposed Convolution
Sum where
3 x 3 transposed convolution, stride 2 pad 1 output overlaps
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 27 April 26, 2022
Learnable Upsampling: Transposed Convolution
Sum where
Q: Why is it called 3 x 3 transposed convolution, stride 2 pad 1 output overlaps
transposed
convolution?
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 28 April 26, 2022
Learnable Upsampling: 1D Example
Output
Input Filter Output contains
ax copies of the filter
weighted by the
x ay input, summing at
where at overlaps in
a the output
y az + bx
b
z by
bz
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 29 April 26, 2022
Convolution as Matrix Multiplication (1D Example)
We can express convolution in
terms of a matrix multiplication
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 30 April 26, 2022
Convolution as Matrix Multiplication (1D Example)
We can express convolution in Transposed convolution multiplies by the
terms of a matrix multiplication transpose of the same matrix:
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 31 April 26, 2022
Semantic Segmentation Idea: Fully Convolutional
Upsampling:
Downsampling: Design network as a bunch of convolutional layers, with
Unpooling or strided
Pooling, strided downsampling and upsampling inside the network!
transposed convolution
convolution
Med-res: Med-res:
D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res:
D3 x H/4 x W/4
Input: High-res: High-res: Predictions:
3xHxW D1 x H/2 x W/2 D1 x H/2 x W/2 HxW
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 32 April 26, 2022
Semantic Segmentation: Summary
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 33 April 26, 2022
Semantic Segmentation This image is CC0 public domain
s
Sky
ee
Sky
Tr
Tr
e
es
Don’t differentiate
Cat Cow
instances, only care about
pixels
Grass
Grass
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 34 April 26, 2022
Object Detection
Semantic Object Instance
Classification
Segmentation Detection Segmentation
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 35 April 26, 2022
Object Detection
Semantic Object Instance
Classification
Segmentation Detection Segmentation
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 36 April 26, 2022
Object Detection: Single Object
(Classification + Localization)
Class Scores
Fully Cat: 0.9
Connected: Dog: 0.05
4096 to 1000 Car: 0.01
x, y ...
w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates
(x, y, w, h)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 37 April 26, 2022
Object Detection: Single Object Correct label:
Cat
(Classification + Localization)
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
x, y ...
w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 38 April 26, 2022
Object Detection: Single Object Correct label:
Cat
(Classification + Localization)
Class Scores
Fully Cat: 0.9 Softmax
Connected: Dog: 0.05 Loss
4096 to 1000 Car: 0.01
x, y ...
w
This image is CC0 public domain Vector: Fully
Connected:
4096 4096 to 4 Box
Coordinates L2 Loss
(x, y, w, h)
Treat localization as a
regression problem! Correct box:
(x’, y’, w’, h’)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 39 April 26, 2022
Object Detection: Multiple Objects
CAT: (x, y, w, h)
DOG: (x, y, w, h)
DOG: (x, y, w, h)
CAT: (x, y, w, h)
DUCK: (x, y, w, h)
DUCK: (x, y, w, h)
….
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 40 April 26, 2022
Each image needs a
Object Detection: Multiple Objects different number of outputs!
DOG: (x, y, w, h)
DOG: (x, y, w, h) 12 numbers
CAT: (x, y, w, h)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 41 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? NO
Cat? NO
Background? YES
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 42 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? YES
Cat? NO
Background? NO
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 43 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? YES
Cat? NO
Background? NO
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 44 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? NO
Cat? YES
Background? NO
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 45 April 26, 2022
Object Detection: Multiple Objects
Apply a CNN to many different crops of the
image, CNN classifies each crop as object
or background
Dog? NO
Cat? YES
Background? NO
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 46 April 26, 2022
Region Proposals: Selective Search
● Find “blobby” image regions that are likely to contain objects
● Relatively fast to run; e.g. Selective Search gives 2000 region
proposals in a few seconds on CPU
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 47 April 26, 2022
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 48 April 26, 2022
R-CNN
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 49 April 26, 2022
R-CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 50 April 26, 2022
R-CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 51 April 26, 2022
R-CNN
SVMs Classify regions with
SVMs SVMs
SVMs
ConvN Forward each region
et through ConvNet
ConvN
(ImageNet-pretranied)
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 52 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
R-CNN
Bbox reg SVMs Classify regions with
Bbox reg SVMs SVMs
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 53 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
R-CNN
Bbox reg SVMs Classify regions with Problem: Very slow!
Bbox reg SVMs SVMs
Need to do ~2k
Bbox reg SVMs independent forward
ConvN Forward each
passes for each image!
et region through
ConvN
ConvNet
et
ConvN
et Warped image regions
(224x224 pixels)
Regions of Interest
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 54 April 26, 2022
Predict “corrections” to the RoI: 4 numbers: (dx, dy, dw, dh)
“Slow” R-CNN
Bbox reg SVMs Classify regions with Problem: Very slow!
Bbox reg SVMs SVMs
Need to do ~2k
Bbox reg SVMs independent forward
ConvN Forward each
passes for each image!
et region through
ConvN
ConvNet Idea: Pass the
et
ConvN
image through
et Warped image regions convnet before
(224x224 pixels) cropping! Crop the
Regions of Interest conv feature instead!
(RoI) from a proposal
method (~2k)
Girshick et al, “Rich feature hierarchies for accurate object detection and
semantic segmentation”, CVPR 2014.
Input image Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 55 April 26, 2022
Fast R-CNN
“Slow” R-CNN
Input image
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 56 April 26, 2022
Fast R-CNN
“Slow” R-CNN
“conv5” features
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 57 April 26, 2022
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 58 April 26, 2022
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 59 April 26, 2022
Fast R-CNN
Object Linear +
softmax Linear Box offset
category
Regions of CNN Per-Region Network “Slow” R-CNN
Interest (RoIs)
Crop + Resize features
from a proposal
method “conv5” features
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 60 April 26, 2022
Fast R-CNN
Object Linear +
softmax Linear Box offset
category
Regions of CNN Per-Region Network “Slow” R-CNN
Interest (RoIs)
Crop + Resize features
from a proposal
method “conv5” features
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 61 April 26, 2022
Cropping Features: RoI Pool
CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 62 April 26, 2022
Cropping Features: RoI Pool
Project proposal
onto features
CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 63 April 26, 2022
Cropping Features: RoI Pool “Snap” to
grid cells
Project proposal
onto features
CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 64 April 26, 2022
Cropping Features: RoI Pool “Snap” to
grid cells
Project proposal
onto features
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 65 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 66 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features
Max-pool within
each subregion
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W Region features always the
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) same size even if input
regions have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 67 April 26, 2022
Cropping Features: RoI Pool “Snap” to
Divide into 2x2
grid of (roughly)
grid cells equal subregions
Project proposal
onto features
Max-pool within
each subregion
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W Region features always the
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15) same size even if input
regions have different sizes!
Girshick, “Fast R-CNN”, ICCV 2015.
Problem: Region features slightly misaligned
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 68 April 26, 2022
Cropping Features: RoI Align
No “snapping”!
Project proposal
onto features
CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 69 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features
CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 70 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features
CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 71 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
51
onto features f ∈R
f11∈R512 21 2
(x1,y1)
(x2,y1)
(x,y)
512
f12∈R f22∈R 512
CNN
(x1,y2) (x2,y2)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 72 April 26, 2022
Sample at regular points
Cropping Features: RoI Align in each subregion using
No “snapping”! bilinear interpolation
Project proposal
onto features
Max-pool within
each subregion
CNN
Region features
(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Input Image Image features: C x H x W
(e.g. 3 x 640 x 480) (e.g. 512 x 20 x 15)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 73 April 26, 2022
R-CNN vs Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 74 April 26, 2022
R-CNN vs Fast R-CNN
Problem:
Runtime dominated
by region proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 75 April 26, 2022
Faster R-CNN:
Make CNN do proposals!
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 76 April 26, 2022
Region Proposal Network
CNN
Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 77 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map
CNN
Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 78 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map
Anchor is an object?
1 x 20 x 15
CNN Conv
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 79 April 26, 2022
Region Proposal Network Imagine an anchor box
of fixed size at each
point in the feature map
Anchor is an object?
1 x 20 x 15
CNN Conv
Box corrections
4 x 20 x 15
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 80 April 26, 2022
In practice use K different
Region Proposal Network anchor boxes of different
size / scale at each point
Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15
Input Image
(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 81 April 26, 2022
In practice use K different
Region Proposal Network anchor boxes of different
size / scale at each point
Anchor is an object?
K x 20 x 15
CNN Conv
Box transforms
4K x 20 x 15
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 82 April 26, 2022
Faster R-CNN:
Make CNN do proposals!
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 83 April 26, 2022
Faster R-CNN:
Make CNN do proposals!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 84 April 26, 2022
Faster R-CNN:
Make CNN do proposals!
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
Figure copyright 2015, Ross Girshick; reproduced with permission
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 85 April 26, 2022
Faster R-CNN:
Make CNN do proposals!
Faster R-CNN is a
Two-stage object detector
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 86 April 26, 2022
Faster R-CNN: Do we really need
Make CNN do proposals! the second stage?
Faster R-CNN is a
Two-stage object detector
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 87 April 26, 2022
Single-Stage Object Detectors: YOLO / SSD / RetinaNet
Within each grid cell:
- Regress from each of the B
base boxes to a final box
with 5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C
classes (including
background as a class)
- Looks a lot like RPN, but
category-specific!
Input image Divide image into grid
3xHxW 7x7
Image a set of base boxes
Output:
Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, CVPR 2016 centered at each grid cell 7 x 7 x (5 * B + C)
Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
Lin et al, “Focal Loss for Dense Object Detection”, ICCV 2017 Here B = 3
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 88 April 26, 2022
Object Detection: Lots of variables ...
Backbone “Meta-Architecture” Takeaways
Network Two-stage: Faster R-CNN Faster R-CNN is slower
VGG16 Single-stage: YOLO / SSD but more accurate
ResNet-101 Hybrid: R-FCN
Inception V2 SSD is much faster but
Inception V3 Image Size not as accurate
Inception # Region Proposals
ResNet … Bigger / Deeper
MobileNet backbones work better
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 89 April 26, 2022
Object Detection: Lots of variables ...
Backbone “Meta-Architecture” Takeaways
Network Two-stage: Faster R-CNN Faster R-CNN is slower
VGG16 Single-stage: YOLO / SSD but more accurate
ResNet-101 Hybrid: R-FCN
Inception V2 SSD is much faster but
Inception V3 Image Size not as accurate
Inception # Region Proposals
ResNet … Bigger / Deeper
MobileNet backbones work better
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Zou et al, “Object Detection in 20 Years: A Survey”, arXiv 2019
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016
Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015
Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016
Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016
MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 90 April 26, 2022
Instance Segmentation
Semantic Object Instance
Classification
Segmentation Detection Segmentation
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 91 April 26, 2022
Object Detection:
Faster R-CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 92 April 26, 2022
Instance Segmentation: Mask Prediction
Mask R-CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 93 April 26, 2022
Mask R-CNN
Classification Scores: C
Box coordinates (per class): 4 * C
C x 28 x 28
He et al, “Mask R-CNN”, arXiv 2017
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 94 April 26, 2022
Mask R-CNN: Example Mask Training Targets
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 95 April 26, 2022
Mask R-CNN: Example Mask Training Targets
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 96 April 26, 2022
Mask R-CNN: Example Mask Training Targets
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 97 April 26, 2022
Mask R-CNN: Example Mask Training Targets
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 98 April 26, 2022
Mask R-CNN: Very Good Results!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 99 April 26, 2022
Mask R-CNN
Also does pose
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 100 April 26, 2022
Open Source Frameworks
Lots of good implementations on GitHub!
Detectron2 (PyTorch)
https://github.com/facebookresearch/detectron2
Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN, ...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 101 April 26, 2022
Beyond 2D Object Detection...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 102 April 26, 2022
Object Detection + Captioning
= Dense Captioning
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016
Figure copyright IEEE, 2016. Reproduced for educational purposes.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 103 April 26, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 104 April 26, 2022
Dense Video Captioning
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 105 April 26, 2022
Objects + Relationships = Scene Graphs
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen et al. "Visual genome: Connecting language and vision using
crowdsourced dense image annotations." International Journal of Computer Vision 123,
no. 1 (2017): 32-73.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 106 April 26, 2022
Scene Graph Prediction
Xu, Zhu, Choy, and Fei-Fei, “Scene Graph Generation by Iterative Message Passing”, CVPR 2017
Figure copyright IEEE, 2018. Reproduced for educational purposes.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 107 April 26, 2022
3D Object Detection
2D Object Detection:
2D bounding box
(x, y, w, h)
3D Object Detection:
3D oriented bounding box
(x, y, z, w, h, l, r, p, y)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 108 April 26, 2022
3D Object Detection: Simple Camera Model
A point on the image plane
corresponds to a ray in the 3D
3D ray space
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 109 April 26, 2022
3D Object Detection: Monocular Camera
Faster R-CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 110 April 26, 2022
3D Shape Prediction: Mesh R-CNN
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 111 April 26, 2022
Recap: Lots of computer vision tasks!
Semantic Object Instance
Classification
Segmentation Detection Segmentation
No spatial extent No objects, just pixels Multiple Object This image is CC0 public domain
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 112 April 26, 2022
Next time: Recurrent Neural Networks
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 9 - 113 April 26, 2022