❑ Convolutional Neural Networks (CNNs):
▪ CNN architecture
▪ Different layers
▪ Progress
❑ CNN Architectures for Classification:
▪ AlexNet, VGG, GoogleNet, ResNet
❑ CNN Architectures for Object Detection:
▪ R-CNN, Fast R-CNN, Faster R-CNN,YOLO
❑ CNN Architectures for Segmentation:
▪ U-Net, SegNet, Mask R-CNN
❑ CNN Architectures for Generation:
▪ Pix2Pix, CycleGAN
Most material from CS231n CNN for Visual Recognition course at Stanford
What isDeep
•
Learning (DL) ?
A machine learning subfield of learning representations of data. Exceptional effective at
learning patterns.
• Deep learning algorithms attempt to learn (multiple levels of) representation by using a
hierarchy of multiple layers
• If you provide the system tons of information, it begins to understand it and respond in useful ways.
[Link]
How to apply NN over
Image?
Stretch pixels
in single
column vector
Stretch pixels
in single
column vector
Problems
?
Stretch pixels
in single
column vector
Problem
s:
High
dimensionality
Stretch pixels
in single
column vector
Problem Solution ?
s:
High
dimensionality
Stretch pixels
in single
column vector
Problem Solution:
s: Convolutional Neural
High
Network
dimensionality
▪ Also known as
CNN,
ConvNet,
DCN
▪ CNN = a multi-layer neural network with
1. Local connectivity
2. Weight sharing
Hidden layer
Input layer
Global connectivity Local connectivity
▪ # input units (neurons): 7
▪ # hidden units: 3
▪ Number of parameters
▪ Global connectivity: ?
▪ Local connectivity: ?
Hidden layer
Input layer
Global connectivity Local connectivity
▪ # input units (neurons): 7
▪ # hidden units: 3
▪ Number of parameters
▪ Global connectivity: ?
▪ Local connectivity: ?
Hidden layer
Input layer
Global connectivity Local connectivity
▪ # input units (neurons): 7
▪ # hidden units: 3
▪ Number of parameters
▪ Global connectivity: 3 x 7 = 21
▪ Local connectivity: 3x3=9
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing:
?
– With weight sharing : ?
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing: ?
– With weight sharing : ?
Hidden layer
w1 w3 w5 w7 w9 w1 w3 w2 w1 w3
w2 w4 w6 w8 w2 w1 w3 w2
Input layer
Without weight sharing With weight sharing
• # input units (neurons): 7
• # hidden units: 3
• Number of parameters
– Without weight sharing: 3 x 3 = 9
– With weight sharing : 3 x 1 = 3
Source: cs231n, Stanford University
Input Layer (Input image)
Convolutional Layer
Non-linearity Layer (such as Sigmoid, Tanh, ReLU, PReLU, ELU, Swish, etc.)
Pooling Layer (such as Max Pooling, Average Pooling, etc.) Fully-Connected
Layer
Classification Layer (Softmax, etc.)
32×32×3 Image -> preserve spatial structure
Width
32
Height 32
3
Depth
32×32×3 Image
Width
32
5×5×3 Filter
Height 32
Convolve the filter with the image i.e.“slide over the
image spatially, computing dot products”
3
Depth
Handling multiple input channels
Filters always extend the full depth of the input
volume
32×32×3 Image
Width
32
5×5×3 Filter
Height 32
Convolve the filter with the image i.e.“slide over the
image spatially, computing dot products”
3
Depth
32×32×3 Image
Width 32
weight mask
5×5×3 Filter
Height 32
3
Depth
32×32×3 Image
Width 32
weight mask
5×5×3 Filter
A single value
Height 32
the result of taking a dot product between the filter
and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-
dimensional dot product + bias)
3
Depth
wT.x + b
32×32×3 Image
Width 32
weight mask
5×5×3 Filter
A single value
Height 32
the result of taking a dot product between the filter
and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-
dimensional dot product + bias)
3
Depth
wT.x + b
Activation map
32×32×3 Image
Width 32
weight mask
5×5×3 Filter
2
8
Height 32
convolve (slide) over
all spatial locations 2
8
3 1
Depth
Handling multiple output maps
32×32×3 Image Activation maps
Width 32
weight mask
5×5×3 Filter
2
8
Height 32
Second filter
2
8
3 1 1
Depth
Handling multiple output maps
32×32×3 Image Activation maps
Width 32
weight mask
5×5×3 Filter
2
8
Height 32
Third filter
2
8
3 1 1 1
Depth
Handling multiple output maps
32×32×3 Image Activation maps
Width 32
weight mask
5×5×3 Filter
2
8
Height 32
Total 96 filters
2
8
3
Depth Depth of output volume: 96
CONVOLUTION AND TRADITIONAL FEATURE EXTRACTION
bank of 𝐾 filters 𝐾 feature maps
.
.
.
image feature map
Image Source: cs231n, Oxford
Image Source: cs231n, Oxford
Image Source: cs231n, Oxford
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32×32×3 Image Activation maps
3
2 2
8
3
2
CONV, 2
e.g. 8
96 5x5x3
filters 9
3
6
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32×32×3 Image Activation maps
3
2 2 One
8 number
5×5×96
Filter
3
2
CONV, 2
e.g. 8
96 5x5x3
filters 9
3
6
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32×32×3 Image Deeper activation
Activation maps
map
3
2 2
2
8
4
5×5×96
Filter
3
2 2
CONV, 28
e.g. 4
convolve (slide) over
96 5x5x3 all spatial locations
filters 9 1
3
6
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32×32×3 Image Deeper activation
Activation maps
maps
3
2 2
8 2
4
128
5×5×96
3 Filters
2
CONV, 2
2
e.g. 8
4
96 5x5x3
3 filters 9
12
6
8
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
...
CONV, CONV, CONV
e.g. e.g.
96 128
3 5x5x3 2 5x5x96 2
2 filters 8 filters 4
3 2 2
2 8 4
3 9 12
6 8
▪ Local connectivity
▪ Weight sharing
▪ Handling multiple input channels
▪ Handling multiple output maps Weight sharing
Local connectivity
# input channels # output (activation) maps
Image credit: A. Karpathy
N
Output size
(N - F) / stride + 1
F
e.g. N = 7, F = 3
F N stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33
...
CONV, CONV, CONV
e.g. e.g.
96 128
3 5x5x3 2 5x5x96 2
2 filters 8 filters 4
3 2 2
2 8 4
3 96 128
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32
-> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
Source: cs231n, Stanford University
0 0 0 0 0 0 0 0 0 e.g. input 7×7 (spatially)
0 0 3×3 filter, applied with stride 1
0 0
pad with 1 pixel border
0 0
7×7 Output
0 0
in general, common to see CONV layers with stride 1,
0 0 filters of size F×F, and zero-padding with
0 0 (F-1)/2. (will preserve size spatially)
e.g.
0 0 F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7
=> zero pad with 3
0 0 0 0 0 0 0 0 0
Source: cs231n, Stanford University
POOLING LAYER
- makes the representations smaller and more manageable
- operates over each activation map independently:
Source: cs231n, Stanford University
MAX POOLING
Backward pass: upstream gradient is
passed back only to the unit with max
value
Source: cs231n, Stanford University
Source: cs231n, Stanford University
• Connect every neuron in one layer to every neuron in another layer
• Same as the traditional multi-layer perceptron neural network
Image Source: [Link]
• Connect every neuron in one layer to every neuron in another layer
• Same as the traditional multi-layer perceptron neural network
No. of Neurons (Last FC)
= No. of classes
Image Source: [Link]
NON-LINEARITY LAYER
Source: cs231n, Stanford University
Activation
Functions
Non-linearities needed to learn complex (non-linear) representations of data,
otherwise the NN would be just a linear function
More layers and neurons can approximate more complex functions
[Link] Full list: [Link]
Activation Function:
Sigmoid
Takes a real-valued number and
“squashes” it into range between 0 and 1.
𝑅𝑛 → 0,1
+ Nice interpretation as the firing rate of a neuron
• 0 = not firing at all
• 1 = fully firing
- Sigmoid neurons saturate and kill gradients, thus NN will barely learn
🙁
• when the neuron’s activation are 0 or 1 (saturate)
🙁
gradient at these regions almost zero
🙁
almost no signal will flow to its weights
if initial weights are too large then most neurons would saturate
[Link]
Activation
Takes a real-valued Function:
number and Tanh
“squashes” it into range between -1 and 1.
𝑅𝑛 → −1,1
- Like sigmoid, tanh neurons saturate
- Unlike sigmoid, output is zero-centered
- Tanh is a scaled sigmoid: tanh 𝑥 = 2𝑠𝑖𝑔𝑚 2𝑥 − 1
- Drawbacks of Sigmoid with Tanh also.
[Link]
Activation
Function:
Takes a real-valued number and
ReLU
thresholds it at zero f 𝑥 =
max(0,
𝑅𝑛 𝑥)
→ 𝑅𝑛+
🙂
Most Deep Networks use ReLU nowadays
Trains much faster
• accelerates the convergence of SGD
🙂 • due to linear, non-saturating form
Less expensive operations
• compared to sigmoid/tanh (exponentials etc.)
🙂
🙂
• implemented by simply thresholding a matrix at zero
More expressive
Prevents the gradient vanishing problem for +ive inputs
[Link]
Activation Function:
Swish
- ReLU is special case of Swish
Ramachandran et al. "Swish: a self-gated activation function." ICLR Workshops, 2018.
Regulariza
tion
Dropout
• Randomly drop units (along with
their connections) during training
• Each unit retained with fixed
probability p, independent of other
units
• Hyper-parameter p to be chosen
(tuned)
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." Journal of machine learning research
(2014)
Batch
“We want zero-meanNormalization
unit-variance activations? lets make them so.”
consider a batch of activations at some layer. To make
each dimension zero-mean unit-variance, apply:
Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift [Ioffe and Szegedy 2015] Source: cs231n
▪ SVM Classifier
▪ SVM Loss/Hinge Loss/Max-margin Loss
▪ Softmax Classifier
▪ Softmax Loss/Cross-entropy Loss
SUMMARY: CNN PIPELINE
Feature maps
Spatial pooling
Non-linearity
Convolution .
(Learned) .
.
Input Image
Input Feature Map
Source: R. Fergus,Y. LeCun
SUMMARY: CNN
PIPELINE
Feature maps
Spatial pooling
Non-linearity
Convolution
(Learned)
Input Image
Source: R. Fergus,Y. LeCun Source: Stanford 231n
SUMMARY: CNN
PIPELINE
Feature maps
Max
Spatial pooling (or Avg)
Non-linearity
Convolution
(Learned)
Input Image
Source: R. Fergus,Y. LeCun
Softmax layer:
IMAGENET CHALLENGE
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon MTurk
• Challenge: 1.2 million training images,
1000 classes
[Link]/challenges/LSVRC/
IMAGENET CHALLENGE
• ~14 million labeled images, 20k classes
• Images gathered from Internet
• Human labels via Amazon MTurk
• Challenge: 1.2 million training images,
1000 classes
[Link]/challenges/LSVRC/
ImageNet Image Classification Top5 Error
18 16.4
16
14
11.7
12
10
8 7.3 6.7
6
4 3.57 3.06
2.251
2
0
ImageNet Image Classification Top5 Error
18 16.4
16
14
11.7
12
10
8 7.3 6.7
6
4 3.57 3.06
2.251
2
0
Best Non-ConvNet in 2012: 26.2%
Full (simplified) AlexNet architecture: Details/
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, Retrospectives:
pad 0 [27x27x96] MAX POOL1: 3x3 filters at - firstNorm
- used use oflayers
ReLU (not common
stride 2 [27x27x96] NORM1: Normalization anymore)
layer [27x27x256] CONV2: 256 5x5 filters at
stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 - heavy data augmentation
filters at stride 2 [13x13x256] NORM2: - batch size 128
Normalization layer [13x13x384] CONV3: 384 - SGD Momentum 0.9
3x3 filters at stride 1, pad 1 [13x13x384] CONV4:
384 3x3 filters at stride 1, pad 1 [13x13x256]
- Learning rate 0.01, reduced
CONV5: 256 3x3 filters at stride 1, pad 1 manually when val accuracy
[6x6x256] MAX POOL3: 3x3 filters at stride 2 saturates
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Krizhevsky, et al. Imagenet classification with deep convolutional neural networks. NIPS 2012. Source: cs231n
Small filters, Deeper networks
8 layers (AlexNet)
-> 16 - 19 layers (VGGNet)
VGG16 (source)
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR2015.
“Inception module”:
design a good local network topology and
then stack these modules on top of each
other
Szegedy, Christian, et al. "Going deeper with convolutions." CVPR 2015. Source: cs231n
No FC layers
besides FC
1000 to
output Classes
Full ResNet architecture:
Global average
Stack residual blocks pooling layer
after last
conv layer
Residual block has two 3x3
conv layers
Periodically, double # of
3x3 conv, 128
filters and downsample filters, /2
spatially using stride 2 spatially with
(/2 in each dimension) stride 2
Additional conv layer at the
3x3 conv, 64
beginning filters
No FC layers at the end
(only FC 1000 to output
Beginning
classes) conv layer
He et al. Deep Residual Learning for Image Recognition, IEEE CVPR 2016. Source: cs231n
SVMs Classify regions with SVMs
Source: R. Girshick
SVMs
SVMs Forward each region
through ConvNet
ConvNet
ConvNet
ConvNet
Warped image regions
Region proposals
Input image
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
Softmax classifier softmax Linear Bounding-box regressors
FCs Fully-connected layers
“RoI Pooling” layer
Region “conv5” feature map of image
proposals
Forward whole image through ConvNet
ConvNet
[Link]
Classification Bounding-box
loss regression loss
…
Classification Bounding-box
loss regression loss RoI
pooling
propo
ONE sals
Region Proposal
NETWOR Network
feature
map
K,
FOUR CNN
LOSSES
Source: R. Girshick, K. He
image
▪ Very efficient but lower accuracy
Redmon et al. You only look once: Unified, real-time object detection. CVPR 2016.
Downsampling: Design network as a bunch of convolutional layers, Upsampling:
Pooling, strided with downsampling and upsampling inside the Unpooling or strided
convolution network! transpose convolution
Long et al. Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Noh et
al. Learning Deconvolution Network for Semantic Segmentation, ICCV 2015 Slide credit: cs231n
O. Ronneberger, P. Fischer, T. Brox U-Net: Convolutional Networks for Biomedical
Image Segmentation, MICCAI 2015
Drop the FC layers, get better results
V. Badrinarayanan, A. Kendall and R. Cipolla, SegNet: A Deep Convolutional
Encoder-Decoder Architecture for Image Segmentation , PAMI 2017
Instance Segmentation: Mask R-CNN
(He etal. ICCV 2017) Classification Scores: C
Box coordinates (per class): 4 * C
Mask R-CNN - extends Faster R-CNN by adding a branch for predicting an object mask in parallel
with the existing branch for bounding box recognition.
Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
Ian Goodfellow et al.,“Generative Adversarial Nets”, NIPS 2014 Fake and real images copyright Emily Denton et al. 2015. Credit: cs231n, Stanford
[Link]