0% found this document useful (0 votes)
14 views

Lecture4 - Convnets For CV Slide

The document discusses convolutional neural network architectures. It covers topics like local connectivity, parameter sharing, convolution, pooling, and examples of convolutional networks applied to tasks like object detection and recognition.

Uploaded by

mohdharislcp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture4 - Convnets For CV Slide

The document discusses convolutional neural network architectures. It covers topics like local connectivity, parameter sharing, convolution, pooling, and examples of convolutional networks applied to tasks like object detection and recognition.

Uploaded by

mohdharislcp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

10707

Deep Learning: Spring 2021


Andrej Risteski
Machine Learning Department

Lecture 4:
Convolutional
architectures
Used Resources

Disclaimer: Some material in this lecture was borrowed from:

Hugo Larochelle’s class on Neural Networks:


https://sites.google.com/site/deeplearningsummerschool2016/

Rob Fergus’ CIFAR/MLSS tutorial on ConvNets:


http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf

Marc'Aurelio Ranzato’s CVPR 2014 tutorial on Convolutional Nets


https://sites.google.com/site/lsvrtutorialcvpr14/home/deeplearning
Neural networks for vision
Prototypical task in vision is object recognition: given an input
image, identify what kind of object it contains.

Are feedforward networks the right architecture for this?


Desiderata for networks for vision
֍ Inputs are very high-dimensional: 150 x 150 pixels = 22500 inputs,
or 3 x 22500 if RGB pixels instead of grayscale.
֍ Should leverage the spatial locality (in the pixel sense) of data
֍ Build in invariance to natural variations: translation, illumination,
etc.

Convolutional architectures are designed for this:


֍ Local connectivity (reflects spatial locality and decreases # params)
֍ Parameter sharing (further decreases # params)
֍ Convolution
֍ Pooling / subsampling hidden units
Local Connectivity
Use local connectivity of hidden units
֍ Each hidden unit is connected only to a sub-
region (patch) of the input image.

It is connected to all channels: 1 if grayscale, 3


(R, G, B) if color image

Why this is a good idea:

֍ Fully connected layer has a lot of parameters


to fit, which requires a lot of training data

֍ Image data isn’t arbitrary: neighboring pixels


are “meaningfully related” – e.g. if a node is
to be a “dog nose” detector – need to look at
small patch of pixels.
Decrease in # of parameters
Convolutional: 200x200 image, 40K
Fully connected: 200x200 image,
hidden units, window size 10x10,
40K hidden units, ~2B parameters!
~4M parameters!
Parameter Sharing
Prior approach makes weights sensitive to translations: e.g.

Might learn weights for But not learn weights


nose detector here for a nose detector here
Parameter Sharing
Share matrix of parameters across some units
֍ Units that are organized into the “feature map” share parameters

֍ Hidden units within a feature map cover different positions in the


image

same color
= Wij is the matrix connecting
same matrix of the ith input channel with the
connection jth feature map
Desiderata for networks for vision
Our goal is to design neural networks that are specifically adapted
for such problems

֍ Must deal with very high-dimensional inputs: 150 x 150 pixels =


22500 inputs, or 3 x 22500 if RGB pixels
֍ Can exploit the 2D topology of pixels (or 3D for video data)
֍ Can build in invariance to certain variations: translation,
illumination, etc.

Convolutional networks leverage these ideas


֍ Local connectivity
֍ Parameter sharing
֍ Convolution
֍ Pooling / subsampling hidden units
Discrete Convolution
Each feature map forms a 2D grid of features
Can be computed with a discrete convolution ( ) of a kernel matrix kij

- xi is the ith channel of input


- kij is the convolution kernel
- gj is a learned scaling factor
- yj is the hidden layer

Can add bias


Jarret et al. 2009
Discrete Convolution

Example:
Discrete Convolution

Example:
with rows and columns flipped
Discrete Convolution

Example: 1 x 0 + 0.5 x 80 + 0.25 x 20 + 0 x 40 = 45


Discrete Convolution

Example: 1 x 80 + 0.5 x 40 + 0.25 x 40 + 0 x 0 = 110


Discrete Convolution

Example: 1 x 20 + 0.5 x 40 + 0.25 x 0 + 0 x 0 = 40


Discrete Convolution

Example: 1 x 40 + 0.5 x 0 + 0.25 x 0 + 0 x 40 = 40


Example of a convolution
Adding non-linearity
With a non-linearity, we get a detector of a feature at any position in
the image:
Adding non-linearity
With a non-linearity, we get a detector of a feature at any position in
the image:
Example of ReLU non-linearity

From Rob Fergus tutorial (http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf)


Padding
Can use ‘‘zero padding’’ to allow going over the borders
The picture so far

From Rob Fergus tutorial (http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus_1.pdf)


Desiderata for networks for vision
Our goal is to design neural networks that are specifically adapted
for such problems

֍ Must deal with very high-dimensional inputs: 150 x 150 pixels =


22500 inputs, or 3 x 22500 if RGB pixels
֍ Can exploit the 2D topology of pixels (or 3D for video data)
֍ Can build in invariance to certain variations: translation,
illumination, etc.

Convolutional networks leverage these ideas


֍ Local connectivity
֍ Parameter sharing
֍ Convolution
֍ Pooling / subsampling hidden units
Pooling
Pool hidden units in same neighborhood
Pooling is performed in non-overlapping neighborhoods (subsampling)

- xi is the ith channel of input


- xi,j,k is value of the ith feature
map at position j,k
- p is vertical index in local
neighborhood
- q is horizontal index in local
neighborhood
- yijk is pooled / subsampled
layer
Jarret et al. 2009
Pooling
Pool hidden units in same neighborhood
An alternative to ‘‘max’’ pooling is ‘‘average’’ pooling

- xi is the ith channel of input


- xi,j,k is value of the ith feature
map at position j,k
- p is vertical index in local
neighborhood
- q is horizontal index in local
neighborhood
- yijk is pooled / subsampled
layer
- m is the neighborhood
Jarret et al. 2009 height/width
Example: Pooling
Illustration of pooling/subsampling operation

Why pooling?
֍ Introduces invariance to local translations
֍ Reduces the number of hidden units in hidden layer
Example: Pooling

Can we make the detection robust to the


exact location of the eye?
Example: Pooling
By “pooling” (e.g., taking max) filter
responses at different locations we gain
robustness to the exact spatial location of
features.
Translation Invariance
Illustration of local translation invariance

Both images result in the same feature map after pooling/subsampling


Convolutional Network
Convolutional neural network alternates between
convolutional and pooling layers

From Yann LeCun’s slides


Generating Additional Examples
Elastic Distortions
Can add ‘‘elastic’’ deformations (useful in character recognition)

We can do this by applying a ‘‘distortion field’’ to the image

A distortion field specifies where to displace each pixel value

Bishop’s book
Elastic Distortions
Can add ‘‘elastic’’ deformations (useful in character recognition)

We can do this by applying a ‘‘distortion field’’ to the image

A distortion field specifies where to displace each pixel value

Bishop’s book
Elastic Distortions
Can add ‘‘elastic’’ deformations (useful in character recognition)

We can do this by applying a ‘‘distortion field’’ to the image

A distortion field specifies where to displace each pixel value

Bishop’s book
Conv Nets: Examples
Optical Character Recognition, House Number and Traffic Sign
classification
Conv Nets: Examples
Pedestrian detection

Sermanet et al. “Pedestrian detection with unsupervised multi-stage..” CVPR 2013


Conv Nets: Examples
Object Detection

Sermanet et al. “OverFeat: Integrated recognition, localization” arxiv 2013


Girshick et al. “Rich feature hierarchies for accurate object detection” arxiv 2013
Szegedy et al. “DNN for object detection” NIPS 2013
ImageNet Dataset
~14 million images, 20k classes

Examples of “Hammer”

Deng et al. “Imagenet: a large scale hierarchical image database” CVPR 2009
Important Breakthroughs
Deep Convolutional Nets for Vision (Supervised)
Krizhevsky, A., Sutskever, I. and Hinton, G. E., ImageNet Classification with Deep
Convolutional Neural Networks, NIPS, 2012.

~14 million images, 20k classes


Architecture

How can we select the “right” architecture:


Manual tuning of features is now replaced with the manual tuning of
architectures

֍ Depth
֍ Width
֍ Parameter count
How to Choose Architecture

Many hyper-parameters:
Number of layers, number of feature maps

֍ Cross Validation

֍ Grid Search (need lots of GPUs)

֍ Smarter Strategies

Random search [Bergstra & Bengio JMLR 2012]


Bayesian Optimization
Famous
architectures
AlexNet
8 layers total, ~60 million parameters Softmax Output

Layer 7: Full
Trained on Imagenet
dataset [Deng et al. CVPR’09] Layer 6: Full

18.2% top-5 error Layer 5: Conv + Pool

Layer 4: Conv

Layer 3: Conv

Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image


AlexNet: ablation study
Remove top fully connected layer 7 Softmax Output

Drop ~16 million parameters


Layer 6: Full
Only 1.1% drop in performance!
Layer 5: Conv + Pool

Layer 4: Conv

Layer 3: Conv

Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image


AlexNet: ablation study
Remove both fully connected layers 6,7 Softmax Output

Drop ~50 million parameters

5.7% drop in performance!


Layer 5: Conv + Pool

Layer 4: Conv

Layer 3: Conv

Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image


AlexNet: ablation study
Remove upper feature extractor layers Softmax Output

(Layers 3 & 4)
Layer 7: Full

Drop ~1 million parameters Layer 6: Full

3% drop in performance. Layer 5: Conv + Pool

Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image


AlexNet: ablation study
Let us remove upper feature extractor layers and Softmax Output

fully connected:

֍ Layers 3,4, 6 and 7

Layer 5: Conv + Pool


33.5% drop in performance!

Depth of the network is the key.


Layer 2: Conv + Pool

Layer 1: Conv + Pool

[From Rob Fergus’ CIFAR 2016 tutorial] Input Image


AlexNet: intermediate features

[From Rob Fergus’ CIFAR 2016 tutorial]


AlexNet: translation invariance

[From Rob Fergus’ CIFAR 2016 tutorial]


AlexNet: translation invariance

[From Rob Fergus’ CIFAR 2016 tutorial]


AlexNet: scaling invariance

[From Rob Fergus’ CIFAR 2016 tutorial]


AlexNet: rotation invariance

[From Rob Fergus’ CIFAR 2016 tutorial]


GoogLeNet
Issue: multiscale nature of images

https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202

Larger kernel good for global features, and smaller kernel for local features.

Idea: have multiple different-size kernels at any given level.


GoogLeNet

24-layer model that uses so-called inception module.

[Going Deep with Convolutions, Szegedy et al., arXiv:1409.4842, 2014]


GoogLeNet
GoogLeNet inception module:

֍ Multiple filter scales at each layer

number
of filters 1x1

3x3

5x5

[Going Deep with Convolutions, Szegedy et al., arXiv:1409.4842, 2014]


GoogLeNet
GoogLeNet inception module:

֍ Multiple filter scales at each layer


֍ Dimensionality reduction to keep computational requirements down

number
of filters 1x1

3x3

5x5

[Going Deep with Convolutions, Szegedy et al., arXiv:1409.4842, 2014]


GoogLeNet

֍ Width of inception modules ranges from 256 filters (in early modules)
to 1024 in top inception modules.
֍ Can remove fully connected layers on top completely
֍ Number of parameters is reduced to 5 million
֍ 6.7% top-5 validation error on Imagnet

[Going Deep with Convolutions, Szegedy et al., arXiv:1409.4842, 2014]


Residual Networks
Really, really deep convnets do not train well, E.g. CIFAR10:

Reason: gradients involve multiplications of a # of matrices proportional to depth.


Vanishing/exploding gradients: gradients get very small/very large.

Key idea: introduce “identity shortcut” connection, skipping


one or more layers.
Intuition: network can easily simulate shallower network
(at initialization, F is not too far from 0 map), so
performance should not degrade by going deeper.

[He, Zhang, Ren, Sun, CVPR 2016]


Residual Networks
Really, really deep convnets do not train well,
E.g. CIFAR10:

Key idea: introduce “identity


shortcut”

With ensembling, 3.57% top-5


test error on ImageNet

[He, Zhang, Ren, Sun, CVPR 2016]


Dense Convolutional Networks
Information in ResNets is only carried implicitly, through addition.
Idea: explicitly forward output of layer to *all* future layers (by concatenation).

Intuition: helps vanishing gradients; encourage reuse features (& hence reduce
parameter count);

Full architecture for Imagenet


[Huang, Liu, Weinberger, van der Maaten, CVPR 2017]
Debugging hints
֍ Check gradients numerically by finite differences

֍ Visualize features (feature maps need to be uncorrelated) and


have high variance

Good training: hidden units


are sparse across samples

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]


Debugging hints
֍ Check gradients numerically by finite differences

֍ Visualize features (feature maps need to be uncorrelated) and


have high variance

Bad training: many hidden


units ignore the input and/or
exhibit strong correlations

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]


Debugging hints
֍ Check gradients numerically by finite differences

֍ Visualize features (feature maps need to be uncorrelated) and


have high variance

֍ Visualize parameters: learned features should exhibit structure


and should be uncorrelated and are uncorrelated

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]


Debugging hints
֍ Check gradients numerically by finite differences

֍ Visualize features (feature maps need to be uncorrelated) and


have high variance

֍ Visualize parameters: learned features should exhibit structure


and should be uncorrelated and are uncorrelated

֍ Measure error on both training and validation set

֍ Test on a small subset of the data and check the error → 0.

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]


When it does not work
Training diverges:
֍ Learning rate may be too large → decrease learning rate
֍ Backprop is buggy → numerical gradient checking

Parameters collapse / loss is minimized but accuracy is low


֍ Check loss function: is it appropriate for the task you want to solve?
֍ Does it have degenerate solutions?

Network is underperforming
֍ Compute flops and nr. params. → if too small, make net larger
֍ Visualize hidden units/params → fix optimization

Network is too slow


֍ GPU, distrib. framework, make net smaller

[From Marc'Aurelio Ranzato, CVPR 2014 tutorial]

You might also like