WEEK 9
WEEK 9
Topic
Lecture 41: Popular CNN Models V
Concepts Covered:
CNN
AlexNet
VGG Net
Transfer Learning
Challenges in Deep Learning
GoogLeNet
ResNet
etc.
18
Challenges
Deep learning is data hungry.
Overfitting or lack of generalization.
Vanishing/Exploding Gradient Problem.
Appropriate Learning Rate.
Covariate Shift.
Effective training.
16
Vanishing Gradient
Problem
X f1 f2 f3 f4 O
W1 W2 W3 W4
∂O
= X . f1′.W2 . f 2′.W3 . f 3′.W4 . f 4′
∂W1
17
Vanishing Gradient
Problem
Choice of activation function: ReLU instead
of Sigmoid.
Appropriate initialization of weights.
Intelligent Back Propagation Learning
Algorithm.
15
GoogLeNet
ILSVRC 2014 Winner
14
GoogLeNe
t
Convolution Layer
Feature Concatenation
27 Layer including Maxpool layers
Softmax Layer
13
GoogLeNe
t
Inception Module
12
Inception
Module
Computing 1×1, 3×3, and 5×5 convolutions within
the same module of the network.
Covers a bigger area, at the same time preserves
fine resolution for small information on the images.
Use different convolution kernels of different sizes
in parallel from the most accurate detailing (1x1) to
a bigger one (5x5).
1x1 convolution also reduces computation.
11
Inception
Module
Number of operations for 1×1 =
(14×14×16)×(1×1×480) = 1.5M
Number of operations for 5×5 =
(14×14×48)×(5×5×16) = 3.8M
Total number of operations =
1.5M + 3.8M = 5.3M
Number of operations =
(14×14×48)×(5×5×480) =
112.9M
https://medium.com/coinmonks/paper-review-of-googlenet-
inception-v1-winner-of-ilsvlc-2014-image-classification-
c2b3565a64e7
10
Inception
Module
Outputs of these filters are then
stacked along the channel
dimension.
Multi-level feature extractor.
There are 9 such inception
modules.
Top-5 error rate of less than 7 %.
9
GoogLeNe
t
Auxiliary Classifier
2
8
Auxiliary
Classifier
Due to large depth of the network, ability to
propagate gradient back through all the layers was a
concern.
Auxiliary Classifiers are smaller CNNs put on top of
middle Inception modules.
Addition of auxiliary classifiers in the middle
exploits the discriminative power of the features
produced by the layers in the middle.
7
Auxiliary
Classifier
During training, loss of Auxiliary classifiers are added to
the total loss of the network.
Losses from Auxiliary classifiers were weighted by 0.3.
Auxiliary classifiers are discarded at Inference time.
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 42: Popular CNN Models VI
Concepts Covered:
CNN
Challenges in Deep Learning
GoogLeNet
ResNet
Momentum Optimizer
18
Challenges
Deep learning is data hungry.
Overfitting or lack of generalization.
Vanishing/Exploding Gradient Problem.
Appropriate Learning Rate.
Covariate Shift.
Effective training.
17
Vanishing Gradient
Problem
Choice of activation function: ReLU instead
of Sigmoid.
Appropriate initialization of weights.
Intelligent Back Propagation Learning
Algorithm.
13
GoogLeNe
t
Inception Module
9
GoogLeNe
t
Auxiliary Classifier
6
ResNet
5
ResNe
t
Core idea is: introduction of Skip
Connection/ Identity Shortcut
Connection that skips one or
more layers.
Stacking layers should not
degrade performance compared
H(x)
to its shallow counterpart.
Weight layer learns F(x)=H(x)-x
https://towardsdatascience.com/an-overview-of-resnet-and-
its-variants-5281e2f56035
4
ResNe
t
By stacking identity mappings the
resultant deep network should give at
least same performance as its shallow
counterpart.
Deeper network should not give higher
training error than shallow network. H(x)
During learning the gradient can flow to
any earlier network through shortcut
connections alleviating vanishing
gradient problem.
3
ResNe
t
https://towardsdatascience.com/an-overview-of-resnet-and-
its-variants-5281e2f56035
2
ResNe
t
Forward flow: Layer l -2
l −1,l l −1 l − 2 ,l l −2
a = f (W
l
.a + b +Wl
.a ) Layer l -1 W l − 2 ,l
l − 2 ,l l −2 W l −1,l
= f (Z + W
l
.a ) Layer l
l −2
a = f (Z + a
l l
) if same dimension
1
ResNe
t
Backward Propagation: Layer l -2
∇W l −1,l
= −a .δ
l −1 l
normal path Layer l -1 W l − 2 ,l
W l −1,l
∇W l − 2 ,l
= −a .δ
l −2 l
skip path Layer l
Optimizing
Gradient Descent
Gradient Descent
Challenges
Challenges of Mini-batch Gradient
Descent
Choice of Proper Learning Rate:
Too small a learning rate leads to
slow convergence.
A large learning rate may lead to
oscillation around the minima or
may even diverge.
Gradient Descent
Challenges
Challenges of Mini-batch Gradient
L(W)
Descent
Choice of Proper Learning Rate:
Too small a learning rate leads to
slow convergence.
W
A large learning rate may lead to
oscillation around the minima or
may even diverge.
Gradient Descent
Challenges
Learning Rate Schedules: changing learning rate according to
some predefined schedule.
The same learning rate applies to all parameter updates.
The data may be sparse and different features have very
different frequencies.
Updating all of them to the same extent might not be
proper.
Larger update for rarely occurring features might be a
better choice.
Gradient Descent
Challenges
Avoiding getting trapped in suboptimal
local minima.
Difficulty arises in from saddle points,
i.e. points where one dimension slopes
up and another slopes down.
These saddle points are usually
surrounded by a plateau of the same
error, which makes it hard for SGD to
escape, as the gradient is close to zero
in all dimensions.
Momentum
Optimizer
W2
L(W)
W1
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 43: Popular Optimizing Gradient Descent
18
Challenges
Deep learning is data hungry.
Overfitting or lack of generalization.
Vanishing/Exploding Gradient Problem.
Appropriate Learning Rate.
Covariate Shift.
Effective training.
Concepts Covered:
CNN
ResNet
Gradient Descent Challenges
Momentum Optimizer
Nestevor Accelerated Gradient
Adagrad.
etc.
Gradient Descent
Challenges
Challenges of Mini-batch Gradient
L(W)
Descent
Choice of Proper Learning Rate:
Too small a learning rate leads to
slow convergence.
W
A large learning rate may lead to
oscillation around the minima or
may even diverge.
Gradient Descent
Challenges
Learning Rate Schedules: changing learning rate according to
some predefined schedule.
The same learning rate applies to all parameter updates.
The data may be sparse and different features have very
different frequencies.
Updating all of them to the same extent might not be
proper.
Larger update for rarely occurring features might be a
better choice.
Gradient Descent
Challenges
Avoiding getting trapped in suboptimal
local minima.
Difficulty arises from saddle points, i.e.
points where one dimension slopes up
and another slopes down.
These saddle points are usually
surrounded by a plateau of the same
error, which makes it hard for SGD to
escape, as the gradient is close to zero
in all dimensions.
9
Optimizing
Gradient Descent
Concepts Covered:
CNN
ResNet
Gradient Descent Challenges
Momentum Optimizer
Adagrad.
etc.
9
Momentum Optimizer
Momentum
Optimizer
W2
L(W)
W1
Momentum
Optimizer
W2
W
1
Momentum
Optimizer
W2 W2
W1 W1
SGD SGD with Momentum
9
Nesterov Accelerated
Gradient (NAG)
Nesterov Accelerated Gradient
(NAG)
W2
W
1
Problem with Momentum
Optimizer/NAG
Both the algorithms require the hyper-parameters to be
set manually.
These hyper-parameters decide the learning rate.
The algorithm uses same learning rate for all dimensions.
The high dimensional (mostly) non-nonconvex nature of
loss function may lead to different sensitivity on different
dimension.
We may require learning rate could be small in some
dimension and large in another dimension.
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 44: Optimizing Gradient Descent II
Concepts Covered:
CNN
Gradient Descent Challenges
Momentum Optimizer
Nesterov Accelerated Gradient
Adagrad
RMSProp
etc.
Momentum
Optimizer
W2
W
1
Momentum
Optimizer
W2 W2
W1 W1
SGD SGD with Momentum
9
Nesterov Accelerated
Gradient (NAG)
Nesterov Accelerated Gradient
(NAG)
W2
W
1
Problem with Momentum
Optimizer/NAG
Both the algorithms require the hyper-parameters to be
set manually.
These hyper-parameters decide the learning rate.
The algorithm uses same learning rate for all dimensions.
The high dimensional (mostly) non-nonconvex nature of
loss function may lead to different sensitivity on different
dimension.
We may require learning rate be small in some dimension
and large in another dimension.
9
Adagrad
Adagra
d
Adagrad adaptively scales the learning rate for different
dimensions.
Scale factor of a parameter is inversely proportional to the
square root of sum of historical squared values of the
gradient.
The parameters with the largest partial derivative of the
loss will have rapid decrease in their learning rate.
Parameters with small partial derivatives will have
relatively small decrease in learning rate.
Adagra
d
t
rt = ∑ gτ gτ
1
gt = ∑ ∇W L(Wt , X )
n ∀X ∈Minibatch τ =1
η
Wt +1 = Wt − gt
∈ I + rt
→ element - wise product
Adagra
d
η (1)
.g t
Wt +1 Wt ∈ + rt
(1)
(1) (1)
( 2) ( 2) η ( 2)
Wt +1 = Wt − .g t
: : ∈ + rt
( 2)
(d ) (d ) :
Wt +1 Wt η (d )
.g t
∈ + rt
(d )
Adagra
d Side:
Positive
Adagrad adaptively scales the learning rate for different
dimensions by normalizing with respect to the gradient magnitude
in the corresponding dimension.
Adagrad eliminates the need to manually tune the learning rate.
Reduces learning rate faster for parameters showing large slope
and slower for parameters giving smaller slope.
Adagrad converges rapidly when applied to convex functions.
Adagra
d
Negative side:
If the function is non-convex:- trajectory may pass through
many complex terrains eventually arriving at a locally region.
By then learning rate may become too small due to the
accumulation of gradients from the beginning of training.
So at some point the model may stop learning.
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 45: Optimizing Gradient Descent III
Concepts Covered:
CNN
Gradient Descent Challenges
Momentum Optimizer
Nesterov Accelerated Gradient
Adagrad
RMSProp
etc.
Adagra
d
t
rt = ∑ gτ gτ
1
gt = ∑ ∇W L(Wt , X )
n ∀X ∈Minibatch τ =1
η
Wt +1 = Wt − gt
∈ I + rt
→ element - wise product
Adagra
d Side:
Positive
Adagrad adaptively scales the learning rate for different
dimensions by normalizing with respect to the gradient magnitude
in the corresponding dimension.
Adagrad eliminates the need to manually tune the learning rate.
Reduces learning rate faster for parameters showing large slope
and slower for parameters giving smaller slope.
Adagrad converges rapidly when applied to convex functions.
Adagra
d
Negative side:
If the function is non-convex:- trajectory may pass through
many complex terrains eventually arriving at a locally region.
By then learning rate may become too small due to the
accumulation of gradients from the beginning of training.
So at some point the model may stop learning.
9
RMSProp
RMSPro
p
RMSProp uses exponentially decaying average of squared
gradient and discards history from the extreme past.
Converges rapidly once it finds a locally convex bowl.
Treats this as an instance of Adagrad algorithm initialized
within that bowl.
RMSPro
p
1
gt = ∑ ∇W L(Wt , X )
n ∀X ∈Minibatch
η
Wt +1 = Wt − gt
∈ I + rt
RMSProp with Nesterov
Momentum
~ 1 ~
W = Wt + αv gt = ∑ ∇W L(W , X )
n ∀X ∈Minibatch
rt = β r t −1+(1 − β ) g t g t
η
vt +1 = α vt − gt Wt +1 = Wt + vt
∈ I + rt
9
Adaptive Moments
(Adam)
Adam
Variant of the combination of RMSProp and Momentum.
Incorporates first order moment (with exponential weighting)
of the gradient (Momentum term).
Momentum is incorporated in RMSProp by adding momentum
to the rescaled gradients.
Both first and second moments are corrected for bias to
account for heir initialization to zero.
Adam
1
gt = ∑ ∇W L(W , X )
n ∀X ∈Minibatch
st = β1 s t −1 +(1 − β1 ) g t
rt = β 2 r t −1+(1 − β 2 ) g t g t
Adam
Bias corrected first and second moments
st rt
sˆt = rˆt =
1 − β1 1− β2
sˆt
Wt +1 = Wt − η
∈ I + rˆt
Momentum
Optimizer
Animation Source:-
https://imgur.com/a/Hqolp