0% found this document useful (0 votes)
5 views

WEEK 9

The document outlines lectures on popular CNN models and optimization techniques in deep learning, focusing on models like GoogLeNet and ResNet, as well as challenges such as overfitting and the vanishing gradient problem. It discusses various optimization algorithms including Momentum, Nesterov Accelerated Gradient, and Adagrad, highlighting their advantages and challenges. The content is structured for a course taught by Prof. P. K. Biswas at IIT Kharagpur.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

WEEK 9

The document outlines lectures on popular CNN models and optimization techniques in deep learning, focusing on models like GoogLeNet and ResNet, as well as challenges such as overfitting and the vanishing gradient problem. It discusses various optimization algorithms including Momentum, Nesterov Accelerated Gradient, and Adagrad, highlighting their advantages and challenges. The content is structured for a course taught by Prof. P. K. Biswas at IIT Kharagpur.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Course Name: Deep Learning

Faculty Name: Prof. P. K. Biswas


Department : E & ECE, IIT Kharagpur

Topic
Lecture 41: Popular CNN Models V
Concepts Covered:
 CNN
 AlexNet
 VGG Net
 Transfer Learning
 Challenges in Deep Learning
 GoogLeNet
 ResNet
 etc.
18
Challenges
 Deep learning is data hungry.
 Overfitting or lack of generalization.
 Vanishing/Exploding Gradient Problem.
 Appropriate Learning Rate.
 Covariate Shift.
 Effective training.
16
Vanishing Gradient
Problem
X f1 f2 f3 f4 O
W1 W2 W3 W4
∂O
= X . f1′.W2 . f 2′.W3 . f 3′.W4 . f 4′
∂W1
17
Vanishing Gradient
Problem
 Choice of activation function: ReLU instead
of Sigmoid.
 Appropriate initialization of weights.
 Intelligent Back Propagation Learning
Algorithm.
15

GoogLeNet
ILSVRC 2014 Winner
14
GoogLeNe
t

Convolution Layer

22 Layers with parameters Maxpool Layer

Feature Concatenation
27 Layer including Maxpool layers
Softmax Layer
13
GoogLeNe
t

Inception Module
12
Inception
Module
 Computing 1×1, 3×3, and 5×5 convolutions within
the same module of the network.
 Covers a bigger area, at the same time preserves
fine resolution for small information on the images.
 Use different convolution kernels of different sizes
in parallel from the most accurate detailing (1x1) to
a bigger one (5x5).
 1x1 convolution also reduces computation.
11
Inception
Module
Number of operations for 1×1 =
(14×14×16)×(1×1×480) = 1.5M
Number of operations for 5×5 =
(14×14×48)×(5×5×16) = 3.8M
Total number of operations =
1.5M + 3.8M = 5.3M

Number of operations =
(14×14×48)×(5×5×480) =
112.9M

https://medium.com/coinmonks/paper-review-of-googlenet-
inception-v1-winner-of-ilsvlc-2014-image-classification-
c2b3565a64e7
10
Inception
Module
 Outputs of these filters are then
stacked along the channel
dimension.
 Multi-level feature extractor.
 There are 9 such inception
modules.
 Top-5 error rate of less than 7 %.
9
GoogLeNe
t

Auxiliary Classifier
2
8
Auxiliary
Classifier
 Due to large depth of the network, ability to
propagate gradient back through all the layers was a
concern.
 Auxiliary Classifiers are smaller CNNs put on top of
middle Inception modules.
 Addition of auxiliary classifiers in the middle
exploits the discriminative power of the features
produced by the layers in the middle.
7
Auxiliary
Classifier
 During training, loss of Auxiliary classifiers are added to
the total loss of the network.
 Losses from Auxiliary classifiers were weighted by 0.3.
 Auxiliary classifiers are discarded at Inference time.
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur

Topic
Lecture 42: Popular CNN Models VI
Concepts Covered:
 CNN
Challenges in Deep Learning
 GoogLeNet
 ResNet
 Momentum Optimizer
18
Challenges
 Deep learning is data hungry.
 Overfitting or lack of generalization.
 Vanishing/Exploding Gradient Problem.
 Appropriate Learning Rate.
 Covariate Shift.
 Effective training.
17
Vanishing Gradient
Problem
 Choice of activation function: ReLU instead
of Sigmoid.
 Appropriate initialization of weights.
 Intelligent Back Propagation Learning
Algorithm.
13
GoogLeNe
t

Inception Module
9
GoogLeNe
t

Auxiliary Classifier
6

ResNet
5
ResNe
t
 Core idea is: introduction of Skip
Connection/ Identity Shortcut
Connection that skips one or
more layers.
 Stacking layers should not
degrade performance compared
H(x)
to its shallow counterpart.
 Weight layer learns F(x)=H(x)-x

https://towardsdatascience.com/an-overview-of-resnet-and-
its-variants-5281e2f56035
4
ResNe
t
 By stacking identity mappings the
resultant deep network should give at
least same performance as its shallow
counterpart.
 Deeper network should not give higher
training error than shallow network. H(x)
 During learning the gradient can flow to
any earlier network through shortcut
connections alleviating vanishing
gradient problem.
3
ResNe
t

https://towardsdatascience.com/an-overview-of-resnet-and-
its-variants-5281e2f56035
2
ResNe
t
Forward flow: Layer l -2

l −1,l l −1 l − 2 ,l l −2
a = f (W
l
.a + b +Wl
.a ) Layer l -1 W l − 2 ,l
l − 2 ,l l −2 W l −1,l
= f (Z + W
l
.a ) Layer l

l −2
a = f (Z + a
l l
) if same dimension
1
ResNe
t
Backward Propagation: Layer l -2

∇W l −1,l
= −a .δ
l −1 l
normal path Layer l -1 W l − 2 ,l
W l −1,l
∇W l − 2 ,l
= −a .δ
l −2 l
skip path Layer l

If the skip path has fixed weights, identity


matrix, then they are not updated.
18
Challenges
 Deep learning is data hungry.
 Overfitting or lack of generalization.
 Vanishing/Exploding Gradient Problem.
 Appropriate Learning Rate.
 Covariate Shift.
 Effective training.
9

Optimizing
Gradient Descent
Gradient Descent
Challenges
Challenges of Mini-batch Gradient
Descent
 Choice of Proper Learning Rate:
 Too small a learning rate leads to
slow convergence.
 A large learning rate may lead to
oscillation around the minima or
may even diverge.
Gradient Descent
Challenges
Challenges of Mini-batch Gradient

L(W)
Descent
 Choice of Proper Learning Rate:
 Too small a learning rate leads to
slow convergence.
W
 A large learning rate may lead to
oscillation around the minima or
may even diverge.
Gradient Descent
Challenges
 Learning Rate Schedules: changing learning rate according to
some predefined schedule.
 The same learning rate applies to all parameter updates.
 The data may be sparse and different features have very
different frequencies.
 Updating all of them to the same extent might not be
proper.
 Larger update for rarely occurring features might be a
better choice.
Gradient Descent
Challenges
 Avoiding getting trapped in suboptimal
local minima.
 Difficulty arises in from saddle points,
i.e. points where one dimension slopes
up and another slopes down.
 These saddle points are usually
surrounded by a plateau of the same
error, which makes it hard for SGD to
escape, as the gradient is close to zero
in all dimensions.
Momentum
Optimizer
W2
L(W)

W1
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur

Topic
Lecture 43: Popular Optimizing Gradient Descent
18
Challenges
 Deep learning is data hungry.
 Overfitting or lack of generalization.
 Vanishing/Exploding Gradient Problem.
 Appropriate Learning Rate.
 Covariate Shift.
 Effective training.
Concepts Covered:
 CNN
 ResNet
 Gradient Descent Challenges
 Momentum Optimizer
 Nestevor Accelerated Gradient
 Adagrad.
 etc.
Gradient Descent
Challenges
Challenges of Mini-batch Gradient

L(W)
Descent
 Choice of Proper Learning Rate:
 Too small a learning rate leads to
slow convergence.
W
 A large learning rate may lead to
oscillation around the minima or
may even diverge.
Gradient Descent
Challenges
 Learning Rate Schedules: changing learning rate according to
some predefined schedule.
 The same learning rate applies to all parameter updates.
 The data may be sparse and different features have very
different frequencies.
 Updating all of them to the same extent might not be
proper.
 Larger update for rarely occurring features might be a
better choice.
Gradient Descent
Challenges
 Avoiding getting trapped in suboptimal
local minima.
 Difficulty arises from saddle points, i.e.
points where one dimension slopes up
and another slopes down.
 These saddle points are usually
surrounded by a plateau of the same
error, which makes it hard for SGD to
escape, as the gradient is close to zero
in all dimensions.
9

Optimizing
Gradient Descent
Concepts Covered:
 CNN
 ResNet
 Gradient Descent Challenges
 Momentum Optimizer
 Adagrad.
 etc.
9

Momentum Optimizer
Momentum
Optimizer
W2
L(W)

W1
Momentum
Optimizer
W2

W
1
Momentum
Optimizer
W2 W2

W1 W1
SGD SGD with Momentum
9

Nesterov Accelerated
Gradient (NAG)
Nesterov Accelerated Gradient
(NAG)
W2

W
1
Problem with Momentum
Optimizer/NAG
 Both the algorithms require the hyper-parameters to be
set manually.
 These hyper-parameters decide the learning rate.
 The algorithm uses same learning rate for all dimensions.
 The high dimensional (mostly) non-nonconvex nature of
loss function may lead to different sensitivity on different
dimension.
 We may require learning rate could be small in some
dimension and large in another dimension.
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur

Topic
Lecture 44: Optimizing Gradient Descent II
Concepts Covered:
 CNN
 Gradient Descent Challenges
 Momentum Optimizer
 Nesterov Accelerated Gradient
 Adagrad
RMSProp
 etc.
Momentum
Optimizer
W2

W
1
Momentum
Optimizer
W2 W2

W1 W1
SGD SGD with Momentum
9

Nesterov Accelerated
Gradient (NAG)
Nesterov Accelerated Gradient
(NAG)
W2

W
1
Problem with Momentum
Optimizer/NAG
 Both the algorithms require the hyper-parameters to be
set manually.
 These hyper-parameters decide the learning rate.
 The algorithm uses same learning rate for all dimensions.
 The high dimensional (mostly) non-nonconvex nature of
loss function may lead to different sensitivity on different
dimension.
 We may require learning rate be small in some dimension
and large in another dimension.
9

Adagrad
Adagra
d
 Adagrad adaptively scales the learning rate for different
dimensions.
 Scale factor of a parameter is inversely proportional to the
square root of sum of historical squared values of the
gradient.
 The parameters with the largest partial derivative of the
loss will have rapid decrease in their learning rate.
 Parameters with small partial derivatives will have
relatively small decrease in learning rate.
Adagra
d
t
rt = ∑ gτ  gτ
1
gt = ∑ ∇W L(Wt , X )
n ∀X ∈Minibatch τ =1
η
Wt +1 = Wt −  gt
∈ I + rt
 → element - wise product
Adagra
d
 η (1) 
 .g t 
Wt +1  Wt   ∈ + rt
(1)
(1) (1)

 ( 2)   ( 2)   η ( 2) 
Wt +1  = Wt  −  .g t 
 :   :   ∈ + rt
( 2)

 (d )   (d )   : 
Wt +1  Wt   η (d ) 
 .g t 
 ∈ + rt
(d )

Adagra
d Side:
Positive
 Adagrad adaptively scales the learning rate for different
dimensions by normalizing with respect to the gradient magnitude
in the corresponding dimension.
 Adagrad eliminates the need to manually tune the learning rate.
 Reduces learning rate faster for parameters showing large slope
and slower for parameters giving smaller slope.
 Adagrad converges rapidly when applied to convex functions.
Adagra
d
Negative side:
 If the function is non-convex:- trajectory may pass through
many complex terrains eventually arriving at a locally region.
 By then learning rate may become too small due to the
accumulation of gradients from the beginning of training.
 So at some point the model may stop learning.
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur

Topic
Lecture 45: Optimizing Gradient Descent III
Concepts Covered:
 CNN
 Gradient Descent Challenges
 Momentum Optimizer
 Nesterov Accelerated Gradient
 Adagrad
RMSProp
 etc.
Adagra
d
t
rt = ∑ gτ  gτ
1
gt = ∑ ∇W L(Wt , X )
n ∀X ∈Minibatch τ =1
η
Wt +1 = Wt −  gt
∈ I + rt
 → element - wise product
Adagra
d Side:
Positive
 Adagrad adaptively scales the learning rate for different
dimensions by normalizing with respect to the gradient magnitude
in the corresponding dimension.
 Adagrad eliminates the need to manually tune the learning rate.
 Reduces learning rate faster for parameters showing large slope
and slower for parameters giving smaller slope.
 Adagrad converges rapidly when applied to convex functions.
Adagra
d
Negative side:
 If the function is non-convex:- trajectory may pass through
many complex terrains eventually arriving at a locally region.
 By then learning rate may become too small due to the
accumulation of gradients from the beginning of training.
 So at some point the model may stop learning.
9

RMSProp
RMSPro
p
 RMSProp uses exponentially decaying average of squared
gradient and discards history from the extreme past.
 Converges rapidly once it finds a locally convex bowl.
 Treats this as an instance of Adagrad algorithm initialized
within that bowl.
RMSPro
p
1
gt = ∑ ∇W L(Wt , X )
n ∀X ∈Minibatch

rt = β r t −1+(1 − β ) g t  g t Exponentially decaying average

η
Wt +1 = Wt −  gt
∈ I + rt
RMSProp with Nesterov
Momentum
~ 1 ~
W = Wt + αv gt = ∑ ∇W L(W , X )
n ∀X ∈Minibatch

rt = β r t −1+(1 − β ) g t  g t
η
vt +1 = α vt −  gt Wt +1 = Wt + vt
∈ I + rt
9

Adaptive Moments
(Adam)
Adam
 Variant of the combination of RMSProp and Momentum.
 Incorporates first order moment (with exponential weighting)
of the gradient (Momentum term).
 Momentum is incorporated in RMSProp by adding momentum
to the rescaled gradients.
 Both first and second moments are corrected for bias to
account for heir initialization to zero.
Adam
1
gt = ∑ ∇W L(W , X )
n ∀X ∈Minibatch

Biased first and second moments

st = β1 s t −1 +(1 − β1 ) g t
rt = β 2 r t −1+(1 − β 2 ) g t  g t
Adam
Bias corrected first and second moments
st rt
sˆt = rˆt =
1 − β1 1− β2
sˆt
Wt +1 = Wt − η
∈ I + rˆt
Momentum
Optimizer

Animation Source:-
https://imgur.com/a/Hqolp

You might also like