0% found this document useful (0 votes)

9 views

Unit-2 Improving-Deep-Neural-Networks

This document contains notes from Coursera's Deep Learning Specialization course on improving deep neural networks. The notes cover topics like training/validation/testing splits, bias and variance, regularization techniques like dropout and batch normalization, optimization algorithms, and multi-class classification. The goal is to share key concepts for optimizing model performance by applying algorithms and techniques to tune hyperparameters and network architecture.

Uploaded by

aayushahuja030201

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Unit-2 Improving-Deep-Neural-Networks

Uploaded by

aayushahuja030201

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Coursera Deep Learning Specialization Notes:

Improving Deep Neural Networks

Amir Masoud Sefidian

Version 1.0, November 2022

1
CONTENTS CONTENTS

Contents
1 Improving Deep Neural Networks 4
1.1 Training/Dev(Cross Validation (CV))/Test . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Other regularization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Normalizing training sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Vanishing / Exploding gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Weight initialization for deep networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Numerical approximation of gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.10 Gradient checking (Grad check) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.11 Mini-batch gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.12 Optimization algorithms - exponentially weighted moving averages . . . . . . . . . . . 10
1.13 Gradient descent with momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.14 RMSprop (Root Mean Square prop) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.15 Adam (adaptive moment estimation) optimization . . . . . . . . . . . . . . . . . . . . 13
1.16 Learning rate decay (lower on the list of hyper-parameters to try) . . . . . . . . . . . . 13
1.17 Local optima and saddle points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.18 Hyperparameter tuning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.19 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.20 Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2
CONTENTS CONTENTS

Preface
A couple of years ago I completed Deep Learning Specialization taught by AI pioneer Andrew Ng. I
found this series of courses immensely helpful in my learning journey of deep learning. After years, I
decided to prepare this document to share some of the notes which highlight key concepts I learned in
the second course of this specialization, Improving Deep Neural Networks. This course teaches how to
optimize your model’s performance by applying many algorithms and techniques. For instance, how
to tune the learning rate, the number of layers, and the number of neurons in each layer. Then regu-
larization techniques like dropout and Batch Normalization are covered, to end with an optimization
section that discusses stochastic gradient descent, momentum, RMS Prop, and Adam optimization
algorithms. Notes are based on lecture videos and supplementary material provided and my own
understanding of the topics.
The content of this document is mainly adapted from this GitHub repository. I have added
some explanations, illustrations, and visualization to make some complex concepts easier to grasp for
readers. This document could be a good reference for Machine Learning Engineers, Deep Learning
Engineers, and Data Scientists to refresh their minds on the fundamentals of deep learning. Please
don’t hesitate to contact me via my website (sefidian.com) if you have any questions.

Happy Learning!
Amir Masoud Sefidian

3
1 IMPROVING DEEP NEURAL NETWORKS

1 Improving Deep Neural Networks

1.1 Training/Dev(Cross Validation (CV))/Test
• Training set is used to train the model’s parameters.
• Dev(cross-validation) is used to train the model’s hyperparameters and check the model’s per-
formance while in development.
• Test set is an unbiased set of data that was never seen by the model. Some teams may not use
this and only a dev set instead.
• Traditionally with small data the split between these two sets would be either 70/30% (train/test(or
dev)) or 60/20/20% (train/dev/test). With big data that can be something like 98/1/1%.
• Ensure that dev and test sets are from the same distribution.
• If the training data is from a different distribution than the test set, then it is recommended
that the dev set should belong to the same distribution as the test set.

1.2 Bias and Variance

• High bias generally means underfitting - high error on the training set.
• High variance generally means overfitting - high error on the test set.
• Base error is the reference (e.g. human) error for the same task, and should be compared to the
model to determine high variance or high bias.
• There used to be a tradeoff between these two but that is not so discussed in the scope of deep
learning, because we can always increase the network or/and add more data.
• High bias and high variance can occur at the same time if the model underfits some parts of the
model and overfits other parts.
Solutions for high bias:
• Bigger network (KEY) - does not cause high variance (with good regularization)
• Train longer
• Change NN architecture
• Hyperparameter search
• Increase the number of useful features
Solutions for high variance:
• More training data (KEY) - does not cause high bias
• Regularization
• Reduce model complexity
• Dropout
• Early Stopping
• Data Augmentation
• Batch Normalization

4
1.3 Regularization 1 IMPROVING DEEP NEURAL NETWORKS

1.3 Regularization
• Regularization is also called weight decay because it causes weights to be smaller for higher
values of lambda (the regularization parameter). It reduces high variance

• There is L2 and L1 regularization, L2 uses the squared of the weights, L1 uses only the norm
and has the “advantage” of making the weights matrix sparse, though L2 is the most used in
practice.

• There is usually no regularization of the bias term because it is just a constant.

Regularization in Logistic Regression:
L1 norm:
1 Pm (i) , ŷ (i) ) λ 2
J(W, b) = m i=1 L(y + 2m kW k2
L2 norm:
1 Pm (i) , ŷ (i) ) λ
J(W, b) = m i=1 L(y + 2m kW k1
Regularization in Neural Networks:
1 Pm λ PL
J(W [1] , b[1] , · · · , W [L] , b[L] ) = m (i) (i)
i=l L(y , ŷ ) + 2m
[l] 2
1 kW kF ,
[l−1] [l] [l]
where kW [l] k2F = ni=1 n 2
P P
i=1 (Wij ) (Frobenius norm)
Gradient Descent update:
λ
W [l] ← (1 − α m )W [l] − α · (Backprop Term)

• Intuitions: Why does it help with reducing variance problems?

– When λ → ∞, it set the weight matrices W [l] to be reasonably close to zero. As a result,
the neural network becomes a much smaller neural network. See Figure (1).

Figure 1: Regularization intuition.

– If the regularization becomes very large, the parameters W [l] ≈ 0, so Z will be relatively
small. Thus, the activation function if is tanh, say, will be relatively linear when Z → 0.
Thus, the whole neural network will be computing something not too far from a big linear
function which is therefore a pretty simple function rather than a very complex highly
non-linear function. See Figure (2).

1.4 Dropout
• Dropout regularization consists of training the NN with a number of neurons “switched off”
at every training iteration (though not during testing). It has a similar effect to regularization,
and it is possible to have different percentages of dropped units/neurons for each layer, making
it more flexible. The same units/neurons are dropped in both forward and backward steps.

5
1.4 Dropout 1 IMPROVING DEEP NEURAL NETWORKS

Figure 2: Regularization intuition.

The cost function with dropout does not necessarily decrease continuously as we usually see for
gradient descent.

– Inverted dropout is the most common type of dropout, and it consists of scaling activa-
tions by dividing with the activation matrix with keep prob (the probability of keeping
units), for each layer.
– Steps:

Listing 1: Inverted Dropout

keep prob = 0.8
d3 = np . random . rand ( a3 . shape [ 0 ] a3 . shape [ 1 ] ) < k e e p p r o b
a3 = np . m u l t i p l y ( a3 , d3 )
#e n s u r e s t h a t t h e e x p e c t e d v a l u e o f a3 remains t h e same
a3 /= k e e p p r o b

Figure 3: Dropout

• Intuitions:

– Dropout randomly knocks out units in the network. Hence, it is as if on every iteration we
are working with a smaller neural network, and so using a smaller neural network seems
like it should have a regularizing effect.

6
1.5 Other regularization methods 1 IMPROVING DEEP NEURAL NETWORKS

– Let’s take a look from the perspective of a single unit. This unit takes some inputs and
generates some meaningful output. Now with dropout, the inputs can get randomly elim-
inated. Therefore, it can’t rely on any one feature because any one feature could go away
at random or any one of its own inputs could go away at random. The weights, we are
reluctant to put too much weight on any one input because it can go away. Thus, this
unit will be more motivated to spread out this way and give a little bit of weight to each
of inputs to this unit. And spreading all the weights will have the effect of shrinking the
squared norm of the weights.

• One big downside of dropout is that the cost function J is no longer well-defined.

1.5 Other regularization methods

• Early stopping consists of stopping training when the error of the network is the lowest for
the dev(cross-validation) dataset, even if it can still be decreased for the training set.

Figure 4: Early stopping

• Data augmentation is a technique of artificially increasing the amount of data by generating

new data points from existing data. This is helpful when we are given a dataset with very few
data samples. In the case of Deep Learning, this situation is bad as the model tends to overfit
when we train it on a limited number of data samples. This includes adding minor alterations
to data or using machine learning models to generate new data points in the latent space of
original data to amplify the dataset.

• Note: Orthogonalization is the separation of the cost optimization step (e.g. gradient descent)
from steps taken for not overfitting the model (e.g. regularization), in other words, optimizing
model’s parameters vs. optimizing model hyperparameters.

1.6 Normalizing training sets

Pm
• µ= 1
m i=1 x
(i)

x←x−µ
1 Pm
σ2 = m (i) 2
i=1 (x )
x ← x/σ 2

7
1.7 Vanishing / Exploding gradients 1 IMPROVING DEEP NEURAL NETWORKS

Figure 5: Data augmentation

• The mean and variance obtained in the training set should be used to scale the test set as well,
(we don’t want to scale the training set differently).

• Allows using higher learning rates and faster convergence for gradient descent.

• If features are on very different scales, say the feature x1 ranges from 1 to 1000, and the feature
x2 ranges from 0 to 1, then the ratio or the range of values for the parameters w1 and w2 will
end up taking on very different values. Then the cost function can be very elongated. When
normalizing the features, the cost function will be more symmetric. When contours are spherical,
we can take much larger steps with gradient descent rather than needing to oscillate. See Figure
(6).

Figure 6: Normalization effect

1.7 Vanishing / Exploding gradients

• In very deep networks (depending on the activation function) weights greater than 1 can make
activations exponentially larger depending on the number of layers with such weights, whereas
weights smaller than 0 can make activations exponentially smaller, depending on the number of

8
1.8 Weight initialization for deep networks 1 IMPROVING DEEP NEURAL NETWORKS

layers with such small weights (think in terms of a very deep network with linear activations as
intuitive example).

• The above is also applicable for the gradients (not just the activation/output), on the opposite
direction (backward propagation), thus gradients can either explode (causing numerical insta-
bility) or become very small (with the consequence of lower layers not being updated as well as
numerical instability).

1.8 Weight initialization for deep networks

• Partial solution to Vanishing/Exploding gradients

• Make the randomly initialized weights to have a variance of 1

n[l−1]
for tanh and 2
n[l−1]
for ReLU.
n[l−1] is the number of inputs of layer l.

• For tanh:
q q
1 2
np.random.randn() * n[l−1]
(Xavier initialization) or np.random.randn() * n[l−1] +n[l]
For ReLU:
q
2
np.random.randn() * n[l−1]

1.9 Numerical approximation of gradients

• Two-sided difference approximates the derivative of a function with O(2 ) error therefore much
better than the one-sided difference that is O() - and for smaller than 0 that that means that
the two-sided difference has a much smaller error.

1.10 Gradient checking (Grad check)

• Used to check the correctness of the implementation (bugs). Only to be used during debugging,
not during training (it’s slow).

• Take W [1] , b[1] , · · · , W [L] , b[L] and concatenate and reshape them into a vector θ.

• Take dW [1] , db[1] , · · · , dW [L] , db[L] and concatenate and reshape them into a vector dθ.

• With Θ being a vector of parameters θi , and dθ[i] = ∂θ∂J

i
, compare the “approx” derivative with
the real dθ with the check:
||dθapprox − dθ||2
||dθapprox ||2 + ||dθ||2
, where
J(θ1 ,θ2 ,··· ,θi +,··· )−J(θ1 ,θ2 ,··· ,θi −,··· )
dθapprox [i] = 2

• Note that || · ||2 denotes the squared root of the sum of the squared differences (that is, the norm
of the vector).

• If the result is near 10−7 it is great, if it is 10−5 , then suspect something in the formula, if 10−3 ,
something is really wrong.

• Look for what dθ[i] (what component) has the highest difference, to pinpoint the cause of the
bug.

• Include the regularization term in the cost function when performing grad check.

9
1.11 Mini-batch gradient descent 1 IMPROVING DEEP NEURAL NETWORKS

• Doesn’t work with dropout (turn it off during grad check).

• Run at initialization and then again after some training.

1.11 Mini-batch gradient descent

• Applicable for large datasets (single batch).

• Running each iteration of gradient descent on smaller batches of the full dataset. May take too
long per iteration.

• The cost function trends downward but not monolithically in this case.

• If batch size = m then it is just batch gradient descent (run for all the examples at once). (use
this for m <= 2000).

• If batch size = 1 then it is Stochastic Gradient Descent (every example is a mini-batch).

Gradient descent doesn’t completely converge. Loses speedup from vectorization.

• Ideal scenario is in between the two above (may not exactly converge, but we can reduce the
learning rate).

• Typical minibatch sizes are powers of two (64, 128, 256, 512) - to ensure they fit in cpu/gpu
memory.

(b) Mini Batch GD

(a) Batch GD

Figure 7: Batch Gradient Descent vs. Mini Batch Gradient Descent vs. Stochastic Gradient Descent

1.12 Optimization algorithms - exponentially weighted moving averages

• Vt = βVt−1 + (1 − β)θt . Vt is average over past 1/(1 − β) data points

• All the coefficients add up to 1.

10
1.13 Gradient descent with momentum 1 IMPROVING DEEP NEURAL NETWORKS

Figure 8: Batch GD cost curve

Figure 9: Convergence of different GD methods

• When β = 0.9 it takes a delay of approx 10 (think 10 days for a daily time-series data) for the
contribution of a point to reduce to 1/3. The general rule (in which is 0.1 and β = 1 − to
meet this example) is:
1 1
(1 − ) =
e
• To correct the bias of the first few terms (compared to 0 initialization), the following formula
can be used:
Vt
Vtcorrected =
1 − βt

1.13 Gradient descent with momentum

• Converges faster than the standard gradient descent algorithm.

• The basic idea is to compute an exponentially weighted average of gradients, and then use that
gradient to update the weights.

• Uses exponential weighted moving averages to smooth out the derivatives dW and db, when
updating W and b in each iteration. For example for dW (and similarly for db):

VdW = βVdW + (1 − β)dW

W = W − αVdW

11
1.14 RMSprop (Root Mean Square prop) 1 IMPROVING DEEP NEURAL NETWORKS

Figure 10: Exponentially Weighted Moving Average

• Sometimes a simplified version is used that factors the (1 − β) term into the learning rate (that
must be adjusted) instead of it being explicit, therefore:
VdW = βVdW + dW
αadjusted = α(1 − β)
• β is most commonly 0.9 (pretty robust value)
• Bias correction is not usually used for gradient descent.

1.14 RMSprop (Root Mean Square prop)

Figure 11: RMSprop.

• Update W and b on each iteration with dW or db divided by the root mean square of the
exponential moving average of dW or db (the square is element-wise):
SdW = βSdW + (1 − β)dW 2
dW
W = W − α√
SdW +
• Implementations add a small to the denominator to avoid divisions by 0.
• The intuition is to have smaller/slower updates of db (derivative of the bias term or vertical
direction) and higher/faster updates of dW (derivative of the weights or horizontal direction),
to improve convergence speed. dW is a large matrix, therefore RMS of dW is much larger than
RMS of db.
• Allows using a higher learning rate and faster convergence.

12
1.15 Adam (adaptive moment estimation) optimization
1 IMPROVING DEEP NEURAL NETWORKS

1.15 Adam (adaptive moment estimation) optimization

• Combine intuitions of Momentum + RMSprop together, both with bias correction!

VdW = β1 VdW + (1 − β1 )dW, Vdb = β1 VdbW + (1 − β1 )db

SdW = β2 SdW + (1 − β2 )dW 2 , Sdb = β2 Sdb + (1 − β2 )db2

corrected VdW corrected Vdb

VdW = , Vdb =
1 − β1t 1 − β1t

corrected SdW corrected Sdb

SdW = , Sdb =
1 − β2t 1 − β2t

V corrected
W = W − α q dW
corrected +
SdW
corrected
Vdb
b = b − αq
corrected +
Sdb

• There are two β parameters:

– β1 is the momentum parameter and is usually 0.9

– β2 is the RMSprop parameter and is usually 0.999
– is usually 10−8

1.16 Learning rate decay (lower on the list of hyper-parameters to try)

• Have a slower learning rate as gradient descent approaches convergence.
1
α= × α0
1 + decay rate × epoch num

• Alternatives:

– Exponential decay: α = 0.95epoch num × α0

k
– Or: α = √
epoch num
× α0
– Discrete staircase, manual decay, etc.

1.17 Local optima and saddle points

• Most points with zero gradients are saddle points, not local optima!

– Plateau’s in saddle points slow down learning.

– Local optima are pretty rare in comparison/unlikely to get stuck in them.

13
1.18 Hyperparameter tuning process 1 IMPROVING DEEP NEURAL NETWORKS

1.18 Hyperparameter tuning process

• Order of importance of hyperparameters for tuning:
1. learning rate (α)
2. momentum term (β : 0.9)
3. mini-batch size
4. number of hidden units
5. number of layers
6. learning rate decay
7. β1 , β2 ,
• Choose the hyperparameter value combinations at random, (don’t use a grid) because of the high
number of hyperparameters nowadays (cube/hyper dimensional space), doesn’t compensate test
all values/combinations.

Figure 12: Grid vs. Random search.

• Coarse to fine-tuning - first coarse changes of the hyperparameters, then fine tune them.
• Using an appropriate scale for the hyperparameters.
– One possibility is to sample values at random within an intended range.
– Using log scales to sample parameter values to try (for example, applicable for the learning
rate).
For α (sample between 10a · · · 10b ), uniformly sample from r ∈ [a, b] ([-4, 0] for example)
and set α = 10r .
For exponentially weighted average hyperparameters (β, β1 , β2 ), uniformly sample from
r ∈ [a, b] ([-3, -1] for example) and set β = 1 − 10r .

• Two possible approaches for hyperparameter search:

– Panda approach: Watch only one model and change the parameters gradually and check
improvements. Requires less hardware which might not be the most efficient method.
– Caviar approach: Run multiple models with different parameters in parallel, if you have
the computing power for it.

14
1.19 Batch normalization 1 IMPROVING DEEP NEURAL NETWORKS

Figure 13: Coarse to fine search.

Figure 14: Hyperparameter Search

1.19 Batch normalization

• Normalize not just the inputs but the activation inputs to the next layer, subtracting the mean
and dividing by the variance (mean 0, variance 1).

• Normally Z is normalized, before the activation function, though some literature suggests nor-
malizing after the activation function.

• New parameters γ (multiplied by Z) and β (added to Z after the multiplication) are introduced
and learned in the forward backward propagation. This is to prevent all neurons from having
activations with mean 0 and variance 1 which is not desirable. Given some intermediate values

15
1.19 Batch normalization 1 IMPROVING DEEP NEURAL NETWORKS

z (1) , z (2) , . . . , z (m) for a layer in the network:

(i) z (i) − µ 1 X (i) 1 X (i)

Znorm =√ , where µ = z , and σ 2 = (z − µ)2
σ2 + m m
i i

Z̃ (i) = γZnorm
(i)
+β

• The bias parameter “b” in the calculation of Z is no longer needed because the mean of Z is
being subtracted, canceling out any effect from adding “b”. The new parameter β effectively
becomes the new bias term.
• At test time there is no µ and σ 2 , so these are computed based on an exponentially weighted
average of these two parameters obtained on different batches during training. Why does

Figure 15: Batch Normalization.

Batch Normalization work?

– It makes weights of deeper layers more robust to changes in weights in earlier layers of
network.
– Covariate shift: Data distribution changes with inputs (e.g. over time, with new batches,
etc), you need to retrain your network normally.
– Batch normalization makes the process of learning easier by reducing the variability of
the inputs presented to each layer (which now have similar variance and mean), therefore
reducing the covariate shift, and that is especially important for deeper layers, where
inputs could change significantly as a net effect of all the other changes in the network.
– It reduces the amount that the distribution of hidden unit values shifts around. And if it
were to plot the distribution of hidden unit values, maybe this is technically we normalize
Z, Batch norm ensures that no matter how the output of the previous layer changes, the
mean and variance of a layer will remain the same. Therefore, the batch norm reduces the
problem of the input values changing, it really causes these values to become more stable,
so that the later layers of the neural network have more firm ground to stand on.

16
1.20 Multi-class classification 1 IMPROVING DEEP NEURAL NETWORKS

– It also has a slight regularization effect with mini-batch, due to the “noise” introduced by
the calculations of mean and variance only for that mini-batch only rather than the entire
dataset, which has an effect similar to that of dropout.

Figure 16: Covariate Shift

1.20 Multi-class classification

• # of neurons in the output layer = # of classes. The sum of all outputs must be 1, since these
are probabilities of X being classified in any of these classes(likelihood).

• Softmax activation function is used as the output activation function:

[L]
t = eZ
[L] ti
ai = PK , where K is the number of classes and output units
j=1 tj

Figure 17: Softmax

• Softmax is in contrast with Hardmax, where the network’s output will be a binary vector
with all 0s except for the position corresponding to the max value of Z [L] .

17
1.20 Multi-class classification 1 IMPROVING DEEP NEURAL NETWORKS

• Softmax is the generalization of logistic regression to more than two classes. For two classes,
it can be simplified/reduced to logistic regression.
Softmax Loss function:
C
X
L(ŷ, y) = − yj log(ŷj ), assuming a binary vector y
j=1

Softmax Cost function:

m
1 X ∂J
J= L(ŷ (i) , y (i) ), backprop: = ŷ − y
m ∂z
i=1

Programming Assignments of Deep Learning Specialization 5 Courses 1
No ratings yet
Programming Assignments of Deep Learning Specialization 5 Courses 1
304 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
cours4
No ratings yet
cours4
30 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
2. Deep Neural Network
No ratings yet
2. Deep Neural Network
60 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
No ratings yet
U O D L J M L C: Nderstanding Ptimization of EEP Earning Via Acobian Atrix and Ipschitz Onstant
48 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Lecture_2
No ratings yet
Lecture_2
31 pages
2 Deep Neural Network_241120_095158
No ratings yet
2 Deep Neural Network_241120_095158
47 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
DL Class3
No ratings yet
DL Class3
28 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
DGM MID SEM
No ratings yet
DGM MID SEM
39 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
UNIT 5
No ratings yet
UNIT 5
36 pages
Secrets of Deep Learning 1716536527
No ratings yet
Secrets of Deep Learning 1716536527
12 pages
A Probabilistic Theory of Deep Learning: Unit 2
No ratings yet
A Probabilistic Theory of Deep Learning: Unit 2
17 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
Adl Unit 1 2
No ratings yet
Adl Unit 1 2
67 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization
No ratings yet
Improving ML, DL networks Hyperparameter tuning, Regularization & Optimization
16 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
4 NN Regularization
No ratings yet
4 NN Regularization
13 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
(Ebook) Understanding Deep Learning by Simon J. D. Prince ISBN 9780262048644, 0262048647 instant download
100% (2)
(Ebook) Understanding Deep Learning by Simon J. D. Prince ISBN 9780262048644, 0262048647 instant download
46 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Lect 12 -Deep Feed Forward NN- Review
No ratings yet
Lect 12 -Deep Feed Forward NN- Review
93 pages
Deep Learning
100% (1)
Deep Learning
49 pages
NN 08
No ratings yet
NN 08
36 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
DL UNIT 3
No ratings yet
DL UNIT 3
14 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
DL_IT324a_3
No ratings yet
DL_IT324a_3
13 pages
WEEK 10
No ratings yet
WEEK 10
69 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Seminar Report cnn1
No ratings yet
Seminar Report cnn1
23 pages
Deep Learning: Technical Introduction: Thomas Epelbaum
No ratings yet
Deep Learning: Technical Introduction: Thomas Epelbaum
106 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
Training Neural
No ratings yet
Training Neural
16 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
BBBB
No ratings yet
BBBB
8 pages
Unit Online 1.4
No ratings yet
Unit Online 1.4
132 pages
Deep Learing
No ratings yet
Deep Learing
37 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
DL Intro
No ratings yet
DL Intro
64 pages
Artificial Neural Networks_dl
No ratings yet
Artificial Neural Networks_dl
55 pages
CNN Basic Structure, Hyper-Parameter Tuning, Regularization-Dropouts
No ratings yet
CNN Basic Structure, Hyper-Parameter Tuning, Regularization-Dropouts
54 pages
Regularization Slides (2)
No ratings yet
Regularization Slides (2)
50 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
StochasticApproximation Borkar
100% (1)
StochasticApproximation Borkar
172 pages
pTIA Data+ DA0-001
No ratings yet
pTIA Data+ DA0-001
11 pages
How To Apply For Job Opportunities
No ratings yet
How To Apply For Job Opportunities
7 pages
Double-Sided Texturing in Cycles - Reynante Martinez
No ratings yet
Double-Sided Texturing in Cycles - Reynante Martinez
12 pages
M100 to PUSR Cloud
No ratings yet
M100 to PUSR Cloud
19 pages
Soft Eng Exam1 Jacinto
No ratings yet
Soft Eng Exam1 Jacinto
4 pages
1.2 Project Overview Title of Project:: Electronic Shop Management System
No ratings yet
1.2 Project Overview Title of Project:: Electronic Shop Management System
20 pages
DIOLA - Activity 6
No ratings yet
DIOLA - Activity 6
1 page
HR Multi 2 FAAC Remote Coding Instructions
No ratings yet
HR Multi 2 FAAC Remote Coding Instructions
2 pages
Web and Android Based Student Council Voting System Capstone Documentation
No ratings yet
Web and Android Based Student Council Voting System Capstone Documentation
3 pages
CH 2
No ratings yet
CH 2
5 pages
INS - 4360704
No ratings yet
INS - 4360704
8 pages
EXERCICES ON PHP PROGRAMMING LANGUAGE
No ratings yet
EXERCICES ON PHP PROGRAMMING LANGUAGE
2 pages
Teradata Overview - Notes
No ratings yet
Teradata Overview - Notes
3 pages
MCQ's On Files and Streams: #Include #Include
No ratings yet
MCQ's On Files and Streams: #Include #Include
9 pages
SP - ASD - 9.x Remote Lab Instructions For Self-Paced Students
No ratings yet
SP - ASD - 9.x Remote Lab Instructions For Self-Paced Students
11 pages
Presentation 1
No ratings yet
Presentation 1
8 pages
Key Publications For SIRO
No ratings yet
Key Publications For SIRO
4 pages
1.3 Premiere Pro - Shortcuts
No ratings yet
1.3 Premiere Pro - Shortcuts
3 pages
02 Appendix - eNSP Study Guide PDF
No ratings yet
02 Appendix - eNSP Study Guide PDF
2 pages
AP Goyal Shimla University Shimla University
No ratings yet
AP Goyal Shimla University Shimla University
6 pages
Bina Kumari: House No 33, Ramamurthy Nagar, Bengaluru, 560016 Mobile: 9015033259, 8010539164
No ratings yet
Bina Kumari: House No 33, Ramamurthy Nagar, Bengaluru, 560016 Mobile: 9015033259, 8010539164
4 pages
Object_counter_project_report
No ratings yet
Object_counter_project_report
17 pages
Top-Down Network Design: Chapter Three
No ratings yet
Top-Down Network Design: Chapter Three
26 pages
What Is Bumblebee Malware
No ratings yet
What Is Bumblebee Malware
3 pages
JAVASCRIPT CONTROL STATEMENTS
No ratings yet
JAVASCRIPT CONTROL STATEMENTS
27 pages
Calibration Procedure FOR Digital Oscilloscope Wavesurfer 44Mxs-B
No ratings yet
Calibration Procedure FOR Digital Oscilloscope Wavesurfer 44Mxs-B
15 pages
Mobile Apps Testing
No ratings yet
Mobile Apps Testing
20 pages
DPP (7-9) 11th J-Batch Maths
No ratings yet
DPP (7-9) 11th J-Batch Maths
10 pages
PHD Econ, Applied Econometrics 2021/22 - Takehome University of Innsbruck
No ratings yet
PHD Econ, Applied Econometrics 2021/22 - Takehome University of Innsbruck
20 pages

Unit-2 Improving-Deep-Neural-Networks

Uploaded by

Unit-2 Improving-Deep-Neural-Networks

Uploaded by

Coursera Deep Learning Specialization Notes:

Improving Deep Neural Networks

Version 1.0, November 2022

1 Improving Deep Neural Networks

1.2 Bias and Variance

• There is usually no regularization of the bias term because it is just a constant.

• Intuitions: Why does it help with reducing variance problems?

Figure 1: Regularization intuition.

Figure 2: Regularization intuition.

Listing 1: Inverted Dropout

1.5 Other regularization methods

Figure 4: Early stopping

• Data augmentation is a technique of artificially increasing the amount of data by generating

1.6 Normalizing training sets

Figure 5: Data augmentation

Figure 6: Normalization effect

1.7 Vanishing / Exploding gradients

1.8 Weight initialization for deep networks

• Make the randomly initialized weights to have a variance of 1

1.9 Numerical approximation of gradients

1.10 Gradient checking (Grad check)

• With Θ being a vector of parameters θi , and dθ[i] = ∂θ∂J

• Doesn’t work with dropout (turn it off during grad check).

• Run at initialization and then again after some training.

1.11 Mini-batch gradient descent

• If batch size = 1 then it is Stochastic Gradient Descent (every example is a mini-batch).

(b) Mini Batch GD

1.12 Optimization algorithms - exponentially weighted moving averages

• All the coefficients add up to 1.

Figure 8: Batch GD cost curve

Figure 9: Convergence of different GD methods

1.13 Gradient descent with momentum

VdW = βVdW + (1 − β)dW

Figure 10: Exponentially Weighted Moving Average

1.14 RMSprop (Root Mean Square prop)

Figure 11: RMSprop.

1.15 Adam (adaptive moment estimation) optimization

VdW = β1 VdW + (1 − β1 )dW, Vdb = β1 VdbW + (1 − β1 )db

SdW = β2 SdW + (1 − β2 )dW 2 , Sdb = β2 Sdb + (1 − β2 )db2

corrected VdW corrected Vdb

corrected SdW corrected Sdb

• There are two β parameters:

– β1 is the momentum parameter and is usually 0.9

1.16 Learning rate decay (lower on the list of hyper-parameters to try)

– Exponential decay: α = 0.95epoch num × α0

1.17 Local optima and saddle points

– Plateau’s in saddle points slow down learning.

1.18 Hyperparameter tuning process

Figure 12: Grid vs. Random search.

• Two possible approaches for hyperparameter search:

Figure 13: Coarse to fine search.

Figure 14: Hyperparameter Search

1.19 Batch normalization

z (1) , z (2) , . . . , z (m) for a layer in the network:

(i) z (i) − µ 1 X (i) 1 X (i)

Figure 15: Batch Normalization.

Batch Normalization work?

Figure 16: Covariate Shift

1.20 Multi-class classification

• Softmax activation function is used as the output activation function:

Figure 17: Softmax

Softmax Cost function:

You might also like