0% found this document useful (0 votes)
9 views

Unit-2 Improving-Deep-Neural-Networks

This document contains notes from Coursera's Deep Learning Specialization course on improving deep neural networks. The notes cover topics like training/validation/testing splits, bias and variance, regularization techniques like dropout and batch normalization, optimization algorithms, and multi-class classification. The goal is to share key concepts for optimizing model performance by applying algorithms and techniques to tune hyperparameters and network architecture.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Unit-2 Improving-Deep-Neural-Networks

This document contains notes from Coursera's Deep Learning Specialization course on improving deep neural networks. The notes cover topics like training/validation/testing splits, bias and variance, regularization techniques like dropout and batch normalization, optimization algorithms, and multi-class classification. The goal is to share key concepts for optimizing model performance by applying algorithms and techniques to tune hyperparameters and network architecture.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Coursera Deep Learning Specialization Notes:

Improving Deep Neural Networks


Amir Masoud Sefidian

Version 1.0, November 2022

1
CONTENTS CONTENTS

Contents
1 Improving Deep Neural Networks 4
1.1 Training/Dev(Cross Validation (CV))/Test . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Other regularization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Normalizing training sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Vanishing / Exploding gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Weight initialization for deep networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Numerical approximation of gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.10 Gradient checking (Grad check) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.11 Mini-batch gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.12 Optimization algorithms - exponentially weighted moving averages . . . . . . . . . . . 10
1.13 Gradient descent with momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.14 RMSprop (Root Mean Square prop) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.15 Adam (adaptive moment estimation) optimization . . . . . . . . . . . . . . . . . . . . 13
1.16 Learning rate decay (lower on the list of hyper-parameters to try) . . . . . . . . . . . . 13
1.17 Local optima and saddle points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.18 Hyperparameter tuning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.19 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.20 Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2
CONTENTS CONTENTS

Preface
A couple of years ago I completed Deep Learning Specialization taught by AI pioneer Andrew Ng. I
found this series of courses immensely helpful in my learning journey of deep learning. After years, I
decided to prepare this document to share some of the notes which highlight key concepts I learned in
the second course of this specialization, Improving Deep Neural Networks. This course teaches how to
optimize your model’s performance by applying many algorithms and techniques. For instance, how
to tune the learning rate, the number of layers, and the number of neurons in each layer. Then regu-
larization techniques like dropout and Batch Normalization are covered, to end with an optimization
section that discusses stochastic gradient descent, momentum, RMS Prop, and Adam optimization
algorithms. Notes are based on lecture videos and supplementary material provided and my own
understanding of the topics.
The content of this document is mainly adapted from this GitHub repository. I have added
some explanations, illustrations, and visualization to make some complex concepts easier to grasp for
readers. This document could be a good reference for Machine Learning Engineers, Deep Learning
Engineers, and Data Scientists to refresh their minds on the fundamentals of deep learning. Please
don’t hesitate to contact me via my website (sefidian.com) if you have any questions.

Happy Learning!
Amir Masoud Sefidian

3
1 IMPROVING DEEP NEURAL NETWORKS

1 Improving Deep Neural Networks


1.1 Training/Dev(Cross Validation (CV))/Test
• Training set is used to train the model’s parameters.
• Dev(cross-validation) is used to train the model’s hyperparameters and check the model’s per-
formance while in development.
• Test set is an unbiased set of data that was never seen by the model. Some teams may not use
this and only a dev set instead.
• Traditionally with small data the split between these two sets would be either 70/30% (train/test(or
dev)) or 60/20/20% (train/dev/test). With big data that can be something like 98/1/1%.
• Ensure that dev and test sets are from the same distribution.
• If the training data is from a different distribution than the test set, then it is recommended
that the dev set should belong to the same distribution as the test set.

1.2 Bias and Variance


• High bias generally means underfitting - high error on the training set.
• High variance generally means overfitting - high error on the test set.
• Base error is the reference (e.g. human) error for the same task, and should be compared to the
model to determine high variance or high bias.
• There used to be a tradeoff between these two but that is not so discussed in the scope of deep
learning, because we can always increase the network or/and add more data.
• High bias and high variance can occur at the same time if the model underfits some parts of the
model and overfits other parts.
Solutions for high bias:
• Bigger network (KEY) - does not cause high variance (with good regularization)
• Train longer
• Change NN architecture
• Hyperparameter search
• Increase the number of useful features
Solutions for high variance:
• More training data (KEY) - does not cause high bias
• Regularization
• Reduce model complexity
• Dropout
• Early Stopping
• Data Augmentation
• Batch Normalization

4
1.3 Regularization 1 IMPROVING DEEP NEURAL NETWORKS

1.3 Regularization
• Regularization is also called weight decay because it causes weights to be smaller for higher
values of lambda (the regularization parameter). It reduces high variance

• There is L2 and L1 regularization, L2 uses the squared of the weights, L1 uses only the norm
and has the “advantage” of making the weights matrix sparse, though L2 is the most used in
practice.

• There is usually no regularization of the bias term because it is just a constant.


Regularization in Logistic Regression:
L1 norm:
1 Pm (i) , ŷ (i) ) λ 2
J(W, b) = m i=1 L(y + 2m kW k2
L2 norm:
1 Pm (i) , ŷ (i) ) λ
J(W, b) = m i=1 L(y + 2m kW k1
Regularization in Neural Networks:
1 Pm λ PL
J(W [1] , b[1] , · · · , W [L] , b[L] ) = m (i) (i)
i=l L(y , ŷ ) + 2m
[l] 2
1 kW kF ,
[l−1] [l] [l]
where kW [l] k2F = ni=1 n 2
P P
i=1 (Wij ) (Frobenius norm)
Gradient Descent update:
λ
W [l] ← (1 − α m )W [l] − α · (Backprop Term)

• Intuitions: Why does it help with reducing variance problems?

– When λ → ∞, it set the weight matrices W [l] to be reasonably close to zero. As a result,
the neural network becomes a much smaller neural network. See Figure (1).

Figure 1: Regularization intuition.

– If the regularization becomes very large, the parameters W [l] ≈ 0, so Z will be relatively
small. Thus, the activation function if is tanh, say, will be relatively linear when Z → 0.
Thus, the whole neural network will be computing something not too far from a big linear
function which is therefore a pretty simple function rather than a very complex highly
non-linear function. See Figure (2).

1.4 Dropout
• Dropout regularization consists of training the NN with a number of neurons “switched off”
at every training iteration (though not during testing). It has a similar effect to regularization,
and it is possible to have different percentages of dropped units/neurons for each layer, making
it more flexible. The same units/neurons are dropped in both forward and backward steps.

5
1.4 Dropout 1 IMPROVING DEEP NEURAL NETWORKS

Figure 2: Regularization intuition.

The cost function with dropout does not necessarily decrease continuously as we usually see for
gradient descent.

– Inverted dropout is the most common type of dropout, and it consists of scaling activa-
tions by dividing with the activation matrix with keep prob (the probability of keeping
units), for each layer.
– Steps:

Listing 1: Inverted Dropout


keep prob = 0.8
d3 = np . random . rand ( a3 . shape [ 0 ] a3 . shape [ 1 ] ) < k e e p p r o b
a3 = np . m u l t i p l y ( a3 , d3 )
#e n s u r e s t h a t t h e e x p e c t e d v a l u e o f a3 remains t h e same
a3 /= k e e p p r o b

Figure 3: Dropout

• Intuitions:

– Dropout randomly knocks out units in the network. Hence, it is as if on every iteration we
are working with a smaller neural network, and so using a smaller neural network seems
like it should have a regularizing effect.

6
1.5 Other regularization methods 1 IMPROVING DEEP NEURAL NETWORKS

– Let’s take a look from the perspective of a single unit. This unit takes some inputs and
generates some meaningful output. Now with dropout, the inputs can get randomly elim-
inated. Therefore, it can’t rely on any one feature because any one feature could go away
at random or any one of its own inputs could go away at random. The weights, we are
reluctant to put too much weight on any one input because it can go away. Thus, this
unit will be more motivated to spread out this way and give a little bit of weight to each
of inputs to this unit. And spreading all the weights will have the effect of shrinking the
squared norm of the weights.

• One big downside of dropout is that the cost function J is no longer well-defined.

1.5 Other regularization methods


• Early stopping consists of stopping training when the error of the network is the lowest for
the dev(cross-validation) dataset, even if it can still be decreased for the training set.

Figure 4: Early stopping

• Data augmentation is a technique of artificially increasing the amount of data by generating


new data points from existing data. This is helpful when we are given a dataset with very few
data samples. In the case of Deep Learning, this situation is bad as the model tends to overfit
when we train it on a limited number of data samples. This includes adding minor alterations
to data or using machine learning models to generate new data points in the latent space of
original data to amplify the dataset.

• Note: Orthogonalization is the separation of the cost optimization step (e.g. gradient descent)
from steps taken for not overfitting the model (e.g. regularization), in other words, optimizing
model’s parameters vs. optimizing model hyperparameters.

1.6 Normalizing training sets


Pm
• µ= 1
m i=1 x
(i)

x←x−µ
1 Pm
σ2 = m (i) 2
i=1 (x )
x ← x/σ 2

7
1.7 Vanishing / Exploding gradients 1 IMPROVING DEEP NEURAL NETWORKS

Figure 5: Data augmentation

• The mean and variance obtained in the training set should be used to scale the test set as well,
(we don’t want to scale the training set differently).

• Allows using higher learning rates and faster convergence for gradient descent.

• If features are on very different scales, say the feature x1 ranges from 1 to 1000, and the feature
x2 ranges from 0 to 1, then the ratio or the range of values for the parameters w1 and w2 will
end up taking on very different values. Then the cost function can be very elongated. When
normalizing the features, the cost function will be more symmetric. When contours are spherical,
we can take much larger steps with gradient descent rather than needing to oscillate. See Figure
(6).

Figure 6: Normalization effect

1.7 Vanishing / Exploding gradients


• In very deep networks (depending on the activation function) weights greater than 1 can make
activations exponentially larger depending on the number of layers with such weights, whereas
weights smaller than 0 can make activations exponentially smaller, depending on the number of

8
1.8 Weight initialization for deep networks 1 IMPROVING DEEP NEURAL NETWORKS

layers with such small weights (think in terms of a very deep network with linear activations as
intuitive example).

• The above is also applicable for the gradients (not just the activation/output), on the opposite
direction (backward propagation), thus gradients can either explode (causing numerical insta-
bility) or become very small (with the consequence of lower layers not being updated as well as
numerical instability).

1.8 Weight initialization for deep networks


• Partial solution to Vanishing/Exploding gradients

• Make the randomly initialized weights to have a variance of 1


n[l−1]
for tanh and 2
n[l−1]
for ReLU.
n[l−1] is the number of inputs of layer l.

• For tanh:
q q
1 2
np.random.randn() * n[l−1]
(Xavier initialization) or np.random.randn() * n[l−1] +n[l]
For ReLU:
q
2
np.random.randn() * n[l−1]

1.9 Numerical approximation of gradients


• Two-sided difference approximates the derivative of a function with O(2 ) error therefore much
better than the one-sided difference that is O() - and for  smaller than 0 that that means that
the two-sided difference has a much smaller error.

1.10 Gradient checking (Grad check)


• Used to check the correctness of the implementation (bugs). Only to be used during debugging,
not during training (it’s slow).

• Take W [1] , b[1] , · · · , W [L] , b[L] and concatenate and reshape them into a vector θ.

• Take dW [1] , db[1] , · · · , dW [L] , db[L] and concatenate and reshape them into a vector dθ.

• With Θ being a vector of parameters θi , and dθ[i] = ∂θ∂J


i
, compare the “approx” derivative with
the real dθ with the check:
||dθapprox − dθ||2
||dθapprox ||2 + ||dθ||2
, where
J(θ1 ,θ2 ,··· ,θi +,··· )−J(θ1 ,θ2 ,··· ,θi −,··· )
dθapprox [i] = 2

• Note that || · ||2 denotes the squared root of the sum of the squared differences (that is, the norm
of the vector).

• If the result is near 10−7 it is great, if it is 10−5 , then suspect something in the formula, if 10−3 ,
something is really wrong.

• Look for what dθ[i] (what component) has the highest difference, to pinpoint the cause of the
bug.

• Include the regularization term in the cost function when performing grad check.

9
1.11 Mini-batch gradient descent 1 IMPROVING DEEP NEURAL NETWORKS

• Doesn’t work with dropout (turn it off during grad check).

• Run at initialization and then again after some training.

1.11 Mini-batch gradient descent


• Applicable for large datasets (single batch).

• Running each iteration of gradient descent on smaller batches of the full dataset. May take too
long per iteration.

• The cost function trends downward but not monolithically in this case.

• If batch size = m then it is just batch gradient descent (run for all the examples at once). (use
this for m <= 2000).

• If batch size = 1 then it is Stochastic Gradient Descent (every example is a mini-batch).


Gradient descent doesn’t completely converge. Loses speedup from vectorization.

• Ideal scenario is in between the two above (may not exactly converge, but we can reduce the
learning rate).

• Typical minibatch sizes are powers of two (64, 128, 256, 512) - to ensure they fit in cpu/gpu
memory.

(b) Mini Batch GD


(a) Batch GD

(c) Stochastic GD

Figure 7: Batch Gradient Descent vs. Mini Batch Gradient Descent vs. Stochastic Gradient Descent

1.12 Optimization algorithms - exponentially weighted moving averages


• Vt = βVt−1 + (1 − β)θt . Vt is average over past 1/(1 − β) data points

• All the coefficients add up to 1.

10
1.13 Gradient descent with momentum 1 IMPROVING DEEP NEURAL NETWORKS

Figure 8: Batch GD cost curve

Figure 9: Convergence of different GD methods

• When β = 0.9 it takes a delay of approx 10 (think 10 days for a daily time-series data) for the
contribution of a point to reduce to 1/3. The general rule (in which  is 0.1 and β = 1 −  to
meet this example) is:
1 1
(1 − )  =
e
• To correct the bias of the first few terms (compared to 0 initialization), the following formula
can be used:
Vt
Vtcorrected =
1 − βt

1.13 Gradient descent with momentum


• Converges faster than the standard gradient descent algorithm.

• The basic idea is to compute an exponentially weighted average of gradients, and then use that
gradient to update the weights.

• Uses exponential weighted moving averages to smooth out the derivatives dW and db, when
updating W and b in each iteration. For example for dW (and similarly for db):

VdW = βVdW + (1 − β)dW

W = W − αVdW

11
1.14 RMSprop (Root Mean Square prop) 1 IMPROVING DEEP NEURAL NETWORKS

Figure 10: Exponentially Weighted Moving Average

• Sometimes a simplified version is used that factors the (1 − β) term into the learning rate (that
must be adjusted) instead of it being explicit, therefore:
VdW = βVdW + dW
αadjusted = α(1 − β)
• β is most commonly 0.9 (pretty robust value)
• Bias correction is not usually used for gradient descent.

1.14 RMSprop (Root Mean Square prop)

Figure 11: RMSprop.

• Update W and b on each iteration with dW or db divided by the root mean square of the
exponential moving average of dW or db (the square is element-wise):
SdW = βSdW + (1 − β)dW 2
dW
W = W − α√
SdW + 
• Implementations add a small  to the denominator to avoid divisions by 0.
• The intuition is to have smaller/slower updates of db (derivative of the bias term or vertical
direction) and higher/faster updates of dW (derivative of the weights or horizontal direction),
to improve convergence speed. dW is a large matrix, therefore RMS of dW is much larger than
RMS of db.
• Allows using a higher learning rate and faster convergence.

12
1.15 Adam (adaptive moment estimation) optimization
1 IMPROVING DEEP NEURAL NETWORKS

1.15 Adam (adaptive moment estimation) optimization


• Combine intuitions of Momentum + RMSprop together, both with bias correction!

VdW = β1 VdW + (1 − β1 )dW, Vdb = β1 VdbW + (1 − β1 )db

SdW = β2 SdW + (1 − β2 )dW 2 , Sdb = β2 Sdb + (1 − β2 )db2

corrected VdW corrected Vdb


VdW = , Vdb =
1 − β1t 1 − β1t

corrected SdW corrected Sdb


SdW = , Sdb =
1 − β2t 1 − β2t

V corrected
W = W − α q dW
corrected + 
SdW
corrected
Vdb
b = b − αq
corrected + 
Sdb

• There are two β parameters:

– β1 is the momentum parameter and is usually 0.9


– β2 is the RMSprop parameter and is usually 0.999
–  is usually 10−8

1.16 Learning rate decay (lower on the list of hyper-parameters to try)


• Have a slower learning rate as gradient descent approaches convergence.
1
α= × α0
1 + decay rate × epoch num

• Alternatives:

– Exponential decay: α = 0.95epoch num × α0


k
– Or: α = √
epoch num
× α0
– Discrete staircase, manual decay, etc.

1.17 Local optima and saddle points


• Most points with zero gradients are saddle points, not local optima!

– Plateau’s in saddle points slow down learning.


– Local optima are pretty rare in comparison/unlikely to get stuck in them.

13
1.18 Hyperparameter tuning process 1 IMPROVING DEEP NEURAL NETWORKS

1.18 Hyperparameter tuning process


• Order of importance of hyperparameters for tuning:
1. learning rate (α)
2. momentum term (β : 0.9)
3. mini-batch size
4. number of hidden units
5. number of layers
6. learning rate decay
7. β1 , β2 , 
• Choose the hyperparameter value combinations at random, (don’t use a grid) because of the high
number of hyperparameters nowadays (cube/hyper dimensional space), doesn’t compensate test
all values/combinations.

Figure 12: Grid vs. Random search.

• Coarse to fine-tuning - first coarse changes of the hyperparameters, then fine tune them.
• Using an appropriate scale for the hyperparameters.
– One possibility is to sample values at random within an intended range.
– Using log scales to sample parameter values to try (for example, applicable for the learning
rate).
For α (sample between 10a · · · 10b ), uniformly sample from r ∈ [a, b] ([-4, 0] for example)
and set α = 10r .
For exponentially weighted average hyperparameters (β, β1 , β2 ), uniformly sample from
r ∈ [a, b] ([-3, -1] for example) and set β = 1 − 10r .

• Two possible approaches for hyperparameter search:


– Panda approach: Watch only one model and change the parameters gradually and check
improvements. Requires less hardware which might not be the most efficient method.
– Caviar approach: Run multiple models with different parameters in parallel, if you have
the computing power for it.

14
1.19 Batch normalization 1 IMPROVING DEEP NEURAL NETWORKS

Figure 13: Coarse to fine search.

Figure 14: Hyperparameter Search

1.19 Batch normalization


• Normalize not just the inputs but the activation inputs to the next layer, subtracting the mean
and dividing by the variance (mean 0, variance 1).

• Normally Z is normalized, before the activation function, though some literature suggests nor-
malizing after the activation function.

• New parameters γ (multiplied by Z) and β (added to Z after the multiplication) are introduced
and learned in the forward backward propagation. This is to prevent all neurons from having
activations with mean 0 and variance 1 which is not desirable. Given some intermediate values

15
1.19 Batch normalization 1 IMPROVING DEEP NEURAL NETWORKS

z (1) , z (2) , . . . , z (m) for a layer in the network:

(i) z (i) − µ 1 X (i) 1 X (i)


Znorm =√ , where µ = z , and σ 2 = (z − µ)2
σ2 +  m m
i i

Z̃ (i) = γZnorm
(i)

• The bias parameter “b” in the calculation of Z is no longer needed because the mean of Z is
being subtracted, canceling out any effect from adding “b”. The new parameter β effectively
becomes the new bias term.
• At test time there is no µ and σ 2 , so these are computed based on an exponentially weighted
average of these two parameters obtained on different batches during training. Why does

Figure 15: Batch Normalization.

Batch Normalization work?


– It makes weights of deeper layers more robust to changes in weights in earlier layers of
network.
– Covariate shift: Data distribution changes with inputs (e.g. over time, with new batches,
etc), you need to retrain your network normally.
– Batch normalization makes the process of learning easier by reducing the variability of
the inputs presented to each layer (which now have similar variance and mean), therefore
reducing the covariate shift, and that is especially important for deeper layers, where
inputs could change significantly as a net effect of all the other changes in the network.
– It reduces the amount that the distribution of hidden unit values shifts around. And if it
were to plot the distribution of hidden unit values, maybe this is technically we normalize
Z, Batch norm ensures that no matter how the output of the previous layer changes, the
mean and variance of a layer will remain the same. Therefore, the batch norm reduces the
problem of the input values changing, it really causes these values to become more stable,
so that the later layers of the neural network have more firm ground to stand on.

16
1.20 Multi-class classification 1 IMPROVING DEEP NEURAL NETWORKS

– It also has a slight regularization effect with mini-batch, due to the “noise” introduced by
the calculations of mean and variance only for that mini-batch only rather than the entire
dataset, which has an effect similar to that of dropout.

Figure 16: Covariate Shift

1.20 Multi-class classification


• # of neurons in the output layer = # of classes. The sum of all outputs must be 1, since these
are probabilities of X being classified in any of these classes(likelihood).

• Softmax activation function is used as the output activation function:


[L]
t = eZ
[L] ti
ai = PK , where K is the number of classes and output units
j=1 tj

Figure 17: Softmax

• Softmax is in contrast with Hardmax, where the network’s output will be a binary vector
with all 0s except for the position corresponding to the max value of Z [L] .

17
1.20 Multi-class classification 1 IMPROVING DEEP NEURAL NETWORKS

• Softmax is the generalization of logistic regression to more than two classes. For two classes,
it can be simplified/reduced to logistic regression.
Softmax Loss function:
C
X
L(ŷ, y) = − yj log(ŷj ), assuming a binary vector y
j=1

Softmax Cost function:


m
1 X ∂J
J= L(ŷ (i) , y (i) ), backprop: = ŷ − y
m ∂z
i=1

18

You might also like