Unit-2 Improving-Deep-Neural-Networks
Unit-2 Improving-Deep-Neural-Networks
1
CONTENTS CONTENTS
Contents
1 Improving Deep Neural Networks 4
1.1 Training/Dev(Cross Validation (CV))/Test . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Other regularization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Normalizing training sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Vanishing / Exploding gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.8 Weight initialization for deep networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.9 Numerical approximation of gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.10 Gradient checking (Grad check) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.11 Mini-batch gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.12 Optimization algorithms - exponentially weighted moving averages . . . . . . . . . . . 10
1.13 Gradient descent with momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.14 RMSprop (Root Mean Square prop) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.15 Adam (adaptive moment estimation) optimization . . . . . . . . . . . . . . . . . . . . 13
1.16 Learning rate decay (lower on the list of hyper-parameters to try) . . . . . . . . . . . . 13
1.17 Local optima and saddle points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.18 Hyperparameter tuning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.19 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.20 Multi-class classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2
CONTENTS CONTENTS
Preface
A couple of years ago I completed Deep Learning Specialization taught by AI pioneer Andrew Ng. I
found this series of courses immensely helpful in my learning journey of deep learning. After years, I
decided to prepare this document to share some of the notes which highlight key concepts I learned in
the second course of this specialization, Improving Deep Neural Networks. This course teaches how to
optimize your model’s performance by applying many algorithms and techniques. For instance, how
to tune the learning rate, the number of layers, and the number of neurons in each layer. Then regu-
larization techniques like dropout and Batch Normalization are covered, to end with an optimization
section that discusses stochastic gradient descent, momentum, RMS Prop, and Adam optimization
algorithms. Notes are based on lecture videos and supplementary material provided and my own
understanding of the topics.
The content of this document is mainly adapted from this GitHub repository. I have added
some explanations, illustrations, and visualization to make some complex concepts easier to grasp for
readers. This document could be a good reference for Machine Learning Engineers, Deep Learning
Engineers, and Data Scientists to refresh their minds on the fundamentals of deep learning. Please
don’t hesitate to contact me via my website (sefidian.com) if you have any questions.
Happy Learning!
Amir Masoud Sefidian
3
1 IMPROVING DEEP NEURAL NETWORKS
4
1.3 Regularization 1 IMPROVING DEEP NEURAL NETWORKS
1.3 Regularization
• Regularization is also called weight decay because it causes weights to be smaller for higher
values of lambda (the regularization parameter). It reduces high variance
• There is L2 and L1 regularization, L2 uses the squared of the weights, L1 uses only the norm
and has the “advantage” of making the weights matrix sparse, though L2 is the most used in
practice.
– When λ → ∞, it set the weight matrices W [l] to be reasonably close to zero. As a result,
the neural network becomes a much smaller neural network. See Figure (1).
– If the regularization becomes very large, the parameters W [l] ≈ 0, so Z will be relatively
small. Thus, the activation function if is tanh, say, will be relatively linear when Z → 0.
Thus, the whole neural network will be computing something not too far from a big linear
function which is therefore a pretty simple function rather than a very complex highly
non-linear function. See Figure (2).
1.4 Dropout
• Dropout regularization consists of training the NN with a number of neurons “switched off”
at every training iteration (though not during testing). It has a similar effect to regularization,
and it is possible to have different percentages of dropped units/neurons for each layer, making
it more flexible. The same units/neurons are dropped in both forward and backward steps.
5
1.4 Dropout 1 IMPROVING DEEP NEURAL NETWORKS
The cost function with dropout does not necessarily decrease continuously as we usually see for
gradient descent.
– Inverted dropout is the most common type of dropout, and it consists of scaling activa-
tions by dividing with the activation matrix with keep prob (the probability of keeping
units), for each layer.
– Steps:
Figure 3: Dropout
• Intuitions:
– Dropout randomly knocks out units in the network. Hence, it is as if on every iteration we
are working with a smaller neural network, and so using a smaller neural network seems
like it should have a regularizing effect.
6
1.5 Other regularization methods 1 IMPROVING DEEP NEURAL NETWORKS
– Let’s take a look from the perspective of a single unit. This unit takes some inputs and
generates some meaningful output. Now with dropout, the inputs can get randomly elim-
inated. Therefore, it can’t rely on any one feature because any one feature could go away
at random or any one of its own inputs could go away at random. The weights, we are
reluctant to put too much weight on any one input because it can go away. Thus, this
unit will be more motivated to spread out this way and give a little bit of weight to each
of inputs to this unit. And spreading all the weights will have the effect of shrinking the
squared norm of the weights.
• One big downside of dropout is that the cost function J is no longer well-defined.
• Note: Orthogonalization is the separation of the cost optimization step (e.g. gradient descent)
from steps taken for not overfitting the model (e.g. regularization), in other words, optimizing
model’s parameters vs. optimizing model hyperparameters.
x←x−µ
1 Pm
σ2 = m (i) 2
i=1 (x )
x ← x/σ 2
7
1.7 Vanishing / Exploding gradients 1 IMPROVING DEEP NEURAL NETWORKS
• The mean and variance obtained in the training set should be used to scale the test set as well,
(we don’t want to scale the training set differently).
• Allows using higher learning rates and faster convergence for gradient descent.
• If features are on very different scales, say the feature x1 ranges from 1 to 1000, and the feature
x2 ranges from 0 to 1, then the ratio or the range of values for the parameters w1 and w2 will
end up taking on very different values. Then the cost function can be very elongated. When
normalizing the features, the cost function will be more symmetric. When contours are spherical,
we can take much larger steps with gradient descent rather than needing to oscillate. See Figure
(6).
8
1.8 Weight initialization for deep networks 1 IMPROVING DEEP NEURAL NETWORKS
layers with such small weights (think in terms of a very deep network with linear activations as
intuitive example).
• The above is also applicable for the gradients (not just the activation/output), on the opposite
direction (backward propagation), thus gradients can either explode (causing numerical insta-
bility) or become very small (with the consequence of lower layers not being updated as well as
numerical instability).
• For tanh:
q q
1 2
np.random.randn() * n[l−1]
(Xavier initialization) or np.random.randn() * n[l−1] +n[l]
For ReLU:
q
2
np.random.randn() * n[l−1]
• Take W [1] , b[1] , · · · , W [L] , b[L] and concatenate and reshape them into a vector θ.
• Take dW [1] , db[1] , · · · , dW [L] , db[L] and concatenate and reshape them into a vector dθ.
• Note that || · ||2 denotes the squared root of the sum of the squared differences (that is, the norm
of the vector).
• If the result is near 10−7 it is great, if it is 10−5 , then suspect something in the formula, if 10−3 ,
something is really wrong.
• Look for what dθ[i] (what component) has the highest difference, to pinpoint the cause of the
bug.
• Include the regularization term in the cost function when performing grad check.
9
1.11 Mini-batch gradient descent 1 IMPROVING DEEP NEURAL NETWORKS
• Running each iteration of gradient descent on smaller batches of the full dataset. May take too
long per iteration.
• The cost function trends downward but not monolithically in this case.
• If batch size = m then it is just batch gradient descent (run for all the examples at once). (use
this for m <= 2000).
• Ideal scenario is in between the two above (may not exactly converge, but we can reduce the
learning rate).
• Typical minibatch sizes are powers of two (64, 128, 256, 512) - to ensure they fit in cpu/gpu
memory.
(c) Stochastic GD
Figure 7: Batch Gradient Descent vs. Mini Batch Gradient Descent vs. Stochastic Gradient Descent
10
1.13 Gradient descent with momentum 1 IMPROVING DEEP NEURAL NETWORKS
• When β = 0.9 it takes a delay of approx 10 (think 10 days for a daily time-series data) for the
contribution of a point to reduce to 1/3. The general rule (in which is 0.1 and β = 1 − to
meet this example) is:
1 1
(1 − ) =
e
• To correct the bias of the first few terms (compared to 0 initialization), the following formula
can be used:
Vt
Vtcorrected =
1 − βt
• The basic idea is to compute an exponentially weighted average of gradients, and then use that
gradient to update the weights.
• Uses exponential weighted moving averages to smooth out the derivatives dW and db, when
updating W and b in each iteration. For example for dW (and similarly for db):
W = W − αVdW
11
1.14 RMSprop (Root Mean Square prop) 1 IMPROVING DEEP NEURAL NETWORKS
• Sometimes a simplified version is used that factors the (1 − β) term into the learning rate (that
must be adjusted) instead of it being explicit, therefore:
VdW = βVdW + dW
αadjusted = α(1 − β)
• β is most commonly 0.9 (pretty robust value)
• Bias correction is not usually used for gradient descent.
• Update W and b on each iteration with dW or db divided by the root mean square of the
exponential moving average of dW or db (the square is element-wise):
SdW = βSdW + (1 − β)dW 2
dW
W = W − α√
SdW +
• Implementations add a small to the denominator to avoid divisions by 0.
• The intuition is to have smaller/slower updates of db (derivative of the bias term or vertical
direction) and higher/faster updates of dW (derivative of the weights or horizontal direction),
to improve convergence speed. dW is a large matrix, therefore RMS of dW is much larger than
RMS of db.
• Allows using a higher learning rate and faster convergence.
12
1.15 Adam (adaptive moment estimation) optimization
1 IMPROVING DEEP NEURAL NETWORKS
V corrected
W = W − α q dW
corrected +
SdW
corrected
Vdb
b = b − αq
corrected +
Sdb
• Alternatives:
13
1.18 Hyperparameter tuning process 1 IMPROVING DEEP NEURAL NETWORKS
• Coarse to fine-tuning - first coarse changes of the hyperparameters, then fine tune them.
• Using an appropriate scale for the hyperparameters.
– One possibility is to sample values at random within an intended range.
– Using log scales to sample parameter values to try (for example, applicable for the learning
rate).
For α (sample between 10a · · · 10b ), uniformly sample from r ∈ [a, b] ([-4, 0] for example)
and set α = 10r .
For exponentially weighted average hyperparameters (β, β1 , β2 ), uniformly sample from
r ∈ [a, b] ([-3, -1] for example) and set β = 1 − 10r .
14
1.19 Batch normalization 1 IMPROVING DEEP NEURAL NETWORKS
• Normally Z is normalized, before the activation function, though some literature suggests nor-
malizing after the activation function.
• New parameters γ (multiplied by Z) and β (added to Z after the multiplication) are introduced
and learned in the forward backward propagation. This is to prevent all neurons from having
activations with mean 0 and variance 1 which is not desirable. Given some intermediate values
15
1.19 Batch normalization 1 IMPROVING DEEP NEURAL NETWORKS
Z̃ (i) = γZnorm
(i)
+β
• The bias parameter “b” in the calculation of Z is no longer needed because the mean of Z is
being subtracted, canceling out any effect from adding “b”. The new parameter β effectively
becomes the new bias term.
• At test time there is no µ and σ 2 , so these are computed based on an exponentially weighted
average of these two parameters obtained on different batches during training. Why does
16
1.20 Multi-class classification 1 IMPROVING DEEP NEURAL NETWORKS
– It also has a slight regularization effect with mini-batch, due to the “noise” introduced by
the calculations of mean and variance only for that mini-batch only rather than the entire
dataset, which has an effect similar to that of dropout.
• Softmax is in contrast with Hardmax, where the network’s output will be a binary vector
with all 0s except for the position corresponding to the max value of Z [L] .
17
1.20 Multi-class classification 1 IMPROVING DEEP NEURAL NETWORKS
• Softmax is the generalization of logistic regression to more than two classes. For two classes,
it can be simplified/reduced to logistic regression.
Softmax Loss function:
C
X
L(ŷ, y) = − yj log(ŷj ), assuming a binary vector y
j=1
18