Part 1.3. Optimazation of Learning Algorithms
Part 1.3. Optimazation of Learning Algorithms
Instructor:
Assoc. Prof. Dr. Truong Ngoc Son
Chapter 3
Optimazation of learning process
Outline
The challenges in Deep learning
Momentum
ADAGRAD – Adaptive Gradient Descent
RMSPROP (Root Mean Squared Propagation)
ADAM
Dropout
The challenge in Deep Learning
Local minima
The objective function of deep learning usually has many local minima
When the numerical solution of an optimization problem is near the local
optimum, the numerical solution obtained by the final iteration may only
minimize the objective function locally, rather than globally, as the gradient of
the objective functionʼs solutions approaches or becomes zero
L(w)
w
Global minimum
The challenge in Deep Learning
Vanishing Gradient
As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approaches zero, making the network hard to
train
The simplest solution is to use other activation functions, such as ReLU, which
doesn’t cause a small derivative.
Residual networks are another solution, as they provide residual connections
straight to earlier layers
The challenge in Deep Learning
Over fitting and under fitting
Overfitting is a modeling error in statistics that occurs when a function is too
closely aligned to a limited set of data points. As a result, the model is useful in
reference only to its initial data set, and not to any other data sets
The model fit well the training data, but it not show the good performance with
the testing data
Underfitting is a scenario in data science where a data model is unable to capture
the relationship between the input and output variables accurately, generating a
high error rate on both the training set and unseen data
Momentum
The method of momentum is designed to accelerate learning.
The momentum algorithm accumulates exponentially decaying moving average
of past gradient and continues to move in their direction
Gradient descent y y
Local minimal Local minimal
𝑤𝑡 = 𝑤𝑡 − ∆𝑤𝑡
∆𝑤𝑡 = 𝜂𝛻C 𝑤
𝜕𝐿
𝑤𝑡 = 𝑤𝑡 − 𝜂′
𝜕𝑤𝑡−1
Where
𝜂
𝜂′ =
𝛼𝑡 + 𝜀
2
𝜕𝐿
𝛼𝑡 = 𝛽𝛼𝑡−1 + 1 − 𝛽
𝜕𝑤𝑡−1
Adam — Adaptive Moment Estimation
ADAM combines two stochastic gradient descent approaches, Adaptive
Gradients, and Root Mean Square Propagation
Adam also keeps an exponentially decaying average of past gradients similar to
SGD with momentum
𝜕L
𝑣𝑑𝑊 = 𝛽1 𝑣𝑑𝑊 + 1 − 𝛽1
𝜕𝑤
𝜕L
𝑠𝑑𝑊 = 𝛽2 𝑠𝑑𝑊 + 1 − 𝛽2
𝜕𝑤
𝑣𝑑𝑊
𝑣𝑑𝑊 =
1 − 𝛽1 𝑡
𝑠𝑑𝑊
𝑠𝑑𝑊 =
1 − 𝛽2 𝑡
𝑣𝑑𝑊
𝑊 =𝑊−𝜂
𝑠𝑑𝑊 + 𝜀
Dropout
Avoid overfitting problem
Probabilistically dropping out nodes in the network is a simple and effective
regularization method
Dropout is implemented per-layer in a neural network
A common value is a probability of 0.5 for retaining the output of each node in a
hidden layer
W pW
Always
Appear with probability of p appear
Training Testing
Standard NN Dropout NN
Dropout
How to apply dropout
𝑙+1 𝑙+1 𝑙+1
𝑧𝑖 = 𝑤𝑖 𝑦 𝑙 + 𝑏𝑖
𝑙+1 𝑙+1
𝑦𝑖 = 𝑓(𝑧𝑖 )
1 1
𝑙
r 3( l )
b i( l 1 )
𝑟𝑗 ~𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝑝
b i( l 1 )
y (l)
3
y 3( l ) ~
y 3
(l) 𝑦 𝑙 =𝑟 𝑙 ∗𝑦 𝑙
f 𝑙+1 𝑙+1 𝑙 𝑙+1
w i( l 1)
z i( l 1 ) f y i( l 1 ) r 2( l )
w i( l 1)
z i( l 1 ) y i( l 1 ) 𝑧𝑖 = 𝑤𝑖 𝑦 + 𝑏𝑖
𝑙+1 𝑙+1
y (l)
2
y (l)
2
~
y (l)
2 𝑦𝑖 = 𝑓(𝑧𝑖 )
(l)
r1
y 1( l ) ~
y 1( l )
y 1( l )
Standard NN Dropout NN
PYTHON CODE
Assignments
Design a multilayer neural network, apply the
optimizations of learning algorithms.
(input layer, 2 hidden layers (sigmoid,ReLU) , output layer)
Optimization: Momentum, Adagrad, Dropout ( +
Momentum, Adagrad)
Compare : Accuracy, (Converging time )
Dataset: MNIST
Assignments
Week 8: Submit assignment
Week 9: Quiz, decision of final project
Week 10-14: CNN
Week 15-17: Final project present