0% found this document useful (0 votes)
8 views

08 Training

Neural Network training

Uploaded by

Tigabu Yaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

08 Training

Neural Network training

Uploaded by

Tigabu Yaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Neural Network training

• Optimization
• Mini-batch SGD
• Learning rate decay
• Adaptive methods
• Massaging the numbers
• Data augmentation
• Data preprocessing
• Weight initialization
• Batch normalization
• Regularization
• Classic regularization: L2 and L1
• Dropout
• Label smoothing
• Test time: ensembles, averaging predictions
Slides from L. Lazebnik

Mini-batch SGD
• Iterate over epochs
• Iterate over dataset mini-batches !", $" , … , (!' , $' )
• Compute gradient of the mini-batch loss:
'
1
∇+* = / ∇2(3, !0 , $0 )
.
01"
• Update parameters:
3 ← 3 − 6∇+*
• Check for convergence, decide whether to decay
learning rate

• What are the hyperparameters?


• Mini-batch size, learning rate decay schedule,
deciding when to stop

1
SGD and mini-batch size
• Larger mini-batches: more expensive and
less frequent updates, lower gradient
variance, more parallelizable
• In the literature, SGD with larger batches is
generally reported to generalize more poorly
(e.g., Keskar et al., 2016)
• But can be made to work by using larger learning
rates with larger mini-batches (Goyal et al., 2017)

Learning rate decay


• Exponential decay: ! = !# $ %&' , where !#
and ( are hyperparameters, ) is the iteration
or epoch number
• */, decay: ! = !# /(1 + ())
• Step decay: reduce rate by a constant factor
every few epochs, e.g., by 0.5 every 5
epochs, 0.1 every 20 epochs
• Manual: watch validation error and reduce
learning rate whenever it stops improving

2
Diagnosing learning rates

Image source: Stanford CS231n

A typical phenomenon

• Why does the learning curve look like this?

Image source: Stanford CS231n

3
A typical phenomenon
Possible explanation

Image source

Debugging learning curves

Not training Error increasing Error decreasing


Bug in update calculation? Bug in update calculation? Not converged yet

Slow start Possible overfitting Definite overfitting


Suboptimal initialization?

Image source: Stanford CS231n

4
Early stopping
• Idea: do not train a network to achieve too
low training error
• Monitor validation error to decide when to
stop

Figure from Deep Learning Book

Advanced optimizers
• SGD with momentum
• RSMProp
• Adam

5
SGD with momentum
What will SGD do?

Image source

SGD with momentum


• Introduce a “momentum” variable ! and
associated “friction” coefficient ":
! ← "! − %∇'
( ←(+!
• Typically start with " = 0.5, gradually increase
over time

"!

( −%∇'

Image source

6
SGD with momentum
• Introduce a “momentum” variable ! and
associated “friction” coefficient ":
! ← "! − %∇'
( ←(+!
• Move faster in directions with consistent gradient
• Avoid oscillating in directions with large but
inconsistent gradients

Standard SGD SGD with momentum

Image source

SGD with momentum


• Introduce a “momentum” variable ! and
associated “friction” coefficient ":
! ← "! − %∇'
( ←(+!
• Nesterov momentum: evaluate gradient at
“lookahead” position ( + "!

"! "!

−%∇'(( + "!)

( −%∇' (
Image source

7
Adaptive per-parameter learning rates
• Gradients of different layers have different
magnitudes
• Want an automatic way to set different
learning rates for different parameters

Adagrad
• Keep track of history of gradient magnitudes,
scale the learning rate for each parameter
based on this history:

%& (
!" ← !" +
%'"
* %&
'" ← '" −
!" + + %'"
• Parameters with small gradients get large
updates and vice versa
• Long-ago gradient magnitudes are not “forgotten”
so learning rate decays too quickly
J. Duchi, Adaptive subgradient methods for online learning and stochastic
optimization, JMLR 2011

8
RMSProp
• Introduce decay factor ! (typically ≥ 0.9) to
downweight past history exponentially:

1
./
&' ← !&' + (1 − !)
.0'

2 ./
0' ← 0' −
&' + 3 .0'

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Adam
• Combine RMSProp with momentum:
! ← #$ ! + 1 − #$ ∇)
.) 0
*+ ← #*+ + (1 − #)
./+
1
/+ ← /+ − !
*+ + 2 +
• Default parameters from paper:
#$ = 0.9, #0 = 0.999, 2 = 17 − 8
• Full algorithm includes bias correction term to
account for ! and * starting at 0:
: @
!
9 = $;<> , *? = $;<> (B is the timestep)
= A

D. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR 2015

9
Which optimizer to use in practice?
• Adaptive methods tend to reduce initial training
error faster than SGD
• Adam with default parameters is a popular choice,
SGD+momentum may work better but requires more
tuning
• However, adaptive methods may quickly
plateau on the validation set or generalize more
poorly
• Use Adam first, then switch to SGD?
• Or just stick with plain old SGD? (Wilson et al., 2017)
• All methods require careful tuning and learning
rate control

Massaging the numbers

10
Data augmentation
• Introduce transformations not adequately
sampled in the training data
• Geometric: flipping, rotation, shearing, multiple crops

Image source Image source

Data augmentation
• Introduce transformations not adequately
sampled in the training data
• Geometric: flipping, rotation, shearing, multiple crops
• Photometric: color transformations

Image source

11
Data augmentation
• Introduce transformations not adequately
sampled in the training data
• Geometric: flipping, rotation, shearing, multiple crops
• Photometric: color transformations
• Other: add noise, compression artifacts, lens
distortions, etc.

Image source

Data augmentation
• Introduce transformations not adequately
sampled in the training data
• Limited only by your imagination and
time/memory constraints!
• Avoid introducing obvious artifacts

Image source

12
Data preprocessing
• Zero centering
• Subtract mean image – all input images need to
have the same resolution
• Subtract per-channel means – images don’t need
to have the same resolution
• Optional: rescaling – divide each value by
(per-pixel or per-channel) standard deviation

• Be sure to apply the same transformation at


training and test time!
• Save training set statistics and apply to test data

Weight initialization
• What’s wrong with initializing all weights to
the same number (e.g., zero)?

13
Weight initialization
• Typically: initialize to random values sampled from
zero-mean Gaussian: ! ~ #(0, ' ( )
• Standard deviation matters!
• Key idea: avoid reducing or amplifying the variance of
layer responses, which would lead to vanishing or
exploding gradients
• Common heuristics:
• ' = 1/ -./, where -./ is the number of inputs to a layer
• ' = 2/ -./ + -234 (Glorot and Bengio, 2010)
• '= 2/-./ for ReLU (He et al., 2015)
• Initializing biases: just set them to 0

More details: http://cs231n.github.io/neural-networks-2/#init

Review: L2 regularization
• Regularized objective:
/
'
"! ($) = $ )) + + 0($, 2, , 3, )
2
,-.
• Gradient of objective:
/

∇"! ($) = '$ + + ∇0($, 2, , 3, )


,-.
• SGD update:
$ ← $ − 7 '$ + ∇0 $, 2, , 3,
$ ← (1 − 7')$ − 7∇0 $, 2, , 3,
• Interpretation: weight decay

14
L1 regularization
• Regularized objective:
+

"! # = % # & + ( , #, .) , /)
)*&
+

= % ( #0 + ( , #, .) , /)
0 )*&
• Gradient: ∇"! # = % sgn(#) + ∑+)*& ∇,(#, .) , /) )
• SGD update:
# ← # − :% sgn # − :∇, #, .) , /)
• Interpretation: encouraging sparsity

Dropout
• At training time, in each forward pass, turn
off some neurons with probability p
• At test time, to have deterministic behavior,
multiply output of neuron by p

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout:


A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014

15
Dropout
• Intuitions
• Prevent “co-adaptation” of units, increase
robustness to noise
• Train implicit ensemble

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout:


A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014

Current status of dropout


• Against
• Slows down convergence
• Made redundant by batch normalization or
possibly even clashes with it
• Unnecessary for larger datasets or with sufficient
data augmentation
• In favor
• Can still help in certain scenarios: e.g., used in
Wide Residual Networks

16
Label smoothing
• Idea: avoid overly confident predictions,
account for label noise
• When using softmax loss, replace hard 1 and
0 prediction targets with “soft” targets of
$
1 − # and %&'
• Used in Inception-v2 architecture

Test time
• Ensembles: train multiple independent models,
then average their predicted label distributions
• Gives 1-2% improvement in most cases
• Can take multiple snapshots of models obtained
during training, especially if you cycle the learning rate

G. Huang et al., Snapshot ensembles: Train 1, get M for free, ICLR 2017

17
Test time
• Average predictions across multiple crops of
test image
• There is a more elegant way to do this with fully
convolutional networks (FCNs)

Attempt at a conclusion
• Training neural networks is still a black art
• Process requires close “babysitting”
• For many techniques, the reasons why, when, and whether
they work are in active dispute
• Read everything but don’t trust anything
• It all comes down to (principled) trial and error

18

You might also like