08 Training
08 Training
• Optimization
• Mini-batch SGD
• Learning rate decay
• Adaptive methods
• Massaging the numbers
• Data augmentation
• Data preprocessing
• Weight initialization
• Batch normalization
• Regularization
• Classic regularization: L2 and L1
• Dropout
• Label smoothing
• Test time: ensembles, averaging predictions
Slides from L. Lazebnik
Mini-batch SGD
• Iterate over epochs
• Iterate over dataset mini-batches !", $" , … , (!' , $' )
• Compute gradient of the mini-batch loss:
'
1
∇+* = / ∇2(3, !0 , $0 )
.
01"
• Update parameters:
3 ← 3 − 6∇+*
• Check for convergence, decide whether to decay
learning rate
1
SGD and mini-batch size
• Larger mini-batches: more expensive and
less frequent updates, lower gradient
variance, more parallelizable
• In the literature, SGD with larger batches is
generally reported to generalize more poorly
(e.g., Keskar et al., 2016)
• But can be made to work by using larger learning
rates with larger mini-batches (Goyal et al., 2017)
2
Diagnosing learning rates
A typical phenomenon
3
A typical phenomenon
Possible explanation
Image source
4
Early stopping
• Idea: do not train a network to achieve too
low training error
• Monitor validation error to decide when to
stop
Advanced optimizers
• SGD with momentum
• RSMProp
• Adam
5
SGD with momentum
What will SGD do?
Image source
"!
( −%∇'
Image source
6
SGD with momentum
• Introduce a “momentum” variable ! and
associated “friction” coefficient ":
! ← "! − %∇'
( ←(+!
• Move faster in directions with consistent gradient
• Avoid oscillating in directions with large but
inconsistent gradients
Image source
"! "!
−%∇'(( + "!)
( −%∇' (
Image source
7
Adaptive per-parameter learning rates
• Gradients of different layers have different
magnitudes
• Want an automatic way to set different
learning rates for different parameters
Adagrad
• Keep track of history of gradient magnitudes,
scale the learning rate for each parameter
based on this history:
%& (
!" ← !" +
%'"
* %&
'" ← '" −
!" + + %'"
• Parameters with small gradients get large
updates and vice versa
• Long-ago gradient magnitudes are not “forgotten”
so learning rate decays too quickly
J. Duchi, Adaptive subgradient methods for online learning and stochastic
optimization, JMLR 2011
8
RMSProp
• Introduce decay factor ! (typically ≥ 0.9) to
downweight past history exponentially:
1
./
&' ← !&' + (1 − !)
.0'
2 ./
0' ← 0' −
&' + 3 .0'
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Adam
• Combine RMSProp with momentum:
! ← #$ ! + 1 − #$ ∇)
.) 0
*+ ← #*+ + (1 − #)
./+
1
/+ ← /+ − !
*+ + 2 +
• Default parameters from paper:
#$ = 0.9, #0 = 0.999, 2 = 17 − 8
• Full algorithm includes bias correction term to
account for ! and * starting at 0:
: @
!
9 = $;<> , *? = $;<> (B is the timestep)
= A
D. Kingma and J. Ba, Adam: A method for stochastic optimization, ICLR 2015
9
Which optimizer to use in practice?
• Adaptive methods tend to reduce initial training
error faster than SGD
• Adam with default parameters is a popular choice,
SGD+momentum may work better but requires more
tuning
• However, adaptive methods may quickly
plateau on the validation set or generalize more
poorly
• Use Adam first, then switch to SGD?
• Or just stick with plain old SGD? (Wilson et al., 2017)
• All methods require careful tuning and learning
rate control
10
Data augmentation
• Introduce transformations not adequately
sampled in the training data
• Geometric: flipping, rotation, shearing, multiple crops
Data augmentation
• Introduce transformations not adequately
sampled in the training data
• Geometric: flipping, rotation, shearing, multiple crops
• Photometric: color transformations
Image source
11
Data augmentation
• Introduce transformations not adequately
sampled in the training data
• Geometric: flipping, rotation, shearing, multiple crops
• Photometric: color transformations
• Other: add noise, compression artifacts, lens
distortions, etc.
Image source
Data augmentation
• Introduce transformations not adequately
sampled in the training data
• Limited only by your imagination and
time/memory constraints!
• Avoid introducing obvious artifacts
Image source
12
Data preprocessing
• Zero centering
• Subtract mean image – all input images need to
have the same resolution
• Subtract per-channel means – images don’t need
to have the same resolution
• Optional: rescaling – divide each value by
(per-pixel or per-channel) standard deviation
Weight initialization
• What’s wrong with initializing all weights to
the same number (e.g., zero)?
13
Weight initialization
• Typically: initialize to random values sampled from
zero-mean Gaussian: ! ~ #(0, ' ( )
• Standard deviation matters!
• Key idea: avoid reducing or amplifying the variance of
layer responses, which would lead to vanishing or
exploding gradients
• Common heuristics:
• ' = 1/ -./, where -./ is the number of inputs to a layer
• ' = 2/ -./ + -234 (Glorot and Bengio, 2010)
• '= 2/-./ for ReLU (He et al., 2015)
• Initializing biases: just set them to 0
Review: L2 regularization
• Regularized objective:
/
'
"! ($) = $ )) + + 0($, 2, , 3, )
2
,-.
• Gradient of objective:
/
14
L1 regularization
• Regularized objective:
+
"! # = % # & + ( , #, .) , /)
)*&
+
= % ( #0 + ( , #, .) , /)
0 )*&
• Gradient: ∇"! # = % sgn(#) + ∑+)*& ∇,(#, .) , /) )
• SGD update:
# ← # − :% sgn # − :∇, #, .) , /)
• Interpretation: encouraging sparsity
Dropout
• At training time, in each forward pass, turn
off some neurons with probability p
• At test time, to have deterministic behavior,
multiply output of neuron by p
15
Dropout
• Intuitions
• Prevent “co-adaptation” of units, increase
robustness to noise
• Train implicit ensemble
16
Label smoothing
• Idea: avoid overly confident predictions,
account for label noise
• When using softmax loss, replace hard 1 and
0 prediction targets with “soft” targets of
$
1 − # and %&'
• Used in Inception-v2 architecture
Test time
• Ensembles: train multiple independent models,
then average their predicted label distributions
• Gives 1-2% improvement in most cases
• Can take multiple snapshots of models obtained
during training, especially if you cycle the learning rate
G. Huang et al., Snapshot ensembles: Train 1, get M for free, ICLR 2017
17
Test time
• Average predictions across multiple crops of
test image
• There is a more elegant way to do this with fully
convolutional networks (FCNs)
Attempt at a conclusion
• Training neural networks is still a black art
• Process requires close “babysitting”
• For many techniques, the reasons why, when, and whether
they work are in active dispute
• Read everything but don’t trust anything
• It all comes down to (principled) trial and error
18