Neural networks and
Backpropagation
Charles Ollion - Olivier Grisel
1 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)
2 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)
Sample s in dataset S :
input: xs
N
∈ R
expected output: y s ∈ [0, K − 1]
3 / 74
Neural Network for
classification
Vector function with tunable parameters θ
N K
f (⋅; θ) : R → (0, 1)
Sample s in dataset S :
input: xs
N
∈ R
expected output: y s ∈ [0, K − 1]
Output is a conditional probability distribution:
s s
f (x ; θ)c = P (Y = c|X = x )
4 / 74
Artificial Neuron
5 / 74
Artificial Neuron
T
z(x) = w x + b
T
f (x) = g(w x + b)
x, f (x) input and output
z(x) pre-activation
w, b weights and bias
g activation function
6 / 74
Layer of Neurons
7 / 74
Layer of Neurons
f (x) = g(z(x)) = g(Wx + b)
W, b now matrix and vector
8 / 74
One Hidden Layer Network
h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )
9 / 74
One Hidden Layer Network
h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )
10 / 74
One Hidden Layer Network
h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )
11 / 74
One Hidden Layer Network
h h h
z (x) = W x + b
h h h
h(x) = g(z (x)) = g(W x + b )
o o o
z (x) = W h(x) + b
o o o
f (x) = sof tmax(z ) = sof tmax(W h(x) + b )
12 / 74
One Hidden Layer Network
Alternate representation
13 / 74
One Hidden Layer Network
Keras implementation
model = Sequential()
model.add(Dense(H, input_dim=N)) # weight matrix dim [N * H]
model.add(Activation("tanh"))
model.add(Dense(K)) # weight matrix dim [H x K]
model.add(Activation("softmax"))
14 / 74
Element-wise activation
functions
blue: activation function
green: derivative 15 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1
⎣ xn ⎦
e
∂ sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i = j
= {
∂ xj −sof tmax(x)i ⋅ sof tmax(x)j i ≠ j
16 / 74
Softmax function
x1
e
⎡ ⎤
x2
1 ⎢ e ⎥
⎢ ⎥
sof tmax(x) = ⋅
n xi ⎢ ⎥
∑ e ⎢ ⋮ ⎥
i=1
⎣ xn ⎦
e
∂ sof tmax(x)i sof tmax(x)i ⋅ (1 − sof tmax(x)i ) i = j
= {
∂ xj −sof tmax(x)i ⋅ sof tmax(x)j i ≠ j
vector of values in (0, 1) that add up to 1
p(Y = c|X = x) = softmax(z(x))c
the pre-activation vector z(x) is often called "the logits"
17 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
18 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
The loss function for a given sample s ∈ S :
s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s
19 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
The loss function for a given sample s ∈ S :
s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s
example
20 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
The loss function for a given sample s ∈ S :
s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s
The cost function is the negative likelihood of the model computed
on the full training set (for i.i.d. samples):
1
s
LS (θ) = − ∑ log f (x ; θ)y s
|S |
s∈S
21 / 74
Training the network
Find parameters θ = (Wh ; bh ; Wo ; bo ) that minimize the
negative log likelihood (or cross entropy)
The loss function for a given sample s ∈ S :
s s s s s
l(f (x ; θ), y ) = nll(x , y ; θ) = − log f (x ; θ)y s
The cost function is the negative likelihood of the model computed
on the full training set (for i.i.d. samples):
1
s
LS (θ) = − ∑ log f (x ; θ)y s + λΩ(θ)
|S |
s∈S
2 2
λΩ(θ) = λ(||W
h
|| + ||W
o
|| ) is an optional regularization term.
22 / 74
Stochastic Gradient Descent
Initialize θ randomly
23 / 74
Stochastic Gradient Descent
Initialize θ randomly
For E epochs perform:
Randomly select a small batch of samples (B ⊂ S)
24 / 74
Stochastic Gradient Descent
Initialize θ randomly
For E epochs perform:
Randomly select a small batch of samples (B ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
25 / 74
Stochastic Gradient Descent
Initialize θ randomly
For E epochs perform:
Randomly select a small batch of samples (B ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Update parameters: θ ← θ − ηΔ
η > 0 is called the learning rate
26 / 74
Stochastic Gradient Descent
Initialize θ randomly
For E epochs perform:
Randomly select a small batch of samples (B ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Update parameters: θ ← θ − ηΔ
η > 0 is called the learning rate
Repeat until the epoch is completed (all of S is covered)
27 / 74
Stochastic Gradient Descent
Initialize θ randomly
For E epochs perform:
Randomly select a small batch of samples (B ⊂ S)
Compute gradients: Δ = ∇θ LB (θ)
Update parameters: θ ← θ − ηΔ
η > 0 is called the learning rate
Repeat until the epoch is completed (all of S is covered)
Stop when reaching criterion:
nll stops decreasing when computed on validation set
28 / 74
Computing Gradients
∂l(f (x),y) ∂l(f (x),y)
Output Weights: o
Output bias: o
∂b i
∂W i,j
∂l(f (x),y) ∂l(f (x),y)
Hidden Weights: h
Hidden bias: h
∂W i,j ∂b i
29 / 74
Computing Gradients
∂l(f (x),y) ∂l(f (x),y)
Output Weights: o
Output bias: o
∂b i
∂W i,j
∂l(f (x),y) ∂l(f (x),y)
Hidden Weights: h
Hidden bias: h
∂W i,j ∂b i
The network is a composition of differentiable modules
We can apply the "chain rule"
30 / 74
Chain rule
31 / 74
Chain rule
chain-rule
32 / 74
Chain rule
chain-rule
33 / 74
Chain rule
chain-rule
34 / 74
Backpropagation
35 / 74
Backpropagation
Compute partial derivatives of the loss
∂l(f (x),y) ∂−log f (x) −1 y=i
y ∂l
= = =
∂f (x) ∂f (x) f (x) ∂f (x)
i i y i
36 / 74
Backpropagation
Compute partial derivatives of the loss
∂l(f (x),y) ∂−log f (x) −1 y=i
y ∂l
= = =
∂f (x) ∂f (x) f (x) ∂f (x)
i i y i
∂l
o
=?
∂z (x)
i
37 / 74
Chain rule!
38 / 74
39 / 74
40 / 74
41 / 74
: one-hot encoding of y
42 / 74
Backpropagation
Gradients
∇zo (x) l = f (x) − e(y)
∇b o l = f (x) − e(y)
o
∂z (x)
because zo (x) and then
o o i
= W h(x) + b o = 1 i=j
∂b j
43 / 74
Backpropagation
Partial derivatives related to Wo
o
∂z (x)
∂l ∂l k
o
= ∑ o
∂W i,j k o
∂z (x) ∂W i,j
k
⊤
∇Wo l = (f (x) − e(y)). h(x)
44 / 74
Backprop gradients
Compute activation gradients
∇zo (x) l = f (x) − e(y)
45 / 74
Backprop gradients
Compute activation gradients
∇zo (x) l = f (x) − e(y)
Compute layer params gradients
⊤
∇Wo l = ∇zo (x) l ⋅ h(x)
∇b o l = ∇zo (x) l
46 / 74
Backprop gradients
Compute activation gradients
∇zo (x) l = f (x) − e(y)
Compute layer params gradients
⊤
∇Wo l = ∇zo (x) l ⋅ h(x)
∇b o l = ∇zo (x) l
Compute prev layer activation gradients
o⊤
∇h(x) l = W ∇zo (x) l
′ h
∇zh (x) l = ∇h(x) l ⊙ σ (z (x))
47 / 74
Loss, Initialization and
Learning Tricks
48 / 74
Discrete output (classification)
Binary classification: y ∈ [0, 1]
Y |X = x ∼ Bernoulli(b = f (x; θ))
output function: logistic(x) =
1
−x
1+e
loss function: binary cross-entropy
Multiclass classification: y ∈ [0, K − 1]
Y |X = x ∼ M ultinoulli(p = f (x; θ))
output function: sof tmax
loss function: categorical cross-entropy
49 / 74
Continuous output (regression)
Continuous output: y ∈ R
n
2
Y |X = x ∼ N (μ = f (x; θ), σ I)
output function: Identity
loss function: square loss
Heteroschedastic if f (x; θ) predicts both μ and σ 2
Mixture Density Network (multimodal output)
Y |X = x ∼ GM M x
f (x; θ) predicts all the parameters: the means,
covariance matrices and mixture weights
50 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
51 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
52 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
53 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)
54 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)
Better inits: Xavier Glorot and Kaming He &
orthogonal
55 / 74
Initialization and normalization
Input data should be normalized to have approx. same range:
standardization or quantile normalization
Initializing W h and W o :
Zero is a saddle point: no gradient, no learning
Constant init: hidden units collapse by symmetry
Solution: random init, ex: w ∼ N (0, 0.01)
Better inits: Xavier Glorot and Kaming He &
orthogonal
Biases can (should) be initialized to zero
56 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
57 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1
Divide by 10 and retry in case of divergence
58 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1
Divide by 10 and retry in case of divergence
Large constant LR prevents final convergence
multiply η t by β < 1 after each update
59 / 74
SGD learning rate
Very sensitive:
Too high → early plateau or even divergence
Too low → slow convergence
Try a large value first: η = 0.1 or even η = 1
Divide by 10 and retry in case of divergence
Large constant LR prevents final convergence
multiply η t by β < 1 after each update
or monitor validation loss and divide η t by 2 or 10
when no progress
See ReduceLROnPlateau in Keras
60 / 74
Momentum
Accumulate gradients across successive updates:
mt = γmt−1 + η∇θ LBt (θt−1 )
θt = θt−1 − mt
γ is typically set to 0.9
61 / 74
Momentum
Accumulate gradients across successive updates:
mt = γmt−1 + η∇θ LBt (θt−1 )
θt = θt−1 − mt
γ is typically set to 0.9
Larger updates in directions where the gradient sign is constant to
accelerate in low curvature areas
62 / 74
Momentum
Accumulate gradients across successive updates:
mt = γmt−1 + η∇θ LBt (θt−1 )
θt = θt−1 − mt
γ is typically set to 0.9
Larger updates in directions where the gradient sign is constant to
accelerate in low curvature areas
Nesterov accelerated gradient
mt = γmt−1 + η∇θ LB (θt−1 − γmt−1 )
t
θt = θt−1 − mt
Better at handling changes in gradient direction.
63 / 74
Why Momentum Really Works
64 / 74
Why Momentum Really Works
65 / 74
Why Momentum Really Works
66 / 74
Why Momentum Really Works
67 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
68 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
69 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
70 / 74
Alternative optimizers
SGD (with Nesterov momentum)
Simple to implement
Very sensitive to initial value of η
Need learning rate scheduling
Adam: adaptive learning rate scale for each param
Global η set to 3e-4 often works well enough
Good default choice of optimizer (often)
But well-tuned SGD with LR scheduling can generalize better
than Adam (with naive l2 reg)...
Promising stochastic second order methods: K-FAC and Shampoo
can be used to accelerate training of very large models.
71 / 74
The Karpathy Constant for Adam
72 / 74
Optimizers around a saddle point
Credits: Alec Radford
73 / 74