Lecture 20
Lecture 20
At the heart of our visual identity is the Oxford logo. Quadrangle Logo
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
This is the square
The primary quadrangle logo consists of an Oxford blue logo of first
(Pantone 282) square with the words UNIVERSITY OF choice or primary
OXFORD at the foot and the belted crest in the top Oxford logo.
right-hand corner reversed out in white.
Tom Rainforth
Hilary 2022
[email protected]
Backpropagation and Automatic Differentiation
1
Backpropagation and Automatic Differentiation
1
Backpropagation and Automatic Differentiation
1
Backpropagation and Automatic Differentiation
1
The loss might also depend directly on xi but we are omitting this for brevity. We can also have unsupervised
settings where the loss is simply of the form L(f (xi )) or L(xi , f (xi )), for which all the ideas will equally apply.
2
Empirical Risk
1
The loss might also depend directly on xi but we are omitting this for brevity. We can also have unsupervised
settings where the loss is simply of the form L(f (xi )) or L(xi , f (xi )), for which all the ideas will equally apply.
2
Empirical Risk
2
The Chain Rule for Feed–Forward Neural Networks
3
The Chain Rule for Feed–Forward Neural Networks
3
The Chain Rule for Feed–Forward Neural Networks
3
The Chain Rule for Feed–Forward Neural Networks
∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,
∂Li
∂hm is a row vector with dm elements representing the
i
derivative of the loss with respect to our set of output units
hm
i = fθ (xi ) (note that even if our output is
multi-dimensional, our loss is still a scalar)
4
Individual Terms
∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,
∂Li
∂hm is a row vector with dm elements representing the
i
derivative of the loss with respect to our set of output units
hm
i = fθ (xi ) (note that even if our output is
multi-dimensional, our loss is still a scalar)
∂hki
Each is a dk × dk−1 matrix representing the Jacobian of
∂hk−1
i
fθkk with respect to the input hk−1
i
4
Individual Terms
∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,
∂Li
∂hm is a row vector with dm elements representing the
i
derivative of the loss with respect to our set of output units
hm
i = fθ (xi ) (note that even if our output is
multi-dimensional, our loss is still a scalar)
∂hki
Each is a dk × dk−1 matrix representing the Jacobian of
∂hk−1
i
fθkk with respect to the input hk−1
i
∂hℓi
∂θℓ
is a dℓ × pℓ matrix (where pℓ is the number of parameters
in θℓ )representing the Jacobian of fθℓℓ with respect to its
parameters θℓ
4
Computation Order
Presuming that the loss, regularizer, and each layer in our network
is differentiable (with respect to both its inputs and parameters),
we can directly calculate the overall derivative by calculating each
such term individual and then combining them as per (1)
5
Computation Order
Presuming that the loss, regularizer, and each layer in our network
is differentiable (with respect to both its inputs and parameters),
we can directly calculate the overall derivative by calculating each
such term individual and then combining them as per (1)
However, the order of the computations will massively change the
cost: we can either use the breakdown
! !
ℓ+1
∂Li ∂hm ℓ
∂Li i ∂hi ∂h i
∂θℓ
=
∂hm m−1 · · · ℓ ∂θ ℓ
(2)
i ∂h i ∂h i
5
Computation Order
Presuming that the loss, regularizer, and each layer in our network
is differentiable (with respect to both its inputs and parameters),
we can directly calculate the overall derivative by calculating each
such term individual and then combining them as per (1)
However, the order of the computations will massively change the
cost: we can either use the breakdown
! !
ℓ+1
∂Li ∂hm ℓ
∂Li i ∂hi ∂h i
∂θℓ
=
∂hm m−1 · · · ℓ ∂θ ℓ
(2)
i ∂h i ∂h i
or
!!!
∂Li ∂Li ∂hmi ∂hℓ+1
i ∂hℓi
= ··· (3)
∂θℓ ∂hm
i ∂hm−1
i ∂hℓi ∂θℓ
5
Computational Cost
For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost
For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
Xm
= O dℓ pℓ + dk dk−1
k=ℓ+1
2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost
For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
Xm
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )
2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost
For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
Xm
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )
Xm
= O dm pℓ + dk dk−1 pℓ
k=ℓ+1
2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost
For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
Xm
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )
Xm
= O dm pℓ + dk dk−1 pℓ
k=ℓ+1
The latter is roughly pℓ times more costly; as pℓ could easily be a
million or more parameters, this is a massive difference!
2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost
For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
Xm
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )
Xm
= O dm pℓ + dk dk−1 pℓ
k=ℓ+1
The latter is roughly pℓ times more costly; as pℓ could easily be a
million or more parameters, this is a massive difference!
Calculating our derivatives by applying the chain rule in the former
order is known as backpropagation in the deep learning literature
2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Backpropagation
7
Backpropagation
7
Backpropagation
have
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= (4)
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij
n dℓ
∂ R̂(θ) 1 XX ∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ (5)
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1
8
Backpropagation in Sum Notation
have
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= (4)
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij
n dℓ
∂ R̂(θ) 1 XX ∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ (5)
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1
Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
9
Backpropagation Example
Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij
9
Backpropagation Example
Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂h ℓ+1 ∂hℓ
ik ij
∂hℓ+1
X
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1
9
Backpropagation Example
Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂h ℓ+1 ∂hℓ
ik ij
∂hℓ+1
X
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1
2
ℓ+1 ℓ+1
= Wkj 1 − hik
9
Backpropagation Example
Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂h ℓ+1 ∂hℓ
ik ij
∂hℓ+1
X
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1
2
ℓ+1 ℓ+1
= Wkj 1 − hik
dℓ+1 2
∂Li X ∂Li
ℓ+1
ℓ+1
=⇒ = Wkj 1 − hik
∂hℓij k=1
∂hℓ+1
ik
9
Backpropagation Example (2)
10
Backpropagation Example (2)
∂hℓij
2
ℓ−1
ℓ
= I(t = j)his 1 − hℓij
∂Wts
10
Backpropagation Example (2)
∂hℓij
2
ℓ−1
ℓ
= I(t = j)his 1 − hℓij
∂Wts
n 2
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1
10
Backpropagation Example (2)
∂hℓij
2
ℓ−1
ℓ
= I(t = j)his 1 − hℓij
∂Wts
n 2
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1
∂hℓij
2
ℓ
= I(t = j) 1 − hℓij
∂bt
10
Backpropagation Example (2)
∂hℓij
2
ℓ−1
ℓ
= I(t = j)his 1 − hℓij
∂Wts
n 2
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1
∂hℓij
2
ℓ
= I(t = j) 1 − hℓij
∂bt
n 2
∂ R̂(θ) 1 X ∂Li ∂r(θ)
=⇒ ℓ
= ℓ
1 − hℓit +λ
∂bt n
i=1
∂hit ∂bℓt
10
Automatic Differentiation (AutoDiff)
11
Automatic Differentiation (AutoDiff)
11
Automatic Differentiation (AutoDiff)
11
Automatic Differentiation (AutoDiff)
11
Automatic Differentiation (AutoDiff)
11
Automatic Differentiation (AutoDiff)
11
Automatic Differentiation (AutoDiff)
12
Reverse Mode AutoDiff Performs Backpropagation
12
Reverse Mode AutoDiff Performs Backpropagation
13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to Forward function calculations in blue
calculate derivatives for y1
13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to Forward function calculations in blue
calculate derivatives for y1
13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to Forward function calculations in blue
calculate derivatives for y1
13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
Forward function calculations in blue
calculate derivatives for y1
13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
Forward function calculations in blue
calculate derivatives for y1
13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
Forward function calculations in blue
calculate derivatives for y1
13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
Forward function calculations in blue
calculate derivatives for y1
13
Creating an AutoDiff System
14
Creating an AutoDiff System
14
Creating an AutoDiff System
15
Creating an AutoDiff System (2)
15
Creating an AutoDiff System (2)
15
Creating an AutoDiff System (2)
16
AutoDiff in PyTorch
16
AutoDiff in PyTorch
17
Recap
17
Recap
18