0% found this document useful (0 votes)
10 views71 pages

Lecture 20

Uploaded by

Vic Yassenov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views71 pages

Lecture 20

Uploaded by

Vic Yassenov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

The Oxford logo

At the heart of our visual identity is the Oxford logo. Quadrangle Logo
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
This is the square
The primary quadrangle logo consists of an Oxford blue logo of first
(Pantone 282) square with the words UNIVERSITY OF choice or primary
OXFORD at the foot and the belted crest in the top Oxford logo.
right-hand corner reversed out in white.

The word OXFORD is a specially drawn typeface while all


other text elements use the typeface Foundry Sterling.

The secondary version of the Oxford logo, the horizontal


rectangle logo, is only to be used where height (vertical
space) is restricted.

Chapter 8, Part 4: Backpropagation and


These standard versions of the Oxford logo are intended
for use on white or light-coloured backgrounds, including
light uncomplicated photographic backgrounds.

Examples of how these logos should be used for various


Rectangle Logo
applications appear in the following pages.

Automatic Differentiation NOTE


The minimum size for the quadrangle logo and the
rectangle logo is 24mm wide. Smaller versions with The rectangular
secondary Oxford
bolder elements are available for use down to 15mm
logo is for use only
wide. See page 7.
where height is
restricted.

Advanced Topics in Statistical Machine Learning


5

Tom Rainforth
Hilary 2022
[email protected]
Backpropagation and Automatic Differentiation

Being able to differentiate the empirical risk is key to deep learning


as it allows us to optimize the parameters using gradient methods

1
Backpropagation and Automatic Differentiation

Being able to differentiate the empirical risk is key to deep learning


as it allows us to optimize the parameters using gradient methods
In this lecture we will look at how we can actually calculate these
derivatives in practice

1
Backpropagation and Automatic Differentiation

Being able to differentiate the empirical risk is key to deep learning


as it allows us to optimize the parameters using gradient methods
In this lecture we will look at how we can actually calculate these
derivatives in practice
In particular, we will look at

ˆ Backpropagation: a particular way of applying the chain rule


that minimizes the cost of calculating the derivatives

1
Backpropagation and Automatic Differentiation

Being able to differentiate the empirical risk is key to deep learning


as it allows us to optimize the parameters using gradient methods
In this lecture we will look at how we can actually calculate these
derivatives in practice
In particular, we will look at

ˆ Backpropagation: a particular way of applying the chain rule


that minimizes the cost of calculating the derivatives
ˆ Automatic differentiation: a programming languages tool
that will allow us to perform this backpropagation
automatically and forms the basis for deep learning packages
like PyTorch and Tensorflow
1
Empirical Risk

For network fθ with parameters θ, loss function L, regularizer r,


our regularized empirical risk is1
n
1X
R̂(θ) = L(yi , fθ (xi )) + λr(θ)
n
i=1

1
The loss might also depend directly on xi but we are omitting this for brevity. We can also have unsupervised
settings where the loss is simply of the form L(f (xi )) or L(xi , f (xi )), for which all the ideas will equally apply.

2
Empirical Risk

For network fθ with parameters θ, loss function L, regularizer r,


our regularized empirical risk is1
n
1X
R̂(θ) = L(yi , fθ (xi )) + λr(θ)
n
i=1

Using the shorthand Li = L(yi , fθ (xi )), the derivative is thus


n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1

1
The loss might also depend directly on xi but we are omitting this for brevity. We can also have unsupervised
settings where the loss is simply of the form L(f (xi )) or L(xi , f (xi )), for which all the ideas will equally apply.

2
Empirical Risk

For network fθ with parameters θ, loss function L, regularizer r,


our regularized empirical risk is1
n
1X
R̂(θ) = L(yi , fθ (xi )) + λr(θ)
n
i=1

Using the shorthand Li = L(yi , fθ (xi )), the derivative is thus


n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1

Presuming suitable choices for r, ∇θ r(θ) can always be calculated


straightforwardly, so the key term we need to calculate is ∇θ Li
1
The loss might also depend directly on xi but we are omitting this for brevity. We can also have unsupervised
settings where the loss is simply of the form L(f (xi )) or L(xi , f (xi )), for which all the ideas will equally apply.

2
The Chain Rule for Feed–Forward Neural Networks

An arbitrary network without loops, skip connections, or parameter


sharing between layers (e.g. MLP, basic CNN) can be expressed as
 
0 ℓ ℓ ℓ−1
h = x, h = fθℓ h ∀ℓ = {1, . . . , m}, fθ (x) = hm

3
The Chain Rule for Feed–Forward Neural Networks

An arbitrary network without loops, skip connections, or parameter


sharing between layers (e.g. MLP, basic CNN) can be expressed as
 
0 ℓ ℓ ℓ−1
h = x, h = fθℓ h ∀ℓ = {1, . . . , m}, fθ (x) = hm

We will further introduce the notation hℓi ∈ Rdℓ to denote the


vector of hidden units values hℓ when the network is given input xi

3
The Chain Rule for Feed–Forward Neural Networks

An arbitrary network without loops, skip connections, or parameter


sharing between layers (e.g. MLP, basic CNN) can be expressed as
 
0 ℓ ℓ ℓ−1
h = x, h = fθℓ h ∀ℓ = {1, . . . , m}, fθ (x) = hm

We will further introduce the notation hℓi ∈ Rdℓ to denote the


vector of hidden units values hℓ when the network is given input xi
Noting the Markovian dependencies in the layers and applying the
chain rule yields the series of vector and matrix products
∂Li ∂Li ∂hm i ∂hℓ+1
i ∂hℓi
= · · ·
∂θℓ ∂hm m−1
i ∂hi ∂hℓi ∂θℓ

3
The Chain Rule for Feed–Forward Neural Networks

An arbitrary network without loops, skip connections, or parameter


sharing between layers (e.g. MLP, basic CNN) can be expressed as
 
0 ℓ ℓ ℓ−1
h = x, h = fθℓ h ∀ℓ = {1, . . . , m}, fθ (x) = hm

We will further introduce the notation hℓi ∈ Rdℓ to denote the


vector of hidden units values hℓ when the network is given input xi
Noting the Markovian dependencies in the layers and applying the
chain rule yields the series of vector and matrix products
∂Li ∂Li ∂hm i ∂hℓ+1
i ∂hℓi
= · · ·
∂θℓ ∂hm m−1
i ∂hi ∂hℓi ∂θℓ
and thus
n
∂ R̂(θ) 1 X ∂Li ∂hm i ∂hℓ+1
i ∂hℓi ∂r(θ)

= m m−1 · · · ℓ ℓ
+λ (1)
∂θ n ∂hi ∂hi ∂hi ∂θ ∂θℓ
i=1
3
Individual Terms

∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,

ˆ ∂Li
∂hm is a row vector with dm elements representing the
i
derivative of the loss with respect to our set of output units
hm
i = fθ (xi ) (note that even if our output is
multi-dimensional, our loss is still a scalar)

4
Individual Terms

∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,

ˆ ∂Li
∂hm is a row vector with dm elements representing the
i
derivative of the loss with respect to our set of output units
hm
i = fθ (xi ) (note that even if our output is
multi-dimensional, our loss is still a scalar)
∂hki
ˆ Each is a dk × dk−1 matrix representing the Jacobian of
∂hk−1
i
fθkk with respect to the input hk−1
i

4
Individual Terms

∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,

ˆ ∂Li
∂hm is a row vector with dm elements representing the
i
derivative of the loss with respect to our set of output units
hm
i = fθ (xi ) (note that even if our output is
multi-dimensional, our loss is still a scalar)
∂hki
ˆ Each is a dk × dk−1 matrix representing the Jacobian of
∂hk−1
i
fθkk with respect to the input hk−1
i
∂hℓi
ˆ ∂θℓ
is a dℓ × pℓ matrix (where pℓ is the number of parameters
in θℓ )representing the Jacobian of fθℓℓ with respect to its
parameters θℓ

4
Computation Order

Presuming that the loss, regularizer, and each layer in our network
is differentiable (with respect to both its inputs and parameters),
we can directly calculate the overall derivative by calculating each
such term individual and then combining them as per (1)

5
Computation Order

Presuming that the loss, regularizer, and each layer in our network
is differentiable (with respect to both its inputs and parameters),
we can directly calculate the overall derivative by calculating each
such term individual and then combining them as per (1)
However, the order of the computations will massively change the
cost: we can either use the breakdown
! !
ℓ+1
∂Li ∂hm ℓ
 
∂Li i ∂hi ∂h i
∂θℓ
=
∂hm m−1 · · · ℓ ∂θ ℓ
(2)
i ∂h i ∂h i

5
Computation Order

Presuming that the loss, regularizer, and each layer in our network
is differentiable (with respect to both its inputs and parameters),
we can directly calculate the overall derivative by calculating each
such term individual and then combining them as per (1)
However, the order of the computations will massively change the
cost: we can either use the breakdown
! !
ℓ+1
∂Li ∂hm ℓ
 
∂Li i ∂hi ∂h i
∂θℓ
=
∂hm m−1 · · · ℓ ∂θ ℓ
(2)
i ∂h i ∂h i

or
!!!
∂Li ∂Li ∂hmi ∂hℓ+1
i ∂hℓi
= ··· (3)
∂θℓ ∂hm
i ∂hm−1
i ∂hℓi ∂θℓ

5
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
 
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
 Xm 
= O dℓ pℓ + dk dk−1
k=ℓ+1

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
 
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
 Xm 
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
 
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
 Xm 
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )
 Xm 
= O dm pℓ + dk dk−1 pℓ
k=ℓ+1

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
 
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
 Xm 
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )
 Xm 
= O dm pℓ + dk dk−1 pℓ
k=ℓ+1
The latter is roughly pℓ times more costly; as pℓ could easily be a
million or more parameters, this is a massive difference!

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
 
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
 Xm 
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )
 Xm 
= O dm pℓ + dk dk−1 pℓ
k=ℓ+1
The latter is roughly pℓ times more costly; as pℓ could easily be a
million or more parameters, this is a massive difference!
Calculating our derivatives by applying the chain rule in the former
order is known as backpropagation in the deep learning literature

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further
 
reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Backpropagation

The name backpropagation derives from the fact that we calculate


our derivatives in a backwards fashion for the network:
ˆ We first calculate the gradient of loss derivatives with respect
∂Li ∂Li
to the last layer ∂hm (= ∂f (x ) )
i
i

7
Backpropagation

The name backpropagation derives from the fact that we calculate


our derivatives in a backwards fashion for the network:
ˆ We first calculate the gradient of loss derivatives with respect
∂Li ∂Li
to the last layer ∂hm (= ∂f (x ) )
i
i
ˆ We then go backwards through the network and recursively
calculate the vector–matrix products
∂Li ∂Li ∂hℓ+1
i
= ∀ℓ = m − 1, m − 2, . . . , 1
∂hℓi ∂hℓ+1
i
∂h ℓ
i

7
Backpropagation

The name backpropagation derives from the fact that we calculate


our derivatives in a backwards fashion for the network:
ˆ We first calculate the gradient of loss derivatives with respect
∂Li ∂Li
to the last layer ∂hm (= ∂f (x ) )
i
i
ˆ We then go backwards through the network and recursively
calculate the vector–matrix products
∂Li ∂Li ∂hℓ+1
i
= ∀ℓ = m − 1, m − 2, . . . , 1
∂hℓi ∂hℓ+1
i
∂h ℓ
i
ˆ From these we can calculate the empirical risk derivatives for
each set of parameters as
n
∂ R̂(θ) 1 X ∂Li ∂hℓi ∂r(θ)

= ℓ ℓ

∂θ n ∂hi ∂θ ∂θℓ
i=1
7
Backpropagation in Sum Notation

When performing backpropagation manually, it can sometimes be


helpful to express these rules as summations rather than vector
matrix products. Namely, using the shorthand hℓij = hℓi j , we


have
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= (4)
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij
n dℓ
∂ R̂(θ) 1 XX ∂Li ∂hℓij ∂r(θ)

= ℓ ℓ
+λ (5)
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

8
Backpropagation in Sum Notation

When performing backpropagation manually, it can sometimes be


helpful to express these rules as summations rather than vector
matrix products. Namely, using the shorthand hℓij = hℓi j , we


have
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= (4)
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij
n dℓ
∂ R̂(θ) 1 XX ∂Li ∂hℓij ∂r(θ)

= ℓ ℓ
+λ (5)
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

Note that backpropagation is a recursive algorithm: we do not


algebraically rollout the equation, but instead step backwards and
separately calculate the values of the gradient at each layer from
those at the next for each {xi , yi }
8
Backpropagation Example

Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i

9
Backpropagation Example

Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij

9
Backpropagation Example

Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂h ℓ+1 ∂hℓ
ik ij

∂hℓ+1
X 
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1

9
Backpropagation Example

Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂h ℓ+1 ∂hℓ
ik ij

∂hℓ+1
X 
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1
  2 
ℓ+1 ℓ+1
= Wkj 1 − hik

9
Backpropagation Example

Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂h ℓ+1 ∂hℓ
ik ij

∂hℓ+1
X 
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1
  2 
ℓ+1 ℓ+1
= Wkj 1 − hik
dℓ+1  2 
∂Li X ∂Li
ℓ+1

ℓ+1
=⇒ = Wkj 1 − hik
∂hℓij k=1
∂hℓ+1
ik
9
Backpropagation Example (2)

These can be recursively calculated and we further have


n dℓ
∂ R̂(θ) 1 XX ∂Li ∂hℓij ∂r(θ)

= ℓ ℓ

∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

10
Backpropagation Example (2)

These can be recursively calculated and we further have


n d
∂ R̂(θ) 1 XX ℓ
∂Li ∂hℓij ∂r(θ)

= ℓ ℓ

∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

∂hℓij
  2 
ℓ−1

= I(t = j)his 1 − hℓij
∂Wts

10
Backpropagation Example (2)

These can be recursively calculated and we further have


n d
∂ R̂(θ) 1 XX ℓ
∂Li ∂hℓij ∂r(θ)

= ℓ ℓ

∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

∂hℓij
  2 
ℓ−1

= I(t = j)his 1 − hℓij
∂Wts
n   2 
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1

10
Backpropagation Example (2)

These can be recursively calculated and we further have


n d
∂ R̂(θ) 1 XX ℓ
∂Li ∂hℓij ∂r(θ)

= ℓ ℓ

∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

∂hℓij
  2 
ℓ−1

= I(t = j)his 1 − hℓij
∂Wts
n   2 
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1
∂hℓij
  2 

= I(t = j) 1 − hℓij
∂bt

10
Backpropagation Example (2)

These can be recursively calculated and we further have


n d
∂ R̂(θ) 1 XX ℓ
∂Li ∂hℓij ∂r(θ)

= ℓ ℓ

∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

∂hℓij
  2 
ℓ−1

= I(t = j)his 1 − hℓij
∂Wts
n   2 
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1
∂hℓij
  2 

= I(t = j) 1 − hℓij
∂bt
n   2 
∂ R̂(θ) 1 X ∂Li ∂r(θ)
=⇒ ℓ
= ℓ
1 − hℓit +λ
∂bt n
i=1
∂hit ∂bℓt

10
Automatic Differentiation (AutoDiff)

ˆ Many modern systems can calculate derivatives for you


automatically, even for large complex programs, using a
method called automatic differentiation (AutoDiff)

11
Automatic Differentiation (AutoDiff)

ˆ Many modern systems can calculate derivatives for you


automatically, even for large complex programs, using a
method called automatic differentiation (AutoDiff)
ˆ Most popular frameworks: PyTorch and Tensorflow

11
Automatic Differentiation (AutoDiff)

ˆ Many modern systems can calculate derivatives for you


automatically, even for large complex programs, using a
method called automatic differentiation (AutoDiff)
ˆ Most popular frameworks: PyTorch and Tensorflow
ˆ In practice you never need to manually calculate network
derivatives (except potentially in exams)!

11
Automatic Differentiation (AutoDiff)

ˆ Many modern systems can calculate derivatives for you


automatically, even for large complex programs, using a
method called automatic differentiation (AutoDiff)
ˆ Most popular frameworks: PyTorch and Tensorflow
ˆ In practice you never need to manually calculate network
derivatives (except potentially in exams)!
ˆ AutoDiff is exact (i.e. it is not a numerical approximation) but
it is not symbolic either

11
Automatic Differentiation (AutoDiff)

ˆ Many modern systems can calculate derivatives for you


automatically, even for large complex programs, using a
method called automatic differentiation (AutoDiff)
ˆ Most popular frameworks: PyTorch and Tensorflow
ˆ In practice you never need to manually calculate network
derivatives (except potentially in exams)!
ˆ AutoDiff is exact (i.e. it is not a numerical approximation) but
it is not symbolic either
ˆ It has two forms: forward mode and reverse mode

11
Automatic Differentiation (AutoDiff)

ˆ Many modern systems can calculate derivatives for you


automatically, even for large complex programs, using a
method called automatic differentiation (AutoDiff)
ˆ Most popular frameworks: PyTorch and Tensorflow
ˆ In practice you never need to manually calculate network
derivatives (except potentially in exams)!
ˆ AutoDiff is exact (i.e. it is not a numerical approximation) but
it is not symbolic either
ˆ It has two forms: forward mode and reverse mode
ˆ The latter is cheap and scalable for deep learning: calculating
derivatives only adds a small constant factor to the runtime
compared to simply evaluating the program itself

11
Automatic Differentiation (AutoDiff)

ˆ Many modern systems can calculate derivatives for you


automatically, even for large complex programs, using a
method called automatic differentiation (AutoDiff)
ˆ Most popular frameworks: PyTorch and Tensorflow
ˆ In practice you never need to manually calculate network
derivatives (except potentially in exams)!
ˆ AutoDiff is exact (i.e. it is not a numerical approximation) but
it is not symbolic either
ˆ It has two forms: forward mode and reverse mode
ˆ The latter is cheap and scalable for deep learning: calculating
derivatives only adds a small constant factor to the runtime
compared to simply evaluating the program itself
ˆ It still works for network that the do not satisfy our earlier
feed–forward assumptions (e.g. RNNs, ResNets)
11
Reverse Mode AutoDiff Performs Backpropagation

Consider an arbitrary node u in a computation graph and assume


that its child nodes are v1:nc

12
Reverse Mode AutoDiff Performs Backpropagation

Consider an arbitrary node u in a computation graph and assume


that its child nodes are v1:nc
By the chain rule, the gradient of some arbitrary downstream node
in the computation graph, z, with respect to u is given by
nc
∂z X ∂z ∂vj
=
∂u ∂vj ∂u
j=1
∂vj ∂z
Here ∂u is a local computations and so if all the ∂vj are known,
∂z
the above allows us to calculate ∂u using only the current value of
u and its relationship with its children.

12
Reverse Mode AutoDiff Performs Backpropagation

Consider an arbitrary node u in a computation graph and assume


that its child nodes are v1:nc
By the chain rule, the gradient of some arbitrary downstream node
in the computation graph, z, with respect to u is given by
nc
∂z X ∂z ∂vj
=
∂u ∂vj ∂u
j=1
∂vj ∂z
Here ∂u is a local computations and so if all the ∂vj are known,
∂z
the above allows us to calculate ∂u using only the current value of
u and its relationship with its children.
If all parent–child derivatives are known, we can calculate the
partial derivatives of z with respect to all nodes by recursively
calculating the derivatives of parent nodes from child their nodes
12
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
calculate derivatives for y1

Slide Credit: Gunes Baydin

13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to Forward function calculations in blue
calculate derivatives for y1

Slide Credit: Gunes Baydin

13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to Forward function calculations in blue
calculate derivatives for y1

Reverse derivative calculations in red

Slide Credit: Gunes Baydin

13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to Forward function calculations in blue
calculate derivatives for y1

Reverse derivative calculations in red

Slide Credit: Gunes Baydin

13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
Forward function calculations in blue
calculate derivatives for y1

Reverse derivative calculations in red

Slide Credit: Gunes Baydin

13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
Forward function calculations in blue
calculate derivatives for y1

Reverse derivative calculations in red

Slide Credit: Gunes Baydin

13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
Forward function calculations in blue
calculate derivatives for y1

Reverse derivative calculations in red

Slide Credit: Gunes Baydin

13
Reverse Mode AutoDiff Example
Backpropagation is Reverse Mode AutoDiff
Presume we want to
Forward function calculations in blue
calculate derivatives for y1

Reverse derivative calculations in red

Slide Credit: Gunes Baydin

13
Creating an AutoDiff System

Consider an arbitrary primitive operation pr that takes in inputs


u1 , . . . , unu and returns outputs v1 , . . . , vnv . Assume for simplicity
that each uj , vj ∈ R.

14
Creating an AutoDiff System

Consider an arbitrary primitive operation pr that takes in inputs


u1 , . . . , unu and returns outputs v1 , . . . , vnv . Assume for simplicity
that each uj , vj ∈ R.
To construct a reverse mode AutoDiff system, each such pr must
have two associated methods:
ˆ The forward calculation (v1 , . . . , vnv ) = pr(u1 , . . . , unu )

14
Creating an AutoDiff System

Consider an arbitrary primitive operation pr that takes in inputs


u1 , . . . , unu and returns outputs v1 , . . . , vnv . Assume for simplicity
that each uj , vj ∈ R.
To construct a reverse mode AutoDiff system, each such pr must
have two associated methods:
ˆ The forward calculation (v1 , . . . , vnv ) = pr(u1 , . . . , unu )
ˆ A vector–Jacobian product calculator, vjp, that takes an
additional input ∆ of the same size as v, such that
 
nv nv
∂v X ∂vj
X ∂v j 
pr.vjp(u, ∆) = ∆T = ∆j ,..., ∆j
∂u ∂u1 ∂unu
j=1 j=1
∂v ∂vi
where ∂u is a Jacobian of pr whose i, j th entry is ∂u j
and we
are implicitly evaluating these at the provided input values
14
Creating an AutoDiff System (2)

Example: product operator primitive (with u ∈ R2 , v ∈ R)


prod(u1 , u2 ) = u1 u2 prod.vjp (u, ∆) = [u2 ∆, u1 ∆]

If our language only allows such primitives, we can perform reverse


mode AutoDiff on arbitrary computation graphs: if ∆ is the set of
required derivatives for the output nodes, pr.vjp (u, ∆) produces
the set of required derivatives for the input nodes

15
Creating an AutoDiff System (2)

Example: product operator primitive (with u ∈ R2 , v ∈ R)


prod(u1 , u2 ) = u1 u2 prod.vjp (u, ∆) = [u2 ∆, u1 ∆]

If our language only allows such primitives, we can perform reverse


mode AutoDiff on arbitrary computation graphs: if ∆ is the set of
required derivatives for the output nodes, pr.vjp (u, ∆) produces
the set of required derivatives for the input nodes

15
Creating an AutoDiff System (2)

Example: product operator primitive (with u ∈ R2 , v ∈ R)


prod(u1 , u2 ) = u1 u2 prod.vjp (u, ∆) = [u2 ∆, u1 ∆]

If our language only allows such primitives, we can perform reverse


mode AutoDiff on arbitrary computation graphs: if ∆ is the set of
required derivatives for the output nodes, pr.vjp (u, ∆) produces
the set of required derivatives for the input nodes
We can also implicitly define a vjp operator for compound
operations by running AutoDiff on the low–level graph, e.g.
T
∂Li ∂hm

m m m−1 m m−1 ∂Li i
hi = fθm (hi ) fθm .vjp hi , m =
∂hi ∂hmi ∂h m−1
i

15
Creating an AutoDiff System (2)

Example: product operator primitive (with u ∈ R2 , v ∈ R)


prod(u1 , u2 ) = u1 u2 prod.vjp (u, ∆) = [u2 ∆, u1 ∆]

If our language only allows such primitives, we can perform reverse


mode AutoDiff on arbitrary computation graphs: if ∆ is the set of
required derivatives for the output nodes, pr.vjp (u, ∆) produces
the set of required derivatives for the input nodes
We can also implicitly define a vjp operator for compound
operations by running AutoDiff on the low–level graph, e.g.
T
∂Li ∂hm

m m m−1 m m−1 ∂Li i
hi = fθm (hi ) fθm .vjp hi , m =
∂hi ∂hmi ∂h m−1
i
Most systems further allow arbitrary tensor sizes for u and v, e.g.
so that we can batch computation across multiple different i.
15
AutoDiff in PyTorch

In PyTorch, AutoDiff allows us to calculate gradients by calling


.backward() on the term we wish to calculate gradients for

16
AutoDiff in PyTorch

In PyTorch, AutoDiff allows us to calculate gradients by calling


.backward() on the term we wish to calculate gradients for

Credit: Güneş Baydin

16
AutoDiff in PyTorch

In PyTorch, AutoDiff allows us to calculate gradients by calling


.backward() on the term we wish to calculate gradients for

Credit: Güneş Baydin

In the context of deep learning, given a forward model, we can


introduce a variable loss corresponding to L(yi , fθ (xi )) (typically
as vector over multiple i) and the simply call loss.backward() to
calculate the gradients of all our network parameters
16
Recap

ˆ The order we do computations in is critically important when


using the chain rule

17
Recap

ˆ The order we do computations in is critically important when


using the chain rule
ˆ Backpropagation allows us to efficiently calculate all the
required gradients for a neural network using the recursive
equations (for feed–forward networks)
∂Li ∂Li ∂hℓ+1
i
=
∂hℓi ℓ+1 ∂hℓ
∂hi i
n
∂ R̂(θ) 1 X ∂Li ∂hℓi ∂r(θ)

= ℓ ℓ

∂θ n ∂hi ∂θ ∂θℓ
i=1

17
Recap

ˆ The order we do computations in is critically important when


using the chain rule
ˆ Backpropagation allows us to efficiently calculate all the
required gradients for a neural network using the recursive
equations (for feed–forward networks)
∂Li ∂Li ∂hℓ+1
i
=
∂hℓi ℓ+1 ∂hℓ
∂hi i
n
∂ R̂(θ) 1 X ∂Li ∂hℓi ∂r(θ)

= ℓ ℓ

∂θ n ∂hi ∂θ ∂θℓ
i=1
ˆ In practice, we do not need to calculate these manually, as
AutoDiff systems allow the gradient calculations to be
performed automatically
17
Further Reading

ˆ Helpful videos by 3Blue1Brown https://youtu.be/Ilg3gGewQ5U


and https://youtu.be/tIeHLnjs5U8
ˆ Lecture 4 from Stanford course (https://youtu.be/d14TUNcbn1k)
ˆ Güneş Baydin’s slides on automatic differentiation
https://www.cs.ox.ac.uk/teaching/courses/2019-2020/advml/
and tutorial paper
https://www.jmlr.org/papers/volume18/17-468/17-468.pdf

ˆ Have a play with PyTorch and/or Tensorflow!

18

You might also like