0% found this document useful (0 votes)

10 views71 pages

Lecture 20

Uploaded by

Vic Yassenov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views71 pages

Lecture 20

Uploaded by

Vic Yassenov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

The Oxford logo

At the heart of our visual identity is the Oxford logo. Quadrangle Logo
It should appear on everything we produce, from
letterheads to leaflets and from online banners to
bookmarks.
This is the square
The primary quadrangle logo consists of an Oxford blue logo of first
(Pantone 282) square with the words UNIVERSITY OF choice or primary
OXFORD at the foot and the belted crest in the top Oxford logo.
right-hand corner reversed out in white.

The word OXFORD is a specially drawn typeface while all

other text elements use the typeface Foundry Sterling.

The secondary version of the Oxford logo, the horizontal

rectangle logo, is only to be used where height (vertical
space) is restricted.

Chapter 8, Part 4: Backpropagation and

These standard versions of the Oxford logo are intended
for use on white or light-coloured backgrounds, including
light uncomplicated photographic backgrounds.

Examples of how these logos should be used for various

Rectangle Logo
applications appear in the following pages.

Automatic Differentiation NOTE

The minimum size for the quadrangle logo and the
rectangle logo is 24mm wide. Smaller versions with The rectangular
secondary Oxford
bolder elements are available for use down to 15mm
logo is for use only
wide. See page 7.
where height is
restricted.

Advanced Topics in Statistical Machine Learning

Tom Rainforth
Hilary 2022
[email protected]
Backpropagation and Automatic Differentiation

Being able to differentiate the empirical risk is key to deep learning

as it allows us to optimize the parameters using gradient methods

1
Backpropagation and Automatic Differentiation

Being able to differentiate the empirical risk is key to deep learning

as it allows us to optimize the parameters using gradient methods
In this lecture we will look at how we can actually calculate these
derivatives in practice

1
Backpropagation and Automatic Differentiation

Being able to differentiate the empirical risk is key to deep learning

as it allows us to optimize the parameters using gradient methods
In this lecture we will look at how we can actually calculate these
derivatives in practice
In particular, we will look at

Backpropagation: a particular way of applying the chain rule

that minimizes the cost of calculating the derivatives

1
Backpropagation and Automatic Differentiation

Being able to differentiate the empirical risk is key to deep learning

as it allows us to optimize the parameters using gradient methods
In this lecture we will look at how we can actually calculate these
derivatives in practice
In particular, we will look at

Backpropagation: a particular way of applying the chain rule

that minimizes the cost of calculating the derivatives
Automatic differentiation: a programming languages tool
that will allow us to perform this backpropagation
automatically and forms the basis for deep learning packages
like PyTorch and Tensorflow
1
Empirical Risk

For network fθ with parameters θ, loss function L, regularizer r,

our regularized empirical risk is1
n
1X
R̂(θ) = L(yi , fθ (xi )) + λr(θ)
n
i=1

1
The loss might also depend directly on xi but we are omitting this for brevity. We can also have unsupervised
settings where the loss is simply of the form L(f (xi )) or L(xi , f (xi )), for which all the ideas will equally apply.

2
Empirical Risk

For network fθ with parameters θ, loss function L, regularizer r,

our regularized empirical risk is1
n
1X
R̂(θ) = L(yi , fθ (xi )) + λr(θ)
n
i=1

Using the shorthand Li = L(yi , fθ (xi )), the derivative is thus

n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1

2
Empirical Risk

For network fθ with parameters θ, loss function L, regularizer r,

our regularized empirical risk is1
n
1X
R̂(θ) = L(yi , fθ (xi )) + λr(θ)
n
i=1

Using the shorthand Li = L(yi , fθ (xi )), the derivative is thus

n
1X
∇θ R̂(θ) = ∇θ Li + λ∇θ r(θ)
n
i=1

Presuming suitable choices for r, ∇θ r(θ) can always be calculated

straightforwardly, so the key term we need to calculate is ∇θ Li
1
The loss might also depend directly on xi but we are omitting this for brevity. We can also have unsupervised
settings where the loss is simply of the form L(f (xi )) or L(xi , f (xi )), for which all the ideas will equally apply.

2
The Chain Rule for Feed–Forward Neural Networks

An arbitrary network without loops, skip connections, or parameter

sharing between layers (e.g. MLP, basic CNN) can be expressed as

0 ℓ ℓ ℓ−1
h = x, h = fθℓ h ∀ℓ = {1, . . . , m}, fθ (x) = hm

3
The Chain Rule for Feed–Forward Neural Networks

An arbitrary network without loops, skip connections, or parameter

sharing between layers (e.g. MLP, basic CNN) can be expressed as

0 ℓ ℓ ℓ−1
h = x, h = fθℓ h ∀ℓ = {1, . . . , m}, fθ (x) = hm

We will further introduce the notation hℓi ∈ Rdℓ to denote the

vector of hidden units values hℓ when the network is given input xi

3
The Chain Rule for Feed–Forward Neural Networks

An arbitrary network without loops, skip connections, or parameter

sharing between layers (e.g. MLP, basic CNN) can be expressed as

0 ℓ ℓ ℓ−1
h = x, h = fθℓ h ∀ℓ = {1, . . . , m}, fθ (x) = hm

We will further introduce the notation hℓi ∈ Rdℓ to denote the

3
The Chain Rule for Feed–Forward Neural Networks

An arbitrary network without loops, skip connections, or parameter

sharing between layers (e.g. MLP, basic CNN) can be expressed as

0 ℓ ℓ ℓ−1
h = x, h = fθℓ h ∀ℓ = {1, . . . , m}, fθ (x) = hm

We will further introduce the notation hℓi ∈ Rdℓ to denote the

vector of hidden units values hℓ when the network is given input xi
Noting the Markovian dependencies in the layers and applying the
chain rule yields the series of vector and matrix products
∂Li ∂Li ∂hm i ∂hℓ+1
i ∂hℓi
= · · ·
∂θℓ ∂hm m−1
i ∂hi ∂hℓi ∂θℓ
and thus
n
∂ R̂(θ) 1 X ∂Li ∂hm i ∂hℓ+1
i ∂hℓi ∂r(θ)
ℓ
= m m−1 · · · ℓ ℓ
+λ (1)
∂θ n ∂hi ∂hi ∂hi ∂θ ∂θℓ
i=1
3
Individual Terms

∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,

4
Individual Terms

∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,

4
Individual Terms

∂Li
Breaking down the terms in ∂θℓ
, we have that, given input xi ,

∂Li
∂hm is a row vector with dm elements representing the
i
derivative of the loss with respect to our set of output units
hm
i = fθ (xi ) (note that even if our output is
multi-dimensional, our loss is still a scalar)
∂hki
Each is a dk × dk−1 matrix representing the Jacobian of
∂hk−1
i
fθkk with respect to the input hk−1
i
∂hℓi
∂θℓ
is a dℓ × pℓ matrix (where pℓ is the number of parameters
in θℓ )representing the Jacobian of fθℓℓ with respect to its
parameters θℓ

4
Computation Order

5
Computation Order

Presuming that the loss, regularizer, and each layer in our network
is differentiable (with respect to both its inputs and parameters),
we can directly calculate the overall derivative by calculating each
such term individual and then combining them as per (1)
However, the order of the computations will massively change the
cost: we can either use the breakdown
! !
ℓ+1
∂Li ∂hm ℓ

∂Li i ∂hi ∂h i
∂θℓ
=
∂hm m−1 · · · ℓ ∂θ ℓ
(2)
i ∂h i ∂h i

5
Computation Order

or
!!!
∂Li ∂Li ∂hmi ∂hℓ+1
i ∂hℓi
= ··· (3)
∂θℓ ∂hm
i ∂hm−1
i ∂hℓi ∂θℓ

5
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further

reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
Xm
= O dℓ pℓ + dk dk−1
k=ℓ+1

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further

reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost

For (2) and (3) we now respectively have the following costs2
(2) : O (dm dm−1 + dm−1 dm−2 + · · · + dℓ+1 dℓ + dℓ pℓ )
Xm
= O dℓ pℓ + dk dk−1
k=ℓ+1
(3) : O (dℓ+1 dℓ pℓ + dℓ+2 dℓ+1 pℓ + · · · + dm dm−1 pℓ + dm pℓ )
Xm
= O dm pℓ + dk dk−1 pℓ
k=ℓ+1
The latter is roughly pℓ times more costly; as pℓ could easily be a
million or more parameters, this is a massive difference!

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further

reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Computational Cost

2
In many common cases, e.g. a fully connected layer, ∂hℓi /∂θ ℓ is sparse and so the cost for (2) can be further

reduced to O pℓ + m
P
k=ℓ+1 dk dk−1
6
Backpropagation

The name backpropagation derives from the fact that we calculate

our derivatives in a backwards fashion for the network:
We first calculate the gradient of loss derivatives with respect
∂Li ∂Li
to the last layer ∂hm (= ∂f (x ) )
i
i

7
Backpropagation

The name backpropagation derives from the fact that we calculate

7
Backpropagation

The name backpropagation derives from the fact that we calculate

our derivatives in a backwards fashion for the network:
We first calculate the gradient of loss derivatives with respect
∂Li ∂Li
to the last layer ∂hm (= ∂f (x ) )
i
i
We then go backwards through the network and recursively
calculate the vector–matrix products
∂Li ∂Li ∂hℓ+1
i
= ∀ℓ = m − 1, m − 2, . . . , 1
∂hℓi ∂hℓ+1
i
∂h ℓ
i
From these we can calculate the empirical risk derivatives for
each set of parameters as
n
∂ R̂(θ) 1 X ∂Li ∂hℓi ∂r(θ)
ℓ
= ℓ ℓ
+λ
∂θ n ∂hi ∂θ ∂θℓ
i=1
7
Backpropagation in Sum Notation

When performing backpropagation manually, it can sometimes be

helpful to express these rules as summations rather than vector
matrix products. Namely, using the shorthand hℓij = hℓi j , we

have
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= (4)
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij
n dℓ
∂ R̂(θ) 1 XX ∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ (5)
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

8
Backpropagation in Sum Notation

When performing backpropagation manually, it can sometimes be

helpful to express these rules as summations rather than vector
matrix products. Namely, using the shorthand hℓij = hℓi j , we

have
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= (4)
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij
n dℓ
∂ R̂(θ) 1 XX ∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ (5)
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

Note that backpropagation is a recursive algorithm: we do not

algebraically rollout the equation, but instead step backwards and
separately calculate the values of the gradient at each layer from
those at the next for each {xi , yi }
8
Backpropagation Example

Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i

9
Backpropagation Example

Consider an MLP with squared loss L(yi , f (xi )) = (yi − fθ (xi ))2
and tanh activations (note d tanh(a)
da = 1 − tanh2 a). Here we have
∂Li
= 2(hm
i − yi ) (can be a scalar or a row vector)
∂hm
i
dℓ+1
∂Li X ∂Li ∂hℓ+1
ik
= ∀ℓ ∈ {1, . . . , m − 1}
∂hℓij k=1
∂hℓ+1 ∂hℓ
ik ij

9
Backpropagation Example

∂hℓ+1
X
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1

9
Backpropagation Example

∂hℓ+1
X
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1
2
ℓ+1 ℓ+1
= Wkj 1 − hik

9
Backpropagation Example

∂hℓ+1
X
ik ∂ dℓ ℓ+1 ℓ ℓ+1
= tanh Wkt hit + bk
∂hℓij ∂hℓij t=1
2
ℓ+1 ℓ+1
= Wkj 1 − hik
dℓ+1 2
∂Li X ∂Li
ℓ+1

ℓ+1
=⇒ = Wkj 1 − hik
∂hℓij k=1
∂hℓ+1
ik
9
Backpropagation Example (2)

These can be recursively calculated and we further have

n dℓ
∂ R̂(θ) 1 XX ∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

10
Backpropagation Example (2)

These can be recursively calculated and we further have

n d
∂ R̂(θ) 1 XX ℓ
∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

∂hℓij
2
ℓ−1
ℓ
= I(t = j)his 1 − hℓij
∂Wts

10
Backpropagation Example (2)

These can be recursively calculated and we further have

n d
∂ R̂(θ) 1 XX ℓ
∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

∂hℓij
2
ℓ−1
ℓ
= I(t = j)his 1 − hℓij
∂Wts
n 2
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1

10
Backpropagation Example (2)

These can be recursively calculated and we further have

n d
∂ R̂(θ) 1 XX ℓ
∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

∂hℓij
2
ℓ−1
ℓ
= I(t = j)his 1 − hℓij
∂Wts
n 2
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1
∂hℓij
2
ℓ
= I(t = j) 1 − hℓij
∂bt

10
Backpropagation Example (2)

These can be recursively calculated and we further have

n d
∂ R̂(θ) 1 XX ℓ
∂Li ∂hℓij ∂r(θ)
ℓ
= ℓ ℓ
+λ
∂θ n ∂hij ∂θ ∂θℓ
i=1 j=1

∂hℓij
2
ℓ−1
ℓ
= I(t = j)his 1 − hℓij
∂Wts
n 2
∂ R̂(θ) 1 X ∂Li ℓ−1 ∂r(θ)
=⇒ ℓ
= h
ℓ is
1 − hℓit +λ ℓ
∂Wts n ∂h it ∂Wts
i=1
∂hℓij
2
ℓ
= I(t = j) 1 − hℓij
∂bt
n 2
∂ R̂(θ) 1 X ∂Li ∂r(θ)
=⇒ ℓ
= ℓ
1 − hℓit +λ
∂bt n
i=1
∂hit ∂bℓt

10
Automatic Differentiation (AutoDiff)

Many modern systems can calculate derivatives for you

automatically, even for large complex programs, using a
method called automatic differentiation (AutoDiff)