0% found this document useful (0 votes)
31 views59 pages

Lecture - 14 - FFNN

Uploaded by

suryapratp369
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views59 pages

Lecture - 14 - FFNN

Uploaded by

suryapratp369
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Lecture – 4

Feedforward Neural Networks, Backpropagation

Dr. Jagendra Singh


References/Acknowledgments
See the excellent videos by Hugo Larochelle on Backpropagation

2/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.1: Feedforward Neural Networks (a.k.a.
multilayered network of neurons)

3/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
a3 this case) having n neurons each
W3 b3 Finally, there is one output layer containing k
h2
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation and
W2 b2 activation (ai and hi are vectors)
h1
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
a1 Wi ∈ Rn×n and bi ∈ Rn are the weight and bias
W1 b1 between layers i − 1 and i (0 < i < L)
x1 x2 xn WL ∈ Rn×k and bL ∈ Rk are the weight and bias
between the last hidden layer and the output layer
(L = 3 in this case) 4/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)


a3
W3 b3 The activation at layer i is given by
h2
hi (x) = g(ai (x))
a2
W2 where g is called the activation function (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL (x) = O(aL (x))
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)
To simplify notation we will refer to ai (x) as ai
and hi (x) as hi 5/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) The pre-activation at layer i is given by

ai = bi + Wi hi−1
a3
W3 b3 The activation at layer i is given by
h2
hi = g(ai )
a2
W2 where g is called the activation function (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL = O(aL )
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)

6/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x)
Data: {xi , yi }N
i=1
Model:
a3
W3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )
b3
h2
Parameters:
a2 θ = W1 , .., WL , b1 , b2 , ..., bL (L = 3)
W2 b2 Algorithm: Gradient Descent with Back-
h1
propagation (we will see soon)
Objective/Loss/Error function: Say,
a1 N k
W1 1 XX
b1 min (ŷij − yij )2
N
i=1 j=1
x1 x2 xn
In general, min L (θ)

where L (θ) is some function of the parameters 7/9


Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.2: Learning Parameters of Feedforward
Neural Networks (Intuition)

8/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
The story so far...
We have introduced feedforward neural networks
We are now interested in finding an algorithm for learning the parameters of
this model

9/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) Recall our gradient descent algorithm
Algorithm: gradient descent()
a3 t ← 0;
W3 b3 max iterations ← 1000;
h2
Initialize w0 , b0 ;
while t++ < max iterations do
a2 wt+1 ← wt − η∇wt ;
W2 b2 bt+1 ← bt − η∇bt ;
h1
end

a1
W1 b1

x1 x2 xn

10/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) Recall our gradient descent algorithm
We can write it more concisely as
a3 Algorithm: gradient descent()
W3 b3 t ← 0;
h2
max iterations ← 1000;
Initialize θ0 = [w0 , b0 ];
a2 while t++ < max iterations do
W2 b2 θt+1 ← θt − η∇θt ;
h1 end
 ∂L (θ) (θ) T
a1 where ∇θt = ∂wt , ∂L
∂bt
W1 b1 Now, in this feedforward neural network,
instead of θ = [w, b] we have θ =
x1 x2 xn [W1 , W2 , .., WL , b1 , b2 , .., bL ]
We can still use the same algorithm for
learning the parameters of our model 11/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) Recall our gradient descent algorithm
We can write it more concisely as
a3 Algorithm: gradient descent()
W3 b3 t ← 0;
h2
max iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
a2 while t++ < max iterations do
W2 b2 θt+1 ← θt − η∇θt ;
h1 end
 ∂L (θ) (θ) ∂L (θ) ∂L (θ) T
where ∇θt = , ., ∂L
∂WL,t , ∂b1,t , ., ∂bL,t
a1 ∂W1,t
W1 b1 Now, in this feedforward neural network,
instead of θ = [w, b] we have θ =
x1 x2 xn [W1 , W2 , .., WL , b1 , b2 , .., bL ]
We can still use the same algorithm for
learning the parameters of our model
12/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)



... ... ... ... ...
 ∂W111 ∂W11n ∂W211 ∂W21n ∂WL,11 ∂WL,1k ∂WL,1k ∂b11 ∂bL1

 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

∂L (θ) 
 ∂W121 . . . ... ... ... . . . ∂bL2 

∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k ∂b12
 . .. .. .. .. .. .. .. .. .. .. .. .. .. 
 .. . . . . . . . . . . . . . 
 
∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ... ∂WL,n1 ... ∂WL,nk ∂WL,nk ∂b1n . . . ∂bLk

∇θ is thus composed of
∇W1 , ∇W2 , ...∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k ,
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk

13/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
We need to answer two questions
How to choose the loss function L (θ)?
How to compute ∇θ which is composed of
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

14/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.3: Output Functions and Loss Functions

15/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
We need to answer two questions
How to choose the loss function L (θ) ?
How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

16/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
The choice of loss function depends
yi = {7.5 8.2 7.7} on the problem at hand
imdb Critics RT We will illustrate this with the help
Rating Rating Rating of two examples
Consider our movie example again
but this time we are interested in
predicting ratings
Neural network with Here yi ∈ R3
L − 1 hidden layers The loss function should capture how
much ŷi deviates from yi
If yi ∈ Rn then the squared error loss
can capture this deviation
isActor isDirector N 3
1 XX
Damon . . Nolan . . . . . . . . L (θ) = (ŷij − yij )2
N
xi i=1 j=1

17/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 More specifically, can it be the logistic
W3 function?
b3
h2 No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2 So, in such cases it makes sense to
W2 b2 have ‘O’ as linear function
h1
f (x) = hL = O(aL )
a1 = W O a L + bO
W1 b1
ŷi = f (xi ) is no longer bounded
x1 x2 xn between 0 and 1

18/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Intentionally left blank 19/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Now let us consider another problem
y = [1 0 0 0] for which a different loss function
Apple Mango Orange Banana would be appropriate
Suppose we want to classify an image
into 1 of k classes
Here again we could use the squared
Neural network with error loss to capture the deviation
L − 1 hidden layers But can you think of a better
function?

20/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Notice that y is a probability
y = [1 0 0 0] distribution
Apple Mango Orange Banana Therefore we should also ensure that
ŷ is a probability distribution
What choice of the output activation
‘O’ will ensure this ?
Neural network with aL = WL hL−1 + bL
L − 1 hidden layers
eaL,j
ŷj = O(aL )j = Pk
aL,i
i=1 e

O(aL )j is the j th element of ŷ and aL,j


is the j th element of the vector aL .
This function is called the softmax
function

21/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Now that we have ensured that both
y = [1 0 0 0] y & ŷ are probability distributions
Apple Mango Orange Banana can you think of a function which
captures the difference between
them?
Cross-entropy
Neural network with k
X
L − 1 hidden layers L (θ) = − yc log ŷc
c=1

Notice that

yc = 1 if c = ℓ (the true class label)


=0 otherwise
∴ L (θ) = − log ŷℓ

22/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
So, for classification problem (where you have
hL = ŷ = f (x) to choose 1 of K classes), we use the following
objective function

a3 minimize L (θ) = − log ŷℓ


θ
W3 b3
h2 or maximize − L (θ) = log ŷℓ
θ

But wait!
a2
Is ŷℓ a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
W2 b2
h1 Yes, it is indeed a function of θ
ŷℓ = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]ℓ
a1 What does ŷℓ encode?
W1 b1 It is the probability that x belongs to the ℓth class
(bring it as close to 1).
x1 x2 xn
log ŷℓ is called the log-likelihood of the data.
23/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at hand
but the two loss functions that we just saw are encountered very often
For the rest of this lecture we will focus on the case where the output activation
is a softmax function and the loss function is cross entropy

24/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.4: Backpropagation (Intuition)

25/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
We need to answer two questions
How to choose the loss function L (θ) ?
How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

26/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
ŷ = f (x)
Let us focus on this one Algorithm: gradient
weight (W112 ). descent()
a31
To learn this weight W311 b3
t ← 0;
h21
using SGD we need a max iterations ← 1000;
∂L (θ)
formula for ∂W . Initialize θ0 ;
112 a21
W211 b2 while
We will see how to h11
t++ < max iterations do
calculate this.
a11
θt+1 ← θt − η∇θt ;
W111 W112 b1 end
x1 x2 xd

27/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
L (θ)
First let us take the simple case when
ŷ = f (x)
we have a deep but thin network.
In this case it is easy to find the
derivative by chain rule. aL1
WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111
a21
∂L (θ) ∂L (θ) ∂h11
= (just compressing the chain rule) h11 W211
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h21
=
∂W211 ∂h21 ∂W211 a11
∂L (θ) ∂L (θ) ∂aL1 W111
=
∂WL11 ∂aL1 ∂WL11 x1

28/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details

29/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
We get a certain loss at the output and we try to − log ŷℓ
figure out who is responsible for this loss
So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility”.
a3
The output layer says “Well, I take responsibility W3 b3
for my part but please understand that I am only h2
as the good as the hidden layer and weights below
me”. After all . . . a2
W2 b2
f (x) = ŷ = O(WL hL−1 + bL ) h1

a1
W1 b1

x1 x2 xn

30/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
So, we talk to WL , bL and hL and ask them “What is − log ŷℓ
wrong with you?”
WL and bL take full responsibility but hL says “Well,
please understand that I am only as good as the pre-
activation layer”
The pre-activation layer in turn says that I am only as a3
good as the hidden layer and weights below me. W3 b3
We continue in this manner and realize that the h2
responsibility lies with all the weights and biases (i.e.
all the parameters of the model) a2
But instead of talking to them directly, it is easier to W2 b2
talk to them through the hidden layers and output h1
layers (and this is exactly what the chain rule allows
us to do) a1
W1 b1
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z } x1 x2 xn
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights
31/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h ∂a ∂h ∂a ∂W
| {z111} | {z } | 2{z 2} | 1{z 1} | {z111}
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

Our focus is on Cross entropy loss and Softmax output.

32/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.5: Backpropagation: Computing Gradients
w.r.t. the Output Units

33/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h ∂a ∂h ∂a ∂W
| {z111} | {z } | 2{z 2} | 1{z 1} | {z111}
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

Our focus is on Cross entropy loss and Softmax output.

34/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Let us first consider the partial derivative − log ŷℓ
w.r.t. i-th output

L (θ) = − log ŷℓ (ℓ = true class label)


∂ ∂
(L (θ)) = (− log ŷℓ ) a3
∂ ŷi ∂ ŷi W3
1 b3
h2
= − if i = ℓ
ŷℓ
= 0 otherwise a2
W2 b2
More compactly, h1
∂ ✶(i=ℓ)
(L (θ)) = −
∂ ŷi ŷℓ a1
W1 b1

x1 x2 xn

35/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
− log ŷℓ
∂ ✶(ℓ=i)
(L (θ)) = −
∂ ŷi ŷℓ
We can now talk about the gradient a3
w.r.t. the vector ŷ W3 b3
h2

✶ℓ=1
   
∂L (θ)
a2
1 ✶ℓ=2 
∂ ŷ1 
 ..   W2 b2
∇ŷ L (θ) =  .
 =−  .  h1

∂L (θ)
 ŷℓ  .. 
∂ ŷk ✶ℓ=k
a1
1 W1 b1
= − eℓ
ŷℓ
x1 x2 xn
where e(ℓ) is a k-dimensional vector
whose ℓ-th element is 1 and all other
elements are 0. 36/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
What we are actually interested in is − log ŷℓ
∂L (θ) ∂(− log ŷℓ )
=
∂aLi ∂aLi
∂(− log ŷℓ ) ∂ ŷℓ
= a3
∂ ŷℓ ∂aLi
W3 b3
Does ŷℓ depend on aLi ? Indeed, it does. h2

exp(aLℓ ) a2
ŷℓ = P W2 b2
i exp(aLi ) h1
Having established this, we will now
derive the full expression on the next a1
W1 b1
slide
x1 x2 xn

37/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
∂ −1 ∂
− log ŷℓ = ŷℓ
∂aLi ŷℓ ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )ℓ ∂x ∂x h(x) h(x)2 ∂x
ŷℓ ∂aLi
−1 ∂ exp(aL )ℓ
= P
ŷℓ ∂aLi i′ exp(aL )ℓ
 P !
−1

exp(aL )ℓ exp(aL )ℓ ∂a∂Li i′ exp(aL )i′
∂aLi
= P − P
ŷℓ i′ exp(aL )i
′ ( i′ (exp(aL )i′ )2
!
−1 ✶(ℓ=i) exp(aL )ℓ exp(aL )ℓ exp(aL )i
= P −P P
ŷℓ i′ exp(aL )i i′ exp(aL )i i′ exp(aL )i
′ ′ ′

 
−1
= ✶(ℓ=i) sof tmax(aL )ℓ − sof tmax(aL )ℓ sof tmax(aL )i
ŷℓ
−1
✶(ℓ=i) ŷℓ − ŷℓ ŷi

=
ŷℓ
= − ✶(ℓ=i) − ŷi


38/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
So far we have derived the partial derivative w.r.t. − log ŷℓ
the i-th element of aL

∂L (θ)
= −(✶ℓ=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
− (✶ℓ=1 − ŷ1 )
   
∂L (θ)
∂aL1  − (✶ℓ=2 − ŷ2 )  a2
 .. 
 =  W2 b2
∇aL L (θ) = 
 .   ..  h1
∂L (θ)
 . 
∂aLk − (✶ℓ=k − ŷk )
a1
= −(e(ℓ) − ŷ) W1 b1

x1 x2 xn

39/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.6: Backpropagation: Computing Gradients
w.r.t. Hidden Units

40/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h ∂a ∂h ∂a ∂W
| {z111} | {z } | 2{z 2} | 1{z 1} | {z111}
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

Our focus is on Cross entropy loss and Softmax output.

41/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Chain rule along multiple paths: If a − log ŷℓ
function p(z) can be written as a function of
intermediate results qi (z) then we have :

∂p(z) X ∂p(z) ∂qm (z) a3


= W3
∂z ∂qm (z) ∂z b3
m h2

In our case: a2
p(z) is the loss function L (θ) W2 b2
h1
z = hij
qm (z) = aLm a1
W1 b1

x1 x2 xn

42/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Intentionally left blank 43/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
k
− log ŷℓ
∂L (θ)
X ∂L (θ) ∂ai+1,m
∂hij = ∂ai+1,m ∂hij
m=1
Xk
∂L (θ)
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
Wi+1,1,j a2
∂ai+1,1
 ..   ..  W2 b2
∇ai+1 L (θ) = 
 .  ; Wi+1, · ,j
 = .  h1
∂L (θ) Wi+1,k,j
∂ai+1,k
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,

k
x1 x2 xn
T
X ∂L (θ)
(Wi+1, · ,j ) ∇ai+1 L (θ) = Wi+1,m,j
∂ai+1,m
m=1 ai+1 = Wi+1 hij + bi+1 44/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
− log ŷℓ
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3


 ∂L (θ)    W3 b3
∂hi1 (Wi+1, · ,1 )T ∇ai+1 L (θ) h2
 ∂L (θ)  
 ∂hi2   (Wi+1, · ,2 )T ∇ai+1 L (θ)  
 ..  = 
∇hi L (θ) =   ..  a2
 .   .  W2 b2
∂L (θ) (Wi+1, · ,n )T ∇ai+1 L (θ) h1
∂hin
= (Wi+1 )T (∇ai+1 L (θ)) a1
W1 b1
We are almost done except that we do not
x1 x2 xn
know how to calculate ∇ai+1 L (θ) for i < L − 1
We will see how to compute that
45/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
− log ŷℓ
 
∂L (θ)
∂ai1

∇ai L (θ) =  .. 

 . 
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) ′ a2
= g (aij ) [∵ hij = g(aij )]
∂hij W2 b2

∂L (θ) ′
 h1
g (a i1 )
 ∂hi1 . 
∇ai L (θ) = 
 .. 
 a1
∂L (θ) ′ W1 b1
∂hin g (ain )

= ∇hi L (θ) ⊙ [. . . , g (aik ), . . . ] x1 x2 xn

46/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.7: Backpropagation: Computing Gradients
w.r.t. Parameters

47/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1


=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h ∂a ∂W
| {z111} | {z } | {z } | 1{z 1} | {z111}
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

Our focus is on Cross entropy loss and Softmax output.

48/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Recall that, − log ŷℓ

ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij
a3
∂L (θ) ∂L (θ) ∂aki W3
= b3
∂Wkij ∂aki ∂Wkij h2
∂L (θ)
= hk−1,j a2
∂a
 ∂Lki(θ) ∂L (θ) ∂L (θ)
 W2 b2
... ... h1
 ∂Wk11 ∂Wk12 ∂Wk1n

 ... ... ... ... ... 
∇Wk L (θ) = 
 .. .. .. .. .. 
 a1
 . . . . .  W1 b1
∂L (θ)
... ... ... ... ∂Wknn
x1 x2 xn

49/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Intentionally left blank 50/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ) 
∂Wk11 ∂Wk12 ∂Wk13
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
∇Wk L (θ) =  ∂W ∂Wk23 
=
 k21 ∂Wk22  ∂Wkij ∂aki ∂Wkij
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ) 


∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) =  ∂L = ∇ak L (θ) · hk−1 T

h ∂ak2 hk−1,2 h
 ∂ak2 k−1,1 ∂ak2 k−1,3 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 ∂ak3 hk−1,3

51/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Finally, coming to the biases − log ŷℓ
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3
∂L (θ) b3
h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector W2 b2
bk h1
 ∂L (θ) 
a a1
 ∂Lk1(θ)  W1 b1
 ak2 
∇bk L (θ) = 
 ..  = ∇ak L (θ)

 .  x1 x2 xn
∂L (θ)
akn

52/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.8: Backpropagation: Pseudo code

53/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Finally, we have all the pieces of the puzzle

∇aL L (θ) (gradient w.r.t. output layer)

∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

54/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max iterations do
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward propagation(θt );
∇θt = backward propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ);
θt+1 ← θt − η∇θt ;
end

55/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Algorithm: forward propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
hk = g(ak );
end
aL = bL + WL hL−1 ;
ŷ = O(aL );

56/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
∇ak−1 L (θ) = ∇hk−1 L (θ) ⊙ [. . . , g ′ (ak−1,j ), . . . ] ;
end

57/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.9: Derivative of the activation function

58/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Now, the only thing we need to figure out is how to compute g ′
Logistic function tanh

g(z) = σ(z) g(z) = tanh (z)


1 ez − e−z
= =
1 + e−z ez + e−z
1 d !
d
g ′ (z) = (−1) (1 + e−z ) (ez + e−z ) dz (ez − e−z )
2
(1 + e ) dz
−z
d
− (ez − e−z ) dz (ez + e−z )
1 g ′ (z) =
= (−1) (−e−z ) (ez + e−z )2
(1 + e−z )2
1

1 + e−z − 1
 (ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
= g(z)(1 − g(z)) (e − e−z )2
z
=1 − z
(e + e−z )2
=1 − (g(z))2

59/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

You might also like