0% found this document useful (0 votes)

31 views59 pages

Lecture - 14 - FFNN

Uploaded by

suryapratp369

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views59 pages

Lecture - 14 - FFNN

Uploaded by

suryapratp369

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Lecture – 4

Feedforward Neural Networks, Backpropagation

Dr. Jagendra Singh

References/Acknowledgments
See the excellent videos by Hugo Larochelle on Backpropagation

2/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.1: Feedforward Neural Networks (a.k.a.
multilayered network of neurons)

3/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) The input to the network is an n-dimensional
vector
The network contains L − 1 hidden layers (2, in
a3 this case) having n neurons each
W3 b3 Finally, there is one output layer containing k
h2
neurons (say, corresponding to k classes)
Each neuron in the hidden layer and output layer
a2 can be split into two parts : pre-activation and
W2 b2 activation (ai and hi are vectors)
h1
The input layer can be called the 0-th layer and
the output layer can be called the (L)-th layer
a1 Wi ∈ Rn×n and bi ∈ Rn are the weight and bias
W1 b1 between layers i − 1 and i (0 < i < L)
x1 x2 xn WL ∈ Rn×k and bL ∈ Rk are the weight and bias
between the last hidden layer and the output layer
(L = 3 in this case) 4/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)

a3
W3 b3 The activation at layer i is given by
h2
hi (x) = g(ai (x))
a2
W2 where g is called the activation function (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL (x) = O(aL (x))
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)
To simplify notation we will refer to ai (x) as ai
and hi (x) as hi 5/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) The pre-activation at layer i is given by

ai = bi + Wi hi−1
a3
W3 b3 The activation at layer i is given by
h2
hi = g(ai )
a2
W2 where g is called the activation function (for
b2 example, logistic, tanh, linear, etc.)
h1
The activation at the output layer is given by
a1 f (x) = hL = O(aL )
W1 b1
where O is the output activation function (for
x1 x2 xn example, softmax, linear, etc.)

6/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x)
Data: {xi , yi }N
i=1
Model:
a3
W3 ŷi = f (xi ) = O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )
b3
h2
Parameters:
a2 θ = W1 , .., WL , b1 , b2 , ..., bL (L = 3)
W2 b2 Algorithm: Gradient Descent with Back-
h1
propagation (we will see soon)
Objective/Loss/Error function: Say,
a1 N k
W1 1 XX
b1 min (ŷij − yij )2
N
i=1 j=1
x1 x2 xn
In general, min L (θ)

where L (θ) is some function of the parameters 7/9

Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.2: Learning Parameters of Feedforward
Neural Networks (Intuition)

8/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
The story so far...
We have introduced feedforward neural networks
We are now interested in ﬁnding an algorithm for learning the parameters of
this model

9/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) Recall our gradient descent algorithm
Algorithm: gradient descent()
a3 t ← 0;
W3 b3 max iterations ← 1000;
h2
Initialize w0 , b0 ;
while t++ < max iterations do
a2 wt+1 ← wt − η∇wt ;
W2 b2 bt+1 ← bt − η∇bt ;
h1
end

a1
W1 b1

x1 x2 xn

10/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) Recall our gradient descent algorithm
We can write it more concisely as
a3 Algorithm: gradient descent()
W3 b3 t ← 0;
h2
max iterations ← 1000;
Initialize θ0 = [w0 , b0 ];
a2 while t++ < max iterations do
W2 b2 θt+1 ← θt − η∇θt ;
h1 end
∂L (θ) (θ) T
a1 where ∇θt = ∂wt , ∂L
∂bt
W1 b1 Now, in this feedforward neural network,
instead of θ = [w, b] we have θ =
x1 x2 xn [W1 , W2 , .., WL , b1 , b2 , .., bL ]
We can still use the same algorithm for
learning the parameters of our model 11/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) Recall our gradient descent algorithm
We can write it more concisely as
a3 Algorithm: gradient descent()
W3 b3 t ← 0;
h2
max iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
a2 while t++ < max iterations do
W2 b2 θt+1 ← θt − η∇θt ;
h1 end
∂L (θ) (θ) ∂L (θ) ∂L (θ) T
where ∇θt = , ., ∂L
∂WL,t , ∂b1,t , ., ∂bL,t
a1 ∂W1,t
W1 b1 Now, in this feedforward neural network,
instead of θ = [w, b] we have θ =
x1 x2 xn [W1 , W2 , .., WL , b1 , b2 , .., bL ]
We can still use the same algorithm for
learning the parameters of our model
12/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Except that now our ∇θ looks much more nasty

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)


... ... ... ... ...
 ∂W111 ∂W11n ∂W211 ∂W21n ∂WL,11 ∂WL,1k ∂WL,1k ∂b11 ∂bL1

 
 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

∂L (θ) 
 ∂W121 . . . ... ... ... . . . ∂bL2 

∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k ∂b12
 . .. .. .. .. .. .. .. .. .. .. .. .. .. 
 .. . . . . . . . . . . . . . 
 
∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)
∂W1n1 . . . ∂W1nn ∂W2n1 ... ∂W2nn ... ∂WL,n1 ... ∂WL,nk ∂WL,nk ∂b1n . . . ∂bLk

∇θ is thus composed of
∇W1 , ∇W2 , ...∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k ,
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk

13/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
We need to answer two questions
How to choose the loss function L (θ)?
How to compute ∇θ which is composed of
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

14/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.3: Output Functions and Loss Functions

15/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
We need to answer two questions
How to choose the loss function L (θ) ?
How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

16/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
The choice of loss function depends
yi = {7.5 8.2 7.7} on the problem at hand
imdb Critics RT We will illustrate this with the help
Rating Rating Rating of two examples
Consider our movie example again
but this time we are interested in
predicting ratings
Neural network with Here yi ∈ R3
L − 1 hidden layers The loss function should capture how
much ŷi deviates from yi
If yi ∈ Rn then the squared error loss
can capture this deviation
isActor isDirector N 3
1 XX
Damon . . Nolan . . . . . . . . L (θ) = (ŷij − yij )2
N
xi i=1 j=1

17/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
hL = ŷ = f (x) A related question: What should the
output function ‘O’ be if yi ∈ R?
a3 More speciﬁcally, can it be the logistic
W3 function?
b3
h2 No, because it restricts ŷi to a value
between 0 & 1 but we want ŷi ∈ R
a2 So, in such cases it makes sense to
W2 b2 have ‘O’ as linear function
h1
f (x) = hL = O(aL )
a1 = W O a L + bO
W1 b1
ŷi = f (xi ) is no longer bounded
x1 x2 xn between 0 and 1

18/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Intentionally left blank 19/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Now let us consider another problem
y = [1 0 0 0] for which a diﬀerent loss function
Apple Mango Orange Banana would be appropriate
Suppose we want to classify an image
into 1 of k classes
Here again we could use the squared
Neural network with error loss to capture the deviation
L − 1 hidden layers But can you think of a better
function?

20/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Notice that y is a probability
y = [1 0 0 0] distribution
Apple Mango Orange Banana Therefore we should also ensure that
ŷ is a probability distribution
What choice of the output activation
‘O’ will ensure this ?
Neural network with aL = WL hL−1 + bL
L − 1 hidden layers
eaL,j
ŷj = O(aL )j = Pk
aL,i
i=1 e

O(aL )j is the j th element of ŷ and aL,j

is the j th element of the vector aL .
This function is called the softmax
function

21/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Now that we have ensured that both
y = [1 0 0 0] y & ŷ are probability distributions
Apple Mango Orange Banana can you think of a function which
captures the diﬀerence between
them?
Cross-entropy
Neural network with k
X
L − 1 hidden layers L (θ) = − yc log ŷc
c=1

Notice that

yc = 1 if c = ℓ (the true class label)

=0 otherwise
∴ L (θ) = − log ŷℓ

22/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
So, for classiﬁcation problem (where you have
hL = ŷ = f (x) to choose 1 of K classes), we use the following
objective function

a3 minimize L (θ) = − log ŷℓ

θ
W3 b3
h2 or maximize − L (θ) = log ŷℓ
θ

But wait!
a2
Is ŷℓ a function of θ = [W1 , W2 , ., WL , b1 , b2 , ., bL ]?
W2 b2
h1 Yes, it is indeed a function of θ
ŷℓ = [O(W3 g(W2 g(W1 x + b1 ) + b2 ) + b3 )]ℓ
a1 What does ŷℓ encode?
W1 b1 It is the probability that x belongs to the ℓth class
(bring it as close to 1).
x1 x2 xn
log ŷℓ is called the log-likelihood of the data.
23/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at hand
but the two loss functions that we just saw are encountered very often
For the rest of this lecture we will focus on the case where the output activation
is a softmax function and the loss function is cross entropy

24/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.4: Backpropagation (Intuition)

25/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
We need to answer two questions
How to choose the loss function L (θ) ?
How to compute ∇θ which is composed of:
∇W1 , ∇W2 , ..., ∇WL−1 ∈ Rn×n , ∇WL ∈ Rn×k
∇b1 , ∇b2 , ..., ∇bL−1 ∈ Rn and ∇bL ∈ Rk ?

26/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
ŷ = f (x)
Let us focus on this one Algorithm: gradient
weight (W112 ). descent()
a31
To learn this weight W311 b3
t ← 0;
h21
using SGD we need a max iterations ← 1000;
∂L (θ)
formula for ∂W . Initialize θ0 ;
112 a21
W211 b2 while
We will see how to h11
t++ < max iterations do
calculate this.
a11
θt+1 ← θt − η∇θt ;
W111 W112 b1 end
x1 x2 xd

27/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
L (θ)
First let us take the simple case when
ŷ = f (x)
we have a deep but thin network.
In this case it is easy to ﬁnd the
derivative by chain rule. aL1
WL11
h21
∂L (θ) ∂L (θ) ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11
=
∂W111 ∂ ŷ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111
a21
∂L (θ) ∂L (θ) ∂h11
= (just compressing the chain rule) h11 W211
∂W111 ∂h11 ∂W111
∂L (θ) ∂L (θ) ∂h21
=
∂W211 ∂h21 ∂W211 a11
∂L (θ) ∂L (θ) ∂aL1 W111
=
∂WL11 ∂aL1 ∂WL11 x1

28/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Let us see an intuitive explanation of backpropagation before we get into the
mathematical details

29/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
We get a certain loss at the output and we try to − log ŷℓ
ﬁgure out who is responsible for this loss
So, we talk to the output layer and say “Hey! You
are not producing the desired output, better take
responsibility”.
a3
The output layer says “Well, I take responsibility W3 b3
for my part but please understand that I am only h2
as the good as the hidden layer and weights below
me”. After all . . . a2
W2 b2
f (x) = ŷ = O(WL hL−1 + bL ) h1

a1
W1 b1

x1 x2 xn

30/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
So, we talk to WL , bL and hL and ask them “What is − log ŷℓ
wrong with you?”
WL and bL take full responsibility but hL says “Well,
please understand that I am only as good as the pre-
activation layer”
The pre-activation layer in turn says that I am only as a3
good as the hidden layer and weights below me. W3 b3
We continue in this manner and realize that the h2
responsibility lies with all the weights and biases (i.e.
all the parameters of the model) a2
But instead of talking to them directly, it is easier to W2 b2
talk to them through the hidden layers and output h1
layers (and this is exactly what the chain rule allows
us to do) a1
W1 b1
∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111
| {z111} | {z } | {z } | {z } | {z } x1 x2 xn
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights
31/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h ∂a ∂h ∂a ∂W
| {z111} | {z } | 2{z 2} | 1{z 1} | {z111}
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

Our focus is on Cross entropy loss and Softmax output.

32/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.5: Backpropagation: Computing Gradients
w.r.t. the Output Units

33/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

Our focus is on Cross entropy loss and Softmax output.

34/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Let us ﬁrst consider the partial derivative − log ŷℓ
w.r.t. i-th output

L (θ) = − log ŷℓ (ℓ = true class label)

∂ ∂
(L (θ)) = (− log ŷℓ ) a3
∂ ŷi ∂ ŷi W3
1 b3
h2
= − if i = ℓ
ŷℓ
= 0 otherwise a2
W2 b2
More compactly, h1
∂ ✶(i=ℓ)
(L (θ)) = −
∂ ŷi ŷℓ a1
W1 b1

x1 x2 xn

35/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
− log ŷℓ
∂ ✶(ℓ=i)
(L (θ)) = −
∂ ŷi ŷℓ
We can now talk about the gradient a3
w.r.t. the vector ŷ W3 b3
h2

✶ℓ=1
   
∂L (θ)
a2
1 ✶ℓ=2 
∂ ŷ1 
 ..   W2 b2
∇ŷ L (θ) =  .
 =−  .  h1

∂L (θ)
 ŷℓ  .. 
∂ ŷk ✶ℓ=k
a1
1 W1 b1
= − eℓ
ŷℓ
x1 x2 xn
where e(ℓ) is a k-dimensional vector
whose ℓ-th element is 1 and all other
elements are 0. 36/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
What we are actually interested in is − log ŷℓ
∂L (θ) ∂(− log ŷℓ )
=
∂aLi ∂aLi
∂(− log ŷℓ ) ∂ ŷℓ
= a3
∂ ŷℓ ∂aLi
W3 b3
Does ŷℓ depend on aLi ? Indeed, it does. h2

exp(aLℓ ) a2
ŷℓ = P W2 b2
i exp(aLi ) h1
Having established this, we will now
derive the full expression on the next a1
W1 b1
slide
x1 x2 xn

37/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
∂ −1 ∂
− log ŷℓ = ŷℓ
∂aLi ŷℓ ∂aLi g(x)
∂ h(x) ∂g(x) 1 g(x) ∂h(x)
−1 ∂ = −
= sof tmax(aL )ℓ ∂x ∂x h(x) h(x)2 ∂x
ŷℓ ∂aLi
−1 ∂ exp(aL )ℓ
= P
ŷℓ ∂aLi i′ exp(aL )ℓ
P !
−1
∂
exp(aL )ℓ exp(aL )ℓ ∂a∂Li i′ exp(aL )i′
∂aLi
= P − P
ŷℓ i′ exp(aL )i
′ ( i′ (exp(aL )i′ )2
!
−1 ✶(ℓ=i) exp(aL )ℓ exp(aL )ℓ exp(aL )i
= P −P P
ŷℓ i′ exp(aL )i i′ exp(aL )i i′ exp(aL )i
′ ′ ′

−1
= ✶(ℓ=i) sof tmax(aL )ℓ − sof tmax(aL )ℓ sof tmax(aL )i
ŷℓ
−1
✶(ℓ=i) ŷℓ − ŷℓ ŷi

=
ŷℓ
= − ✶(ℓ=i) − ŷi

38/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
So far we have derived the partial derivative w.r.t. − log ŷℓ
the i-th element of aL

∂L (θ)
= −(✶ℓ=i − ŷi )
∂aL,i
a3
We can now write the gradient w.r.t. the vector aL W3 b3
h2
− (✶ℓ=1 − ŷ1 )
   
∂L (θ)
∂aL1  − (✶ℓ=2 − ŷ2 )  a2
 .. 
 =  W2 b2
∇aL L (θ) = 
 .   ..  h1
∂L (θ)
 . 
∂aLk − (✶ℓ=k − ŷk )
a1
= −(e(ℓ) − ŷ) W1 b1

x1 x2 xn

39/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.6: Backpropagation: Computing Gradients
w.r.t. Hidden Units

40/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

Our focus is on Cross entropy loss and Softmax output.

41/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Chain rule along multiple paths: If a − log ŷℓ
function p(z) can be written as a function of
intermediate results qi (z) then we have :

∂p(z) X ∂p(z) ∂qm (z) a3

= W3
∂z ∂qm (z) ∂z b3
m h2

In our case: a2
p(z) is the loss function L (θ) W2 b2
h1
z = hij
qm (z) = aLm a1
W1 b1

x1 x2 xn

42/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Intentionally left blank 43/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
k
− log ŷℓ
∂L (θ)
X ∂L (θ) ∂ai+1,m
∂hij = ∂ai+1,m ∂hij
m=1
Xk
∂L (θ)
= ∂ai+1,m Wi+1,m,j a3
m=1 W3 b3
Now consider these two vectors, h2
 ∂L (θ)   
Wi+1,1,j a2
∂ai+1,1
 ..   ..  W2 b2
∇ai+1 L (θ) = 
 .  ; Wi+1, · ,j
 = .  h1
∂L (θ) Wi+1,k,j
∂ai+1,k
a1
W1 b1
Wi+1, · ,j is the j-th column of Wi+1 ; see that,

k
x1 x2 xn
T
X ∂L (θ)
(Wi+1, · ,j ) ∇ai+1 L (θ) = Wi+1,m,j
∂ai+1,m
m=1 ai+1 = Wi+1 hij + bi+1 44/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
− log ŷℓ
∂L (θ)
We have, = (Wi+1,.,j )T ∇ai+1 L (θ)
∂hij

We can now write the gradient w.r.t. hi a3

 ∂L (θ)    W3 b3
∂hi1 (Wi+1, · ,1 )T ∇ai+1 L (θ) h2
 ∂L (θ)  
 ∂hi2   (Wi+1, · ,2 )T ∇ai+1 L (θ)  
 ..  = 
∇hi L (θ) =   ..  a2
 .   .  W2 b2
∂L (θ) (Wi+1, · ,n )T ∇ai+1 L (θ) h1
∂hin
= (Wi+1 )T (∇ai+1 L (θ)) a1
W1 b1
We are almost done except that we do not
x1 x2 xn
know how to calculate ∇ai+1 L (θ) for i < L − 1
We will see how to compute that
45/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
− log ŷℓ
 
∂L (θ)
∂ai1

∇ai L (θ) =  .. 

 . 
∂L (θ)
∂ain a3
∂L (θ) ∂L (θ) ∂hij W3 b3
= h2
∂aij ∂hij ∂aij
∂L (θ) ′ a2
= g (aij ) [∵ hij = g(aij )]
∂hij W2 b2

∂L (θ) ′
 h1
g (a i1 )
 ∂hi1 . 
∇ai L (θ) = 
 .. 
 a1
∂L (θ) ′ W1 b1
∂hin g (ain )
′
= ∇hi L (θ) ⊙ [. . . , g (aik ), . . . ] x1 x2 xn

46/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.7: Backpropagation: Computing Gradients
w.r.t. Parameters

47/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Quantities of interest (roadmap for the remaining part):
Gradient w.r.t. output units
Gradient w.r.t. hidden units
Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

=
∂W ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h ∂a ∂W
| {z111} | {z } | {z } | 1{z 1} | {z111}
Talk to the Talk to the Talk to the Talk to the and now
weight directly output layer previous hidden previous talk to
layer hidden layer the
weights

Our focus is on Cross entropy loss and Softmax output.

48/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Recall that, − log ŷℓ

ak = bk + Wk hk−1
∂aki
= hk−1,j
∂Wkij
a3
∂L (θ) ∂L (θ) ∂aki W3
= b3
∂Wkij ∂aki ∂Wkij h2
∂L (θ)
= hk−1,j a2
∂a
 ∂Lki(θ) ∂L (θ) ∂L (θ)
 W2 b2
... ... h1
 ∂Wk11 ∂Wk12 ∂Wk1n

 ... ... ... ... ... 
∇Wk L (θ) = 
 .. .. .. .. .. 
 a1
 . . . . .  W1 b1
∂L (θ)
... ... ... ... ∂Wknn
x1 x2 xn

49/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Intentionally left blank 50/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Lets take a simple example of a Wk ∈ R3×3 and see what each entry looks like
 ∂L (θ) ∂L (θ) ∂L (θ) 
∂Wk11 ∂Wk12 ∂Wk13
 
 
 ∂L (θ) ∂L (θ) ∂L (θ)  ∂L (θ) ∂L (θ) ∂aki
∇Wk L (θ) =  ∂W ∂Wk23 
=
 k21 ∂Wk22  ∂Wkij ∂aki ∂Wkij
 
∂L (θ) ∂L (θ) ∂L (θ)
∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ) 

∂ak1 hk−1,1 ∂ak1 hk−1,2 ∂ak1 hk−1,3
 
 
 (θ) ∂L (θ) ∂L (θ)
∇Wk L (θ) =  ∂L = ∇ak L (θ) · hk−1 T

h ∂ak2 hk−1,2 h
 ∂ak2 k−1,1 ∂ak2 k−1,3 
 
∂L (θ) ∂L (θ) ∂L (θ)
∂ak3 hk−1,1 ∂ak3 hk−1,2 ∂ak3 hk−1,3

51/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Finally, coming to the biases − log ŷℓ
X
aki = bki + Wkij hk−1,j
j
∂L (θ) ∂L (θ) ∂aki
= a3
∂bki ∂aki ∂bki W3
∂L (θ) b3
h2
=
∂aki
a2
We can now write the gradient w.r.t. the vector W2 b2
bk h1
 ∂L (θ) 
a a1
 ∂Lk1(θ)  W1 b1
 ak2 
∇bk L (θ) = 
 ..  = ∇ak L (θ)

 .  x1 x2 xn
∂L (θ)
akn

52/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.8: Backpropagation: Pseudo code

53/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Finally, we have all the pieces of the puzzle

∇aL L (θ) (gradient w.r.t. output layer)

∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

54/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Algorithm: gradient descent()
t ← 0;
max iterations ← 1000;
Initialize θ0 = [W10 , ..., WL0 , b01 , ..., b0L ];
while t++ < max iterations do
h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , ŷ = f orward propagation(θt );
∇θt = backward propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ);
θt+1 ← θt − η∇θt ;
end

55/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Algorithm: forward propagation(θ)
for k = 1 to L − 1 do
ak = bk + Wk hk−1 ;
hk = g(ak );
end
aL = bL + WL hL−1 ;
ŷ = O(aL );

56/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Just do a forward propagation and compute all hi ’s, ai ’s, and ŷ
Algorithm: back propagation(h1 , h2 , ..., hL−1 , a1 , a2 , ..., aL , y, ŷ)
//Compute output gradient ;
∇aL L (θ) = −(e(y) − ŷ) ;
for k = L to 1 do
// Compute gradients w.r.t. parameters ;
∇Wk L (θ) = ∇ak L (θ)hTk−1 ;
∇bk L (θ) = ∇ak L (θ) ;
// Compute gradients w.r.t. layer below ;
∇hk−1 L (θ) = WkT (∇ak L (θ)) ;
// Compute gradients w.r.t. layer below (pre-activation);
∇ak−1 L (θ) = ∇hk−1 L (θ) ⊙ [. . . , g ′ (ak−1,j ), . . . ] ;
end

57/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Module 4.9: Derivative of the activation function

58/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4
Now, the only thing we need to ﬁgure out is how to compute g ′
Logistic function tanh

g(z) = σ(z) g(z) = tanh (z)

1 ez − e−z
= =
1 + e−z ez + e−z
1 d !
d
g ′ (z) = (−1) (1 + e−z ) (ez + e−z ) dz (ez − e−z )
2
(1 + e ) dz
−z
d
− (ez − e−z ) dz (ez + e−z )
1 g ′ (z) =
= (−1) (−e−z ) (ez + e−z )2
(1 + e−z )2
1

1 + e−z − 1
(ez + e−z )2 − (ez − e−z )2
= =
1 + e−z 1 + e−z (ez + e−z )2
= g(z)(1 − g(z)) (e − e−z )2
z
=1 − z
(e + e−z )2
=1 − (g(z))2

59/9
Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4

Lecture 4
No ratings yet
Lecture 4
288 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Lecture 18. Backpropagation
No ratings yet
Lecture 18. Backpropagation
55 pages
Lecture Slides 2 - Neural Networks - 2021
No ratings yet
Lecture Slides 2 - Neural Networks - 2021
42 pages
Lecture 0.4 - Neural Networks
No ratings yet
Lecture 0.4 - Neural Networks
51 pages
Module 2 Deep Feed Forward Networks
No ratings yet
Module 2 Deep Feed Forward Networks
18 pages
Deep Feedforward Neural Networks Guide
No ratings yet
Deep Feedforward Neural Networks Guide
97 pages
Softmax vs Sigmoid in Neural Networks
No ratings yet
Softmax vs Sigmoid in Neural Networks
15 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Feedforward Networks: Marco Kuhlmann
No ratings yet
Feedforward Networks: Marco Kuhlmann
53 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Feed-Forward Neural Networks Overview
No ratings yet
Feed-Forward Neural Networks Overview
18 pages
DL 2
No ratings yet
DL 2
62 pages
Sparse Autoencoder Overview
No ratings yet
Sparse Autoencoder Overview
15 pages
Slides 11
No ratings yet
Slides 11
48 pages
Lecture 9 H
No ratings yet
Lecture 9 H
69 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
Introduction to Deep Learning Techniques
No ratings yet
Introduction to Deep Learning Techniques
299 pages
Intro to Machine Learning Basics
No ratings yet
Intro to Machine Learning Basics
61 pages
Lecture NN Part1
No ratings yet
Lecture NN Part1
62 pages
Neural Networks & Backpropagation
No ratings yet
Neural Networks & Backpropagation
77 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Lecture 3 H
No ratings yet
Lecture 3 H
70 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Deep Learning Basics Lecture 1 Feedforward
No ratings yet
Deep Learning Basics Lecture 1 Feedforward
31 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Autoencoders in Deep Learning
No ratings yet
Autoencoders in Deep Learning
73 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
8-10. Backpropagation Algorithm
No ratings yet
8-10. Backpropagation Algorithm
233 pages
UNIT 1 Introduction Part 1
No ratings yet
UNIT 1 Introduction Part 1
37 pages
Week2 DL
No ratings yet
Week2 DL
29 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
Week 4
No ratings yet
Week 4
61 pages
Module 2
No ratings yet
Module 2
44 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
Lecture 20
No ratings yet
Lecture 20
71 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Deep Learning for Beginners
100% (1)
Deep Learning for Beginners
87 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
Sparseautoencoder 2011new
No ratings yet
Sparseautoencoder 2011new
19 pages
Lect 5
No ratings yet
Lect 5
89 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Handwritten Notes - Unit 1,2
No ratings yet
Handwritten Notes - Unit 1,2
9 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
First
No ratings yet
First
92 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
Genetics Revision Notes
No ratings yet
Genetics Revision Notes
2 pages
Water Resources
No ratings yet
Water Resources
7 pages
Chemical Coordination in Plants
No ratings yet
Chemical Coordination in Plants
10 pages
Lecture 6.2 - Polynomial Regression
No ratings yet
Lecture 6.2 - Polynomial Regression
56 pages
Uni2 NNDL
No ratings yet
Uni2 NNDL
21 pages
CVlecture 5
No ratings yet
CVlecture 5
56 pages
DL syllabus-PHD
No ratings yet
DL syllabus-PHD
3 pages
Deep Learning in Finance Survey
100% (1)
Deep Learning in Finance Survey
29 pages
UER: An Open-Source Toolkit For Pre-Training Models
No ratings yet
UER: An Open-Source Toolkit For Pre-Training Models
6 pages
SoftComputing Module I
No ratings yet
SoftComputing Module I
4 pages
Bab I Mcculloch-Pitts Neuron: %program % Illustration of Various Activation Functions Used in NN's
No ratings yet
Bab I Mcculloch-Pitts Neuron: %program % Illustration of Various Activation Functions Used in NN's
8 pages
Unit5 PPT
No ratings yet
Unit5 PPT
13 pages
CS60010: Deep Learning CNN - Part 1: Sudeshna Sarkar
No ratings yet
CS60010: Deep Learning CNN - Part 1: Sudeshna Sarkar
64 pages
Ethem Alpaydin-Introduction To Machine Learning-The MIT Press (2014) (330-333)
No ratings yet
Ethem Alpaydin-Introduction To Machine Learning-The MIT Press (2014) (330-333)
4 pages
Understanding Artificial Neural Networks
No ratings yet
Understanding Artificial Neural Networks
23 pages
Neural Networks Course Notes
No ratings yet
Neural Networks Course Notes
253 pages
9.the Long Short Term
No ratings yet
9.the Long Short Term
4 pages
Figure PPT ch009
No ratings yet
Figure PPT ch009
27 pages
DL Notes 1 5 Deep Learning
100% (1)
DL Notes 1 5 Deep Learning
189 pages
Neural Networks Exam - ETEG 425
No ratings yet
Neural Networks Exam - ETEG 425
2 pages
3081-Deep Learning Syllabus
No ratings yet
3081-Deep Learning Syllabus
3 pages
AI Deep Learning & NLP Course
No ratings yet
AI Deep Learning & NLP Course
4 pages
Neural Networks & Fuzzy Logic Basics
No ratings yet
Neural Networks & Fuzzy Logic Basics
51 pages
AI Sequence Models for Students
No ratings yet
AI Sequence Models for Students
69 pages
Neural Network Functions & Architectures
No ratings yet
Neural Network Functions & Architectures
8 pages
Deep Learning MCQ
No ratings yet
Deep Learning MCQ
7 pages
13.AI-CNN - Ipynb - Colab
No ratings yet
13.AI-CNN - Ipynb - Colab
3 pages
3-Neural Network
No ratings yet
3-Neural Network
26 pages
Networks Using Blocks (VGG) Networks Using Blocks (VGG)
No ratings yet
Networks Using Blocks (VGG) Networks Using Blocks (VGG)
6 pages
Semester - 7-Deep Learning
No ratings yet
Semester - 7-Deep Learning
3 pages
Deep Learning Course CS671 IIT Mandi
No ratings yet
Deep Learning Course CS671 IIT Mandi
2 pages
EEE1007 Neural Network and Fuzzy Control
No ratings yet
EEE1007 Neural Network and Fuzzy Control
2 pages
ANN for Bearing Capacity Prediction
No ratings yet
ANN for Bearing Capacity Prediction
21 pages
DL Unit 2.3
No ratings yet
DL Unit 2.3
16 pages

Lecture - 14 - FFNN

Uploaded by

Lecture - 14 - FFNN

Uploaded by

Lecture – 4

Feedforward Neural Networks, Backpropagation

Dr. Jagendra Singh

ai (x) = bi + Wi hi−1 (x)

where L (θ) is some function of the parameters 7/9

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)

O(aL )j is the j th element of ŷ and aL,j

yc = 1 if c = ℓ (the true class label)

a3 minimize L (θ) = − log ŷℓ

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

Our focus is on Cross entropy loss and Softmax output.

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

Our focus is on Cross entropy loss and Softmax output.

L (θ) = − log ŷℓ (ℓ = true class label)

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

Our focus is on Cross entropy loss and Softmax output.

∂p(z) X ∂p(z) ∂qm (z) a3

We can now write the gradient w.r.t. hi a3

∂L (θ) ∂L (θ) ∂ ŷ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1

Our focus is on Cross entropy loss and Softmax output.

 ∂L (θ) ∂L (θ) ∂L (θ) 

∇aL L (θ) (gradient w.r.t. output layer)

∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

g(z) = σ(z) g(z) = tanh (z)

You might also like