0% found this document useful (0 votes)

2 views

XCS224N_Module2_Slides

The document discusses Named Entity Recognition (NER) within the context of Natural Language Processing, detailing its purpose, applications, and a simple classification approach using logistic regression. It also covers the concepts of gradients, matrix calculus, and backpropagation in neural networks, emphasizing their importance in training models. The lecture plan outlines the topics to be covered, including the computation of gradients and the application of the chain rule.

Uploaded by

nasoheel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

XCS224N_Module2_Slides

Uploaded by

nasoheel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Natural Language Processing

with Deep Learning

CS224N/Ling284

Christopher Manning
Lecture 3: Neural net learning: Gradients by hand (matrix calculus)
and algorithmically (the backpropagation algorithm)
Named Entity Recognition (NER)
• The task: find and classify names in text, for example:

Last night , Paris Hilton wowed in a sequin gown .

PER PER
Samuel Quinn was arrested in the Hilton Hotel in Paris in April 1989 .
PER PER LOC LOC LOC DATE DATE

• Possible uses:
• Tracking mentions of particular entities in documents
• For question answering, answers are usually named entities
• Often followed by Named Entity Linking/Canonicalization into Knowledge Base

3
Simple NER: Window classification using binary logistic classifier
• Idea: classify each word in its context window of neighboring words
• Train logistic classifier on hand-labeled data to classify center word {yes/no} for each
class based on a concatenation of word vectors in a window
• Really, we usually use multi-class softmax, but trying to keep it simple J
• Example: Classify “Paris” as +/– location in context of sentence with window length 2:

the museums in Paris are amazing to see .

Xwindow = [ xmuseums xin xParis xare xamazing ]T

• Resulting vector xwindow = x ∈ R5d , a column vector!

• To classify all words: run classifier for each class on the vector centered on each word
in the sentence
4
NER: Binary classification for center word being location

• We do supervised training and want high score if it’s a location

1
𝐽! 𝜃 = 𝜎 𝑠 =
1 + 𝑒 "#
predicted model
probability of class

x = [ xmuseums xin xParis xare xamazing ]

5
Remember: Stochastic Gradient Descent
Update equation:

𝛼 = step size or learning rate

+, -
i.e., for each parameter: 𝜃$%&' = 𝜃$()* −𝛼
+-!"#$

In deep learning, we update the data representation (e.g., word vectors) too!

How can we compute ∇- 𝐽(𝜃)?

1. By hand
2. Algorithmically: the backpropagation algorithm
6
Lecture Plan
Lecture 4: Gradients by hand and algorithmically
1. Introduction (5 mins)
2. Matrix calculus (40 mins)
3. Backpropagation (35 mins)

7
Computing Gradients by Hand
• Matrix calculus: Fully vectorized gradients
• “Multivariable calculus is just like single-variable calculus if you use matrices”
• Much faster and more useful than non-vectorized gradients
• But doing a non-vectorized gradient can be good for intuition; recall the first
lecture for an example
• Lecture notes and matrix calculus notes cover this material in more detail
• You might also review Math 51, which has a new online textbook:
http://web.stanford.edu/class/math51/textbook.html
or maybe you’re luckier if you did Engr 108

8
Gradients
• Given a function with 1 output and 1 input
𝑓 𝑥 = 𝑥.
• It’s gradient (slope) is its derivative
*/
*0
= 3𝑥 1
“How much will the output change if we change the input a bit?”
At x = 1 it changes about 3 times as much: 1.013 = 1.03
At x = 4 it changes about 48 times as much: 4.013 = 64.48

9
Gradients
• Given a function with 1 output and n inputs

• Its gradient is a vector of partial derivatives with

respect to each input

10
Jacobian Matrix: Generalization of the Gradient
• Given a function with m outputs and n inputs

• It’s Jacobian is an m x n matrix of partial derivatives

11
Chain Rule
• For composition of one-variable functions: multiply derivatives

• For multiple variables at once: multiply Jacobians

12
Example Jacobian: Elementwise activation Function

13
Example Jacobian: Elementwise activation Function

Function has n outputs and n inputs → n by n Jacobian

14
Example Jacobian: Elementwise activation Function

15
Example Jacobian: Elementwise activation Function

16
Example Jacobian: Elementwise activation Function

17
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes
18
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes
19
Other Jacobians

Fine print: This is the correct Jacobian.

Later we discuss the “shape convention”;
using it the answer would be h.

• Compute these at home for practice!

• Check your answers with the lecture notes
20
Other Jacobians

• Compute these at home for practice!

• Check your answers with the lecture notes

21
Back to our Neural Net!

x = [ xmuseums xin xParis xare xamazing ]

22
Back to our Neural Net!
• Let’s find
• Really, we care about the gradient of the loss Jt but we
will compute the gradient of the score for simplicity

x = [ xmuseums xin xParis xare xamazing ]

23
1. Break up equations into simple pieces

Carefully define your variables and keep track of their dimensionality!

24
2. Apply the chain rule

25
2. Apply the chain rule

26
2. Apply the chain rule

27
2. Apply the chain rule

28
3. Write out the Jacobians

Useful Jacobians from previous slide

29
3. Write out the Jacobians

𝒖!

Useful Jacobians from previous slide

30
3. Write out the Jacobians

𝒖!

Useful Jacobians from previous slide

31
3. Write out the Jacobians

𝒖!

Useful Jacobians from previous slide

32
3. Write out the Jacobians

𝒖!
𝒖!
Useful Jacobians from previous slide

33
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

34
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

The same! Let’s avoid duplicated computation …

35
Re-using Computation

• Suppose we now want to compute

• Using the chain rule again:

𝒖!

𝛿 is the local error signal

36
Derivative with respect to Matrix: Output shape

• What does look like?

• 1 output, nm inputs: 1 by nm Jacobian?
• Inconvenient to then do

37
Derivative with respect to Matrix: Output shape

• What does look like?

• 1 output, nm inputs: 1 by nm Jacobian?
• Inconvenient to then do

• Instead, we leave pure math and use the shape convention:

the shape of the gradient is the shape of the parameters!

• So is n by m:

38
Derivative with respect to Matrix

• What is
• is going to be in our answer
• The other term should be because

• Answer is:

𝛿 is local error signal at 𝑧

𝑥 is local input signal

39
Deriving local input gradient in backprop
"𝒛
• For "𝑾 in our equation:
𝜕𝑠 𝜕𝒛 𝜕
=𝜹 =𝜹 (𝑾𝒙 + 𝒃)
𝜕𝑾 𝜕𝑾 𝜕𝑾
• Let’s consider the derivative of a single weight Wij
• Wij only contributes to zi u2
• For example: W23 is only
s
used to compute z2 not z1 f(z1)= h1 h2 =f(z2)

W23
𝜕𝑧2 𝜕
= 𝑾23 𝒙 + 𝑏2 b2
𝜕𝑊2$ 𝜕𝑊2$
+
= ∑*567 𝑊25 𝑥5 = 𝑥$ x1 x2 x3 +1
+4%!
40
Why the Transposes?

• Hacky answer: this makes the dimensions work out!

• Useful trick for checking your work!
• Full explanation in the lecture notes
• Each input goes to each output – you want to get outer product
41
What shape should derivatives be?

• Similarly, is a row vector

• But shape convention says our gradient should be a column vector because b is
a column vector …

• Disagreement between Jacobian form (which makes the chain rule

easy) and the shape convention (which makes implementing SGD easy)
• We expect answers in the assignment to follow the shape convention
• But Jacobian form is useful for computing the answers

42
What shape should derivatives be?
Two options:
1. Use Jacobian form as much as possible, reshape to
follow the shape convention at the end:
• What we just did. But at the end transpose to make the
derivative a column vector, resulting in

2. Always follow the shape convention

• Look at dimensions to figure out when to transpose and/or
reorder terms
• The error message 𝜹 that arrives at a hidden layer has the
same dimensionality as that hidden layer

43
3. Backpropagation

We’ve almost shown you backpropagation

It’s taking derivatives and using the (generalized, multivariate, or matrix)
chain rule
Other trick:
We re-use derivatives computed for higher layers in computing
derivatives for lower layers to minimize computation

44
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• Interior nodes: operations

45
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• Interior nodes: operations
• Edges pass along result of the
operation

46
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• “Forward Propagation”
Interior nodes: operations
• Edges pass along result of the
operation

47
Backpropagation
• Then go backwards along edges
• Pass along gradients

48
Backpropagation: Single Node
• Node receives an “upstream gradient”
• Goal is to pass on the correct
“downstream gradient”

Downstream Upstream
49 gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of its output with
respect to its input

Downstream Local Upstream

50 gradient gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of its output with
respect to its input

Chain
rule!
Downstream Local Upstream
51 gradient gradient gradient
Backpropagation: Single Node

• Each node has a local gradient

• The gradient of its output with
respect to its input

• [downstream gradient] = [upstream gradient] x [local gradient]

Downstream Local Upstream

52 gradient gradient gradient
Backpropagation: Single Node
• What about nodes with multiple inputs?

53
Backpropagation: Single Node
• Multiple inputs → multiple local gradients

Downstream Local Upstream

gradients gradients gradient
54
An Example

55
An Example

Forward prop steps

*
max

56
An Example

Forward prop steps

2
+ 3

6
2 *
2
max
0
57
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
58
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
59
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
60
An Example

Forward prop steps Local gradients

2
+ 3

6
2 *
2
max
0
61
An Example

Forward prop steps Local gradients

2
+ 3
1*2 = 2 6
2 * 1
2
max 1*3 = 3
0
upstream * local = downstream
62
An Example

Forward prop steps Local gradients

2
+ 3
2
6
2 * 1
2
3*1 = 3
max 3
0
3*0 = 0 upstream * local = downstream
63
An Example

Forward prop steps Local gradients

1
2*1 = 2
2
+ 3
2
2*1 = 2 6
2 * 1
2
3
max 3
0
0 upstream * local = downstream
64
An Example

Forward prop steps Local gradients

1
2
2
+ 3
2
2 6
2 * 1
2
3
max 3
0
0
65
Gradients sum at outward branches

66
Gradients sum at outward branches

67
Node Intuitions

• + “distributes” the upstream gradient to each summand

1
2
2
+ 3
2
2 6
2 * 1
2
max
0
68
Node Intuitions

• + “distributes” the upstream gradient to each summand

• max “routes” the upstream gradient

2
+ 3

6
2 * 1
2
3
max 3
0
0
69
Node Intuitions

• + “distributes” the upstream gradient

• max “routes” the upstream gradient
• * “switches” the upstream gradient

2
+ 3
2
6
2 * 1
2
max 3
0
70
Efficiency: compute all gradients at once
• Incorrect way of doing backprop:
• First compute

* +

71
Efficiency: compute all gradients at once
• Incorrect way of doing backprop:
• First compute
• Then independently compute
• Duplicated computation!

* +

72
Efficiency: compute all gradients at once
• Correct way:
• Compute all the gradients at once
• Analogous to using 𝜹 when we
computed gradients by hand

* +

73
Back-Prop in General Computation Graph
1. Fprop: visit nodes in topological sort order
Single scalar output - Compute value of node given predecessors
2. Bprop:
- initialize output gradient = 1
… - visit nodes in reverse order:
Compute gradient wrt each node using
gradient wrt successors
… = successors of

Done correctly, big O() complexity of fprop and

bprop is the same
… In general, our nets have regular layer-structure
and so we can use matrices and Jacobians…
74
Automatic Differentiation

• The gradient computation can be

automatically inferred from the symbolic
expression of the fprop
• Each node type needs to know how to
compute its output and how to compute
the gradient wrt its inputs given the
gradient wrt its output
• Modern DL frameworks (Tensorflow,
PyTorch, etc.) do backpropagation for
you but mainly leave layer/node writer
to hand-calculate the local derivative
75
Backprop Implementations

76
Implementation: forward/backward API

77
Implementation: forward/backward API

78
Manual Gradient checking: Numeric Gradient

• For small h (≈ 1e-4),

• Easy to implement correctly
• But approximate and very slow:
• You have to recompute f for every parameter of our model

• Useful for checking your implementation

• In the old days, we hand-wrote everything, doing this everywhere was the key test
• Now much less needed; you can use it to check layers are correctly implemented

79
Summary

We’ve mastered the core technology of neural nets! 🎉

• Backpropagation: recursively (and hence efficiently) apply the chain rule

along computation graph
• [downstream gradient] = [upstream gradient] x [local gradient]

• Forward pass: compute results of operations and save intermediate

values
• Backward pass: apply chain rule to compute gradients
80
Why learn all these details about gradients?
• Modern deep learning frameworks compute gradients for you!
• Come to the PyTorch introduction this Friday!

• But why take a class on compilers or systems when they are implemented for you?
• Understanding what is going on under the hood is useful!

• Backpropagation doesn’t always work perfectly

• Understanding why is crucial for debugging and improving models
• See Karpathy article (in syllabus):
• https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b
• Example in future lecture: exploding and vanishing gradients

Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
TUM I2DL Matrix Derivatives
No ratings yet
TUM I2DL Matrix Derivatives
8 pages
The Matrix Calculus You Need For Deep Learning
No ratings yet
The Matrix Calculus You Need For Deep Learning
34 pages
2024 04 CS115 Vector Caculus
No ratings yet
2024 04 CS115 Vector Caculus
131 pages
5 Backward Propagation
No ratings yet
5 Backward Propagation
81 pages
Tut 01
No ratings yet
Tut 01
39 pages
Gradient Notes PDF
No ratings yet
Gradient Notes PDF
7 pages
Computing Neural Network Gradients-merged
No ratings yet
Computing Neural Network Gradients-merged
67 pages
Learning 3
No ratings yet
Learning 3
98 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
LOD Differentiable
No ratings yet
LOD Differentiable
55 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
3-Gradient.pptx
No ratings yet
3-Gradient.pptx
31 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Unit 3
No ratings yet
Unit 3
110 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Lecture 2, Part 2: Backpropagation: Roger Grosse
No ratings yet
Lecture 2, Part 2: Backpropagation: Roger Grosse
9 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Derivatives, Backpropagation, and Vectorization
No ratings yet
Derivatives, Backpropagation, and Vectorization
7 pages
1d Backprop
No ratings yet
1d Backprop
23 pages
Lecture 02
No ratings yet
Lecture 02
37 pages
1d Backprop4
No ratings yet
1d Backprop4
6 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
CS231n Convolutional Neural Networks For Visual Recognition 4
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 4
10 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Chap3slides
No ratings yet
Chap3slides
95 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
mit18_s096iap23_lec4
No ratings yet
mit18_s096iap23_lec4
14 pages
Final2 Math EE
No ratings yet
Final2 Math EE
77 pages
AI - W7L13
No ratings yet
AI - W7L13
46 pages
Unit 3
No ratings yet
Unit 3
6 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
NN and Optimization Regularization
No ratings yet
NN and Optimization Regularization
198 pages
ML807_Distributed_and_Federated_Learning_Slides_2
No ratings yet
ML807_Distributed_and_Federated_Learning_Slides_2
211 pages
L04 Slides.mlp1
No ratings yet
L04 Slides.mlp1
22 pages
Week3_Backpropagation
No ratings yet
Week3_Backpropagation
32 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
DL
No ratings yet
DL
73 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Convolutional Neural Networks in Computer Vision: Jochen Lang
No ratings yet
Convolutional Neural Networks in Computer Vision: Jochen Lang
46 pages
L2 Neural Network Basics
No ratings yet
L2 Neural Network Basics
105 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
nn2
No ratings yet
nn2
12 pages
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Attacking Problems in Logarithms and Exponential Functions
From Everand
Attacking Problems in Logarithms and Exponential Functions
David S. Kahn
5/5 (1)
Unit 3
No ratings yet
Unit 3
21 pages
Twitter Data Preprocessing For Spam Detection: Myungsook Klassen
No ratings yet
Twitter Data Preprocessing For Spam Detection: Myungsook Klassen
6 pages
Backpropagation Example
No ratings yet
Backpropagation Example
9 pages
Hardware Implementation of The Neural Network
No ratings yet
Hardware Implementation of The Neural Network
9 pages
Exam2005 2
0% (1)
Exam2005 2
19 pages
BackPropagation PDF
No ratings yet
BackPropagation PDF
48 pages
Lecture 1Artificial Neural Networks
No ratings yet
Lecture 1Artificial Neural Networks
45 pages
6 Supervised NN MLP
No ratings yet
6 Supervised NN MLP
16 pages
Tutorial Backpropagation Neural Network
No ratings yet
Tutorial Backpropagation Neural Network
10 pages
Sample Factory: Egocentric 3D Control From Pixels at 100000 FPS With Asynchronous Reinforcement Learning
No ratings yet
Sample Factory: Egocentric 3D Control From Pixels at 100000 FPS With Asynchronous Reinforcement Learning
18 pages
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
No ratings yet
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
56 pages
Hacker's Guide To Neural Networks
100% (1)
Hacker's Guide To Neural Networks
39 pages
30 Frequently Asked Deep Learning Interview Questions and Answers
100% (1)
30 Frequently Asked Deep Learning Interview Questions and Answers
28 pages
Introduction To Deep Learning-1
No ratings yet
Introduction To Deep Learning-1
16 pages
Machine Learning Solution
No ratings yet
Machine Learning Solution
6 pages
Package Neuralnet': R Topics Documented
No ratings yet
Package Neuralnet': R Topics Documented
13 pages
2.neural Network
No ratings yet
2.neural Network
19 pages
11 PDF
No ratings yet
11 PDF
13 pages
Direct Self Control of Induction Motor Based On Neural Network
No ratings yet
Direct Self Control of Induction Motor Based On Neural Network
9 pages
Neurl 231 Net
No ratings yet
Neurl 231 Net
219 pages
Machine Learning Concept Map Book
No ratings yet
Machine Learning Concept Map Book
20 pages
AI Documents
No ratings yet
AI Documents
25 pages
PDF Deep Learning A Visual Approach Glassner download
100% (2)
PDF Deep Learning A Visual Approach Glassner download
37 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
UG Project Team - 08
No ratings yet
UG Project Team - 08
7 pages
Predicting Students Academic Performance Using Artificial Neural Network: A Case Study of An Engineering Course
No ratings yet
Predicting Students Academic Performance Using Artificial Neural Network: A Case Study of An Engineering Course
9 pages
weights and biases
No ratings yet
weights and biases
10 pages
Deployment AI in SmartPID-Rev 0 Rozita Othman Saipem
No ratings yet
Deployment AI in SmartPID-Rev 0 Rozita Othman Saipem
17 pages
Neural Network Based Estimation OF Feedback Signals For A Vector Controlled Induction Motor Drive
No ratings yet
Neural Network Based Estimation OF Feedback Signals For A Vector Controlled Induction Motor Drive
9 pages
Chapter-2(Deep Learning)
No ratings yet
Chapter-2(Deep Learning)
18 pages

XCS224N_Module2_Slides

Uploaded by

XCS224N_Module2_Slides

Uploaded by

Natural Language Processing

with Deep Learning

Last night , Paris Hilton wowed in a sequin gown .

the museums in Paris are amazing to see .

Xwindow = [ xmuseums xin xParis xare xamazing ]T

• Resulting vector xwindow = x ∈ R5d , a column vector!

• We do supervised training and want high score if it’s a location

x = [ xmuseums xin xParis xare xamazing ]

𝛼 = step size or learning rate

How can we compute ∇- 𝐽(𝜃)?

• Its gradient is a vector of partial derivatives with

• It’s Jacobian is an m x n matrix of partial derivatives

• For multiple variables at once: multiply Jacobians

Function has n outputs and n inputs → n by n Jacobian

• Compute these at home for practice!

• Compute these at home for practice!

Fine print: This is the correct Jacobian.

• Compute these at home for practice!

• Compute these at home for practice!

x = [ xmuseums xin xParis xare xamazing ]

x = [ xmuseums xin xParis xare xamazing ]

Carefully define your variables and keep track of their dimensionality!

Useful Jacobians from previous slide

Useful Jacobians from previous slide

Useful Jacobians from previous slide

Useful Jacobians from previous slide

• Suppose we now want to compute

• Suppose we now want to compute

The same! Let’s avoid duplicated computation …

• Suppose we now want to compute

𝛿 is the local error signal

• What does look like?

• What does look like?

• Instead, we leave pure math and use the shape convention:

𝛿 is local error signal at 𝑧

• Hacky answer: this makes the dimensions work out!

• Similarly, is a row vector

• Disagreement between Jacobian form (which makes the chain rule

2. Always follow the shape convention

We’ve almost shown you backpropagation

• Each node has a local gradient

Downstream Local Upstream

• Each node has a local gradient

• Each node has a local gradient

• [downstream gradient] = [upstream gradient] x [local gradient]

Downstream Local Upstream

Downstream Local Upstream

Forward prop steps

Forward prop steps

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

Forward prop steps Local gradients

• + “distributes” the upstream gradient to each summand

• + “distributes” the upstream gradient to each summand

• + “distributes” the upstream gradient

Done correctly, big O() complexity of fprop and

• The gradient computation can be

• For small h (≈ 1e-4),

• Useful for checking your implementation

We’ve mastered the core technology of neural nets! 🎉

• Backpropagation: recursively (and hence efficiently) apply the chain rule

• Forward pass: compute results of operations and save intermediate

• Backpropagation doesn’t always work perfectly

You might also like