XCS224N_Module2_Slides
XCS224N_Module2_Slides
Christopher Manning
Lecture 3: Neural net learning: Gradients by hand (matrix calculus)
and algorithmically (the backpropagation algorithm)
Named Entity Recognition (NER)
• The task: find and classify names in text, for example:
• Possible uses:
• Tracking mentions of particular entities in documents
• For question answering, answers are usually named entities
• Often followed by Named Entity Linking/Canonicalization into Knowledge Base
3
Simple NER: Window classification using binary logistic classifier
• Idea: classify each word in its context window of neighboring words
• Train logistic classifier on hand-labeled data to classify center word {yes/no} for each
class based on a concatenation of word vectors in a window
• Really, we usually use multi-class softmax, but trying to keep it simple J
• Example: Classify “Paris” as +/– location in context of sentence with window length 2:
1
𝐽! 𝜃 = 𝜎 𝑠 =
1 + 𝑒 "#
predicted model
probability of class
+, -
i.e., for each parameter: 𝜃$%&' = 𝜃$()* −𝛼
+-!"#$
In deep learning, we update the data representation (e.g., word vectors) too!
7
Computing Gradients by Hand
• Matrix calculus: Fully vectorized gradients
• “Multivariable calculus is just like single-variable calculus if you use matrices”
• Much faster and more useful than non-vectorized gradients
• But doing a non-vectorized gradient can be good for intuition; recall the first
lecture for an example
• Lecture notes and matrix calculus notes cover this material in more detail
• You might also review Math 51, which has a new online textbook:
http://web.stanford.edu/class/math51/textbook.html
or maybe you’re luckier if you did Engr 108
8
Gradients
• Given a function with 1 output and 1 input
𝑓 𝑥 = 𝑥.
• It’s gradient (slope) is its derivative
*/
*0
= 3𝑥 1
“How much will the output change if we change the input a bit?”
At x = 1 it changes about 3 times as much: 1.013 = 1.03
At x = 4 it changes about 48 times as much: 4.013 = 64.48
9
Gradients
• Given a function with 1 output and n inputs
10
Jacobian Matrix: Generalization of the Gradient
• Given a function with m outputs and n inputs
11
Chain Rule
• For composition of one-variable functions: multiply derivatives
12
Example Jacobian: Elementwise activation Function
13
Example Jacobian: Elementwise activation Function
14
Example Jacobian: Elementwise activation Function
15
Example Jacobian: Elementwise activation Function
16
Example Jacobian: Elementwise activation Function
17
Other Jacobians
21
Back to our Neural Net!
25
2. Apply the chain rule
26
2. Apply the chain rule
27
2. Apply the chain rule
28
3. Write out the Jacobians
29
3. Write out the Jacobians
𝒖!
30
3. Write out the Jacobians
𝒖!
31
3. Write out the Jacobians
𝒖!
32
3. Write out the Jacobians
𝒖!
𝒖!
Useful Jacobians from previous slide
33
Re-using Computation
34
Re-using Computation
35
Re-using Computation
𝒖!
37
Derivative with respect to Matrix: Output shape
• So is n by m:
38
Derivative with respect to Matrix
• What is
• is going to be in our answer
• The other term should be because
• Answer is:
39
Deriving local input gradient in backprop
"𝒛
• For "𝑾 in our equation:
𝜕𝑠 𝜕𝒛 𝜕
=𝜹 =𝜹 (𝑾𝒙 + 𝒃)
𝜕𝑾 𝜕𝑾 𝜕𝑾
• Let’s consider the derivative of a single weight Wij
• Wij only contributes to zi u2
• For example: W23 is only
s
used to compute z2 not z1 f(z1)= h1 h2 =f(z2)
W23
𝜕𝑧2 𝜕
= 𝑾23 𝒙 + 𝑏2 b2
𝜕𝑊2$ 𝜕𝑊2$
+
= ∑*567 𝑊25 𝑥5 = 𝑥$ x1 x2 x3 +1
+4%!
40
Why the Transposes?
42
What shape should derivatives be?
Two options:
1. Use Jacobian form as much as possible, reshape to
follow the shape convention at the end:
• What we just did. But at the end transpose to make the
derivative a column vector, resulting in
43
3. Backpropagation
44
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• Interior nodes: operations
+
45
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• Interior nodes: operations
• Edges pass along result of the
operation
+
46
Computation Graphs and Backpropagation
• Software represents our neural
net equations as a graph
• Source nodes: inputs
• “Forward Propagation”
Interior nodes: operations
• Edges pass along result of the
operation
+
47
Backpropagation
• Then go backwards along edges
• Pass along gradients
+
48
Backpropagation: Single Node
• Node receives an “upstream gradient”
• Goal is to pass on the correct
“downstream gradient”
Downstream Upstream
49 gradient gradient
Backpropagation: Single Node
Chain
rule!
Downstream Local Upstream
51 gradient gradient gradient
Backpropagation: Single Node
53
Backpropagation: Single Node
• Multiple inputs → multiple local gradients
55
An Example
*
max
56
An Example
2
+ 3
6
2 *
2
max
0
57
An Example
2
+ 3
6
2 *
2
max
0
58
An Example
2
+ 3
6
2 *
2
max
0
59
An Example
2
+ 3
6
2 *
2
max
0
60
An Example
2
+ 3
6
2 *
2
max
0
61
An Example
2
+ 3
1*2 = 2 6
2 * 1
2
max 1*3 = 3
0
upstream * local = downstream
62
An Example
2
+ 3
2
6
2 * 1
2
3*1 = 3
max 3
0
3*0 = 0 upstream * local = downstream
63
An Example
1
2*1 = 2
2
+ 3
2
2*1 = 2 6
2 * 1
2
3
max 3
0
0 upstream * local = downstream
64
An Example
1
2
2
+ 3
2
2 6
2 * 1
2
3
max 3
0
0
65
Gradients sum at outward branches
66
Gradients sum at outward branches
67
Node Intuitions
1
2
2
+ 3
2
2 6
2 * 1
2
max
0
68
Node Intuitions
2
+ 3
6
2 * 1
2
3
max 3
0
0
69
Node Intuitions
2
+ 3
2
6
2 * 1
2
max 3
0
70
Efficiency: compute all gradients at once
• Incorrect way of doing backprop:
• First compute
* +
71
Efficiency: compute all gradients at once
• Incorrect way of doing backprop:
• First compute
• Then independently compute
• Duplicated computation!
* +
72
Efficiency: compute all gradients at once
• Correct way:
• Compute all the gradients at once
• Analogous to using 𝜹 when we
computed gradients by hand
* +
73
Back-Prop in General Computation Graph
1. Fprop: visit nodes in topological sort order
Single scalar output - Compute value of node given predecessors
2. Bprop:
- initialize output gradient = 1
… - visit nodes in reverse order:
Compute gradient wrt each node using
gradient wrt successors
… = successors of
76
Implementation: forward/backward API
77
Implementation: forward/backward API
78
Manual Gradient checking: Numeric Gradient
79
Summary
• But why take a class on compilers or systems when they are implemented for you?
• Understanding what is going on under the hood is useful!
81