0% found this document useful (0 votes)

38 views5 pages

Gradient Notes

Uploaded by

ganor44300

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views5 pages

Gradient Notes

Uploaded by

ganor44300

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Gradient Notes

Christopher Yeh

June 20, 2024

1 Jacobian
Consider a vector-valued function f : Rn → Rm . Then the Jacobian is the matrix
∂f1 ∂f1
 
 ∂x1 · · · ∂xn 
∂f ∂f  . ... .. 
J= ··· = .
. . 
∂x1 ∂xn 
 ∂fm

∂fm 
···
∂x1 ∂xn
∂fi
where element-wise Jij = ∂xj
.
If f : Rn → R is a scalar-valued function with vector inputs, then its gradient is just a special
case of the Jacobian with shape 1 × n.

2 Softmax Cross-Entropy Loss w.r.t. Logits

We want to compute the gradient for the cross-entropy loss J ∈ R between our predicted
softmax probabilities ŷ and the true one-hot probabilities y. Both y and ŷ are vectors of the
same length. They can be either row or column vectors; the result is the same.
We are given the following:
1. ŷ = softmax(θ)
2. y is a one-hot vector, where yk = 1 and yc6=k = 0
3. y, ŷ, θ ∈ Rn
The cross-entropy loss J is computed as follows. The second line expands out the softmax

1
function.
X
J(θ) = CE(y, ŷ) = − yc log ŷc = − log ŷk
c
!
θk
e X
= − log P = log eθc − θk
c eθc c

The gradient of the loss w.r.t. the logits θ is

∂J eθi
= P θc − 1[i = k] −→ ∇θ J = ŷ − y
∂θi ce

3 Matrix times column vector w.r.t. matrix

∂J ∂J
Given z = W x and r = ∂z
, what is ∂W
?
1. z ∈ Rn and x ∈ Rm are column vectors
2. W ∈ Rn×m is a matrix
3. J ∈ R is some scalar function of z
4. r ∈ R1×n is the Jacobian of J w.r.t. z
Note on notation: Technically, J is a scalar-valued function taking nm inputs (the entries
∂J
of W ). This means the Jacobian ∂W should be a 1 × nm vector, which isn’t very useful.
∂J ∂J ∂J
Instead, we will let ∂W be a n × m matrix where ∂W ij = ∂Wij .

∂J ∂J
 
···
 ∂W1,1 ∂W1,m 
∂J  . .. .. 
∂W  ..
= . . 
 ∂J ∂J 
···
∂Wn,1 ∂Wn,m

Naively, we can write

∂J ∂J ∂z ∂z
= =r
∂W ∂z ∂W ∂W
∂z
However, it is unclear how to derive ∂W , since this is the gradient of a vector w.r.t. a
matrix. This gradient would have to be 3-dimensional, and multiplying the vector r by this
3-D tensor is not well-defined. Thus, we instead have to take the element-wise derivative
∂J
∂Wij
.
Note that zk is the dot-product between the k-th row of W and x.
m m
X ∂ X ∂
zk = Wkl xl zk = xl Wkl
l=1
Wij l=1
W ij

2
Clearly, W∂ij Wkl = 1 only when i = k and j = l, and 0 otherwise. Thus, ∂
z
Wij k
= 1[k = i]xj .
Another way we can write this is
 
0
 .. 
.
0
 
∂z
= xj  ← ith element
 
Wij  
0
.
 .. 
0

Now we can compute

∂J X ∂J ∂zk X
= = rk 1[k = i]xj = ri xj
∂Wij k
∂z k ∂Wij
k

where the summation comes from the Chain Rule. (Every change to Wij influences each zk
which in turn influences J, so the total effect of Wij on J is the sum of the influences of each
zk on J).
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= rT xT (recall that r is a row vector).

4 Row vector times matrix w.r.t. matrix

The problem setup is basically the same as the previous case, except with row vectors instead
of column vectors.
∂J ∂J
Given z = xW and r = ∂z
, what is ∂W
?
1. z ∈ R1×n and x ∈ R1×m are row vectors
2. W ∈ Rm×n is a matrix
3. J ∈ R is some scalar function of z
4. r ∈ R1×n is the Jacobian of J w.r.t. z
Similar to the previous case, we have
n
X
zl = xk Wkl
k=1
n
∂ X ∂
zl = xk Wkl = 1[j = l]xi
Wij k=1
Wij

Now we can compute

∂J X ∂J ∂zl X
= = rl 1[j = l]xi = xi rj
∂Wij l
∂z l ∂Wij
l

3
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= xT r (recall that both x and r are row
vectors).

5 Scalar Function of Matrix Multiplication w.r.t. Ma-

trix
Let B = XY be some matrix multiplication. Let A be a scalar that is a function of B, where
∂A ∂A ∂A
∂B
is known. We want to find ∂X and ∂Y .
1. Let X ∈ Rn×m and Y ∈ Rm×p .
∂A
2. This means that B, ∂B ∈ Rn×p .
Note on notation: Technically, A is a scalar-valued function taking np inputs (the entries of
∂A
B). This means the Jacobian ∂B should be a 1 × np vector, which isn’t very useful. Instead,
∂A ∂A ∂A ∂A ∂A
we will let ∂B be a n × p matrix where ∂B ij
= ∂B ij
. We define ∂X and ∂Y similarly.
∂A ∂A ∂B ∂B
Naively, we can write ∂X = ∂B ∂X
. However, it is unclear how to derive ∂X , since this is the
gradient of a matrix w.r.t. another matrix. This gradient would have to be 4-dimensional,
∂A
and multiplying the matrix ∂B by this 4-D tensor is not well-defined. Thus, we instead take
∂A
the element-wise derivative ∂Xij .
First, we compute the derivatives for each element of B w.r.t. each element of X and Y .
∂ ∂
Bk,l = (Xk,: · Y:,l ) = 1[k = i]Yj,l
∂Xi,j ∂Xi,j
∂ ∂
Bk,l = (Xk,: · Y:,l ) = 1[l = j]Xk,i
∂Yi,j ∂Yi,j

Now we can use the (multi-path) chain rule to take the derivative of A w.r.t. each element
of X and Y .

∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[k = i]Yj,l = Yj,l = · Yj,:
∂Xi,j k,l
∂Bk,l ∂Xi,j k,l
∂Bk,l l
∂Bi,l ∂B i,:

∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[l = j]Xk,i = Xk,i = · X:,i
∂Yi,j k,l
∂B k,l ∂Yi,j
k,l
∂B k,l
k
∂B k,j ∂B :,j

Combining these element-wise derivatives yields the matrix equations

∂A ∂A ∂A ∂A
= ·YT and = XT ·
∂X ∂B ∂Y ∂B

4
6 Scalar Function of Matrix-Vector Broadcast Sum w.r.t.
Vector
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known. Let
B = X + y be some broadcasted sum between a matrix X and a row-vector y ∈ R1×m . We
want to find ∂A
∂y
.
Intuitively, notice that any change in y directly and linearly affects every row of B. Each
row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A X ∂A
= = ·I =
∂y i
∂Bi ∂y i
∂Bi i
∂Bi

where the Bi is the i-th row of B.

Alternatively, we can write this broadcasted sum properly as
B =X +1·y
where 1 is a n-dimensional column vector. Then we can use the gradient rules derived earlier
to get the equivalent result.
∂A ∂A X ∂A
= 1T · =
∂y ∂B i
∂Bi

7 Scalar Function of Matrix-Vector Broadcast Prod-

uct
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known.
Let B = y · X be a broadcasted Hadamard (element-wise) product between a row vector
y ∈ R1×m and matrix X. In other words, the i-th row of B is computed by the Hadamard
product Bi = y Xi . We want to find ∂A∂y
∂A
and ∂X .

We first find ∂A
∂y
. Intuitively, any change in y directly affects every row of B by a factor of
the same row in X. Each row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A
= = Xi
∂y i
∂Bi ∂y i
∂Bi

∂A
Next we find ∂X . We can find this element-wise, then compose the entire gradient. Note
that changing Xij only affects Bij by a scale of yj . No other indices in B are affected.
∂A ∂A
= yj
∂Xij ∂Bij
∂A ∂A
=y·
∂X ∂B

Vector/Matrix Calculus Guide
No ratings yet
Vector/Matrix Calculus Guide
10 pages
Chapter Matrix Derivative Common Cases
No ratings yet
Chapter Matrix Derivative Common Cases
6 pages
Matrix Calculus: Gradients and Derivatives
No ratings yet
Matrix Calculus: Gradients and Derivatives
23 pages
Matrix Calculus Derivatives Guide
No ratings yet
Matrix Calculus Derivatives Guide
8 pages
Derivatives and Backpropagation Guide
No ratings yet
Derivatives and Backpropagation Guide
7 pages
Matrixcalc PDF
No ratings yet
Matrixcalc PDF
23 pages
Matrix Calculus and Kronecker Product
No ratings yet
Matrix Calculus and Kronecker Product
7 pages
Day 1
No ratings yet
Day 1
41 pages
Matrix Calculus: 1 The Derivative
100% (2)
Matrix Calculus: 1 The Derivative
13 pages
Regression 1
No ratings yet
Regression 1
63 pages
Vectorized Neural Network Gradients
No ratings yet
Vectorized Neural Network Gradients
7 pages
Vectorized Neural Network Gradients
No ratings yet
Vectorized Neural Network Gradients
67 pages
Matrix Calculus Derivatives Overview
No ratings yet
Matrix Calculus Derivatives Overview
9 pages
Matrix Calculus: Derivatives & Rules
No ratings yet
Matrix Calculus: Derivatives & Rules
9 pages
TA WEEK 3 Copy
No ratings yet
TA WEEK 3 Copy
27 pages
Thomas Minka - Note On Matrix Calculus and Algebra
No ratings yet
Thomas Minka - Note On Matrix Calculus and Algebra
19 pages
Mit18 S096iap23 Lec05
No ratings yet
Mit18 S096iap23 Lec05
6 pages
matrixcalc Đạo hàm ma trận PDF
No ratings yet
matrixcalc Đạo hàm ma trận PDF
25 pages
00 Lectureslides LinAlg
No ratings yet
00 Lectureslides LinAlg
20 pages
Matrix Calculus
No ratings yet
Matrix Calculus
9 pages
M0 1 After Class
No ratings yet
M0 1 After Class
21 pages
Vector Algebra and Matrix Formulas
No ratings yet
Vector Algebra and Matrix Formulas
24 pages
Engineering Math Formula Guide
100% (1)
Engineering Math Formula Guide
26 pages
Selected Linear Algebra For Machine Learning
No ratings yet
Selected Linear Algebra For Machine Learning
30 pages
Crespin, D. - Matrix Formulas For Semilinear Backpropagation
No ratings yet
Crespin, D. - Matrix Formulas For Semilinear Backpropagation
29 pages
Gradients Involving Matrices
No ratings yet
Gradients Involving Matrices
5 pages
Linear - Algebra - and Metric Calculus-1
No ratings yet
Linear - Algebra - and Metric Calculus-1
59 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
Int401 Lec
No ratings yet
Int401 Lec
31 pages
Mit18 S096iap23 Lec1
No ratings yet
Mit18 S096iap23 Lec1
16 pages
Review of Linear Algebra
No ratings yet
Review of Linear Algebra
19 pages
Review of Matrix Operations: Vector: A Sequence of Elements (The Order Is Important)
No ratings yet
Review of Matrix Operations: Vector: A Sequence of Elements (The Order Is Important)
11 pages
L02 Notes
No ratings yet
L02 Notes
6 pages
Matrix Calculus
No ratings yet
Matrix Calculus
9 pages
Matrix Calculus
No ratings yet
Matrix Calculus
9 pages
Matrix Algebra for Robotics Control
No ratings yet
Matrix Algebra for Robotics Control
4 pages
Linear Algebra - Class Notes
No ratings yet
Linear Algebra - Class Notes
5 pages
Summary
No ratings yet
Summary
115 pages
Matrix Algebra Calculus Review
0% (1)
Matrix Algebra Calculus Review
12 pages
Background Notes - Scientific Computing Michael T Heath
No ratings yet
Background Notes - Scientific Computing Michael T Heath
16 pages
Matrix Algebra Basics Explained
No ratings yet
Matrix Algebra Basics Explained
20 pages
Matrices Lecture - Econometrics - Unil
No ratings yet
Matrices Lecture - Econometrics - Unil
18 pages
Lec 3
No ratings yet
Lec 3
43 pages
LinearAlgebraPrimer Ver 2010
No ratings yet
LinearAlgebraPrimer Ver 2010
15 pages
Multivariate Analysis Notes
No ratings yet
Multivariate Analysis Notes
54 pages
Vectors and Least Squares Guide
No ratings yet
Vectors and Least Squares Guide
18 pages
Mobrob Linear Algebra
No ratings yet
Mobrob Linear Algebra
42 pages
Chiang - Chapter 4
100% (2)
Chiang - Chapter 4
14 pages
cs229.... Machine Language. Andrew NG
No ratings yet
cs229.... Machine Language. Andrew NG
17 pages
Linear Algebra
No ratings yet
Linear Algebra
23 pages
Math 5390 Chapter 2
No ratings yet
Math 5390 Chapter 2
5 pages
Matrix Chapter1 Part1 2025
No ratings yet
Matrix Chapter1 Part1 2025
39 pages
Linear Algebra for Business Analytics
No ratings yet
Linear Algebra for Business Analytics
27 pages
Community Nutrition & Health Guide
No ratings yet
Community Nutrition & Health Guide
13 pages
Drivers of Amul Supply Chain
0% (1)
Drivers of Amul Supply Chain
7 pages
LPG Gas Couplings
No ratings yet
LPG Gas Couplings
12 pages
Advanced Data Structures Guide
No ratings yet
Advanced Data Structures Guide
3 pages
Astm D395 PDF
100% (3)
Astm D395 PDF
6 pages
A1250 Datasheet
No ratings yet
A1250 Datasheet
12 pages
Nokia Antenna Solutions Customer Presentation v4.0 (Long Version)
No ratings yet
Nokia Antenna Solutions Customer Presentation v4.0 (Long Version)
50 pages
Skuld Charterers' Cover T&Cs 2023
No ratings yet
Skuld Charterers' Cover T&Cs 2023
23 pages
Phenanthrene Safety Data Sheet
No ratings yet
Phenanthrene Safety Data Sheet
5 pages
ATOPCV1 7 2 Photochemical Reactions Hydrogen Bromine and Hydrogen Chlorine Reactions
No ratings yet
ATOPCV1 7 2 Photochemical Reactions Hydrogen Bromine and Hydrogen Chlorine Reactions
14 pages
Activity 4. Lab Report
No ratings yet
Activity 4. Lab Report
8 pages
Valida Guideline Formulations - Vitamin C Energising Serum
No ratings yet
Valida Guideline Formulations - Vitamin C Energising Serum
1 page
Brochure GM IM Horizontal Scourer MHXS EN 2018
No ratings yet
Brochure GM IM Horizontal Scourer MHXS EN 2018
4 pages
Understanding Intervertebral Disc Prolapse
No ratings yet
Understanding Intervertebral Disc Prolapse
81 pages
KSB Pump Catalog Etaline Syt
No ratings yet
KSB Pump Catalog Etaline Syt
26 pages
Diagnosis of OLTC Via Duval Triangle Method and Dynamic Current Measurement
No ratings yet
Diagnosis of OLTC Via Duval Triangle Method and Dynamic Current Measurement
7 pages
Chp. 4. Epidemiology and Disease Forecast
No ratings yet
Chp. 4. Epidemiology and Disease Forecast
4 pages
Raychem MWTM Medium Wall PDF
No ratings yet
Raychem MWTM Medium Wall PDF
4 pages
KT404A Sound Chip User Manual - V2.4
No ratings yet
KT404A Sound Chip User Manual - V2.4
65 pages
Elements of Poetry
No ratings yet
Elements of Poetry
2 pages
Bridgestone's Global Quality Commitment
No ratings yet
Bridgestone's Global Quality Commitment
13 pages
A+conceptual+study+of+Medo+Dusthi+w S R +to+Dyslipidemia+in+Ayurveda+-+Copy+-+
No ratings yet
A+conceptual+study+of+Medo+Dusthi+w S R +to+Dyslipidemia+in+Ayurveda+-+Copy+-+
5 pages
Animal Cloning PDF
100% (1)
Animal Cloning PDF
6 pages
R-22 Vs R-410 A - What's The Difference - A Refrigerant Comparison
No ratings yet
R-22 Vs R-410 A - What's The Difference - A Refrigerant Comparison
7 pages
X1 Mini G3
No ratings yet
X1 Mini G3
2 pages
30. THPT Chuyên Trần Phú - Hải Phòng (Tốt Nghiệp THPT 2025 Môn Tiếng Anh) - GGXENJOfE0
No ratings yet
30. THPT Chuyên Trần Phú - Hải Phòng (Tốt Nghiệp THPT 2025 Môn Tiếng Anh) - GGXENJOfE0
42 pages
CTES WICS Weight Indicator Calibration System
No ratings yet
CTES WICS Weight Indicator Calibration System
1 page
AutoPIPE 11 Standard vs. Advanced
No ratings yet
AutoPIPE 11 Standard vs. Advanced
3 pages
Irish Cream Technical File March 2019
No ratings yet
Irish Cream Technical File March 2019
12 pages
6 1 Angles of Polygons
No ratings yet
6 1 Angles of Polygons
2 pages

Gradient Notes

Uploaded by

Gradient Notes

Uploaded by

Gradient Notes

June 20, 2024

2 Softmax Cross-Entropy Loss w.r.t. Logits

The gradient of the loss w.r.t. the logits θ is

3 Matrix times column vector w.r.t. matrix

Naively, we can write

Now we can compute

4 Row vector times matrix w.r.t. matrix

Now we can compute

5 Scalar Function of Matrix Multiplication w.r.t. Ma-

Combining these element-wise derivatives yields the matrix equations

where the Bi is the i-th row of B.

7 Scalar Function of Matrix-Vector Broadcast Prod-

You might also like