Gradient Notes
Christopher Yeh
June 20, 2024
1 Jacobian
Consider a vector-valued function f : Rn → Rm . Then the Jacobian is the matrix
∂f1 ∂f1
∂x1 · · · ∂xn
∂f ∂f . ... ..
J= ··· = .
. .
∂x1 ∂xn
∂fm
∂fm
···
∂x1 ∂xn
∂fi
where element-wise Jij = ∂xj
.
If f : Rn → R is a scalar-valued function with vector inputs, then its gradient is just a special
case of the Jacobian with shape 1 × n.
2 Softmax Cross-Entropy Loss w.r.t. Logits
We want to compute the gradient for the cross-entropy loss J ∈ R between our predicted
softmax probabilities ŷ and the true one-hot probabilities y. Both y and ŷ are vectors of the
same length. They can be either row or column vectors; the result is the same.
We are given the following:
1. ŷ = softmax(θ)
2. y is a one-hot vector, where yk = 1 and yc6=k = 0
3. y, ŷ, θ ∈ Rn
The cross-entropy loss J is computed as follows. The second line expands out the softmax
1
function.
X
J(θ) = CE(y, ŷ) = − yc log ŷc = − log ŷk
c
!
θk
e X
= − log P = log eθc − θk
c eθc c
The gradient of the loss w.r.t. the logits θ is
∂J eθi
= P θc − 1[i = k] −→ ∇θ J = ŷ − y
∂θi ce
3 Matrix times column vector w.r.t. matrix
∂J ∂J
Given z = W x and r = ∂z
, what is ∂W
?
1. z ∈ Rn and x ∈ Rm are column vectors
2. W ∈ Rn×m is a matrix
3. J ∈ R is some scalar function of z
4. r ∈ R1×n is the Jacobian of J w.r.t. z
Note on notation: Technically, J is a scalar-valued function taking nm inputs (the entries
∂J
of W ). This means the Jacobian ∂W should be a 1 × nm vector, which isn’t very useful.
∂J ∂J ∂J
Instead, we will let ∂W be a n × m matrix where ∂W ij = ∂Wij .
∂J ∂J
···
∂W1,1 ∂W1,m
∂J . .. ..
∂W ..
= . .
∂J ∂J
···
∂Wn,1 ∂Wn,m
Naively, we can write
∂J ∂J ∂z ∂z
= =r
∂W ∂z ∂W ∂W
∂z
However, it is unclear how to derive ∂W , since this is the gradient of a vector w.r.t. a
matrix. This gradient would have to be 3-dimensional, and multiplying the vector r by this
3-D tensor is not well-defined. Thus, we instead have to take the element-wise derivative
∂J
∂Wij
.
Note that zk is the dot-product between the k-th row of W and x.
m m
X ∂ X ∂
zk = Wkl xl zk = xl Wkl
l=1
Wij l=1
W ij
2
Clearly, W∂ij Wkl = 1 only when i = k and j = l, and 0 otherwise. Thus, ∂
z
Wij k
= 1[k = i]xj .
Another way we can write this is
0
..
.
0
∂z
= xj ← ith element
Wij
0
.
..
0
Now we can compute
∂J X ∂J ∂zk X
= = rk 1[k = i]xj = ri xj
∂Wij k
∂z k ∂Wij
k
where the summation comes from the Chain Rule. (Every change to Wij influences each zk
which in turn influences J, so the total effect of Wij on J is the sum of the influences of each
zk on J).
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= rT xT (recall that r is a row vector).
4 Row vector times matrix w.r.t. matrix
The problem setup is basically the same as the previous case, except with row vectors instead
of column vectors.
∂J ∂J
Given z = xW and r = ∂z
, what is ∂W
?
1. z ∈ R1×n and x ∈ R1×m are row vectors
2. W ∈ Rm×n is a matrix
3. J ∈ R is some scalar function of z
4. r ∈ R1×n is the Jacobian of J w.r.t. z
Similar to the previous case, we have
n
X
zl = xk Wkl
k=1
n
∂ X ∂
zl = xk Wkl = 1[j = l]xi
Wij k=1
Wij
Now we can compute
∂J X ∂J ∂zl X
= = rl 1[j = l]xi = xi rj
∂Wij l
∂z l ∂Wij
l
3
∂J ∂J
Thus the full matrix ∂W
is the outer product ∂W
= xT r (recall that both x and r are row
vectors).
5 Scalar Function of Matrix Multiplication w.r.t. Ma-
trix
Let B = XY be some matrix multiplication. Let A be a scalar that is a function of B, where
∂A ∂A ∂A
∂B
is known. We want to find ∂X and ∂Y .
1. Let X ∈ Rn×m and Y ∈ Rm×p .
∂A
2. This means that B, ∂B ∈ Rn×p .
Note on notation: Technically, A is a scalar-valued function taking np inputs (the entries of
∂A
B). This means the Jacobian ∂B should be a 1 × np vector, which isn’t very useful. Instead,
∂A ∂A ∂A ∂A ∂A
we will let ∂B be a n × p matrix where ∂B ij
= ∂B ij
. We define ∂X and ∂Y similarly.
∂A ∂A ∂B ∂B
Naively, we can write ∂X = ∂B ∂X
. However, it is unclear how to derive ∂X , since this is the
gradient of a matrix w.r.t. another matrix. This gradient would have to be 4-dimensional,
∂A
and multiplying the matrix ∂B by this 4-D tensor is not well-defined. Thus, we instead take
∂A
the element-wise derivative ∂Xij .
First, we compute the derivatives for each element of B w.r.t. each element of X and Y .
∂ ∂
Bk,l = (Xk,: · Y:,l ) = 1[k = i]Yj,l
∂Xi,j ∂Xi,j
∂ ∂
Bk,l = (Xk,: · Y:,l ) = 1[l = j]Xk,i
∂Yi,j ∂Yi,j
Now we can use the (multi-path) chain rule to take the derivative of A w.r.t. each element
of X and Y .
∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[k = i]Yj,l = Yj,l = · Yj,:
∂Xi,j k,l
∂Bk,l ∂Xi,j k,l
∂Bk,l l
∂Bi,l ∂B i,:
∂A X ∂A ∂Bk,l X ∂A X ∂A ∂A
= = 1[l = j]Xk,i = Xk,i = · X:,i
∂Yi,j k,l
∂B k,l ∂Yi,j
k,l
∂B k,l
k
∂B k,j ∂B :,j
Combining these element-wise derivatives yields the matrix equations
∂A ∂A ∂A ∂A
= ·YT and = XT ·
∂X ∂B ∂Y ∂B
4
6 Scalar Function of Matrix-Vector Broadcast Sum w.r.t.
Vector
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known. Let
B = X + y be some broadcasted sum between a matrix X and a row-vector y ∈ R1×m . We
want to find ∂A
∂y
.
Intuitively, notice that any change in y directly and linearly affects every row of B. Each
row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A X ∂A
= = ·I =
∂y i
∂Bi ∂y i
∂Bi i
∂Bi
where the Bi is the i-th row of B.
Alternatively, we can write this broadcasted sum properly as
B =X +1·y
where 1 is a n-dimensional column vector. Then we can use the gradient rules derived earlier
to get the equivalent result.
∂A ∂A X ∂A
= 1T · =
∂y ∂B i
∂Bi
7 Scalar Function of Matrix-Vector Broadcast Prod-
uct
∂A
Let A be a scalar that is a function of a matrix B ∈ Rn×m , and suppose ∂B is known.
Let B = y · X be a broadcasted Hadamard (element-wise) product between a row vector
y ∈ R1×m and matrix X. In other words, the i-th row of B is computed by the Hadamard
product Bi = y Xi . We want to find ∂A∂y
∂A
and ∂X .
We first find ∂A
∂y
. Intuitively, any change in y directly affects every row of B by a factor of
the same row in X. Each row of B in turn affects A. Therefore,
∂A X ∂A ∂Bi X ∂A
= = Xi
∂y i
∂Bi ∂y i
∂Bi
∂A
Next we find ∂X . We can find this element-wise, then compose the entire gradient. Note
that changing Xij only affects Bij by a scale of yj . No other indices in B are affected.
∂A ∂A
= yj
∂Xij ∂Bij
∂A ∂A
=y·
∂X ∂B