Matrix Calculus
Matrix Calculus
Sourya Dey
1 Notation
• Scalars are written as lower case letters.
• Vectors are written as lower case bold letters, such as x, and can be either row (dimensions
1×n) or column (dimensions n×1). Column vectors are the default choice, unless otherwise
mentioned. Individual elements are indexed by subscripts, such as xi (i ∈ {1, · · · , n}).
• Matrices are written as upper case bold letters, such as X, and have dimensions m × n
corresponding to m rows and n columns. Individual elements are indexed by double
subscripts for row and column, such as Xij (i ∈ {1, · · · , m}, j ∈ {1, · · · , n}).
• Occasionally higher order tensors occur, such as 3rd order with dimensions m × n × p, etc.
Note that a matrix is a 2nd order tensor. A row vector is a matrix with 1 row, and a column
vector is a matrix with 1 column. A scalar is a matrix with 1 row and 1 column. Essentially,
scalars and vectors are special cases of matrices.
∂f
The derivative of f with respect to x is . Both x and f can be a scalar, vector, or matrix,
∂x T
∂f
leading to 9 types of derivatives. The gradient of f w.r.t x is ∇x f = , i.e. gradient
∂x
is transpose of derivative. The gradient at any point x0 in the domain has a physical
interpretation, its direction is the direction of maximum increase of the function f at the point
x0 , and its magnitude is the rate of increase in that direction. We do not generally deal with
the gradient when x is a scalar.
2 Basic Rules
We’ll first state the most general matrix-matrix derivative type. All other types are sim-
plifications since scalars and vectors are special cases of matrices. Consider a function F (·)
which maps m × n matrices to p × q matrices, i.e. domain ⊂ Rm×n and range ⊂ Rp×q . So,
∂F
F (·) : X → F (X). Its derivative is a 4th order tensor of dimensions p × q × n × m. This
m×n p×q ∂X
is an outer matrix of dimensions n × m (transposed dimensions of the denominator X), with
1
each element being a p × q inner matrix (same dimensions as the numerator F ). It is given as:
∂F ∂F
∂X1,1 · · · ∂Xm,1
∂F . .. ..
. .
= . . (1a)
∂X
∂F ∂F
···
∂X1,n ∂Xm,n
which has n rows and m columns, and the (i, j)th element is given as:
∂F1,1 ∂F1,q
∂Xi,j · · ·
∂Xi,j
∂F
. . .. ..
= . . .
(1b)
∂Xi,j
∂Fp,1 ∂Fp,q
···
∂Xi,j ∂Xi,j
which has p rows and q columns.
Whew! Now that that’s out of the way, let’s get to some general rules (for the following, x and
y can represent scalar, vector or matrix):
∂y
• The derivative always has outer matrix dimensions = transposed dimen-
∂x
sions of denominator x, and each individual element (inner matrix) has di-
mensions = same dimensions of numerator y. If you do a calculation and the
dimension doesn’t come out right, the answer is not correct.
∂f (g(x)) ∂f (g(x)) ∂g(x)
• Derivatives usually obey the chain rule, i.e. = .
∂x ∂g(x) ∂x
∂f (x)g(x) ∂g(x) ∂f (x)
• Derivatives usually obey the product rule, i.e. = f (x) + g(x) .
∂x ∂x ∂x
3 Types of derivatives
Nothing special here. The derivative is a scalar, and can also be written as f 0 (x). For example,
if f (x) = sin x, then f 0 (x) = cos x.
2
3.3 Vector by scalar
∂f1
∂x
∂f2
∂f ∂x
= (3)
∂x
..
.
∂f
n
∂x
In this case, both the derivative and gradient are the same m × m diagonal matrix, given as:
0
0
f (x 1 )
f 0 (x2 )
∂f
∇x f = = (6)
∂x ..
.
0
0
f (xm )
∂f (xi )
where f 0 (xi ) = .
∂xi
3
Note: Some texts take the derivative of a vectorized scalar function by taking element-wise
derivatives to get a m × 1 vector. To avoid confusion with (6), we will refer to this as f 0 (x).
0
f (x1 )
f 0 (x2 )
f 0 (x) = . (7)
..
f 0 (xm )
To realize the effect of this, let’s say we want to multiply the gradient from (6) with some
m-dimensional vector a. This would result in:
0
f (x1 ) a1
0
f (x2 ) a2
∇x f a = (8)
..
.
f 0 (xm ) am
Achieving the same result with f 0 (x) from (7) would require the Hadamard product ◦, defined
as element-wise multiplication of 2 vectors:
0
f (x1 ) a1
0
f (x2 ) a2
f 0 (x) ◦ a = (9)
..
.
f 0 (xm ) am
Consider the type of function in Sec. 3.2, i.e. f (·) : x → f (x). Its gradient is a vector-
m×1 1×1
to-vector function given as ∇x f (·) : x → ∇x f (x). The transpose of its derivative is the
m×1 m×1
Hessian:
∂2f ∂2f
···
∂x21 ∂x1 ∂xm
.. .. ..
H=
. . .
(10)
∂2f 2
∂ f
···
∂xm ∂x1 ∂x2m
T
∂2f ∂2f
∂∇x f
i.e. H = . If derivatives are continuous, then = , so the Hessian is
∂x ∂xi ∂xj ∂xj ∂xi
symmetric.
4
3.5 Scalar by matrix
∂f ∂f
∂X1,1 ···
∂Xm,1
∂f . .. ..
= .. . . (11)
∂X
∂f ∂f
···
∂X1,n ∂Xm,n
The gradient has the same dimensions as the input matrix, i.e. m × n.
∂F ∂F1,q
1,1
···
∂x ∂x
∂F .
. .. ..
=
. . . (12)
∂x
∂Fp,1 ∂Fp,q
···
∂x ∂x
f (·) : X → f (X). In this case, the derivative is a 3rd-order tensor with dimensions p × n × m.
m×n p×1
This is the same n × m matrix in (11), but with f replaced by the p-dimensional vector f , i.e.:
∂f ∂f
∂X1,1 ···
∂Xm,1
∂f
= ... .. .. (13)
∂X . .
∂f ∂f
···
∂X1,n ∂Xm,n
F (·) : x → F (x). In this case, the derivative is a 3rd-order tensor with dimensions p × q × m.
m×1 p×q
This is the same m × 1 row vector in (2), but with f replaced by the p × q matrix F , i.e.:
∂F ∂F ∂F ∂F
= ··· (14)
∂x ∂x1 ∂x2 ∂xm
5
4 Operations and Examples
4.1 Commutation
If things normally don’t commute (such as for matrices, AB 6= BA), then order should be
maintained when taking derivatives. If things normally commute (such as for vector inner
product, a·b = b·a), their order can be switched when taking derivatives. Output dimensions
must always come out right.
∂f
For example, let f (x) = (aT x) b . The derivative should be a n×m matrix. Keeping
n×1 1×m m×1 n×1 ∂x
∂f ∂x
order fixed, we get = aT b = aT Ib = aT b. This is a scalar, which is wrong! The solution?
∂x ∂x
Note that aT x is a scalar, which can sit either to the right or the left of vector b, i.e. ordering
∂f ∂x
doesn’t really matter. Rewriting f (x) = b aT x , we get = baT = baT I = baT , which
∂x ∂x
is the correct n × m matrix.
If this seems confusing, it might be useful to take a simple example with low values for m and
n, and write out the full derivative in matrix form as shown in (4). The resulting matrix will
be baT .
The derivative of a transposed vector w.r.t itself is the identity matrix, but the transpose
gets applied
to everything
after. For example, let f (w) = (y − wT x)2 = y 2 − wT x y −
y wT x + wT x wT x , where y and x are not a function of w. Taking derivative of the
terms individually:
∂y 2
• = 0T , i.e. a row vector of all 0s.
∂w
∂ wT x y ∂wT T
• = xy = (xy) = y T xT . Since y is a scalar, this is simply yxT .
∂w ∂w
∂y wT x ∂wT
• =y x = yxT
∂w ∂w
∂ wT x wT x ∂wT ∂wT
• = x wT x + wT x x = xT w xT + wT x xT . Since vector
∂w ∂w ∂w
inner products commute, this is 2 wT x xT .
∂f
So = −2yxT + 2 wT x xT
∂w
6
of dimensions p × q, i.e. for each inner matrix, pre-multiply with a 1 × p row vector and post-
multiply with a q × 1 column vector to get a scalar. This gives a final matrix of dimensions
n × m.
∂f
Example: f (W ) = aT W b . This is a scalar, so should be a matrix which has
1×m m×n n×1 ∂W
∂f ∂W ∂W
transposed dimensions as W , i.e. n × m. Now, = aT b, where has dimensions
∂W ∂W ∂W
m × n × n × m. For example if m = 3, n = 2, then:
1 0 0 0 0 0
0 0 1 0 0 0
∂W 0 0 0 0 1 0
= (15)
∂W
0 1 0 0 0 0
0 0 0 1 0 0
0 0 0 0 0 1
Note that the (i, j)th inner matrix has a 1 in its (j, i)th position. Pre- and post-multiplying the
(i, j)th inner matrix with aT and b gives aj bi , where i ∈ {1, 2} and j ∈ {1, 2, 3}. So:
T ∂W a1 b1 a2 b1 a3 b1
a b= (16)
∂W a1 b2 a2 b2 a3 b2
∂f
Thus, = baT .
∂W
= 2 xT − aT
∂f xT − aT
So =p .
∂x (x − a)T (x − a)
x−a
So ∇x f = , which is basically the unit displacement vector from a to x. This means
kx − ak2
that to get maximum increase in f (x), one should move away from a along the straight line
joining a and x. Alternatively, to get maximum decrease in f (x), one should move from x
directly towards a, which makes sense geometrically.
7
5 Notes and Further Reading
The chain rule and product rule do not always hold when dealing with matrices. However,
some modified forms can hold when using the T race(·) function. For a full list of derivatives,
the reader should consult a textbook or websites such as Wikipedia’s page on Matrix calculus.
Keep in mind that some texts may use denominator layout convention, where results will look
different.