Vector and Matrix Calculus
Herman Kamper
[email protected]
30 January 2013
1 Introduction
As explained in detail in [1], there unfortunately exists multiple competing notations concerning
the layout of matrix derivatives. This can cause a lot of difficulty when consulting several
sources, since different sources might use different conventions. Some sources, for example [2]
(from which I use a lot of identities), even use a mixed layout (according to [1, Notes]). Identities
for both the numerator layout (sometimes called the Jacobian formulation) and the denominator
layout (sometimes called the Hessian formulation) is given in [1], so this makes it easy to check
what layout a particular source uses. I will aim to stick to the denominator layout, which seems
to be the most widely used in the field of statistics and pattern recognition (e.g. [3] and [4,
pp. 327–332]). Other useful references concerning matrix calculus include [5] and [6]. In this
document column vectors are assumed in all cases expect where specifically stated otherwise.
Table 1: Derivatives of scalars, vector functions and matrices [1, 6].
column vector
scalar y matrix Y ∈ Rm×n
y ∈ Rm
∂y row vector matrix ∂Y
∂x (only
scalar x scalar ∂y m
∂x ∈ R
∂x numerator layout)
column vector matrix
column vector x ∈ Rn ∂y n ∂y
∂x ∈ R ∂x ∈ Rn×m
∂y
matrix X ∈ Rp×q matrix ∂X ∈ Rp×q
2 Definitions
Table 1 indicates the six possible kinds of derivatives when using the denominator layout. Using
this layout notation consistently, we have the following definitions.
The derivative of a scalar function f : Rn → R with respect to vector x ∈ Rn is
∂f (x)
∂x1
∂f (x)
∂f (x) def ∂x2
= ..
(1)
∂x .
∂f (x)
∂xn
This is the transpose of the gradient (some authors simply call this the gradient, irrespective of
whether numerator or denominator layout is used).
1
T
The derivative of a vector function f : Rn → Rm , where f (x) = f1 (x) f2 (x) . . . fm (x)
and x ∈ Rn , with respect to scalar xi is
∂f (x) def h ∂f1 (x) ∂f2 (x) ∂fm (x)
i
= ∂xi ∂xi ... ∂xi
(2)
∂xi
T
The derivative of a vector function f : Rn → Rm , where f (x) = f1 (x) f2 (x) . . . fm (x) ,
with respect to vector x ∈ Rn is
∂f (x) ∂f1 (x) ∂f2 (x) ∂fm (x)
. . .
∂x1 ∂x1 ∂x1 ∂x1
∂f (x) ∂f1 (x) ∂f2 (x) ∂fm (x)
∂f (x) def ∂x2 ...
= ∂x2 ∂x2 ∂x2
= .. .. .. .. ..
(3)
∂x . . . . .
∂f (x) ∂f1 (x) ∂f2 (x) ∂fm (x)
∂xn ∂xn ∂xn ... ∂xn
This is just the transpose of the Jacobian matrix.
The derivative of a scalar function f : Rm×n → R with respect to matrix X ∈ Rm×n is
∂f (X) ∂f (X) ∂f (X)
· · ·
∂X11 ∂X12 ∂X1n
∂f (X) ∂f (X) ∂f (X)
∂f (X) def ∂X21 ∂X22 · · · ∂X2n
= .. .. .. ..
(4)
∂X . . . .
∂f (X) ∂f (X) ∂f (X)
∂Xm1 ∂Xm2 ··· ∂Xmn
Observe that the (1) is just a special case of (4) for column vectors. Often (as in [3]) the gradient
notation is used as an alternative to the notation used above, for example:
∂f (x)
∇x f (x) = (5)
∂x
∂f (X)
∇X f (X) = (6)
∂X
3 Identities
3.1 Scalar-by-vector product rule
If a ∈ Rm , b ∈ Rn and C ∈ Rm×n then
m
X m
X n
X m X
X n
aT Cb = ai (Cb)i = ai Cij bj = Cij ai bj (7)
i=1 i=1 j=1 i=1 j=1
Now assume we have vector functions u : Rm → Rm , v = Rn → Rn and A ∈ Rm×n . The vector
functions u and v are functions of x ∈ Rq , but A is not. We want to find an identity for
∂uT Av
(8)
∂x
2
From (7), we have:
T m n
∂uT Av
∂u Av ∂ XX
= = Aij ui vj
∂x l ∂xl ∂xl
i=1 j=1
m X
n
X ∂
= Aijui vj
∂xl
i=1 j=1
m X n
X ∂ui ∂vj
= Aij vj + ui
∂xl ∂xl
i=1 j=1
m X
n m n
X ∂ui X X ∂vj
= Aij vj + Aij ui (9)
∂xl ∂xl
i=1 j=1 i=1 j=1
Now we can show (by writing out the elements [Notebook, 2012-05-22]) that:
m Xn m n
∂u ∂v T X ∂ui X X T ∂vj
Av + A u = Aij vj + (A )ji ui
∂x ∂x l ∂xl ∂xl
i=1 j=1 i=1 j=1
m X n m X n
X ∂ui X ∂vj
= Aij vj + Aij ui (10)
∂xl ∂xl
i=1 j=1 i=1 j=1
A comparison of (9) and (10) completes the proof that
∂uT Av ∂u ∂v T
= Av + A u (11)
∂x ∂x ∂x
3.2 Useful identities from scalar-by-vector product rule
From (11) it follows, with vectors and matrices b ∈ Rm , d ∈ Rq , x ∈ Rn , B ∈ Rm×n , C ∈ Rm×q ,
D ∈ Rq×n , that
∂(Bx + b)T C(Dx + d) ∂(Ax + b) ∂(Dx + d)T T
= C(Dx + d) + C (Bx + b) (12)
∂x ∂x ∂x
resulting in the identity:
∂(Bx + b)T C(Dx + d)
= BT C(Dx + d) + DT CT (Bx + b) (13)
∂x
by using the easily verifiable identities:
∂(u(x) + v(x)) ∂u(x) ∂v(x)
= + (14)
∂x ∂x ∂x
∂Ax
= AT (15)
∂x
∂a
=0 (16)
∂x
Some other useful special cases of (11):
∂xT Ab
= Ab (17)
∂x
3
∂xT Ax
= (A + AT )x (18)
∂x
∂xT Ax
= 2Ax if A is symmetric (19)
∂x
3.3 Derivatives of determinant
See [7, p. 374] for definition of cofactors. Also see [Notebook, 2012-05-22].
We can write the determinant of matrix X ∈ Rn×n as
n
X
|X| = Xi1 Ci1 + Xi2 Ci2 + . . . + Xin Cin = Xij Cij+ (20)
j=1
Thus the derivative will be
∂|X| ∂
= {Xi1 Ci1 + Xi2 Ci2 + . . . + Xin Cin }
∂X kl ∂Xkl
∂
= {Xk1 Ck1 + Xk2 Ck2 + . . . + Xkn Ckn }
∂Xkl
(can choose i any number, so choose i = k)
= Ckl (21)
Thus (see [7, p. 386])
∂|X|
= cofactor X = (adj X)T (22)
∂X
But we know that the inverse of X is given by [7, p. 387]
1
X−1 = adj X (23)
|X|
thus
adj X = |X|X−1 (24)
which, when substituted into (22), results in the identity
∂|X|
= |X|(X−1 )T (25)
∂X
From (25) we can also write
∂ ln |X| ∂ ln |X| 1 ∂|X| 1
= = = |X|(X−1 )T (26)
∂X kl ∂Xkl |X| ∂X |X|
giving the identity
∂ ln |X|
= (X−1 )T (27)
∂X
4
References
[1] Matrix calculus. [Online]. Available: http://en.wikipedia.org/wiki/Matrix calculus
[2] K. B. Petersen and M. S. Pedersen, “The matrix cookbook,” 2008.
[3] A. Ng, Machine Learning. Class notes for CS229, Stanford Engineering Everywhere,
Stanford University, 2008. [Online]. Available: http://see.stanford.edu
[4] S. R. Searle, Matrix Algebra Useful for Statistics. New York, NY: John Wiley & Sons, 1982.
[5] J. R. Schott, Matrix Analysis for Statistics. New York, NY: John Wiley & Sons, 1996.
[6] T. P. Minka, “Old and new matrix algebra useful for statistics,” 2000. [Online]. Available:
http://research.microsoft.com/en-us/um/people/minka/papers/matrix
[7] D. G. Zill and M. R. Cullen, Advanced Engineering Mathematics, 3rd ed. Jones and Bartlett,
2006.