0% found this document useful (0 votes)
26 views

Matrix Calculus

This document discusses notation and basic rules of matrix calculus. It covers derivatives of scalar, vector, and matrix functions. The derivatives can be scalars, vectors, matrices or higher order tensors depending on the type of functions and variables. Chain and product rules for derivatives are also stated.

Uploaded by

fadeevla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Matrix Calculus

This document discusses notation and basic rules of matrix calculus. It covers derivatives of scalar, vector, and matrix functions. The derivatives can be scalars, vectors, matrices or higher order tensors depending on the type of functions and variables. Chain and product rules for derivatives are also stated.

Uploaded by

fadeevla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Matrix Calculus

Sourya Dey

1 Notation
• Scalars are written as lower case letters.
• Vectors are written as lower case bold letters, such as x, and can be either row (dimensions
1×n) or column (dimensions n×1). Column vectors are the default choice, unless otherwise
mentioned. Individual elements are indexed by subscripts, such as xi (i ∈ {1, · · · , n}).
• Matrices are written as upper case bold letters, such as X, and have dimensions m × n
corresponding to m rows and n columns. Individual elements are indexed by double
subscripts for row and column, such as Xij (i ∈ {1, · · · , m}, j ∈ {1, · · · , n}).

• Occasionally higher order tensors occur, such as 3rd order with dimensions m × n × p, etc.

Note that a matrix is a 2nd order tensor. A row vector is a matrix with 1 row, and a column
vector is a matrix with 1 column. A scalar is a matrix with 1 row and 1 column. Essentially,
scalars and vectors are special cases of matrices.
∂f
The derivative of f with respect to x is . Both x and f can be a scalar, vector, or matrix,
∂x  T
∂f
leading to 9 types of derivatives. The gradient of f w.r.t x is ∇x f = , i.e. gradient
∂x
is transpose of derivative. The gradient at any point x0 in the domain has a physical
interpretation, its direction is the direction of maximum increase of the function f at the point
x0 , and its magnitude is the rate of increase in that direction. We do not generally deal with
the gradient when x is a scalar.

2 Basic Rules

This document follows numerator layout convention. There is an alternative denom-


inator layout convention, where several results are transposed. Do not mix different layout
conventions.

We’ll first state the most general matrix-matrix derivative type. All other types are sim-
plifications since scalars and vectors are special cases of matrices. Consider a function F (·)
which maps m × n matrices to p × q matrices, i.e. domain ⊂ Rm×n and range ⊂ Rp×q . So,
∂F
F (·) : X → F (X). Its derivative is a 4th order tensor of dimensions p × q × n × m. This
m×n p×q ∂X
is an outer matrix of dimensions n × m (transposed dimensions of the denominator X), with

1
each element being a p × q inner matrix (same dimensions as the numerator F ). It is given as:
∂F ∂F
 
 ∂X1,1 · · · ∂Xm,1 
 
∂F  . .. .. 
 . .
= . .  (1a)

∂X  
 
 ∂F ∂F 
···
∂X1,n ∂Xm,n
which has n rows and m columns, and the (i, j)th element is given as:
∂F1,1 ∂F1,q
 
 ∂Xi,j · · ·
 ∂Xi,j 

∂F 
 . . .. .. 
= . . . 

(1b)
∂Xi,j  
 
 ∂Fp,1 ∂Fp,q 
···
∂Xi,j ∂Xi,j
which has p rows and q columns.

Whew! Now that that’s out of the way, let’s get to some general rules (for the following, x and
y can represent scalar, vector or matrix):

∂y
• The derivative always has outer matrix dimensions = transposed dimen-
∂x
sions of denominator x, and each individual element (inner matrix) has di-
mensions = same dimensions of numerator y. If you do a calculation and the
dimension doesn’t come out right, the answer is not correct.
∂f (g(x)) ∂f (g(x)) ∂g(x)
• Derivatives usually obey the chain rule, i.e. = .
∂x ∂g(x) ∂x
∂f (x)g(x) ∂g(x) ∂f (x)
• Derivatives usually obey the product rule, i.e. = f (x) + g(x) .
∂x ∂x ∂x

3 Types of derivatives

3.1 Scalar by scalar

Nothing special here. The derivative is a scalar, and can also be written as f 0 (x). For example,
if f (x) = sin x, then f 0 (x) = cos x.

3.2 Scalar by vector

f (·) : x → f (x). For this, the derivative is a 1 × m row vector:


m×1 1×1
 
∂f ∂f ∂f ∂f
= ··· (2)
∂x ∂x1 ∂x2 ∂xm
The gradient ∇x f is its transposed column vector.

2
3.3 Vector by scalar

f (·) : x → f (x). For this, the derivative is a n × 1 column vector:


1×1 n×1

∂f1
 
 ∂x 
 
 ∂f2 
 
∂f  ∂x 
= (3)
 
∂x

 .. 
 . 
 
 
 ∂f 
n
∂x

3.4 Vector by vector

f (·) : x → f (x). Derivative, also known as the Jacobian, is a matrix of dimensions n × m.


m×1 n×1
Its (i, j)th element is the scalar derivative of the ith output component w.r.t the jth input
component, i.e.:
∂f1 ∂f1
 
 ∂x1 · · · ∂xm 
 
∂f  . .. 
 . ..
= . . .  (4)

∂x  
 
 ∂fn ∂fn 
···
∂x1 ∂xm

3.4.1 Special case – Vectorized scalar function

This is a scalar-scalar function applied element-wise to a vector, and is denoted by f (·) : x →


m×1
f (x). For example:
m×1
   
x1 f (x1 )
 x2   f (x2 ) 
f  .  =  .  (5)
   
 ..   .. 
xm f (xm )

In this case, both the derivative and gradient are the same m × m diagonal matrix, given as:

0
 
0
f (x 1 ) 
f 0 (x2 )
 
∂f  
∇x f = = (6)
 
∂x  .. 
 . 

0
 
0
f (xm )

∂f (xi )
where f 0 (xi ) = .
∂xi

3
Note: Some texts take the derivative of a vectorized scalar function by taking element-wise
derivatives to get a m × 1 vector. To avoid confusion with (6), we will refer to this as f 0 (x).
 0 
f (x1 )
 f 0 (x2 ) 
f 0 (x) =  .  (7)
 
 .. 
f 0 (xm )

To realize the effect of this, let’s say we want to multiply the gradient from (6) with some
m-dimensional vector a. This would result in:
 0 
f (x1 ) a1
 0
 f (x2 ) a2 


∇x f a =  (8)
 
.. 

 . 

f 0 (xm ) am

Achieving the same result with f 0 (x) from (7) would require the Hadamard product ◦, defined
as element-wise multiplication of 2 vectors:
 0 
f (x1 ) a1
 0
 f (x2 ) a2 

f 0 (x) ◦ a =  (9)
 
.. 

 . 

f 0 (xm ) am

3.4.2 Special Case – Hessian

Consider the type of function in Sec. 3.2, i.e. f (·) : x → f (x). Its gradient is a vector-
m×1 1×1
to-vector function given as ∇x f (·) : x → ∇x f (x). The transpose of its derivative is the
m×1 m×1
Hessian:
∂2f ∂2f
 
···

 ∂x21 ∂x1 ∂xm 

 .. .. .. 
H=

. . .

 (10)
 
 
 ∂2f 2
∂ f 
···
∂xm ∂x1 ∂x2m
T
∂2f ∂2f

∂∇x f
i.e. H = . If derivatives are continuous, then = , so the Hessian is
∂x ∂xi ∂xj ∂xj ∂xi
symmetric.

4
3.5 Scalar by matrix

f (·) : X → f (X). In this case, the derivative is a n × m matrix:


m×n 1×1

∂f ∂f
 
 ∂X1,1 ···
 ∂Xm,1 

∂f  . .. .. 
=  .. . .  (11)
 
∂X  
 
 ∂f ∂f 
···
∂X1,n ∂Xm,n

The gradient has the same dimensions as the input matrix, i.e. m × n.

3.6 Matrix by scalar

f (·) : x → F (x). In this case, the derivative is a p × q matrix:


1×1 p×q

 ∂F ∂F1,q 
1,1
···
 ∂x ∂x 
 
∂F  .
. .. .. 
=
 . . .  (12)
∂x 


∂Fp,1 ∂Fp,q
 
···
∂x ∂x

3.7 Vector by matrix

f (·) : X → f (X). In this case, the derivative is a 3rd-order tensor with dimensions p × n × m.
m×n p×1
This is the same n × m matrix in (11), but with f replaced by the p-dimensional vector f , i.e.:
 
∂f ∂f
 ∂X1,1 ···
∂Xm,1 
∂f  
=  ... .. ..  (13)

∂X  . . 

 ∂f ∂f 
···
∂X1,n ∂Xm,n

3.8 Matrix by vector

F (·) : x → F (x). In this case, the derivative is a 3rd-order tensor with dimensions p × q × m.
m×1 p×q
This is the same m × 1 row vector in (2), but with f replaced by the p × q matrix F , i.e.:
 
∂F ∂F ∂F ∂F
= ··· (14)
∂x ∂x1 ∂x2 ∂xm

5
4 Operations and Examples

4.1 Commutation

If things normally don’t commute (such as for matrices, AB 6= BA), then order should be
maintained when taking derivatives. If things normally commute (such as for vector inner
product, a·b = b·a), their order can be switched when taking derivatives. Output dimensions
must always come out right.
∂f
For example, let f (x) = (aT x) b . The derivative should be a n×m matrix. Keeping
n×1 1×m m×1 n×1 ∂x
∂f ∂x
order fixed, we get = aT b = aT Ib = aT b. This is a scalar, which is wrong! The solution?
 ∂x ∂x
Note that aT x is a scalar, which can sit either to the right or the left of vector b, i.e. ordering
 ∂f ∂x
doesn’t really matter. Rewriting f (x) = b aT x , we get = baT = baT I = baT , which
∂x ∂x
is the correct n × m matrix.

If this seems confusing, it might be useful to take a simple example with low values for m and
n, and write out the full derivative in matrix form as shown in (4). The resulting matrix will
be baT .

4.2 Derivative of a transposed vector

The derivative of a transposed vector w.r.t itself is the identity matrix, but the transpose

gets applied
 to everything
  after. For example, let f (w) = (y − wT x)2 = y 2 − wT x y −
y wT x + wT x wT x , where y and x are not a function of w. Taking derivative of the
terms individually:

∂y 2
• = 0T , i.e. a row vector of all 0s.
∂w

∂ wT x y ∂wT T
• = xy = (xy) = y T xT . Since y is a scalar, this is simply yxT .
∂w ∂w

∂y wT x ∂wT
• =y x = yxT
∂w ∂w
 
∂ wT x wT x ∂wT   ∂wT  
• = x wT x + wT x x = xT w xT + wT x xT . Since vector
∂w ∂w  ∂w
inner products commute, this is 2 wT x xT .

∂f 
So = −2yxT + 2 wT x xT
∂w

4.3 Dealing with tensors

A tensor of dimensions p × q × n × m (such as given in (1)) can be pre- and post-multiplied by


vectors just like an ordinary matrix. These vectors must be compatible with the inner matrices

6
of dimensions p × q, i.e. for each inner matrix, pre-multiply with a 1 × p row vector and post-
multiply with a q × 1 column vector to get a scalar. This gives a final matrix of dimensions
n × m.
∂f
Example: f (W ) = aT W b . This is a scalar, so should be a matrix which has
1×m m×n n×1 ∂W
∂f ∂W ∂W
transposed dimensions as W , i.e. n × m. Now, = aT b, where has dimensions
∂W ∂W ∂W
m × n × n × m. For example if m = 3, n = 2, then:
     
1 0 0 0 0 0
0 0 1 0 0 0
 
∂W  0 0 0 0 1 0 
=  (15)
 
∂W

 0 1 0 0 0 0
 
0 0 0 1 0 0
0 0 0 0 0 1

Note that the (i, j)th inner matrix has a 1 in its (j, i)th position. Pre- and post-multiplying the
(i, j)th inner matrix with aT and b gives aj bi , where i ∈ {1, 2} and j ∈ {1, 2, 3}. So:
 
T ∂W a1 b1 a2 b1 a3 b1
a b= (16)
∂W a1 b2 a2 b2 a3 b2

∂f
Thus, = baT .
∂W

4.4 Gradient Example: L2 Norm

Problem: Given f (x) = kx − ak2 , find ∇x f .


p
Note that kx − ak2 = (x − a)T (x − a), which is a scalar. So the derivative will be a row
vector and gradient will be a column vector of the same dimension as x. Let’s use the chain
rule:
p
∂f ∂ (x − a)T (x − a) ∂(x − a)T (x − a)
= × (17)
∂x ∂(x − a)T (x − a) ∂x
1
The first term is a scalar-scalar derivative equal to p . The second term is:
2 (x − a)T (x − a)

∂(x − a)T (x − a) ∂ xT x − aT x − xT a + aT a
= (18)
∂x ∂x
= xT + xT − aT − aT + 0T


= 2 xT − aT


∂f xT − aT
So =p .
∂x (x − a)T (x − a)
x−a
So ∇x f = , which is basically the unit displacement vector from a to x. This means
kx − ak2
that to get maximum increase in f (x), one should move away from a along the straight line
joining a and x. Alternatively, to get maximum decrease in f (x), one should move from x
directly towards a, which makes sense geometrically.

7
5 Notes and Further Reading

The chain rule and product rule do not always hold when dealing with matrices. However,
some modified forms can hold when using the T race(·) function. For a full list of derivatives,
the reader should consult a textbook or websites such as Wikipedia’s page on Matrix calculus.
Keep in mind that some texts may use denominator layout convention, where results will look
different.

You might also like