0% found this document useful (0 votes)
35 views53 pages

Understanding Linear Algebra Concepts

The document provides an introduction to linear algebra, focusing on the concept of vectors, which can be geometric, polynomial, audio signals, or elements of Rn. It explains the properties of vectors, including addition and scalar multiplication, and introduces systems of linear equations as a central topic in linear algebra. The document also highlights the importance of matrices in representing linear equations and functions, setting the stage for further exploration of these concepts in machine learning.

Uploaded by

Quân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views53 pages

Understanding Linear Algebra Concepts

The document provides an introduction to linear algebra, focusing on the concept of vectors, which can be geometric, polynomial, audio signals, or elements of Rn. It explains the properties of vectors, including addition and scalar multiplication, and introduces systems of linear equations as a central topic in linear algebra. The document also highlights the importance of matrices in representing linear equations and functions, setting the stage for further exploration of these concepts in machine learning.

Uploaded by

Quân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2

Linear Algebra

When formalizing intuitive concepts, a common approach is to construct a


set of objects (symbols) and a set of rules to manipulate these objects. This
is known as an algebra. Linear algebra is the study of vectors and certain algebra
rules to manipulate vectors. The vectors many of us know from school are
called “geometric vectors”, which are usually denoted by a small arrow
above the letter, e.g., →

x and → −
y . In this book, we discuss more general
concepts of vectors and use a bold letter to represent them, e.g., x and y .
In general, vectors are special objects that can be added together and
multiplied by scalars to produce another object of the same kind. From
an abstract mathematical viewpoint, any object that satisfies these two
properties can be considered a vector. Here are some examples of such
vector objects:

1. Geometric vectors. This example of a vector may be familiar from high


school mathematics and physics. Geometric vectors – see Figure 2.1(a)
– are directed segments, which can be drawn (at least in two dimen-
→ → → → →
sions). Two geometric vectors x, y can be added, such that x + y = z
is another geometric vector. Furthermore, multiplication by a scalar

λ x , λ ∈ R, is also a geometric vector. In fact, it is the original vector
scaled by λ. Therefore, geometric vectors are instances of the vector
concepts introduced previously. Interpreting vectors as geometric vec-
tors enables us to use our intuitions about direction and magnitude to
reason about mathematical operations.
2. Polynomials are also vectors; see Figure 2.1(b): Two polynomials can

→ → 4 Figure 2.1
x+y Different types of
2
vectors. Vectors can
0 be surprising
objects, including
y

→ −2 (a) geometric
x → vectors
y −4
and (b) polynomials.
−6
−2 0 2
x

(a) Geometric vectors. (b) Polynomials.

17
This material is published by Cambridge University Press as Mathematics for Machine Learning by
Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong (2020). This version is free to view
and download for personal use only. Not for re-distribution, re-sale, or use in derivative works.
©by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2021. [Link]
18 Linear Algebra

be added together, which results in another polynomial; and they can


be multiplied by a scalar λ ∈ R, and the result is a polynomial as
well. Therefore, polynomials are (rather unusual) instances of vectors.
Note that polynomials are very different from geometric vectors. While
geometric vectors are concrete “drawings”, polynomials are abstract
concepts. However, they are both vectors in the sense previously de-
scribed.
3. Audio signals are vectors. Audio signals are represented as a series of
numbers. We can add audio signals together, and their sum is a new
audio signal. If we scale an audio signal, we also obtain an audio signal.
Therefore, audio signals are a type of vector, too.
4. Elements of Rn (tuples of n real numbers) are vectors. Rn is more
abstract than polynomials, and it is the concept we focus on in this
book. For instance,
 
1
a = 2 ∈ R3 (2.1)
3

is an example of a triplet of numbers. Adding two vectors a, b ∈ Rn


component-wise results in another vector: a + b = c ∈ Rn . Moreover,
multiplying a ∈ Rn by λ ∈ R results in a scaled vector λa ∈ Rn .
Be careful to check Considering vectors as elements of Rn has an additional benefit that
whether array it loosely corresponds to arrays of real numbers on a computer. Many
operations actually
programming languages support array operations, which allow for con-
perform vector
operations when venient implementation of algorithms that involve vector operations.
implementing on a
computer. Linear algebra focuses on the similarities between these vector concepts.
Pavel Grinfeld’s
We can add them together and multiply them by scalars. We will largely
series on linear focus on vectors in Rn since most algorithms in linear algebra are for-
algebra: mulated in Rn . We will see in Chapter 8 that we often consider data to
[Link] be represented as vectors in Rn . In this book, we will focus on finite-
com/nahclwm
dimensional vector spaces, in which case there is a 1:1 correspondence
Gilbert Strang’s
course on linear between any kind of vector and Rn . When it is convenient, we will use
algebra: intuitions about geometric vectors and consider array-based algorithms.
[Link] One major idea in mathematics is the idea of “closure”. This is the ques-
com/29p5q8j
tion: What is the set of all things that can result from my proposed oper-
3Blue1Brown series
ations? In the case of vectors: What is the set of vectors that can result by
on linear algebra:
[Link] starting with a small set of vectors, and adding them to each other and
com/h5g4kps scaling them? This results in a vector space (Section 2.4). The concept of
a vector space and its properties underlie much of machine learning. The
concepts introduced in this chapter are summarized in Figure 2.2.
This chapter is mostly based on the lecture notes and books by Drumm
and Weil (2001), Strang (2003), Hogben (2013), Liesen and Mehrmann
(2015), as well as Pavel Grinfeld’s Linear Algebra series. Other excellent

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.1 Systems of Linear Equations 19

Vector
Figure 2.2 A mind
map of the concepts
es
p os pro
p introduced in this
erty

closure
m
co of chapter, along with
Chapter 5 where they are used
Matrix Abelian
Vector calculus with + in other parts of the
s Linear
nt Vector space Group independence book.
rep

es e
pr
res

re

maximal set
ent
s

System of
linear equations
Linear/affine
so mapping
lv
solved by

es Basis

Matrix
inverse
Gaussian
elimination

Chapter 3 Chapter 10
Chapter 12
Analytic geometry Dimensionality
Classification
reduction

resources are Gilbert Strang’s Linear Algebra course at MIT and the Linear
Algebra Series by 3Blue1Brown.
Linear algebra plays an important role in machine learning and gen-
eral mathematics. The concepts introduced in this chapter are further ex-
panded to include the idea of geometry in Chapter 3. In Chapter 5, we
will discuss vector calculus, where a principled knowledge of matrix op-
erations is essential. In Chapter 10, we will use projections (to be intro-
duced in Section 3.8) for dimensionality reduction with principal compo-
nent analysis (PCA). In Chapter 9, we will discuss linear regression, where
linear algebra plays a central role for solving least-squares problems.

2.1 Systems of Linear Equations


Systems of linear equations play a central part of linear algebra. Many
problems can be formulated as systems of linear equations, and linear
algebra gives us the tools for solving them.

Example 2.1
A company produces products N1 , . . . , Nn for which resources
R1 , . . . , Rm are required. To produce a unit of product Nj , aij units of
resource Ri are needed, where i = 1, . . . , m and j = 1, . . . , n.
The objective is to find an optimal production plan, i.e., a plan of how
many units xj of product Nj should be produced if a total of bi units of
resource Ri are available and (ideally) no resources are left over.
If we produce x1 , . . . , xn units of the corresponding products, we need

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


20 Linear Algebra

a total of
ai1 x1 + · · · + ain xn (2.2)
many units of resource Ri . An optimal production plan (x1 , . . . , xn ) ∈ Rn ,
therefore, has to satisfy the following system of equations:
a11 x1 + · · · + a1n xn = b1
.. , (2.3)
.
am1 x1 + · · · + amn xn = bm
where aij ∈ R and bi ∈ R.

system of linear Equation (2.3) is the general form of a system of linear equations, and
equations x1 , . . . , xn are the unknowns of this system. Every n-tuple (x1 , . . . , xn ) ∈
solution Rn that satisfies (2.3) is a solution of the linear equation system.

Example 2.2
The system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) (2.4)
2x1 + 3x3 = 1 (3)
has no solution: Adding the first two equations yields 2x1 +3x3 = 5, which
contradicts the third equation (3).
Let us have a look at the system of linear equations
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) . (2.5)
x2 + x3 = 2 (3)
From the first and third equation, it follows that x1 = 1. From (1)+(2),
we get 2x1 + 3x3 = 5, i.e., x3 = 1. From (3), we then get that x2 = 1.
Therefore, (1, 1, 1) is the only possible and unique solution (verify that
(1, 1, 1) is a solution by plugging in).
As a third example, we consider
x1 + x2 + x3 = 3 (1)
x1 − x2 + 2x3 = 2 (2) . (2.6)
2x1 + 3x3 = 5 (3)
Since (1)+(2)=(3), we can omit the third equation (redundancy). From
(1) and (2), we get 2x1 = 5−3x3 and 2x2 = 1+x3 . We define x3 = a ∈ R
as a free variable, such that any triplet
5 3 1 1
 
− a, + a, a , a ∈ R (2.7)
2 2 2 2

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.1 Systems of Linear Equations 21

Figure 2.3 The


x2 solution space of a
system of two linear
equations with two
4x1 + 4x2 = 5
variables can be
geometrically
2x1 − 4x2 = 1 interpreted as the
intersection of two
lines. Every linear
equation represents
a line.
x1

is a solution of the system of linear equations, i.e., we obtain a solution


set that contains infinitely many solutions.

In general, for a real-valued system of linear equations we obtain either


no, exactly one, or infinitely many solutions. Linear regression (Chapter 9)
solves a version of Example 2.1 when we cannot solve the system of linear
equations.
Remark (Geometric Interpretation of Systems of Linear Equations). In a
system of linear equations with two variables x1 , x2 , each linear equation
defines a line on the x1 x2 -plane. Since a solution to a system of linear
equations must satisfy all equations simultaneously, the solution set is the
intersection of these lines. This intersection set can be a line (if the linear
equations describe the same line), a point, or empty (when the lines are
parallel). An illustration is given in Figure 2.3 for the system

4x1 + 4x2 = 5
(2.8)
2x1 − 4x2 = 1

where the solution space is the point (x1 , x2 ) = (1, 41 ). Similarly, for three
variables, each linear equation determines a plane in three-dimensional
space. When we intersect these planes, i.e., satisfy all linear equations at
the same time, we can obtain a solution set that is a plane, a line, a point
or empty (when the planes have no common intersection). ♦
For a systematic approach to solving systems of linear equations, we
will introduce a useful compact notation. We collect the coefficients aij
into vectors and collect the vectors into matrices. In other words, we write
the system from (2.3) in the following form:
       
a11 a12 a1n b1
 ..   ..   ..   .. 
 .  x1 +  .  x2 + · · · +  .  xn =  .  (2.9)
am1 am2 amn bm

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


22 Linear Algebra
    
a11 · · · a1n x1 b1
 .. . .
..   ..  =  ... 
⇐⇒  . . (2.10)
   

am1 · · · amn xn bm
In the following, we will have a close look at these matrices and de-
fine computation rules. We will return to solving linear equations in Sec-
tion 2.3.

2.2 Matrices
Matrices play a central role in linear algebra. They can be used to com-
pactly represent systems of linear equations, but they also represent linear
functions (linear mappings) as we will see later in Section 2.7. Before we
discuss some of these interesting topics, let us first define what a matrix
is and what kind of operations we can do with matrices. We will see more
properties of matrices in Chapter 4.
matrix Definition 2.1 (Matrix). With m, n ∈ N a real-valued (m, n) matrix A is
an m·n-tuple of elements aij , i = 1, . . . , m, j = 1, . . . , n, which is ordered
according to a rectangular scheme consisting of m rows and n columns:
a11 a12 · · · a1n
 
 a21 a22 · · · a2n 
A =  .. .. ..  , aij ∈ R . (2.11)
 
 . . . 
am1 am2 · · · amn
row By convention (1, n)-matrices are called rows and (m, 1)-matrices are called
column columns. These special matrices are also called row/column vectors.
row vector
column vector Rm×n is the set of all real-valued (m, n)-matrices. A ∈ Rm×n can be
Figure 2.4 By equivalently represented as a ∈ Rmn by stacking all n columns of the
stacking its matrix into a long vector; see Figure 2.4.
columns, a matrix A
can be represented
as a long vector a.
2.2.1 Matrix Addition and Multiplication
A ∈ R4×2 a ∈ R8
The sum of two matrices A ∈ Rm×n , B ∈ Rm×n is defined as the element-
re-shape
wise sum, i.e.,
 
a11 + b11 · · · a1n + b1n
.. .. m×n
A + B :=  ∈R . (2.12)
 
. .
am1 + bm1 · · · amn + bmn
Note the size of the For matrices A ∈ Rm×n , B ∈ Rn×k , the elements cij of the product
matrices. C = AB ∈ Rm×k are computed as
C =
n
[Link](’il, X
lj’, A, B) cij = ail blj , i = 1, . . . , m, j = 1, . . . , k. (2.13)
l=1

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.2 Matrices 23

This means, to compute element cij we multiply the elements of the ith There are n columns
row of A with the j th column of B and sum them up. Later in Section 3.2, in A and n rows in
B so that we can
we will call this the dot product of the corresponding row and column. In
compute ail blj for
cases, where we need to be explicit that we are performing multiplication, l = 1, . . . , n.
we use the notation A · B to denote multiplication (explicitly showing Commonly, the dot
“·”). product between
two vectors a, b is
Remark. Matrices can only be multiplied if their “neighboring” dimensions denoted by a> b or
match. For instance, an n × k -matrix A can be multiplied with a k × m- ha, bi.
matrix B , but only from the left side:
A |{z}
|{z} B = |{z}
C (2.14)
n×k k×m n×m

The product BA is not defined if m 6= n since the neighboring dimensions


do not match. ♦
Remark. Matrix multiplication is not defined as an element-wise operation
on matrix elements, i.e., cij 6= aij bij (even if the size of A, B was cho-
sen appropriately). This kind of element-wise multiplication often appears
in programming languages when we multiply (multi-dimensional) arrays
with each other, and is called a Hadamard product. ♦ Hadamard product

Example 2.3  
  0 2
1 2 3
For A = ∈ R2×3 , B = 1 −1 ∈ R3×2 , we obtain
3 2 1
0 1
 
  0 2  
1 2 3  2 3
AB = 1 −1 =  ∈ R2×2 , (2.15)
3 2 1 2 5
0 1
   
0 2   6 4 2
1 2 3
BA = 1 −1 = −2 0 2 ∈ R3×3 . (2.16)
3 2 1
0 1 3 2 1

Figure 2.5 Even if


From this example, we can already see that matrix multiplication is not both matrix
multiplications AB
commutative, i.e., AB 6= BA; see also Figure 2.5 for an illustration.
and BA are
defined, the
Definition 2.2 (Identity Matrix). In Rn×n , we define the identity matrix
dimensions of the
1 0 ··· 0 ··· 0 results can be
 
0 1 · · · 0 · · · 0 different.
 .. .. . . .. . . .. 
 
. . . . . . n×n
0 0 · · · 1 · · · 0 ∈ R
I n :=  (2.17)

 
. . .
 .. .. . . ... . . . ... 

identity matrix
0 0 ··· 0 ··· 1

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


24 Linear Algebra

as the n × n-matrix containing 1 on the diagonal and 0 everywhere else.


Now that we defined matrix multiplication, matrix addition and the
identity matrix, let us have a look at some properties of matrices:
associativity
Associativity:
∀A ∈ Rm×n , B ∈ Rn×p , C ∈ Rp×q : (AB)C = A(BC) (2.18)
distributivity
Distributivity:
∀A, B ∈ Rm×n , C, D ∈ Rn×p : (A + B)C = AC + BC (2.19a)
A(C + D) = AC + AD (2.19b)
Multiplication with the identity matrix:
∀A ∈ Rm×n : I m A = AI n = A (2.20)
Note that I m 6= I n for m 6= n.

2.2.2 Inverse and Transpose


A square matrix Definition 2.3 (Inverse). Consider a square matrix A ∈ Rn×n . Let matrix
possesses the same B ∈ Rn×n have the property that AB = I n = BA. B is called the
number of columns
inverse of A and denoted by A−1 .
and rows.
inverse Unfortunately, not every matrix A possesses an inverse A−1 . If this
regular inverse does exist, A is called regular/invertible/nonsingular, otherwise
invertible singular/noninvertible. When the matrix inverse exists, it is unique. In Sec-
nonsingular tion 2.3, we will discuss a general way to compute the inverse of a matrix
singular by solving a system of linear equations.
noninvertible
Remark (Existence of the Inverse of a 2 × 2-matrix). Consider a matrix
 
a11 a12
A := ∈ R2×2 . (2.21)
a21 a22
If we multiply A with
 
0 a22 −a12
A := (2.22)
−a21 a11
we obtain
 
a11 a22 − a12 a21 0
AA0 = = (a11 a22 − a12 a21 )I .
0 a11 a22 − a12 a21
(2.23)
Therefore,
1
 
a22 −a12
A−1 = (2.24)
a11 a22 − a12 a21 −a21 a11
if and only if a11 a22 − a12 a21 6= 0. In Section 4.1, we will see that a11 a22 −

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.2 Matrices 25

a12 a21 is the determinant of a 2×2-matrix. Furthermore, we can generally


use the determinant to check whether a matrix is invertible. ♦

Example 2.4 (Inverse Matrix)


The matrices
   
1 2 1 −7 −7 6
A = 4 4 5 , B= 2 1 −1 (2.25)
6 7 7 4 5 −4
are inverse to each other since AB = I = BA.

Definition 2.4 (Transpose). For A ∈ Rm×n the matrix B ∈ Rn×m with


bij = aji is called the transpose of A. We write B = A> . transpose
> The main diagonal
In general, A can be obtained by writing the columns of A as the rows (sometimes called
of A> . The following are important properties of inverses and transposes: “principal diagonal”,
“primary diagonal”,
“leading diagonal”,
AA−1 = I = A−1 A (2.26) or “major diagonal”)
−1 −1 −1 of a matrix A is the
(AB) =B A (2.27) collection of entries
−1 −1
(A + B)−1 6= A +B (2.28) Aij where i = j.
> > The scalar case of
(A ) = A (2.29) (2.28) is
1
= 61 6= 12 + 41 .
(A + B)> = A> + B > (2.30) 2+4

> > >


(AB) = B A (2.31)
Definition 2.5 (Symmetric Matrix). A matrix A ∈ Rn×n is symmetric if symmetric matrix
A = A> .
Note that only (n, n)-matrices can be symmetric. Generally, we call
(n, n)-matrices also square matrices because they possess the same num- square matrix
ber of rows and columns. Moreover, if A is invertible, then so is A> , and
(A−1 )> = (A> )−1 =: A−> .
Remark (Sum and Product of Symmetric Matrices). The sum of symmet-
ric matrices A, B ∈ Rn×n is always symmetric. However, although their
product is always defined, it is generally not symmetric:
    
1 0 1 1 1 1
= . (2.32)
0 0 1 1 0 0

2.2.3 Multiplication by a Scalar


Let us look at what happens to matrices when they are multiplied by a
scalar λ ∈ R. Let A ∈ Rm×n and λ ∈ R. Then λA = K , Kij = λ aij .
Practically, λ scales each element of A. For λ, ψ ∈ R, the following holds:

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


26 Linear Algebra

associativity
Associativity:
(λψ)C = λ(ψC), C ∈ Rm×n
λ(BC) = (λB)C = B(λC) = (BC)λ, B ∈ Rm×n , C ∈ Rn×k .
Note that this allows us to move scalar values around.
distributivity (λC)> = C > λ> = C > λ = λC > since λ = λ> for all λ ∈ R.
Distributivity:
(λ + ψ)C = λC + ψC, C ∈ Rm×n
λ(B + C) = λB + λC, B, C ∈ Rm×n

Example 2.5 (Distributivity)


If we define
 
1 2
C := , (2.33)
3 4
then for any λ, ψ ∈ R we obtain
   
(λ + ψ)1 (λ + ψ)2 λ + ψ 2λ + 2ψ
(λ + ψ)C = = (2.34a)
(λ + ψ)3 (λ + ψ)4 3λ + 3ψ 4λ + 4ψ
   
λ 2λ ψ 2ψ
= + = λC + ψC . (2.34b)
3λ 4λ 3ψ 4ψ

2.2.4 Compact Representations of Systems of Linear Equations


If we consider the system of linear equations

2x1 + 3x2 + 5x3 = 1


4x1 − 2x2 − 7x3 = 8 (2.35)
9x1 + 5x2 − 3x3 = 2

and use the rules for matrix multiplication, we can write this equation
system in a more compact form as
    
2 3 5 x1 1
4 −2 −7 x2  = 8 . (2.36)
9 5 −3 x3 2

Note that x1 scales the first column, x2 the second one, and x3 the third
one.
Generally, a system of linear equations can be compactly represented in
their matrix form as Ax = b; see (2.3), and the product Ax is a (linear)
combination of the columns of A. We will discuss linear combinations in
more detail in Section 2.5.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.3 Solving Systems of Linear Equations 27

2.3 Solving Systems of Linear Equations


In (2.3), we introduced the general form of an equation system, i.e.,
a11 x1 + · · · + a1n xn = b1
.. (2.37)
.
am1 x1 + · · · + amn xn = bm ,
where aij ∈ R and bi ∈ R are known constants and xj are unknowns,
i = 1, . . . , m, j = 1, . . . , n. Thus far, we saw that matrices can be used as
a compact way of formulating systems of linear equations so that we can
write Ax = b, see (2.10). Moreover, we defined basic matrix operations,
such as addition and multiplication of matrices. In the following, we will
focus on solving systems of linear equations and provide an algorithm for
finding the inverse of a matrix.

2.3.1 Particular and General Solution


Before discussing how to generally solve systems of linear equations, let
us have a look at an example. Consider the system of equations
 
  x1  
1 0 8 −4  x2  = 42 .

(2.38)
0 1 2 12 x3  8
x4
The system has two equations and four unknowns. Therefore, in general
we would expect infinitely many solutions. This system of equations is
in a particularly easy form, where the first two columns consist of a 1
P4 a 0. Remember that we want to find scalars x1 , . . . , x4 , such that
and
i=1 xi ci = b, where we define ci to be the ith column of the matrix and
b the right-hand-side of (2.38). A solution to the problem in (2.38) can
be found immediately by taking 42 times the first column and 8 times the
second column so that
     
42 1 0
b= = 42 +8 . (2.39)
8 0 1
Therefore, a solution is [42, 8, 0, 0]> . This solution is called a particular particular solution
solution or special solution. However, this is not the only solution of this special solution
system of linear equations. To capture all the other solutions, we need
to be creative in generating 0 in a non-trivial way using the columns of
the matrix: Adding 0 to our special solution does not change the special
solution. To do so, we express the third column using the first two columns
(which are of this very simple form)
     
8 1 0
=8 +2 (2.40)
2 0 1

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


28 Linear Algebra

so that 0 = 8c1 + 2c2 − 1c3 + 0c4 and (x1 , x2 , x3 , x4 ) = (8, 2, −1, 0). In
fact, any scaling of this solution by λ1 ∈ R produces the 0 vector, i.e.,
  
  8
1 0 8 −4   2   
 = λ1 (8c1 + 2c2 − c3 ) = 0 .
λ1   (2.41)
0 1 2 12  −1 

0
Following the same line of reasoning, we express the fourth column of the
matrix in (2.38) using the first two columns and generate another set of
non-trivial versions of 0 as
−4
  
 
1 0 8 −4   12 
  
 = λ2 (−4c1 + 12c2 − c4 ) = 0
λ2   (2.42)
0 1 2 12  0 

−1
for any λ2 ∈ R. Putting everything together, we obtain all solutions of the
general solution equation system in (2.38), which is called the general solution, as the set
 
−4
     

 42 8 

8 2 12
       
4
x ∈ R : x =   + λ1   + λ2   , λ1 , λ2 ∈ R . (2.43)
     

 0 −1 0 

0 0 −1
 

Remark. The general approach we followed consisted of the following


three steps:
1. Find a particular solution to Ax = b.
2. Find all solutions to Ax = 0.
3. Combine the solutions from steps 1. and 2. to the general solution.
Neither the general nor the particular solution is unique. ♦
The system of linear equations in the preceding example was easy to
solve because the matrix in (2.38) has this particularly convenient form,
which allowed us to find the particular and the general solution by in-
spection. However, general equation systems are not of this simple form.
Fortunately, there exists a constructive algorithmic way of transforming
any system of linear equations into this particularly simple form: Gaussian
elimination. Key to Gaussian elimination are elementary transformations
of systems of linear equations, which transform the equation system into
a simple form. Then, we can apply the three steps to the simple form that
we just discussed in the context of the example in (2.38).

2.3.2 Elementary Transformations


elementary Key to solving a system of linear equations are elementary transformations
transformations that keep the solution set the same, but that transform the equation system
into a simpler form:

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.3 Solving Systems of Linear Equations 29

Exchange of two equations (rows in the matrix representing the system


of equations)
Multiplication of an equation (row) with a constant λ ∈ R\{0}
Addition of two equations (rows)

Example 2.6
For a ∈ R, we seek all solutions of the following system of equations:
−2x1 + 4x2 − 2x3 − x4 + 4x5 = −3
4x1 − 8x2 + 3x3 − 3x4 + x5 = 2
. (2.44)
x1 − 2x2 + x3 − x4 + x5 = 0
x1 − 2x2 − 3x4 + 4x5 = a
We start by converting this system of equations into the compact matrix
notation Ax = b. We no longer mention the variables x explicitly and
build the augmented matrix (in the form A | b ) augmented matrix

−2 −2 −1 −3
 
4 4 Swap with R3
 4
 −8 3 −3 1 2 

 1 −2 1 −1 1 0  Swap with R1
1 −2 0 −3 4 a
where we used the vertical line to separate the left-hand side from the
right-hand side in (2.44). We use to indicate a transformation of the
augmented matrix using elementary transformations. The augmented
 
matrix A | b
Swapping Rows 1 and 3 leads to
compactly
−2 −1 represents the
 
1 1 1 0
system of linear

 4 −8 3 −3 1 2  −4R1 equations Ax = b.
 −2 4 −2 −1 4 −3  +2R1
1 −2 0 −3 4 a −R1
When we now apply the indicated transformations (e.g., subtract Row 1
four times from Row 2), we obtain
−2 −1
 
1 1 1 0

 0 0 −1 1 −3 2 

 0 0 0 −3 6 −3 
0 0 −1 −2 3 a −R2 − R3
−2 −1
 
1 1 1 0

 0 0 −1 1 −3  ·(−1)
2 
 0 0 0 −3 6 −3  ·(− 31 )
0 0 0 0 0 a+1
−2 −1
 
1 1 1 0

 0 0 1 −1 3 −2 
 0 0 0 1 −2 1 
0 0 0 0 0 a+1

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


30 Linear Algebra

row-echelon form This (augmented) matrix is in a convenient form, the row-echelon form
(REF). Reverting this compact notation back into the explicit notation with
the variables we seek, we obtain
x1 − 2x2 + x3 − x4 + x5 = 0
x3 − x4 + 3x5 = −2
. (2.45)
x4 − 2x5 = 1
0 = a+1
particular solution Only for a = −1 this system can be solved. A particular solution is
2
   
x1
x2   0 
   
x3  = −1 . (2.46)
   
x4   1 
x5 0
general solution The general solution, which captures the set of all possible solutions, is
 
2 2 2
     

 



  0  1  0  


5
     
x∈R :x= −1
 
 + λ1 0 + λ2 −1 , λ1 , λ2 ∈ R . (2.47)
   


  1  0  2  


 
0 0 1
 

In the following, we will detail a constructive way to obtain a particular


and general solution of a system of linear equations.
Remark (Pivots and Staircase Structure). The leading coefficient of a row
pivot (first nonzero number from the left) is called the pivot and is always
strictly to the right of the pivot of the row above it. Therefore, any equa-
tion system in row-echelon form always has a “staircase” structure. ♦
row-echelon form Definition 2.6 (Row-Echelon Form). A matrix is in row-echelon form if

All rows that contain only zeros are at the bottom of the matrix; corre-
spondingly, all rows that contain at least one nonzero element are on
top of rows that contain only zeros.
Looking at nonzero rows only, the first nonzero number from the left
pivot (also called the pivot or the leading coefficient) is always strictly to the
leading coefficient right of the pivot of the row above it.
In other texts, it is
sometimes required Remark (Basic and Free Variables). The variables corresponding to the
that the pivot is 1. pivots in the row-echelon form are called basic variables and the other
basic variable variables are free variables. For example, in (2.45), x1 , x3 , x4 are basic
free variable variables, whereas x2 , x5 are free variables. ♦
Remark (Obtaining a Particular Solution). The row-echelon form makes

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.3 Solving Systems of Linear Equations 31

our lives easier when we need to determine a particular solution. To do


this, we express the right-hand
PP side of the equation system using the pivot
columns, such that b = i=1 λi pi , where pi , i = 1, . . . , P , are the pivot
columns. The λi are determined easiest if we start with the rightmost pivot
column and work our way to the left.
In the previous example, we would try to find λ1 , λ2 , λ3 so that
−1
       
1 1 0
0 1 −1 −2
0 + λ2 0 + λ3  1  =  1  .
λ1  (2.48)
      

0 0 0 0
From here, we find relatively directly that λ3 = 1, λ2 = −1, λ1 = 2. When
we put everything together, we must not forget the non-pivot columns
for which we set the coefficients implicitly to 0. Therefore, we get the
particular solution x = [2, 0, −1, 1, 0]> . ♦
Remark (Reduced Row Echelon Form). An equation system is in reduced reduced
row-echelon form (also: row-reduced echelon form or row canonical form) if row-echelon form

It is in row-echelon form.
Every pivot is 1.
The pivot is the only nonzero entry in its column.

The reduced row-echelon form will play an important role later in Sec-
tion 2.3.3 because it allows us to determine the general solution of a sys-
tem of linear equations in a straightforward way.
Gaussian
Remark (Gaussian Elimination). Gaussian elimination is an algorithm that elimination
performs elementary transformations to bring a system of linear equations
into reduced row-echelon form. ♦

Example 2.7 (Reduced Row Echelon Form)


Verify that the following matrix is in reduced row-echelon form (the pivots
are in bold):
 
1 3 0 0 3
A = 0 0 1 0 9  . (2.49)
0 0 0 1 −4
The key idea for finding the solutions of Ax = 0 is to look at the non-
pivot columns, which we will need to express as a (linear) combination of
the pivot columns. The reduced row echelon form makes this relatively
straightforward, and we express the non-pivot columns in terms of sums
and multiples of the pivot columns that are on their left: The second col-
umn is 3 times the first column (we can ignore the pivot columns on the
right of the second column). Therefore, to obtain 0, we need to subtract

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


32 Linear Algebra

the second column from three times the first column. Now, we look at the
fifth column, which is our second non-pivot column. The fifth column can
be expressed as 3 times the first pivot column, 9 times the second pivot
column, and −4 times the third pivot column. We need to keep track of
the indices of the pivot columns and translate this into 3 times the first col-
umn, 0 times the second column (which is a non-pivot column), 9 times
the third column (which is our second pivot column), and −4 times the
fourth column (which is the third pivot column). Then we need to subtract
the fifth column to obtain 0. In the end, we are still solving a homogeneous
equation system.
To summarize, all solutions of Ax = 0, x ∈ R5 are given by
 
3 3
   

 



 −1  0  


5
   
x ∈ R : x = λ1  0  + λ2  9  , λ1 , λ2 ∈ R .
    (2.50)


  0  −4 


 
0 −1
 

2.3.3 The Minus-1 Trick


In the following, we introduce a practical trick for reading out the solu-
tions x of a homogeneous system of linear equations Ax = 0, where
A ∈ Rk×n , x ∈ Rn .
To start, we assume that A is in reduced row-echelon form without any
rows that just contain zeros, i.e.,
0 ··· 0 1 ∗ ··· ∗ 0 ∗ ··· ∗ 0 ∗ ··· ∗
 
 .. .. . . .. 
 . . 0 0 · · · 0 1 ∗ · · · ∗ .. .. . 
 .. . . . . . . . . ..  ,
 
A= . .. .. .. .. 0 .. .. .. .. . 
 .. .. .. .. .. .. .. .. .. .. 
 
 . . . . . . . . 0 . . 
0 ··· 0 0 0 ··· 0 0 0 ··· 0 1 ∗ ··· ∗
(2.51)
where ∗ can be an arbitrary real number, with the constraints that the first
nonzero entry per row must be 1 and all other entries in the corresponding
column must be 0. The columns j1 , . . . , jk with the pivots (marked in
bold) are the standard unit vectors e1 , . . . , ek ∈ Rk . We extend this matrix
to an n × n-matrix à by adding n − k rows of the form
 
0 · · · 0 −1 0 · · · 0 (2.52)

so that the diagonal of the augmented matrix à contains either 1 or −1.


Then, the columns of à that contain the −1 as pivots are solutions of

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.3 Solving Systems of Linear Equations 33

the homogeneous equation system Ax = 0. To be more precise, these


columns form a basis (Section 2.6.1) of the solution space of Ax = 0,
which we will later call the kernel or null space (see Section 2.7.3). kernel
null space

Example 2.8 (Minus-1 Trick)


Let us revisit the matrix in (2.49), which is already in reduced REF:
 
1 3 0 0 3
A = 0 0 1 0 9  . (2.53)
0 0 0 1 −4
We now augment this matrix to a 5 × 5 matrix by adding rows of the
form (2.52) at the places where the pivots on the diagonal are missing
and obtain
1 3 0 0 3
 
0 −1 0 0 0 
 
0 0 1 0 9  .
à =  (2.54)

0 0 0 1 −4 
0 0 0 0 −1
From this form, we can immediately read out the solutions of Ax = 0 by
taking the columns of Ã, which contain −1 on the diagonal:
 
3 3
   

 



 
−1 


 0 




5
x ∈ R : x = λ1  0  + λ2  9  , λ1 , λ2 ∈ R ,
    (2.55)


  0  −4 


 
0 −1
 

which is identical to the solution in (2.50) that we obtained by “insight”.

Calculating the Inverse


To compute the inverse A−1 of A ∈ Rn×n , we need to find a matrix X
that satisfies AX = I n . Then, X = A−1 . We can write this down as
a set of simultaneous linear equations AX = I n , where we solve for
X = [x1 | · · · |xn ]. We use the augmented matrix notation for a compact
representation of this set of systems of linear equations and obtain

I n |A−1 .
   
A|I n ··· (2.56)

This means that if we bring the augmented equation system into reduced
row-echelon form, we can read out the inverse on the right-hand side of
the equation system. Hence, determining the inverse of a matrix is equiv-
alent to solving systems of linear equations.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


34 Linear Algebra

Example 2.9 (Calculating an Inverse Matrix by Gaussian Elimination)


To determine the inverse of
 
1 0 2 0
1 1 0 0
A= 1 2 0 1
 (2.57)
1 1 1 1
we write down the augmented matrix
 
1 0 2 0 1 0 0 0
 1 1 0 0 0 1 0 0 
 
 1 2 0 1 0 0 1 0 
1 1 1 1 0 0 0 1
and use Gaussian elimination to bring it into reduced row-echelon form
1 0 0 0 −1 2 −2 2
 
 0 1 0 0
 1 −1 2 −2  ,
 0 0 1 0 1 −1 1 −1 
0 0 0 1 −1 0 −1 2
such that the desired inverse is given as its right-hand side:
−1 2 −2 2
 
 1 −1 2 −2
A−1 =  1 −1 1 −1 .
 (2.58)
−1 0 −1 2
We can verify that (2.58) is indeed the inverse by performing the multi-
plication AA−1 and observing that we recover I 4 .

2.3.4 Algorithms for Solving a System of Linear Equations


In the following, we briefly discuss approaches to solving a system of lin-
ear equations of the form Ax = b. We make the assumption that a solu-
tion exists. Should there be no solution, we need to resort to approximate
solutions, which we do not cover in this chapter. One way to solve the ap-
proximate problem is using the approach of linear regression, which we
discuss in detail in Chapter 9.
In special cases, we may be able to determine the inverse A−1 , such
that the solution of Ax = b is given as x = A−1 b. However, this is
only possible if A is a square matrix and invertible, which is often not the
case. Otherwise, under mild assumptions (i.e., A needs to have linearly
independent columns) we can use the transformation

Ax = b ⇐⇒ A> Ax = A> b ⇐⇒ x = (A> A)−1 A> b (2.59)

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.4 Vector Spaces 35

and use the Moore-Penrose pseudo-inverse (A> A)−1 A> to determine the Moore-Penrose
solution (2.59) that solves Ax = b, which also corresponds to the mini- pseudo-inverse
mum norm least-squares solution. A disadvantage of this approach is that
it requires many computations for the matrix-matrix product and comput-
ing the inverse of A> A. Moreover, for reasons of numerical precision it
is generally not recommended to compute the inverse or pseudo-inverse.
In the following, we therefore briefly discuss alternative approaches to
solving systems of linear equations.
Gaussian elimination plays an important role when computing deter-
minants (Section 4.1), checking whether a set of vectors is linearly inde-
pendent (Section 2.5), computing the inverse of a matrix (Section 2.2.2),
computing the rank of a matrix (Section 2.6.2), and determining a basis
of a vector space (Section 2.6.1). Gaussian elimination is an intuitive and
constructive way to solve a system of linear equations with thousands of
variables. However, for systems with millions of variables, it is impracti-
cal as the required number of arithmetic operations scales cubically in the
number of simultaneous equations.
In practice, systems of many linear equations are solved indirectly, by ei-
ther stationary iterative methods, such as the Richardson method, the Ja-
cobi method, the Gauß-Seidel method, and the successive over-relaxation
method, or Krylov subspace methods, such as conjugate gradients, gener-
alized minimal residual, or biconjugate gradients. We refer to the books
by Stoer and Burlirsch (2002), Strang (2003), and Liesen and Mehrmann
(2015) for further details.
Let x∗ be a solution of Ax = b. The key idea of these iterative methods
is to set up an iteration of the form
x(k+1) = Cx(k) + d (2.60)
for suitable C and d that reduces the residual error kx(k+1) − x∗ k in every
iteration and converges to x∗ . We will introduce norms k · k, which allow
us to compute similarities between vectors, in Section 3.1.

2.4 Vector Spaces


Thus far, we have looked at systems of linear equations and how to solve
them (Section 2.3). We saw that systems of linear equations can be com-
pactly represented using matrix-vector notation (2.10). In the following,
we will have a closer look at vector spaces, i.e., a structured space in which
vectors live.
In the beginning of this chapter, we informally characterized vectors as
objects that can be added together and multiplied by a scalar, and they
remain objects of the same type. Now, we are ready to formalize this,
and we will start by introducing the concept of a group, which is a set
of elements and an operation defined on these elements that keeps some
structure of the set intact.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


36 Linear Algebra

2.4.1 Groups
Groups play an important role in computer science. Besides providing a
fundamental framework for operations on sets, they are heavily used in
cryptography, coding theory, and graphics.
Definition 2.7 (Group). Consider a set G and an operation ⊗ : G ×G → G
group defined on G . Then G := (G, ⊗) is called a group if the following hold:
closure
associativity 1. Closure of G under ⊗: ∀x, y ∈ G : x ⊗ y ∈ G
neutral element 2. Associativity: ∀x, y, z ∈ G : (x ⊗ y) ⊗ z = x ⊗ (y ⊗ z)
inverse element 3. Neutral element: ∃e ∈ G ∀x ∈ G : x ⊗ e = x and e ⊗ x = x
4. Inverse element: ∀x ∈ G ∃y ∈ G : x ⊗ y = e and y ⊗ x = e, where e is
the neutral element. We often write x−1 to denote the inverse element
of x.
Remark. The inverse element is defined with respect to the operation ⊗
and does not necessarily mean x1 . ♦
Abelian group If additionally ∀x, y ∈ G : x ⊗ y = y ⊗ x, then G = (G, ⊗) is an Abelian
group (commutative).

Example 2.10 (Groups)


Let us have a look at some examples of sets with associated operations
and see whether they are groups:
(Z, +) is an Abelian group.
N0 := N ∪ {0} (N0 , +) is not a group: Although (N0 , +) possesses a neutral element
(0), the inverse elements are missing.
(Z, ·) is not a group: Although (Z, ·) contains a neutral element (1), the
inverse elements for any z ∈ Z, z 6= ±1, are missing.
(R, ·) is not a group since 0 does not possess an inverse element.
(R\{0}, ·) is Abelian.
(Rn , +), (Zn , +), n ∈ N are Abelian if + is defined componentwise, i.e.,
(x1 , · · · , xn ) + (y1 , · · · , yn ) = (x1 + y1 , · · · , xn + yn ). (2.61)
Then, (x1 , · · · , xn )−1 := (−x1 , · · · , −xn ) is the inverse element and
e = (0, · · · , 0) is the neutral element.
(Rm×n , +), the set of m × n-matrices is Abelian (with componentwise
addition as defined in (2.61)).
Let us have a closer look at (Rn×n , ·), i.e., the set of n × n-matrices with
matrix multiplication as defined in (2.13).
– Closure and associativity follow directly from the definition of matrix
multiplication.
– Neutral element: The identity matrix I n is the neutral element with
respect to matrix multiplication “·” in (Rn×n , ·).

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.4 Vector Spaces 37

– Inverse element: If the inverse exists (A is regular), then A−1 is the


inverse element of A ∈ Rn×n , and in exactly this case (Rn×n , ·) is a
group, called the general linear group.

Definition 2.8 (General Linear Group). The set of regular (invertible)


matrices A ∈ Rn×n is a group with respect to matrix multiplication as
defined in (2.13) and is called general linear group GL(n, R). However, general linear group
since matrix multiplication is not commutative, the group is not Abelian.

2.4.2 Vector Spaces


When we discussed groups, we looked at sets G and inner operations on
G , i.e., mappings G × G → G that only operate on elements in G . In the
following, we will consider sets that in addition to an inner operation +
also contain an outer operation ·, the multiplication of a vector x ∈ G by
a scalar λ ∈ R. We can think of the inner operation as a form of addition,
and the outer operation as a form of scaling. Note that the inner/outer
operations have nothing to do with inner/outer products.
Definition 2.9 (Vector Space). A real-valued vector space V = (V, +, ·) is vector space
a set V with two operations
+: V ×V →V (2.62)
·: R×V →V (2.63)
where
1. (V, +) is an Abelian group
2. Distributivity:
1. ∀λ ∈ R, x, y ∈ V : λ · (x + y) = λ · x + λ · y
2. ∀λ, ψ ∈ R, x ∈ V : (λ + ψ) · x = λ · x + ψ · x
3. Associativity (outer operation): ∀λ, ψ ∈ R, x ∈ V : λ·(ψ ·x) = (λψ)·x
4. Neutral element with respect to the outer operation: ∀x ∈ V : 1·x = x
The elements x ∈ V are called vectors. The neutral element of (V, +) is vector
the zero vector 0 = [0, . . . , 0]> , and the inner operation + is called vector vector addition
addition. The elements λ ∈ R are called scalars and the outer operation scalar
· is a multiplication by scalars. Note that a scalar product is something multiplication by
different, and we will get to this in Section 3.2. scalars

Remark. A “vector multiplication” ab, a, b ∈ Rn , is not defined. Theoret-


ically, we could define an element-wise multiplication, such that c = ab
with cj = aj bj . This “array multiplication” is common to many program-
ming languages but makes mathematically limited sense using the stan-
dard rules for matrix multiplication: By treating vectors as n × 1 matrices

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


38 Linear Algebra

(which we usually do), we can use the matrix multiplication as defined


in (2.13). However, then the dimensions of the vectors do not match. Only
outer product the following multiplications for vectors are defined: ab> ∈ Rn×n (outer
product), a> b ∈ R (inner/scalar/dot product). ♦

Example 2.11 (Vector Spaces)


Let us have a look at some important examples:
V = Rn , n ∈ N is a vector space with operations defined as follows:
– Addition: x+y = (x1 , . . . , xn )+(y1 , . . . , yn ) = (x1 +y1 , . . . , xn +yn )
for all x, y ∈ Rn
– Multiplication by scalars: λx = λ(x1 , . . . , xn ) = (λx1 , . . . , λxn ) for
all λ ∈ R, x ∈ Rn
V = Rm×n , m, n ∈ N is a vector space with
 
a11 + b11 · · · a1n + b1n
– Addition: A + B =  .. ..
 is defined ele-
 
. .
am1 + bm1 · · · amn + bmn
mentwise for all A, B ∈ V  
λa11 · · · λa1n
– Multiplication by scalars: λA =  ... ..  as defined in

. 
λam1 · · · λamn
Section 2.2. Remember that Rm×n is equivalent to Rmn .
V = C, with the standard definition of addition of complex numbers.

Remark. In the following, we will denote a vector space (V, +, ·) by V


when + and · are the standard vector addition and scalar multiplication.
Moreover, we will use the notation x ∈ V for vectors in V to simplify
notation. ♦
Remark. The vector spaces Rn , Rn×1 , R1×n are only different in the way
we write vectors. In the following, we will not make a distinction between
column vector Rn and Rn×1 , which allows us to write n-tuples as column vectors
 
x1
 .. 
x =  . . (2.64)
xn

This simplifies the notation regarding vector space operations. However,


row vector we do distinguish between Rn×1 and R1×n (the row vectors) to avoid con-
fusion with matrix multiplication. By default, we write x to denote a col-
transpose umn vector, and a row vector is denoted by x> , the transpose of x. ♦

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.4 Vector Spaces 39

2.4.3 Vector Subspaces


In the following, we will introduce vector subspaces. Intuitively, they are
sets contained in the original vector space with the property that when
we perform vector space operations on elements within this subspace, we
will never leave it. In this sense, they are “closed”. Vector subspaces are a
key idea in machine learning. For example, Chapter 10 demonstrates how
to use vector subspaces for dimensionality reduction.
Definition 2.10 (Vector Subspace). Let V = (V, +, ·) be a vector space
and U ⊆ V , U 6= ∅. Then U = (U, +, ·) is called vector subspace of V (or vector subspace
linear subspace) if U is a vector space with the vector space operations + linear subspace
and · restricted to U × U and R × U . We write U ⊆ V to denote a subspace
U of V .
If U ⊆ V and V is a vector space, then U naturally inherits many prop-
erties directly from V because they hold for all x ∈ V , and in particular for
all x ∈ U ⊆ V . This includes the Abelian group properties, the distribu-
tivity, the associativity and the neutral element. To determine whether
(U, +, ·) is a subspace of V we still do need to show
1. U 6= ∅, in particular: 0 ∈ U
2. Closure of U :
a. With respect to the outer operation: ∀λ ∈ R ∀x ∈ U : λx ∈ U .
b. With respect to the inner operation: ∀x, y ∈ U : x + y ∈ U .

Example 2.12 (Vector Subspaces)


Let us have a look at some examples:
For every vector space V , the trivial subspaces are V itself and {0}.
Only example D in Figure 2.6 is a subspace of R2 (with the usual inner/
outer operations). In A and C , the closure property is violated; B does
not contain 0.
The solution set of a homogeneous system of linear equations Ax = 0
with n unknowns x = [x1 , . . . , xn ]> is a subspace of Rn .
The solution of an inhomogeneous system of linear equations Ax =
b, b 6= 0 is not a subspace of Rn .
The intersection of arbitrarily many subspaces is a subspace itself.

B Figure 2.6 Not all


A
subsets of R2 are
subspaces. In A and
D
C, the closure
0 0 0 0
C property is violated;
B does not contain
0. Only D is a
subspace.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


40 Linear Algebra

Remark. Every subspace U ⊆ (Rn , +, ·) is the solution space of a homo-


geneous system of linear equations Ax = 0 for x ∈ Rn . ♦

2.5 Linear Independence


In the following, we will have a close look at what we can do with vectors
(elements of the vector space). In particular, we can add vectors together
and multiply them with scalars. The closure property guarantees that we
end up with another vector in the same vector space. It is possible to find
a set of vectors with which we can represent every vector in the vector
space by adding them together and scaling them. This set of vectors is
a basis, and we will discuss them in Section 2.6.1. Before we get there,
we will need to introduce the concepts of linear combinations and linear
independence.
Definition 2.11 (Linear Combination). Consider a vector space V and a
finite number of vectors x1 , . . . , xk ∈ V . Then, every v ∈ V of the form
k
X
v = λ1 x1 + · · · + λk xk = λi xi ∈ V (2.65)
i=1

linear combination with λ1 , . . . , λk ∈ R is a linear combination of the vectors x1 , . . . , xk .


The 0-vector can always be P written as the linear combination of k vec-
k
tors x1 , . . . , xk because 0 = i=1 0xi is always true. In the following,
we are interested in non-trivial linear combinations of a set of vectors to
represent 0, i.e., linear combinations of vectors x1 , . . . , xk , where not all
coefficients λi in (2.65) are 0.
Definition 2.12 (Linear (In)dependence). Let us consider a vector space
V with k ∈ N and x1 , . P . . , xk ∈ V . If there is a non-trivial linear com-
k
bination, such that 0 = i=1 λi xi with at least one λi 6= 0, the vectors
linearly dependent x1 , . . . , xk are linearly dependent. If only the trivial solution exists, i.e.,
linearly λ1 = . . . = λk = 0 the vectors x1 , . . . , xk are linearly independent.
independent
Linear independence is one of the most important concepts in linear
algebra. Intuitively, a set of linearly independent vectors consists of vectors
that have no redundancy, i.e., if we remove any of those vectors from
the set, we will lose something. Throughout the next sections, we will
formalize this intuition more.

Example 2.13 (Linearly Dependent Vectors)


A geographic example may help to clarify the concept of linear indepen-
dence. A person in Nairobi (Kenya) describing where Kigali (Rwanda) is
might say ,“You can get to Kigali by first going 506 km Northwest to Kam-
pala (Uganda) and then 374 km Southwest.”. This is sufficient information

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.5 Linear Independence 41

to describe the location of Kigali because the geographic coordinate sys-


tem may be considered a two-dimensional vector space (ignoring altitude
and the Earth’s curved surface). The person may add, “It is about 751 km
West of here.” Although this last statement is true, it is not necessary to
find Kigali given the previous information (see Figure 2.7 for an illus-
tration). In this example, the “506 km Northwest” vector (blue) and the
“374 km Southwest” vector (purple) are linearly independent. This means
the Southwest vector cannot be described in terms of the Northwest vec-
tor, and vice versa. However, the third “751 km West” vector (black) is a
linear combination of the other two vectors, and it makes the set of vec-
tors linearly dependent. Equivalently, given “751 km West” and “374 km
Southwest” can be linearly combined to obtain “506 km Northwest”.

Kampala Figure 2.7


506 Geographic example
t
es km (with crude
hw No approximations to
u t rth
wes cardinal directions)
So t of linearly
km Nairobi
dependent vectors
3 74 751 km West in a
t two-dimensional
es
hw space (plane).
Kigali t
Sou
km
4
37

Remark. The following properties are useful to find out whether vectors
are linearly independent:

k vectors are either linearly dependent or linearly independent. There


is no third option.
If at least one of the vectors x1 , . . . , xk is 0 then they are linearly de-
pendent. The same holds if two vectors are identical.
The vectors {x1 , . . . , xk : xi 6= 0, i = 1, . . . , k}, k > 2, are linearly
dependent if and only if (at least) one of them is a linear combination
of the others. In particular, if one vector is a multiple of another vector,
i.e., xi = λxj , λ ∈ R then the set {x1 , . . . , xk : xi 6= 0, i = 1, . . . , k}
is linearly dependent.
A practical way of checking whether vectors x1 , . . . , xk ∈ V are linearly
independent is to use Gaussian elimination: Write all vectors as columns
of a matrix A and perform Gaussian elimination until the matrix is in
row echelon form (the reduced row-echelon form is unnecessary here):

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


42 Linear Algebra

– The pivot columns indicate the vectors, which are linearly indepen-
dent of the vectors on the left. Note that there is an ordering of vec-
tors when the matrix is built.
– The non-pivot columns can be expressed as linear combinations of
the pivot columns on their left. For instance, the row-echelon form
 
1 3 0
(2.66)
0 0 2
tells us that the first and third columns are pivot columns. The sec-
ond column is a non-pivot column because it is three times the first
column.
All column vectors are linearly independent if and only if all columns
are pivot columns. If there is at least one non-pivot column, the columns
(and, therefore, the corresponding vectors) are linearly dependent.

Example 2.14
Consider R4 with
−1
     
1 1
 2  1 −2
x1 = 
−3 ,
 x2 = 
0 ,
 x3 = 
 1 .
 (2.67)
4 2 1
To check whether they are linearly dependent, we follow the general ap-
proach and solve
−1
     
1 1
 2  1 −2
λ1 x1 + λ2 x2 + λ3 x3 = λ1 
−3 + λ2 0 + λ3  1  = 0
     (2.68)
4 2 1
for λ1 , . . . , λ3 . We write the vectors xi , i = 1, 2, 3, as the columns of a
matrix and apply elementary row operations until we identify the pivot
columns:

1 −1 1 −1
   
1 1
 2 1 −2 0 1 0

−3
 ···  . (2.69)
0 1 0 0 1
4 2 1 0 0 0
Here, every column of the matrix is a pivot column. Therefore, there is no
non-trivial solution, and we require λ1 = 0, λ2 = 0, λ3 = 0 to solve the
equation system. Hence, the vectors x1 , x2 , x3 are linearly independent.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.5 Linear Independence 43

Remark. Consider a vector space V with k linearly independent vectors


b1 , . . . , bk and m linear combinations
k
X
x1 = λi1 bi ,
i=1
.. (2.70)
.
k
X
xm = λim bi .
i=1

Defining B = [b1 , . . . , bk ] as the matrix whose columns are the linearly


independent vectors b1 , . . . , bk , we can write
 
λ1j
 .. 
xj = Bλj , λj =  .  , j = 1, . . . , m , (2.71)
λkj
in a more compact form.
We want to test whether x1 , . . . , xm are linearly independent.
Pm For this
purpose, we follow the general approach of testing when j=1 ψj xj = 0.
With (2.71), we obtain
m
X m
X m
X
ψj xj = ψj Bλj = B ψj λj . (2.72)
j=1 j=1 j=1

This means that {x1 , . . . , xm } are linearly independent if and only if the
column vectors {λ1 , . . . , λm } are linearly independent.

Remark. In a vector space V , m linear combinations of k vectors x1 , . . . , xk
are linearly dependent if m > k . ♦

Example 2.15
Consider a set of linearly independent vectors b1 , b2 , b3 , b4 ∈ Rn and
x1 = b1 − 2b2 + b3 − b4
x2 = −4b1 − 2b2 + 4b4
. (2.73)
x3 = 2b1 + 3b2 − b3 − 3b4
x4 = 17b1 − 10b2 + 11b3 + b4
Are the vectors x1 , . . . , x4 ∈ Rn linearly independent? To answer this
question, we investigate whether the column vectors
       

 1 −4 2 17 
−2 , −2 ,  3  , −10
       
 1   0  −1  11  (2.74)

 
−1 4 −3 1
 

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


44 Linear Algebra

are linearly independent. The reduced row-echelon form of the corre-


sponding linear equation system with coefficient matrix
1 −4 2
 
17
−2 −2 3 −10
A=  (2.75)
 1 0 −1 11 
−1 4 −3 1
is given as
0 −7
 
1 0
0 1 0 −15
 . (2.76)
0 0 1 −18
0 0 0 0
We see that the corresponding linear equation system is non-trivially solv-
able: The last column is not a pivot column, and x4 = −7x1 −15x2 −18x3 .
Therefore, x1 , . . . , x4 are linearly dependent as x4 can be expressed as a
linear combination of x1 , . . . , x3 .

2.6 Basis and Rank


In a vector space V , we are particularly interested in sets of vectors A that
possess the property that any vector v ∈ V can be obtained by a linear
combination of vectors in A. These vectors are special vectors, and in the
following, we will characterize them.

2.6.1 Generating Set and Basis


Definition 2.13 (Generating Set and Span). Consider a vector space V =
(V, +, ·) and set of vectors A = {x1 , . . . , xk } ⊆ V . If every vector v ∈
V can be expressed as a linear combination of x1 , . . . , xk , A is called a
generating set generating set of V . The set of all linear combinations of vectors in A is
span called the span of A. If A spans the vector space V , we write V = span[A]
or V = span[x1 , . . . , xk ].

Generating sets are sets of vectors that span vector (sub)spaces, i.e.,
every vector can be represented as a linear combination of the vectors
in the generating set. Now, we will be more specific and characterize the
smallest generating set that spans a vector (sub)space.

Definition 2.14 (Basis). Consider a vector space V = (V, +, ·) and A ⊆


minimal V . A generating set A of V is called minimal if there exists no smaller set
à ( A ⊆ V that spans V . Every linearly independent generating set of V
basis is minimal and is called a basis of V .

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.6 Basis and Rank 45

Let V = (V, +, ·) be a vector space and B ⊆ V, B =


6 ∅. Then, the
following statements are equivalent: A basis is a minimal
generating set and a
B is a basis of V . maximal linearly
B is a minimal generating set. independent set of
vectors.
B is a maximal linearly independent set of vectors in V , i.e., adding any
other vector to this set will make it linearly dependent.
Every vector x ∈ V is a linear combination of vectors from B , and every
linear combination is unique, i.e., with
k
X k
X
x= λ i bi = ψi bi (2.77)
i=1 i=1

and λi , ψi ∈ R, bi ∈ B it follows that λi = ψi , i = 1, . . . , k .

Example 2.16

In R3 , the canonical/standard basis is canonical basis


     
 1 0 0 
B= 0 , 1 , 0 .
    (2.78)
0 0 1
 

Different bases in R3 are


           
 1 1 1   0.5 1.8 −2.2 
B1 = 0 , 1 , 1 , B2 = 0.8 , 0.3 , −1.3 . (2.79)
0 0 1 0.4 0.3 3.5
   

The set
     

 1 2 1 
     
2 −1  1 

A=  , , (2.80)
 3
    0   0 

4 2 −4
 

is linearly independent, but not a generating set (and no basis) of R4 :


For instance, the vector [1, 0, 0, 0]> cannot be obtained by a linear com-
bination of elements in A.

Remark. Every vector space V possesses a basis B . The preceding exam-


ples show that there can be many bases of a vector space V , i.e., there is
no unique basis. However, all bases possess the same number of elements,
the basis vectors. ♦ basis vector

We only consider finite-dimensional vector spaces V . In this case, the


dimension of V is the number of basis vectors of V , and we write dim(V ). dimension
If U ⊆ V is a subspace of V , then dim(U ) 6 dim(V ) and dim(U ) =

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


46 Linear Algebra

dim(V ) if and only if U = V . Intuitively, the dimension of a vector space


can be thought of as the number of independent directions in this vector
The dimension of a space.
vector space
corresponds to the
Remark. The dimension of a vector space is not necessarily the number
 
number of its basis 0
of elements in a vector. For instance, the vector space V = span[ ] is
vectors. 1
one-dimensional, although the basis vector possesses two elements. ♦
n
Remark. A basis of a subspace U = span[x1 , . . . , xm ] ⊆ R can be found
by executing the following steps:
1. Write the spanning vectors as columns of a matrix A
2. Determine the row-echelon form of A.
3. The spanning vectors associated with the pivot columns are a basis of
U.

Example 2.17 (Determining a Basis)


For a vector subspace U ⊆ R5 , spanned by the vectors
1 2 3 −1
       
 2  −1 −4  8 
        5
x1 = −1 , x2 =  1  , x3 =  3  , x4 = −5 ∈ R , (2.81)
      
−1  2   5  −6
−1 −2 −3 1
we are interested in finding out which vectors x1 , . . . , x4 are a basis for U .
For this, we need to check whether x1 , . . . , x4 are linearly independent.
Therefore, we need to solve
4
X
λi xi = 0 , (2.82)
i=1

which leads to a homogeneous system of equations with matrix


1 2 3 −1
 

  2 −1 −4 8 
 

x1 , x2 , x3 , x4 = −1 1
 3 −5 . (2.83)
−1 2 5 −6
−1 −2 −3 1
With the basic transformation rules for systems of linear equations, we
obtain the row-echelon form
1 2 3 −1 1 2 3 −1
   
 2 −1 −4 8   0 1 2 −2 
   
−1 1 3 −5  · · ·  0 0 0 1 .
   
−1 2 5 −6   0 0 0 0 
−1 −2 −3 1 0 0 0 0

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.6 Basis and Rank 47

Since the pivot columns indicate which set of vectors is linearly indepen-
dent, we see from the row-echelon form that x1 , x2 , x4 are linearly inde-
pendent (because the system of linear equations λ1 x1 + λ2 x2 + λ4 x4 = 0
can only be solved with λ1 = λ2 = λ4 = 0). Therefore, {x1 , x2 , x4 } is a
basis of U .

2.6.2 Rank
The number of linearly independent columns of a matrix A ∈ Rm×n
equals the number of linearly independent rows and is called the rank rank
of A and is denoted by rk(A).
Remark. The rank of a matrix has some important properties:
rk(A) = rk(A> ), i.e., the column rank equals the row rank.
The columns of A ∈ Rm×n span a subspace U ⊆ Rm with dim(U ) =
rk(A). Later we will call this subspace the image or range. A basis of
U can be found by applying Gaussian elimination to A to identify the
pivot columns.
The rows of A ∈ Rm×n span a subspace W ⊆ Rn with dim(W ) =
rk(A). A basis of W can be found by applying Gaussian elimination to
A> .
For all A ∈ Rn×n it holds that A is regular (invertible) if and only if
rk(A) = n.
For all A ∈ Rm×n and all b ∈ Rm it holds that the linear equation
system Ax = b can be solved if and only if rk(A) = rk(A|b), where
A|b denotes the augmented system.
For A ∈ Rm×n the subspace of solutions for Ax = 0 possesses dimen-
sion n − rk(A). Later, we will call this subspace the kernel or the null kernel
space. null space
A matrix A ∈ Rm×n has full rank if its rank equals the largest possible full rank
rank for a matrix of the same dimensions. This means that the rank of
a full-rank matrix is the lesser of the number of rows and columns, i.e.,
rk(A) = min(m, n). A matrix is said to be rank deficient if it does not rank deficient
have full rank.

Example 2.18 (Rank)


 
1 0 1
A = 0 1 1.
0 0 0
A has two linearly independent rows/columns so that rk(A) = 2.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


48 Linear Algebra

 
1 2 1
A = −2 −3 1 .
3 5 0
We use Gaussian elimination to determine the rank:
   
1 2 1 1 2 1
−2 −3 1 ··· 0 1 3 . (2.84)
3 5 0 0 0 0
Here, we see that the number of linearly independent rows and columns
is 2, such that rk(A) = 2.

2.7 Linear Mappings


In the following, we will study mappings on vector spaces that preserve
their structure, which will allow us to define the concept of a coordinate.
In the beginning of the chapter, we said that vectors are objects that can be
added together and multiplied by a scalar, and the resulting object is still
a vector. We wish to preserve this property when applying the mapping:
Consider two real vector spaces V, W . A mapping Φ : V → W preserves
the structure of the vector space if
Φ(x + y) = Φ(x) + Φ(y) (2.85)
Φ(λx) = λΦ(x) (2.86)
for all x, y ∈ V and λ ∈ R. We can summarize this in the following
definition:
Definition 2.15 (Linear Mapping). For vector spaces V, W , a mapping
linear mapping Φ : V → W is called a linear mapping (or vector space homomorphism/
vector space linear transformation) if
homomorphism
linear ∀x, y ∈ V ∀λ, ψ ∈ R : Φ(λx + ψy) = λΦ(x) + ψΦ(y) . (2.87)
transformation
It turns out that we can represent linear mappings as matrices (Sec-
tion 2.7.1). Recall that we can also collect a set of vectors as columns of a
matrix. When working with matrices, we have to keep in mind what the
matrix represents: a linear mapping or a collection of vectors. We will see
more about linear mappings in Chapter 4. Before we continue, we will
briefly introduce special mappings.
Definition 2.16 (Injective, Surjective, Bijective). Consider a mapping Φ :
V → W , where V, W can be arbitrary sets. Then Φ is called
injective
surjective Injective if ∀x, y ∈ V : Φ(x) = Φ(y) =⇒ x = y .
bijective Surjective if Φ(V) = W .
Bijective if it is injective and surjective.

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.7 Linear Mappings 49

If Φ is surjective, then every element in W can be “reached” from V


using Φ. A bijective Φ can be “undone”, i.e., there exists a mapping Ψ :
W → V so that Ψ ◦ Φ(x) = x. This mapping Ψ is then called the inverse
of Φ and normally denoted by Φ−1 .
With these definitions, we introduce the following special cases of linear
mappings between vector spaces V and W :
isomorphism
Isomorphism: Φ : V → W linear and bijective endomorphism
Endomorphism: Φ : V → V linear automorphism
Automorphism: Φ : V → V linear and bijective
We define idV : V → V , x 7→ x as the identity mapping or identity identity mapping
automorphism in V . identity
automorphism

Example 2.19 (Homomorphism)


The mapping Φ : R2 → C, Φ(x) = x1 + ix2 , is a homomorphism:
   
x1 y
Φ + 1 = (x1 + y1 ) + i(x2 + y2 ) = x1 + ix2 + y1 + iy2
x2 y2
   
x1 y1
=Φ +Φ
x2 y2
    
x x1
Φ λ 1 = λx1 + λix2 = λ(x1 + ix2 ) = λΦ .
x2 x2
(2.88)
This also justifies why complex numbers can be represented as tuples in
R2 : There is a bijective linear mapping that converts the elementwise addi-
tion of tuples in R2 into the set of complex numbers with the correspond-
ing addition. Note that we only showed linearity, but not the bijection.

Theorem 2.17 (Theorem 3.59 in Axler (2015)). Finite-dimensional vector


spaces V and W are isomorphic if and only if dim(V ) = dim(W ).
Theorem 2.17 states that there exists a linear, bijective mapping be-
tween two vector spaces of the same dimension. Intuitively, this means
that vector spaces of the same dimension are kind of the same thing, as
they can be transformed into each other without incurring any loss.
Theorem 2.17 also gives us the justification to treat Rm×n (the vector
space of m × n-matrices) and Rmn (the vector space of vectors of length
mn) the same, as their dimensions are mn, and there exists a linear, bi-
jective mapping that transforms one into the other.
Remark. Consider vector spaces V, W, X . Then:
For linear mappings Φ : V → W and Ψ : W → X , the mapping
Ψ ◦ Φ : V → X is also linear.
If Φ : V → W is an isomorphism, then Φ−1 : W → V is an isomor-
phism, too.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


50 Linear Algebra

Figure 2.8 Two


different coordinate
systems defined by
two sets of basis
vectors. A vector x
has different
coordinate
x x
representations b2
depending on which
coordinate system is e2
chosen.
b1
e1

If Φ : V → W, Ψ : V → W are linear, then Φ + Ψ and λΦ, λ ∈ R, are


linear, too.

2.7.1 Matrix Representation of Linear Mappings


Any n-dimensional vector space is isomorphic to Rn (Theorem 2.17). We
consider a basis {b1 , . . . , bn } of an n-dimensional vector space V . In the
following, the order of the basis vectors will be important. Therefore, we
write

B = (b1 , . . . , bn ) (2.89)

ordered basis and call this n-tuple an ordered basis of V .


Remark (Notation). We are at the point where notation gets a bit tricky.
Therefore, we summarize some parts here. B = (b1 , . . . , bn ) is an ordered
basis, B = {b1 , . . . , bn } is an (unordered) basis, and B = [b1 , . . . , bn ] is a
matrix whose columns are the vectors b1 , . . . , bn . ♦
Definition 2.18 (Coordinates). Consider a vector space V and an ordered
basis B = (b1 , . . . , bn ) of V . For any x ∈ V we obtain a unique represen-
tation (linear combination)

x = α1 b1 + . . . + αn bn (2.90)

coordinate of x with respect to B . Then α1 , . . . , αn are the coordinates of x with


respect to B , and the vector
 
α1
 .. 
α =  .  ∈ Rn (2.91)
αn
coordinate vector is the coordinate vector/coordinate representation of x with respect to the
coordinate ordered basis B .
representation

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.7 Linear Mappings 51

A basis effectively defines a coordinate system. We are familiar with the


Cartesian coordinate system in two dimensions, which is spanned by the
canonical basis vectors e1 , e2 . In this coordinate system, a vector x ∈ R2
has a representation that tells us how to linearly combine e1 and e2 to
obtain x. However, any basis of R2 defines a valid coordinate system,
and the same vector x from before may have a different coordinate rep-
resentation in the (b1 , b2 ) basis. In Figure 2.8, the coordinates of x with
respect to the standard basis (e1 , e2 ) is [2, 2]> . However, with respect to
the basis (b1 , b2 ) the same vector x is represented as [1.09, 0.72]> , i.e.,
x = 1.09b1 + 0.72b2 . In the following sections, we will discover how to
obtain this representation.

Example 2.20
Let us have a look at a geometric vector x ∈ R2 with coordinates [2, 3]> Figure 2.9
with respect to the standard basis (e1 , e2 ) of R2 . This means, we can write Different coordinate
representations of a
x = 2e1 + 3e2 . However, we do not have to choose the standard basis to
vector x, depending
represent this vector. If we use the basis vectors b1 = [1, −1]> , b2 = [1, 1]> on the choice of
we will obtain the coordinates 21 [−1, 5]> to represent the same vector with basis.
respect to (b1 , b2 ) (see Figure 2.9). x = 2e1 + 3e2
x = − 12 b1 + 52 b2

Remark. For an n-dimensional vector space V and an ordered basis B


of V , the mapping Φ : Rn → V , Φ(ei ) = bi , i = 1, . . . , n, is linear
(and because of Theorem 2.17 an isomorphism), where (e1 , . . . , en ) is e2
b2
the standard basis of Rn .
♦ e1
b1
Now we are ready to make an explicit connection between matrices and
linear mappings between finite-dimensional vector spaces.
Definition 2.19 (Transformation Matrix). Consider vector spaces V, W
with corresponding (ordered) bases B = (b1 , . . . , bn ) and C = (c1 , . . . , cm ).
Moreover, we consider a linear mapping Φ : V → W . For j ∈ {1, . . . , n},
m
X
Φ(bj ) = α1j c1 + · · · + αmj cm = αij ci (2.92)
i=1

is the unique representation of Φ(bj ) with respect to C . Then, we call the


m × n-matrix AΦ , whose elements are given by
AΦ (i, j) = αij , (2.93)
the transformation matrix of Φ (with respect to the ordered bases B of V transformation
and C of W ). matrix

The coordinates of Φ(bj ) with respect to the ordered basis C of W


are the j -th column of AΦ . Consider (finite-dimensional) vector spaces
V, W with ordered bases B, C and a linear mapping Φ : V → W with

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


52 Linear Algebra

transformation matrix AΦ . If x̂ is the coordinate vector of x ∈ V with


respect to B and ŷ the coordinate vector of y = Φ(x) ∈ W with respect
to C , then

ŷ = AΦ x̂ . (2.94)

This means that the transformation matrix can be used to map coordinates
with respect to an ordered basis in V to coordinates with respect to an
ordered basis in W .

Example 2.21 (Transformation Matrix)


Consider a homomorphism Φ : V → W and ordered bases B =
(b1 , . . . , b3 ) of V and C = (c1 , . . . , c4 ) of W . With
Φ(b1 ) = c1 − c2 + 3c3 − c4
Φ(b2 ) = 2c1 + c2 + 7c3 + 2c4 (2.95)
Φ(b3 ) = 3c2 + c3 + 4c4

P4 transformation matrix AΦ with respect to


the B and C satisfies Φ(bk ) =
i=1 αik ci for k = 1, . . . , 3 and is given as
 
1 2 0
−1 1 3
AΦ = [α1 , α2 , α3 ] =   3
, (2.96)
7 1
−1 2 4
where the αj , j = 1, 2, 3, are the coordinate vectors of Φ(bj ) with respect
to C .

Example 2.22 (Linear Transformations of Vectors)

Figure 2.10 Three


examples of linear
transformations of
the vectors shown
as dots in (a);
(b) Rotation by 45◦ ;
(c) Stretching of the
horizontal
(a) Original data. (b) Rotation by 45◦ . (c) Stretch along the (d) General linear
coordinates by 2;
horizontal axis. mapping.
(d) Combination of
reflection, rotation We consider three linear transformations of a set of vectors in R2 with
and stretching.
the transformation matrices
cos( π4 ) − sin( π4 ) 1 3 −1
     
2 0
A1 = , A2 = , A3 = . (2.97)
sin( π4 ) cos( π4 ) 0 1 2 1 −1

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.7 Linear Mappings 53

Figure 2.10 gives three examples of linear transformations of a set of vec-


tors. Figure 2.10(a) shows 400 vectors in R2 , each of which is represented
by a dot at the corresponding (x1 , x2 )-coordinates. The vectors are ar-
ranged in a square. When we use matrix A1 in (2.97) to linearly transform
each of these vectors, we obtain the rotated square in Figure 2.10(b). If we
apply the linear mapping represented by A2 , we obtain the rectangle in
Figure 2.10(c) where each x1 -coordinate is stretched by 2. Figure 2.10(d)
shows the original square from Figure 2.10(a) when linearly transformed
using A3 , which is a combination of a reflection, a rotation, and a stretch.

2.7.2 Basis Change


In the following, we will have a closer look at how transformation matrices
of a linear mapping Φ : V → W change if we change the bases in V and
W . Consider two ordered bases
B = (b1 , . . . , bn ), B̃ = (b̃1 , . . . , b̃n ) (2.98)
of V and two ordered bases
C = (c1 , . . . , cm ), C̃ = (c̃1 , . . . , c̃m ) (2.99)
of W . Moreover, AΦ ∈ Rm×n is the transformation matrix of the linear
mapping Φ : V → W with respect to the bases B and C , and ÃΦ ∈ Rm×n
is the corresponding transformation mapping with respect to B̃ and C̃ .
In the following, we will investigate how A and à are related, i.e., how/
whether we can transform AΦ into ÃΦ if we choose to perform a basis
change from B, C to B̃, C̃ .
Remark. We effectively get different coordinate representations of the
identity mapping idV . In the context of Figure 2.9, this would mean to
map coordinates with respect to (e1 , e2 ) onto coordinates with respect to
(b1 , b2 ) without changing the vector x. By changing the basis and corre-
spondingly the representation of vectors, the transformation matrix with
respect to this new basis can have a particularly simple form that allows
for straightforward computation. ♦

Example 2.23 (Basis Change)


Consider a transformation matrix
 
2 1
A= (2.100)
1 2
with respect to the canonical basis in R2 . If we define a new basis
   
1 1
B=( , ) (2.101)
1 −1

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


54 Linear Algebra

we obtain a diagonal transformation matrix


 
3 0
à = (2.102)
0 1
with respect to B , which is easier to work with than A.

In the following, we will look at mappings that transform coordinate


vectors with respect to one basis into coordinate vectors with respect to
a different basis. We will state our main result first and then provide an
explanation.
Theorem 2.20 (Basis Change). For a linear mapping Φ : V → W , ordered
bases
B = (b1 , . . . , bn ), B̃ = (b̃1 , . . . , b̃n ) (2.103)
of V and
C = (c1 , . . . , cm ), C̃ = (c̃1 , . . . , c̃m ) (2.104)
of W , and a transformation matrix AΦ of Φ with respect to B and C , the
corresponding transformation matrix ÃΦ with respect to the bases B̃ and C̃
is given as
ÃΦ = T −1 AΦ S . (2.105)
Here, S ∈ Rn×n is the transformation matrix of idV that maps coordinates
with respect to B̃ onto coordinates with respect to B , and T ∈ Rm×m is the
transformation matrix of idW that maps coordinates with respect to C̃ onto
coordinates with respect to C .
Proof Following Drumm and Weil (2001), we can write the vectors of
the new basis B̃ of V as a linear combination of the basis vectors of B ,
such that
n
X
b̃j = s1j b1 + · · · + snj bn = sij bi , j = 1, . . . , n . (2.106)
i=1

Similarly, we write the new basis vectors C̃ of W as a linear combination


of the basis vectors of C , which yields
m
X
c̃k = t1k c1 + · · · + tmk cm = tlk cl , k = 1, . . . , m . (2.107)
l=1

We define S = ((sij )) ∈ Rn×n as the transformation matrix that maps


coordinates with respect to B̃ onto coordinates with respect to B and
T = ((tlk )) ∈ Rm×m as the transformation matrix that maps coordinates
with respect to C̃ onto coordinates with respect to C . In particular, the j th
column of S is the coordinate representation of b̃j with respect to B and

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.7 Linear Mappings 55

the k th column of T is the coordinate representation of c̃k with respect to


C . Note that both S and T are regular.
We are going to look at Φ(b̃j ) from two perspectives. First, applying the
mapping Φ, we get that for all j = 1, . . . , n
m m m m m
!
(2.107)
X X X X X
Φ(b̃j ) = ãkj c̃k = ãkj tlk cl = tlk ãkj cl , (2.108)
k=1
| {z } k=1 l=1 l=1 k=1
∈W

where we first expressed the new basis vectors c̃k ∈ W as linear com-
binations of the basis vectors cl ∈ W and then swapped the order of
summation.
Alternatively, when we express the b̃j ∈ V as linear combinations of
bj ∈ V , we arrive at
n
! n n m
(2.106)
X X X X
Φ(b̃j ) = Φ sij bi = sij Φ(bi ) = sij ali cl (2.109a)
i=1 i=1 i=1 l=1
m n
!
X X
= ali sij cl , j = 1, . . . , n , (2.109b)
l=1 i=1

where we exploited the linearity of Φ. Comparing (2.108) and (2.109b),


it follows for all j = 1, . . . , n and l = 1, . . . , m that
m
X n
X
tlk ãkj = ali sij (2.110)
k=1 i=1

and, therefore,

T ÃΦ = AΦ S ∈ Rm×n , (2.111)

such that

ÃΦ = T −1 AΦ S , (2.112)

which proves Theorem 2.20.

Theorem 2.20 tells us that with a basis change in V (B is replaced with


B̃ ) and W (C is replaced with C̃ ), the transformation matrix AΦ of a
linear mapping Φ : V → W is replaced by an equivalent matrix ÃΦ with

ÃΦ = T −1 AΦ S. (2.113)

Figure 2.11 illustrates this relation: Consider a homomorphism Φ : V →


W and ordered bases B, B̃ of V and C, C̃ of W . The mapping ΦCB is an
instantiation of Φ and maps basis vectors of B onto linear combinations
of basis vectors of C . Assume that we know the transformation matrix AΦ
of ΦCB with respect to the ordered bases B, C . When we perform a basis
change from B to B̃ in V and from C to C̃ in W , we can determine the

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


56 Linear Algebra

Figure 2.11 For a Φ Φ


Vector spaces V W V W
homomorphism
Φ : V → W and ΦCB ΦCB
B C B C
ordered bases B, B̃ AΦ AΦ
of V and C, C̃ of W Ordered bases ΨB B̃ S T ΞC C̃ ΨB B̃ S T −1 ΞC̃C = Ξ−1
C C̃
(marked in blue), ÃΦ ÃΦ
we can express the B̃ C̃ B̃ C̃
ΦC̃ B̃ ΦC̃ B̃
mapping ΦC̃ B̃ with
respect to the bases
B̃, C̃ equivalently as
corresponding transformation matrix ÃΦ as follows: First, we find the ma-
a composition of the
homomorphisms trix representation of the linear mapping ΨB B̃ : V → V that maps coordi-
ΦC̃ B̃ = nates with respect to the new basis B̃ onto the (unique) coordinates with
ΞC̃C ◦ ΦCB ◦ ΨB B̃ respect to the “old” basis B (in V ). Then, we use the transformation ma-
with respect to the
trix AΦ of ΦCB : V → W to map these coordinates onto the coordinates
bases in the
subscripts. The with respect to C in W . Finally, we use a linear mapping ΞC̃C : W → W
corresponding to map the coordinates with respect to C onto coordinates with respect to
transformation C̃ . Therefore, we can express the linear mapping ΦC̃ B̃ as a composition of
matrices are in red. linear mappings that involve the “old” basis:
ΦC̃ B̃ = ΞC̃C ◦ ΦCB ◦ ΨB B̃ = Ξ−1
C C̃
◦ ΦCB ◦ ΨB B̃ . (2.114)
Concretely, we use ΨB B̃ = idV and ΞC C̃ = idW , i.e., the identity mappings
that map vectors onto themselves, but with respect to a different basis.

equivalent Definition 2.21 (Equivalence). Two matrices A, Ã ∈ Rm×n are equivalent


if there exist regular matrices S ∈ Rn×n and T ∈ Rm×m , such that
à = T −1 AS .
similar Definition 2.22 (Similarity). Two matrices A, Ã ∈ Rn×n are similar if
there exists a regular matrix S ∈ Rn×n with à = S −1 AS

Remark. Similar matrices are always equivalent. However, equivalent ma-


trices are not necessarily similar. ♦
Remark. Consider vector spaces V, W, X . From the remark that follows
Theorem 2.17, we already know that for linear mappings Φ : V → W
and Ψ : W → X the mapping Ψ ◦ Φ : V → X is also linear. With
transformation matrices AΦ and AΨ of the corresponding mappings, the
overall transformation matrix is AΨ◦Φ = AΨ AΦ . ♦
In light of this remark, we can look at basis changes from the perspec-
tive of composing linear mappings:

AΦ is the transformation matrix of a linear mapping ΦCB : V → W


with respect to the bases B, C .
ÃΦ is the transformation matrix of the linear mapping ΦC̃ B̃ : V → W
with respect to the bases B̃, C̃ .
S is the transformation matrix of a linear mapping ΨB B̃ : V → V
(automorphism) that represents B̃ in terms of B . Normally, Ψ = idV is
the identity mapping in V .

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.7 Linear Mappings 57

T is the transformation matrix of a linear mapping ΞC C̃ : W → W


(automorphism) that represents C̃ in terms of C . Normally, Ξ = idW is
the identity mapping in W .
If we (informally) write down the transformations just in terms of bases,
then AΦ : B → C , ÃΦ : B̃ → C̃ , S : B̃ → B , T : C̃ → C and
T −1 : C → C̃ , and
B̃ → C̃ = B̃ → B→ C → C̃ (2.115)
−1
ÃΦ = T AΦ S . (2.116)
Note that the execution order in (2.116) is from right to left because vec-
tors are multiplied at the right-hand side so that x 7→ Sx 7→ AΦ (Sx) 7→
T −1 AΦ (Sx) = ÃΦ x.


Example 2.24 (Basis Change)


Consider a linear mapping Φ : R3 → R4 whose transformation matrix is
 
1 2 0
−1 1 3
AΦ =  3 7 1
 (2.117)
−1 2 4
with respect to the standard bases
       
      1 0 0 0
1 0 0 0 1 0 0
B = ( 0 , 1 , 0) ,
     C = (
0 , 0 , 1 , 0).
       (2.118)
0 0 1
0 0 0 1
We seek the transformation matrix ÃΦ of Φ with respect to the new bases

       
      1 1 0 1
1 0 1 1 0 1 0
B̃ = (1 , 1 , 0) ∈ R3 , 0 , 1 , 1 , 0) .
C̃ = (        (2.119)
0 1 1
0 0 0 1
Then,
 
  1 1 0 1
1 0 1 1 0 1 0
S = 1 1 0 , T =
0
, (2.120)
1 1 0
0 1 1
0 0 0 1
where the ith column of S is the coordinate representation of b̃i in
terms of the basis vectors of B . Since B is the standard basis, the co-
ordinate representation is straightforward to find. For a general basis B ,
we would need to solve a linear equation system to find the λi such that

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


58 Linear Algebra

P3
i=1 λi bi = b̃j , j = 1, . . . , 3. Similarly, the j th column of T is the coordi-
nate representation of c̃j in terms of the basis vectors of C .
Therefore, we obtain
1 −1 −1
  
1 3 2 1
1  1 −1 1 −1  0 4 2
ÃΦ = T −1 AΦ S = 
 
(2.121a)
2 −1 1 1 1 10 8 4
0 0 0 2 1 6 3
−4 −4 −2
 
 6 0 0
= 4
. (2.121b)
8 4
1 6 3

In Chapter 4, we will be able to exploit the concept of a basis change


to find a basis with respect to which the transformation matrix of an en-
domorphism has a particularly simple (diagonal) form. In Chapter 10, we
will look at a data compression problem and find a convenient basis onto
which we can project the data while minimizing the compression loss.

2.7.3 Image and Kernel


The image and kernel of a linear mapping are vector subspaces with cer-
tain important properties. In the following, we will characterize them
more carefully.
Definition 2.23 (Image and Kernel).
kernel For Φ : V → W , we define the kernel/null space
null space
ker(Φ) := Φ−1 (0W ) = {v ∈ V : Φ(v) = 0W } (2.122)
image and the image/range
range
Im(Φ) := Φ(V ) = {w ∈ W |∃v ∈ V : Φ(v) = w} . (2.123)
domain We also call V and W the domain and codomain of Φ, respectively.
codomain
Intuitively, the kernel is the set of vectors v ∈ V that Φ maps onto the
neutral element 0W ∈ W . The image is the set of vectors w ∈ W that
can be “reached” by Φ from any vector in V . An illustration is given in
Figure 2.12.
Remark. Consider a linear mapping Φ : V → W , where V, W are vector
spaces.
It always holds that Φ(0V ) = 0W and, therefore, 0V ∈ ker(Φ). In
particular, the null space is never empty.
Im(Φ) ⊆ W is a subspace of W , and ker(Φ) ⊆ V is a subspace of V .

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.7 Linear Mappings 59

Φ:V →W Figure 2.12 Kernel


V W and image of a
linear mapping
Φ : V → W.

ker(Φ) Im(Φ)

0V 0W

Φ is injective (one-to-one) if and only if ker(Φ) = {0}.



m×n
Remark (Null Space and Column Space). Let us consider A ∈ R and
a linear mapping Φ : Rn → Rm , x 7→ Ax.
For A = [a1 , . . . , an ], where ai are the columns of A, we obtain
( n )
X
n
Im(Φ) = {Ax : x ∈ R } = xi ai : x1 , . . . , xn ∈ R (2.124a)
i=1
m
= span[a1 , . . . , an ] ⊆ R , (2.124b)
i.e., the image is the span of the columns of A, also called the column column space
space. Therefore, the column space (image) is a subspace of Rm , where
m is the “height” of the matrix.
rk(A) = dim(Im(Φ)).
The kernel/null space ker(Φ) is the general solution to the homoge-
neous system of linear equations Ax = 0 and captures all possible
linear combinations of the elements in Rn that produce 0 ∈ Rm .
The kernel is a subspace of Rn , where n is the “width” of the matrix.
The kernel focuses on the relationship among the columns, and we can
use it to determine whether/how we can express a column as a linear
combination of other columns.

Example 2.25 (Image and Kernel of a Linear Mapping)


The mapping
   
x1   x1  
4 2
x2  1 2 −1 0  x2  x1 + 2x2 − x3
Φ : R → R ,   7→
    =
x3 1 0 0 1 x3  x1 + x4
x4 x4
(2.125a)

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


60 Linear Algebra

       
1 2 −1 0
= x1 + x2 + x3 + x4 (2.125b)
1 0 0 1
is linear. To determine Im(Φ), we can take the span of the columns of the
transformation matrix and obtain
       
1 2 −1 0
Im(Φ) = span[ , , , ]. (2.126)
1 0 0 1
To compute the kernel (null space) of Φ, we need to solve Ax = 0, i.e.,
we need to solve a homogeneous equation system. To do this, we use
Gaussian elimination to transform A into reduced row-echelon form:
   
1 2 −1 0 1 0 0 1
··· . (2.127)
1 0 0 1 0 1 − 21 − 12
This matrix is in reduced row-echelon form, and we can use the Minus-
1 Trick to compute a basis of the kernel (see Section 2.3.3). Alternatively,
we can express the non-pivot columns (columns 3 and 4) as linear com-
binations of the pivot columns (columns 1 and 2). The third column a3 is
equivalent to − 21 times the second column a2 . Therefore, 0 = a3 + 12 a2 . In
the same way, we see that a4 = a1 − 12 a2 and, therefore, 0 = a1 − 12 a2 −a4 .
Overall, this gives us the kernel (null space) as
−1
   
0
1  1 
ker(Φ) = span[  1  ,  0 ] .
2  2  (2.128)
0 1

rank-nullity
theorem Theorem 2.24 (Rank-Nullity Theorem). For vector spaces V, W and a lin-
ear mapping Φ : V → W it holds that
dim(ker(Φ)) + dim(Im(Φ)) = dim(V ) . (2.129)
fundamental The rank-nullity theorem is also referred to as the fundamental theorem
theorem of linear of linear mappings (Axler, 2015, theorem 3.22). The following are direct
mappings
consequences of Theorem 2.24:
If dim(Im(Φ)) < dim(V ), then ker(Φ) is non-trivial, i.e., the kernel
contains more than 0V and dim(ker(Φ)) > 1.
If AΦ is the transformation matrix of Φ with respect to an ordered basis
and dim(Im(Φ)) < dim(V ), then the system of linear equations AΦ x =
0 has infinitely many solutions.
If dim(V ) = dim(W ), then the following three-way equivalence holds:
– Φ is injective
– Φ is surjective
– Φ is bijective
since Im(Φ) ⊆ W .

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.8 Affine Spaces 61

2.8 Affine Spaces


In the following, we will have a closer look at spaces that are offset from
the origin, i.e., spaces that are no longer vector subspaces. Moreover, we
will briefly discuss properties of mappings between these affine spaces,
which resemble linear mappings.
Remark. In the machine learning literature, the distinction between linear
and affine is sometimes not clear so that we can find references to affine
spaces/mappings as linear spaces/mappings. ♦

2.8.1 Affine Subspaces


Definition 2.25 (Affine Subspace). Let V be a vector space, x0 ∈ V and
U ⊆ V a subspace. Then the subset
L = x0 + U := {x0 + u : u ∈ U } (2.130a)
= {v ∈ V |∃u ∈ U : v = x0 + u} ⊆ V (2.130b)
is called affine subspace or linear manifold of V . U is called direction or affine subspace
direction space, and x0 is called support point. In Chapter 12, we refer to linear manifold
such a subspace as a hyperplane. direction
direction space
Note that the definition of an affine subspace excludes 0 if x0 ∈ / U. support point
Therefore, an affine subspace is not a (linear) subspace (vector subspace) hyperplane
of V for x0 ∈
/ U.
Examples of affine subspaces are points, lines, and planes in R3 , which
do not (necessarily) go through the origin.
Remark. Consider two affine subspaces L = x0 + U and L̃ = x̃0 + Ũ of a
vector space V . Then, L ⊆ L̃ if and only if U ⊆ Ũ and x0 − x̃0 ∈ Ũ .
Affine subspaces are often described by parameters: Consider a k -dimen-
sional affine space L = x0 + U of V . If (b1 , . . . , bk ) is an ordered basis of
U , then every element x ∈ L can be uniquely described as
x = x0 + λ1 b1 + . . . + λk bk , (2.131)
where λ1 , . . . , λk ∈ R. This representation is called parametric equation parametric equation
of L with directional vectors b1 , . . . , bk and parameters λ1 , . . . , λk . ♦ parameters

Example 2.26 (Affine Subspaces)

One-dimensional affine subspaces are called lines and can be written line
as y = x0 + λb1 , where λ ∈ R and U = span[b1 ] ⊆ Rn is a one-
dimensional subspace of Rn . This means that a line is defined by a sup-
port point x0 and a vector b1 that defines the direction. See Figure 2.13
for an illustration.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


62 Linear Algebra

plane Two-dimensional affine subspaces of Rn are called planes. The para-


metric equation for planes is y = x0 + λ1 b1 + λ2 b2 , where λ1 , λ2 ∈ R
and U = span[b1 , b2 ] ⊆ Rn . This means that a plane is defined by a
support point x0 and two linearly independent vectors b1 , b2 that span
the direction space.
hyperplane In Rn , the (n − 1)-dimensional affine subspaces are called hyperplanes,
Pn−1
and the corresponding parametric equation is y = x0 + i=1 λi bi ,
where b1 , . . . , bn−1 form a basis of an (n − 1)-dimensional subspace
U of Rn . This means that a hyperplane is defined by a support point
x0 and (n − 1) linearly independent vectors b1 , . . . , bn−1 that span the
direction space. In R2 , a line is also a hyperplane. In R3 , a plane is also
a hyperplane.

Figure 2.13 Lines


+ λb 1
are affine subspaces.
L = x0
Vectors y on a line
x0 + λb1 lie in an y
affine subspace L
with support point x0
x0 and direction b1 . b1
0

Remark (Inhomogeneous systems of linear equations and affine subspaces).


For A ∈ Rm×n and x ∈ Rm , the solution of the system of linear equa-
tions Aλ = x is either the empty set or an affine subspace of Rn of
dimension n − rk(A). In particular, the solution of the linear equation
λ1 b1 + . . . + λn bn = x, where (λ1 , . . . , λn ) 6= (0, . . . , 0), is a hyperplane
in Rn .
In Rn , every k -dimensional affine subspace is the solution of an inho-
mogeneous system of linear equations Ax = b, where A ∈ Rm×n , b ∈
Rm and rk(A) = n − k . Recall that for homogeneous equation systems
Ax = 0 the solution was a vector subspace, which we can also think of
as a special affine space with support point x0 = 0. ♦

2.8.2 Affine Mappings


Similar to linear mappings between vector spaces, which we discussed
in Section 2.7, we can define affine mappings between two affine spaces.
Linear and affine mappings are closely related. Therefore, many properties
that we already know from linear mappings, e.g., that the composition of
linear mappings is a linear mapping, also hold for affine mappings.

Definition 2.26 (Affine Mapping). For two vector spaces V, W , a linear

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


2.9 Further Reading 63

mapping Φ : V → W , and a ∈ W , the mapping


φ:V →W (2.132)
x 7→ a + Φ(x) (2.133)
is an affine mapping from V to W . The vector a is called the translation affine mapping
vector of φ. translation vector

Every affine mapping φ : V → W is also the composition of a linear


mapping Φ : V → W and a translation τ : W → W in W , such that
φ = τ ◦ Φ. The mappings Φ and τ are uniquely determined.
The composition φ0 ◦ φ of affine mappings φ : V → W , φ0 : W → X is
affine.
Affine mappings keep the geometric structure invariant. They also pre-
serve the dimension and parallelism.

2.9 Further Reading


There are many resources for learning linear algebra, including the text-
books by Strang (2003), Golan (2007), Axler (2015), and Liesen and
Mehrmann (2015). There are also several online resources that we men-
tioned in the introduction to this chapter. We only covered Gaussian elim-
ination here, but there are many other approaches for solving systems of
linear equations, and we refer to numerical linear algebra textbooks by
Stoer and Burlirsch (2002), Golub and Van Loan (2012), and Horn and
Johnson (2013) for an in-depth discussion.
In this book, we distinguish between the topics of linear algebra (e.g.,
vectors, matrices, linear independence, basis) and topics related to the
geometry of a vector space. In Chapter 3, we will introduce the inner
product, which induces a norm. These concepts allow us to define angles,
lengths and distances, which we will use for orthogonal projections. Pro-
jections turn out to be key in many machine learning algorithms, such as
linear regression and principal component analysis, both of which we will
cover in Chapters 9 and 10, respectively.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


64 Linear Algebra

Exercises
2.1 We consider (R\{−1}, ?), where

a ? b := ab + a + b, a, b ∈ R\{−1} (2.134)

a. Show that (R\{−1}, ?) is an Abelian group.


b. Solve

3 ? x ? x = 15

in the Abelian group (R\{−1}, ?), where ? is defined in (2.134).


2.2 Let n be in N\{0}. Let k, x be in Z. We define the congruence class k̄ of the
integer k as the set

k = {x ∈ Z | x − k = 0 (modn)}
= {x ∈ Z | ∃a ∈ Z : (x − k = n · a)} .

We now define Z/nZ (sometimes written Zn ) as the set of all congruence


classes modulo n. Euclidean division implies that this set is a finite set con-
taining n elements:

Zn = {0, 1, . . . , n − 1}

For all a, b ∈ Zn , we define

a ⊕ b := a + b

a. Show that (Zn , ⊕) is a group. Is it Abelian?


b. We now define another operation ⊗ for all a and b in Zn as

a ⊗ b = a × b, (2.135)

where a × b represents the usual multiplication in Z.


Let n = 5. Draw the times table of the elements of Z5 \{0} under ⊗, i.e.,
calculate the products a ⊗ b for all a and b in Z5 \{0}.
Hence, show that Z5 \{0} is closed under ⊗ and possesses a neutral
element for ⊗. Display the inverse of all elements in Z5 \{0} under ⊗.
Conclude that (Z5 \{0}, ⊗) is an Abelian group.
c. Show that (Z8 \{0}, ⊗) is not a group.
d. We recall that the Bézout theorem states that two integers a and b are
relatively prime (i.e., gcd(a, b) = 1) if and only if there exist two integers
u and v such that au + bv = 1. Show that (Zn \{0}, ⊗) is a group if and
only if n ∈ N\{0} is prime.
2.3 Consider the set G of 3 × 3 matrices defined as follows:
  
 1 x z 
G = 0 1 y  ∈ R3×3 x, y, z ∈ R
0 0 1
 

We define · as the standard matrix multiplication.


Is (G, ·) a group? If yes, is it Abelian? Justify your answer.
2.4 Compute the following matrix products, if possible:

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


Exercises 65

a.
  
1 2 1 1 0
4 5  0 1 1
7 8 1 0 1

b.
  
1 2 3 1 1 0
4 5 6 0 1 1
7 8 9 1 0 1

c.
  
1 1 0 1 2 3
0 1 1 4 5 6
1 0 1 7 8 9

d.
 
  0 3
1 2 1 2 1 −1
4 1 −1 −4 2 1
5 2

e.
 
0 3
 
1
 −1 1 2 1 2
2 1 4 1 −1 −4
5 2

2.5 Find the set S of all solutions in x of the following inhomogeneous linear
systems Ax = b, where A and b are defined as follows:
a.
   
1 1 −1 −1 1
2 5 −7 −5 −2
A= , b= 
2 −1 1 3 4
5 2 −4 2 6

b.
   
1 −1 0 0 1 3
1 1 0 −3 0 6
A= , b= 
2 −1 0 1 −1 5
−1 2 0 −2 −1 −1

2.6 Using Gaussian elimination, find all solutions of the inhomogeneous equa-
tion system Ax = b with
   
0 1 0 0 1 0 2
A = 0 0 0 1 1 0 , b = −1 .
0 1 0 0 0 1 1

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


66 Linear Algebra
 
x1
2.7 Find all solutions in x = x2  ∈ R3 of the equation system Ax = 12x,
x3
where
 
6 4 3
A = 6 0 9
0 8 0

and 3i=1 xi = 1.
P
2.8 Determine the inverses of the following matrices if possible:
a.
 
2 3 4
A = 3 4 5
4 5 6
b.
 
1 0 1 0
0 1 1 0
A=
1

1 0 1
1 1 1 0

2.9 Which of the following sets are subspaces of R3 ?


a. A = {(λ, λ + µ3 , λ − µ3 ) | λ, µ ∈ R}
b. B = {(λ2 , −λ2 , 0) | λ ∈ R}
c. Let γ be in R.
C = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ1 − 2ξ2 + 3ξ3 = γ}
d. D = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ2 ∈ Z}
2.10 Are the following sets of vectors linearly independent?
a.
     
2 1 3
x1 = −1 , x2 =  1  , x3 = −3
3 −2 8
b.
     
1 1 1
2 1 0 
     
1 ,
x1 =  
0 ,
x2 =   x3 = 
0 

0 1 1 
0 1 1
2.11 Write
 
1
y = −2
5
as linear combination of
     
1 1 2
x1 = 1 , x2 = 2 , x3 = −1
1 3 1

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


Exercises 67

2.12 Consider two subspaces of R4 :


           
1 2 −1 −1 2 −3
 1  −1  1  −2 −2  6 
−3 ,  0  , −1] ,
U1 = span[      U2 = span[
 2  ,  0  , −2] .
    

1 −1 1 1 0 −1

Determine a basis of U1 ∩ U2 .
2.13 Consider two subspaces U1 and U2 , where U1 is the solution space of the
homogeneous equation system A1 x = 0 and U2 is the solution space of the
homogeneous equation system A2 x = 0 with
   
1 0 1 3 −3 0
1 −2 −1 1 2 3
A1 =  , A2 =  .
2 1 3 7 −5 2
1 0 1 3 −1 2

a. Determine the dimension of U1 , U2 .


b. Determine bases of U1 and U2 .
c. Determine a basis of U1 ∩ U2 .
2.14 Consider two subspaces U1 and U2 , where U1 is spanned by the columns of
A1 and U2 is spanned by the columns of A2 with
   
1 0 1 3 −3 0
1 −2 −1 1 2 3
A1 =  , A2 =  .
2 1 3 7 −5 2
1 0 1 3 −1 2

a. Determine the dimension of U1 , U2


b. Determine bases of U1 and U2
c. Determine a basis of U1 ∩ U2
2.15 Let F = {(x, y, z) ∈ R3 | x+y−z = 0} and G = {(a−b, a+b, a−3b) | a, b ∈ R}.
a. Show that F and G are subspaces of R3 .
b. Calculate F ∩ G without resorting to any basis vector.
c. Find one basis for F and one for G, calculate F ∩G using the basis vectors
previously found and check your result with the previous question.
2.16 Are the following mappings linear?
a. Let a, b ∈ R.
Φ : L1 ([a, b]) → R
Z b
f 7→ Φ(f ) = f (x)dx ,
a

where L1 ([a, b]) denotes the set of integrable functions on [a, b].
b.
Φ : C1 → C0
f 7→ Φ(f ) = f 0 ,

where for k > 1, C k denotes the set of k times continuously differen-


tiable functions, and C 0 denotes the set of continuous functions.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).


68 Linear Algebra

c.
Φ:R→R
x 7→ Φ(x) = cos(x)

d.
Φ : R 3 → R2
 
1 2 3
x 7→ x
1 4 3

e. Let θ be in [0, 2π[ and


Φ : R2 → R2
 
cos(θ) sin(θ)
x 7→ x
− sin(θ) cos(θ)

2.17 Consider the linear mapping


Φ : R3 → R 4
 
  3x1 + 2x2 + x3
x1  x1 + x2 + x3 
Φ x2  = 
 x1 − 3x2 

x3
2x1 + 3x2 + x3

Find the transformation matrix AΦ .


Determine rk(AΦ ).
Compute the kernel and image of Φ. What are dim(ker(Φ)) and dim(Im(Φ))?
2.18 Let E be a vector space. Let f and g be two automorphisms on E such that
f ◦ g = idE (i.e., f ◦ g is the identity mapping idE ). Show that ker(f ) =
ker(g ◦ f ), Im(g) = Im(g ◦ f ) and that ker(f ) ∩ Im(g) = {0E }.
2.19 Consider an endomorphism Φ : R3 → R3 whose transformation matrix
(with respect to the standard basis in R3 ) is
 
1 1 0
AΦ = 1 −1 0 .
1 1 1

a. Determine ker(Φ) and Im(Φ).


b. Determine the transformation matrix ÃΦ with respect to the basis
     
1 1 1
B = (1 , 2 , 0) ,
1 1 0

i.e., perform a basis change toward the new basis B .


2.20 Let us consider b1 , b2 , b01 , b02 , 4 vectors of R2 expressed in the standard basis
of R2 as
       
2 −1 2 1
b1 = , b2 = , b01 = , b02 =
1 −1 −2 1

and let us define two ordered bases B = (b1 , b2 ) and B 0 = (b01 , b02 ) of R2 .

Draft (2022-01-11) of “Mathematics for Machine Learning”. Feedback: [Link]


Exercises 69

a. Show that B and B 0 are two bases of R2 and draw those basis vectors.
b. Compute the matrix P 1 that performs a basis change from B 0 to B .
c. We consider c1 , c2 , c3 , three vectors of R3 defined in the standard basis
of R3 as
     
1 0 1
c1 =  2  , c2 = −1 , c3 =  0 
−1 2 −1

and we define C = (c1 , c2 , c3 ).


(i) Show that C is a basis of R3 , e.g., by using determinants (see
Section 4.1).
(ii) Let us call C 0 = (c01 , c02 , c03 ) the standard basis of R3 . Determine
the matrix P 2 that performs the basis change from C to C 0 .
d. We consider a homomorphism Φ : R2 −→ R3 , such that
Φ(b1 + b2 ) = c2 + c3
Φ(b1 − b2 ) = 2c1 − c2 + 3c3

where B = (b1 , b2 ) and C = (c1 , c2 , c3 ) are ordered bases of R2 and R3 ,


respectively.
Determine the transformation matrix AΦ of Φ with respect to the or-
dered bases B and C .
e. Determine A0 , the transformation matrix of Φ with respect to the bases
B 0 and C 0 .
f. Let us consider the vector x ∈ R2 whose coordinates in B 0 are [2, 3]> .
In other words, x = 2b01 + 3b02 .
(i) Calculate the coordinates of x in B .
(ii) Based on that, compute the coordinates of Φ(x) expressed in C .
(iii) Then, write Φ(x) in terms of c01 , c02 , c03 .
(iv) Use the representation of x in B 0 and the matrix A0 to find this
result directly.

©2021 M. P. Deisenroth, A. A. Faisal, C. S. Ong. Published by Cambridge University Press (2020).

You might also like