Matrix Analysis and Algorithms Overview
Matrix Analysis and Algorithms Overview
1 Introduction 1
2 Gaussian Elimination 4
3 LU factorisation 7
9 Error Analysis 26
10 Conditioning 29
13 Computational Cost 38
17 Conditioning of LSQ 49
18 QR factorisation 52
i
CONTENTS CONTENTS
25 More on CG 72
29 Power Iteration 84
ii
Lecture 1
Introduction
Subject of this course are some problems that are central to numerical linear algebra and occur
in many applications.
In this case,
1 ξ1 y1
.. .. ,
A = . b = ... .
.
1 ξm ym
1
LECTURE 1. INTRODUCTION
• r with probability 1 − p.
Similarly, there is a probability q ∈ (0, 1) for no change of state r and 1 − q for a change from r
to s.
1−p
p s r q
1−q
Let us denote the actual state by a vector µ ∈ R2 such that (1, 0) corresponds to s (i.e., we
associate state s with index 1) and (0, 1) corresponds to r (associate r with index 2). Let us
furthermore introduce the probability matrix P = (pi,j )2i,j=1 where pi,j denotes the probability
that the system being in state i changes to state j, i.e.,
p 1−p
P = .
1−q q
Multiplying the actual state with P from the right then yields a vector of probabilities for the
next day’s state, for instance if we start with a sunny day, µ(0) := (1, 0), then
(0) p 1−p
µ P = (1, 0) = (p, 1 − p) =: µ(1)
1−q q
indeed contains the correct probabilities for s and r. The nice thing about this notation is
that we simply may go on multiplying with P from the right to obtain the probabilities for the
subsequent days, for instance
2
LECTURE 1. INTRODUCTION
Experts will recognise this iteration as a discrete Markov chain. In applications, one often is
interested in stationary distributions π satisfying
X
π = πP, πi ≥ 0 ∀i, πi = 1
i
which means that π is a left eigenvector of P for the eigenvalue 1. In fact, 1 always is an eigenvalue
of such probability matrices so the goal is just to compute a corresponding eigenvector. And of
further interest is when Markov chains often converge to such stationary states.
We will see such type of iterations as the Markov chain (1.1) again in a slightly more general
context of the so-called power iteration. Apart from the convergence, another questions imme-
diately arising is how long one has to iterate in order to obtain an acceptable approximation to
a stationary state (as this is an iterative method one will not expect to get an exact solution).
– conditioning of problems,
– stablity of algorithms,
3
Lecture 2
Gaussian Elimination
4
LECTURE 2. GAUSSIAN ELIMINATION
2x1 + x2 + x3 = 4
; 4x1 + 3x2 + 3x3 = 10
8x1 + 7x2 + 9x3 = 24
multiply the first equation with −2 = −a2,1 /a1,1 and add to the second equation,
multiply the first equation with −4 = −a3,1 /a1,1 and add to the third equation,
2x1 + x2 + x3 = 4
; 0 + x2 + x3 = 2
0 + 3x2 + 5x3 = 8
multiply the second equation with −3 and add to the third equation,
2x1 + x2 + x3 = 4
; x2 + x3 = 2
0 + 2x3 = 2
1. ai,j = 0 if 1 ≤ i < j ≤ n,
5
LECTURE 2. GAUSSIAN ELIMINATION
Lemma 2.1. [5.1] (a) Let L ∈ Cn×n unit lower triangular with non-zero entries only in column
k. Then L−1 is unit lower triangular with non-zero entries only below the diagonal in column k
with entries
(L−1 )i,k = −Li,k , i = k + 1, . . . , n.
(b) Let L, M ∈ Cn×n be unit lower triangular , k ∈ {1, . . . , n}, and
L has non-zero entries below the diagonal only in columns 1, . . . , k,
M has non-zero entries below the diagonal only in columns k + 1, . . . , n.
Then LM is unit lower triangular with
(
Li,j if j ≤ k,
(LM )i,j =
Mi,j else.
Proof. Exercise.
Consequence:
1 0 0 1 0 0
(L̃1 )−1 = 2 1 0 =: L1 , (L̃2 )−1 := 0 1 0 =: L2 ,
4 0 1 0 3 1
6
Lecture 3
LU factorisation
Proof. (a) Proof is based on induction, similar proofs will follow → Exercise.
(b) Assume that A = LU exists but Aj is singular. In block form:
A11 A12 L11 0 U11 U12 L11 U11 ?
A= = =
A21 A22 L21 L22 0 U22 ? ?
with A11 = Aj , and since L11 is unit lower triangular and U11 is regular and upper triangular,
Aj = L11 U11 is the LU factorisation of Aj . But then
Algorithm 2 LU
input: A = (aij )ni,j=1 ∈ Cn×n with det(Ak ) 6= 0, k = 1, . . . , n.
output: L ∈ Cn×n unit lower triangular, U ∈ Cn×n upper triangular and regular with A = LU .
1: U = A, L = I.
2: for k = 1 to n − 1 do
3: for j = k + 1 to n do
4: lj,k := uj,k /uk,k
5: uj,k := 0
6: for i = k + 1 to n do
7: uj,i := uj,i − lj,k uk,i
8: end for
9: end for
10: end for
7
LECTURE 3. LU FACTORISATION
Definition 3.2. P ∈ Cn×n is a permutation matrix if every row and every column contains
n − 1 zeros and 1 one.
We may write P = (eσ(1) , eσ(2) , . . . , eσ(n) ) where ej is the j th vector of the standard basis of Cn .
With π = σ −1 we also may write
— eπ(1) —
P = ..
.
.
— eπ(n) —
8
LECTURE 3. LU FACTORISATION
hence, a multiplication with a permutation matrix from the left exchanges the rows according
to the associated permutation. Similarly, a multiplication with a permutation matrix from the
right exchanges the columns.
Theorem 3.3. Let A ∈ Cn×n be regular. Then there are a permutation matrix P ∈ Cn×n ,
L ∈ Cn×n unit lower triangular, and U ∈ Cn×n upper triangular with P A = LU .
Proof. (Recording available here) By induction on n. The case n = 1 is trivial as one just has
to choose P = L = 1 and U = A.
Let n > 1 and assume that the assertion is true for n − 1. Choose a permutation matrix P1 such
that a := (P1 A)1,1 6= 0. Such a matrix exists because A is regular, whence the first column of A
will contain a nonzero entry. We write
a u∗ a u∗
1 0
P1 A = = 1
l B al I 0 Ã
By the induction hypothesis there are P̃ (permutation), L̃ (unit lower triangular) and Ũ (regular
upper triangular) with P̃ Ã = L̃Ũ . Therefore
1 0 1 0 a u
P1 A = 1
al I 0 (P̃ )−1 L̃ 0 Ũ
| {z }
=:U
1 0
= U
(P̃ )−1 L̃
1
al
1 0 1 0
= U
0 (P̃ )−1 1
a P̃ l L̃
| {z }
=:L
−1
1 0
⇒ P1 A = LU
0 (P̃ )−1
| {z }
=:P
where P is a permutation matrix and L and U are of the desired structure, too.
9
Lecture 4
Before step k:
? ··· ··· ? ? · ?
.. .. ..
0 . . .
. . .
.. . . .. ..
. . ? ? . .
U (k−1) = 0 · · ·
0 uk,k ? ··· ?
0 · · · 0 ? ? ··· ?
.. .. .. .. ..
. . . . .
0 ··· 0 ? ? ··· ?
and we have a problem if the pivot uk,k is zero.
In fact, a small uk,k is undesirable as it leads to stability problems → will see this later on.
There are basically two strategies to overcome this problem.
1. GEPP, Gaussian Elimination with Partial Pivoting: Swap rows to maximise |uk,k | among
the entries ul,k , l = k, . . . , n.
2. GECP, Gaussian Elimination with Complete Pivoting: Swap rows and columns to max-
imise |uk,k | among the entries ul,m , l, m = k, . . . , n.
In the following we will only consider GEPP since, in practice, partial pivoting usually is
sufficient and the gain in stability due to complete pivoting negligible.
With an appropriate permutation matrix Pk that realises the swap of the rows we then compute
U (k) = L̃k Pk U (k−1) so that in the end
The permutation associated with Pl exchanges l with some number il > l and leaves the other
entries unchanged,
10
LECTURE 4. GAUSSIAN ELIMINATION WITH PIVOTING
Observation:
has the same structure as L̃k but just the entries in column k below the diagonal are permuted.
Since
L0k Pn−1 · · · Pk+1 = Pn−1 · · · Pk+1 L̃k
we obtain from (4.1) that
11
LECTURE 4. GAUSSIAN ELIMINATION WITH PIVOTING
Algorithm 3 LUPP
input: A = (aij )ni,j=1 ∈ Cn×n regular.
output: L ∈ Cn×n unit lower triangular, U ∈ Cn×n regular upper triangular, P ∈ Cn×n
permutation matrix with P A = LU .
1: U = A, L = I, P = I.
2: for k = 1 to n − 1 do
3: choose i ∈ {k, . . . , n} such that |ui,k | is maximal
4: exchange (uk,k , . . . , uk,n ) with (ui,k , . . . , ui,n )
5: exchange (lk,1 , . . . , lk,k−1 ) with (li,1 , . . . , li,k−1 )
6: exchange (pk,1 , . . . , pk,n ) with (pi,1 , . . . , pi,n )
7: for j = k + 1 to n do
8: lj,k := uj,k /uk,k
9: uj,k := 0
10: for i = k + 1 to n do
11: uj,i := uj,i − lj,k uk,i
12: end for
13: end for
14: end for
A more elegant variant of the algorithms stores the permutation in a vector of length n initialised
with the numbers (1, . . . , n) that finally contains the permutation π associated with P , i.e., the
ith entry of that vector contains π(i). See example below.
Example: Consider
−2 2 0 0 0
2 −4 1 1 0
A= , b=
0 4 −2 0 2
1 1 0 1 3
and execute GEPP. In what follows considering the structure and properties of sign operations
when using lower/upper triangular matrix operations will become useful.
Step 1, computation of the LU factorisation with LUPP.
The last vector will contain the permutation π:
1 0 0 0 −2 2 0 0 1
0 1 0 0 2 −4 1 1 , π (0) = 2
L(0) = (0)
0 0 1 0 U = 0
4 −2 0 3
0 0 0 1 1 1 0 1 4
(0) (0)
Apparently, 2 = |U1,1 | ≥ maxj≥1 |Uj,1 |, hence we need no permutation. One elimination step
12
LECTURE 4. GAUSSIAN ELIMINATION WITH PIVOTING
leads to
1 0 0 0 −2 2 0 0 1
−1 1 0 0 0 −2 1 1 2
L(1) = U (1) = , π (1) =
0 0 1 0 0 4 −2 0 3
− 12 0 0 1 0 2 0 1 4
(1) (1)
Now, 4 = |U3,2 | ≥ maxj≥2 |Uj,2 |, hence we permute rows 2 and 3:
1 0 0 0 −2 2 0 0 1
0 1 0 0 (U (1) )0 = 0 4 −2 0 3
(L(1) )0 = π (2)
0 −2 1 1 , =
−1 0 1 0 2
− 12 0 0 1 0 2 0 1 4
The next elimination step yields
1 0 0 0 −2 2 0 0 1
0 1 0 0 0 4 −2 0 3
L(2) = U (2) π (2)
= , =
−1 − 21 1 0 0 0 0 1 2
− 12 1
2 0 1 0 0 1 1 4
(1) (1)
We have that 1 = |U4,3 | ≥ maxj≥3 |Uj,3 |, hence we permute rows 3 and 4:
1 0 0 0 −2 2 0 0 1
0 1 0 0 0 4 −2 0 , π (3) = 3
(L(2) )0 = (U (2) )0 =
− 1 1
1 0 0 0 1 1 4
2 2
1
−1 − 2 0 1 0 0 0 1 2
0 1 0 0
eπ(n) · b bπ(n)
which indicates how the action of the permutation matrix P on a vector can be computed given
the associated permutation π. In our example, the solution is y = (0, 2, 2, 1)T .
Step3, solving U x = y.
One can check that x = (1, 1, 1, 1)T which indeed is the solution to Ax = b.
13
Lecture 5
is a norm on Cn .
14
LECTURE 5. MATRIX NORMS, PART I
Then
y1 n
∗
. X
x y = x1 . . . xn .. = xi yi = hx, yi.
yn i=1
We also write
xy ∗ = (xi y j )n,n
i,j=1 = x ⊗ y ∈ C
n×n
.
moreover,
hAx, yi = (Ax)∗ y = x∗ A∗ y = hx, A∗ yi
for all A ∈ Cm×n , x ∈ Cn and y ∈ Cm .
Some further definitions:
2. A ∈ Cn×n is Hermitian if A∗ = A,
A ∈ Rn×n is symmetric if AT = A.
Q being unitary means that the columns are orthonormal. In the case m = n we have that
Q−1 = Q∗ and also QQ∗ = I.
Proof. Exercise.
Another way of assigning a norm to a matrix is the operator norm: Given norms k · km̂ on Cm
and k · kn̂ on Cn define
kAxkm̂
kAk(m̂,n̂) := max = max kAxkm̂ .
x∈Cn \{0} kxkn̂ kxkn̂ =1
15
LECTURE 5. MATRIX NORMS, PART I
Definition 5.5. [1.25] A matrix norm on Cn×n is a mapping k·k : Cn×n → R with the properties
The last condition makes the difference to a vector norm. It means that the norm is compatible
with the matrix-matrix product.
Definition 5.6. [1.26] Given a vector norm k · kv on Cn , we define the induced (operator) norm
k · km on Cn×n by
kAxkv
kAkm := max = max kAxkv .
x∈Cn \{0} kxkv kxkv =1
Theorem 5.7 (1.27). The induced norm k · km of a vector norm k · kv is a matrix norm with
kIn km = 1 and
kAxkv ≤ kAkm kxkv ∀A ∈ Cn×n , x ∈ C.
kAxkv
kAkm = 0 ⇔ = 0 ∀x ∈ Cn \{0} ⇔ kAxkv = 0 ∀x ∈ Cn ⇔ A = 0
kxkv
which shows the first point. For the second point we observe that
and similarly the third property can be deduced from the corresponding property of the vector
norm:
kAxkv kAykv
kAkm = max ≥ ⇒ kAykv ≤ kAkm kykv .
x6=0 kxkv kykv
Using this we can show the submultiplicativity, the fourth property of matrix norms:
kABkm = max kABxkv ≤ max kAkm kBxkv = kAkm max kBxkv = kAkm kBkm
kxkv =1 kxkv =1 kxkv =1
Some remarks:
• We will use the same symbol / subscript for a vector norm and its induced norm, e.g. kxk2
and kAk2 .
16
LECTURE 5. MATRIX NORMS, PART I
Theorem 5.8. [1.2] The matrix norm induced by the infinity norm is the maximum row sum,
n
X
kAk∞ = max |ai,j |, A ∈ Cn×n
1≤i≤n
j=1
To show the other inequality let k ∈ {1, . . . , n} be the row index with maximal sum,
X X
max |ai,j | = |ak,j |.
i
j j
Define x ∈ Cn by
ak,j
(
|ak,j | if ak,j 6= 0,
xj :=
0 else.
Then kxk∞ = 1 and
X ak,j X ak,j X X
kAk∞ ≥ kAxk∞ = max ai,j ≥ ak,j = |ak,j | = max |ai,j |.
i |ak,j | |ak,j | i
j j j j
Theorem 5.9. [1.29] The matrix norm induced by the 1-norm is the maximum column sum,
n
X
kAk1 = kA∗ k∞ = max |ai,j |. A ∈ Cn×n .
1≤j≤n
i=1
17
Lecture 6
0 0 0 1
has ρA (z) = (2 − z)(1 − z)3 . Eigenvalues λ = 1: Algebraic multiplicity q = 3, geometric
multiplicity r = 2 since the kernel of
1 0 0 0
0 0 0 0
A − λIn =
0
0 0 1
0 0 0 0
is two-dimensional.
Theorem 6.1. r ≤ q.
Definition 6.2. The spectral radius of A ∈ Cn×n is
ρ(A) = max |λ| λ eigenvalue of A .
18
LECTURE 6. EIGENVALUES AND EIGENVECTORS
We will see below that normal matrices indeed can be characterised by this relation! To see this,
we need
Theorem 6.5. [2.2] Given A ∈ Cn×n , there is a unitary Q ∈ Cn×n and an upper triangular
T ∈ Cn×n such that A = QT Q∗ .
λ r∗
AU = U with some r ∈ Cn−1 , Ã ∈ C(n−1)×(n−1) .
0 Ã
r∗ λ r∗ V
λ ∗ 1 0 1 0
A=U U = U U ∗.
0 V T̃ V ∗ 0 V 0 T̃ 0 V∗
| {z } | {z }
=:Q =:T
Observing that Q is unitary since U and V are, this factorisation is of the desired structure.
Theorem 6.6. [2.3] If A ∈ Cn×n satisfies A∗ A = AA∗ then there is a unitary Q ∈ Cn×n and a
diagonal D ∈ Cn×n such that A = QDQ∗ .
Proof. We have from Th. [2.2] that A = QT Q∗ with T upper triangular. We will show that T
is diagonal.
Since QT ∗ T Q∗ = QT ∗ Q∗ QT Q∗ = A∗ A = AA∗ = QT Q∗ QT ∗ Q∗ = QT T ∗ Q∗ we obtain T ∗ T =
T T ∗ . Therefore
n
X n
X i
X
(T ∗ T )i,i = (T ∗ )i,k Tk,i = T k,i Tk,i = |Tk,i |2 (?1 )
k=1 k=1 k=1
T1,k = 0 for k = 2, . . . , n.
Let now i > 1 and assume that Tk,j = 0 for 1 ≤ k ≤ i − 1 and all j > k. We need to show
that Ti,k = 0 for k = i + 1, . . . , [Link], in particular, Tk,i = 0 for k = 1, . . . , i − 1 we obtain
from (?1 ) and (?2 ) that |Ti,i |2 = nk=i |Ti,k |2 from which we can conclude that indeed Ti,k = 0
for k = i + 1, . . . , n.
19
LECTURE 6. EIGENVALUES AND EIGENVECTORS
Theorem 6.7. [2.4] If A ∈ Cn×n is Hermitian then there is a unitary Q and a real Λ ∈ Rn×n
such that A = QΛQ∗ .
Lemma 6.8. [2.5] A ∈ Cn×n positive definite. There is a positive definite A1/2 ∈ Cn×n (the
square root of A) such that A = A1/2 A1/2 .
Video material covering statements 6.3, 6.7 and 6.8 can be found here.
20
Lecture 7
The inequalities in the above theorem become equalities if the matrix is normal and we use the
matrix norm induced by the Euclidean norm:
Theorem 7.2. [1.31] If A ∈ Cn×n is normal then ρ(A)l = kAkl2 for all l ∈ N.
Therefore
|αj |2 |λj |2 2 2
P P
kAxk22 j j |αj | |λ1 |
= ≤ = |λ1 |2
kxk22 2 2
P P
j |αj | j |α j |
from which we see that kAk2 ≤ |λ1 |. Together with Theorem 7.1 the assertion follows.
Theorem 7.3. [1.32] For all A ∈ Cm×n the equality kAk22 = ρ(A∗ A) holds true.
21
LECTURE 7. MATRIX NORMS, PART II
ρ(A∗ A) = kA∗ Ak2 = max kA∗ Axk2 = max max hy, A∗ Axi
(?)
kxk2 =1 kxk2 =1 kyk2 =1
where we used a duality argument◦ for the last identity. On the one hand,
(?) ≥ max hx, A∗ Axi = max hAx, Axi = max kAxk22 = kAk22 .
kxk2 =1 kxk2 =1 kxk2 =1
◦ We will not lean heavily on duality as part of the module, but some knowledge thereof could
become useful in other contexts. As a quick introduction (you can find more in Section 1.3 of
Stuart & Voss): Given a norm k · k on Cn , the pair (Cn , k · k) is a Banach space (a complete
normed vector space) B. The Banach space B 0 , the dual of B, is the pair (Cn , k · kB 0 ), where
kxkB 0 = maxkyk=1 |hx, yi|. The usage of max here implicity relies on the fact that a continuous
function on a closed, bounded set achieves its maximum value.
Exercise: Use the above theorem to show that kU Ak2 = kAk2 = kAV k2 for all unitary matrices
U ∈ Cm×m and V ∈ Cn×n .
22
LECTURE 7. MATRIX NORMS, PART II
Lemma 7.4. For any A ∈ Cn×n and δ > 0 there is a vector norm k · kδ on Cn such that the
induced norm fulfills ρ(A) ≤ kAkδ ≤ ρ(A) + δ.
= kJδ k∞
and recalling that k · k∞ is the maximum row sum we obtain that this is
X
= max |(Jδ )i,j |
i
j
≤ max |λi | + δ
i
= ρ(A) + δ
23
Lecture 8
Computers are based on the dual system, β = 2. And as they are finite, the idea is to cut the
infinite sum and to approximate x by
t
X s
X
ξ = σ2e 1 + an 2−n = σ2e × (1.a1 . . . at )2 , e = 2−m bi 2i .
n=1 i=1
The (a1 . . . at ) are called mantissa, here of length t with fraction bits an ∈ {0, 1}, and the
(b1 . . . bs ) represent the exponent of length s with exponent bits bi ∈ {0, 1}. The number m is
called bias or shift.
Example, IEEE Standard 754 Double Precision:
There are 64 bits to represent a number. The first bit is the sign bit, the next eleven bits are
the exponent bits, and the final 52 bits are the fraction bits:
(σb1 . . . b11 a1 . . . a52 )
The bias is fixed at m = 1023.
You can try this out on a particular numerical example, e.g. converting 286.75 into an IEEE
Standard 754 format. In what follows we can use single (rather than double) precision just to
simplify some of the algebra. Key steps:
1. Represent the decimal number in standard binary: (286.75)10 = (100011110.11)2 . This
is a good opportunity for a refresher, particularly for the fractional parts which require
contributions of 2−1 and 2−2 in this case in order to represent the 0.75 part of the original
decimal number.
2. Normalise the binary number via binary shift (the so-called 1.m form) such that only one
hidden one is left at the start: 1.0001111011.
3. Adjust with the bias for the single precision format, which for us is 2(8−1) − 1 = 127. (You
can try out the double precision version as an exercise.)
4. The exponent value (+8 for 28 as performed in step 2) after renormalisation becomes
added to the bias (8 + 127 = 13510 ), which leads us to our 8−bit exponent structure being
(10000111)2 .
24
LECTURE 8. FLOATING POINT REPRESENTATION
5. Putting everything together (0 for the sign bit, 1000 0111 for the exponent bits (8 bits in to-
tal) and 0001 1110 1100 0000 0000 00 (23 bits in total, with padding at the end), we retrieve
our final result: (286.75)10 = (0100001110001111011000000000000)2IEEE754 single-precision
Additional resource: if you would like to try out some examples (or verify your own calcula-
tions) there are nice converters out there, as well as other worked-out cases of various degrees
of complexity.
The relative error when approximating a number x by its nearest neighbour ξ is
|ξ − x|
≤ εm ≈ 10−16 .
|x|
εm is called machine precision. For positive x, Values for ξ are between ≈ 10−320 and 10308 . If
x is bigger (smaller) then we have to deal with an overflow (underflow).
Landau (Big O) Notation. Asymptotic notation is a useful tool in assessing algorithmic
performance, as well as analysing (and indeed constraining) error levels in implementation.
You may have come across descriptions of algorithms (in terms of operation count) as being
linear/polynomial/exponential in the relevant variable n (e.g. the size of a matrix) or having a
concrete estimate as being O(n), O(n2 ), O(n log n) etc.
In general form, if we let f be a real- or complex-valued function and g be a real-valued function
on some unbounded subset of the positive real numbers, with g(x) strictly positive for sufficiently
large values of x. Then we write f (x) = O(g(x)) as x → ∞ if there exists a positive M ∈ R and
a x0 ∈ R such that |f (x)| ≤ M g(x) ∀ x ≥ x0 .
In more practical terms, the O(g(n)) provides a worst-case scenario estimate for the runtime of
an algorithm (in that it is bounded between 0 and M g(x). As a concrete example, we can say
that if f (x) = 5x3 + 6x + 2 then f = O(x3 ) given the cubic term provides the dominant route
towards increase as x → ∞. Similarly, there exists a lower bound equivalent (denoted by Ω).
Finally, this framework also provides the means to create an asymptotic tight bound using the
Θ-notation, which sets a proportionality relationship between the two relevant functions. In
other words, f (x) = Θ(g(x)) means that there exist positive constants c1 , c2 and x0 such that
0 ≤ c1 g(x) ≤ f (x) ≤ c2 g(x) ∀ x ≥ x0 .
Additional resources and examples: freeCodeCamp offers some nice background material
and visualisations on the topic if you are keen to develop your intuition on the topic.
25
Lecture 9
Error Analysis
In an abstract way, a problem consists of input data of a specific type that are transformed to
output data of a specific type using an algorithm:
The subject of numerical mathematics is to analyse and estimate errors in the result in terms
of errors in the input data and occurring when performing operations:
The question is: how close is the computed solution to the correct one?
Input Errors: We have two sources in mind,
2. input data uncertainty, for example parameters obtained from experimental measurements.
26
LECTURE 9. ERROR ANALYSIS
f
y R
x φ
ξ η
E
θ
kφ(ξ) − f (x)k kθ − yk
absolute: kφ(ξ) − f (x)k = kθ − yk, relative: = .
kf (x)k kyk
An algorithm is forward stable if
kθ − yk kφ(ξ) − f (x)k kξ − xk
= = O(εm ) whenever = O(εm ) as εm & 0.
kyk kf (x)k kxk
27
LECTURE 9. ERROR ANALYSIS
A straight forward way towards an error analysis is to take the errors of order εm for every input
number and every elementary executable operation into account and to drop terms of order
o(εm ) as εm → 0.
Example: (Recording available here) Given three numbers x1 , x2 , x3 6= 0 we want to compute
f (x1 , x2 , x3 ) = x1 x2 /x3 .
A possible algorithm is to first compute the product x1 x2 and then to divide the result by x3 .
By assumption A1, there are floating point numbers ξi and small reals εi = O(εm ) as εm → 0
such that ξi = xi (1 + εi ), i = 1, 2, 3. With assumption A2, the first step of the algorithm
yields a number ξ1 ξ2 (1 + δ1 ) with some δ1 = O(εm ). The second step involves a δ2 = O(εm )
associated with the small error due to the division and results in (ξ1 ξ2 /ξ3 )(1 + δ1 )(1 + δ2 ) =
(ξ1 ξ2 /ξ3 )(1 + δ1 + δ2 + O(ε2m )) where we used that δ1 δ2 = O(ε2m ) = o(εm ). Together with the
erroneous input data the algorithm computes
ξ1 ξ2
(x1 , x2 , x3 ) 7→ (1 + δ1 + δ2 + O(ε2m ))
ξ3
x1 (1 + ε1 )x2 (1 + ε2 )
= (1 + δ1 + δ2 + O(ε2m ))
x3 (1 + ε3 )
x1 x2
= (1 − ε3 + O(ε2m ))(1 + ε1 + ε2 + δ1 + δ2 + O(ε2m ))
x3
x1 x2
= (1 + ε1 + ε2 − ε3 + δ1 + δ2 + O(ε2m )) =: f˜(x)
x3
Hence we obtain for the relative forward error that
|f˜(x) − f (x)|
= |ε1 + ε2 − ε3 + δ1 + δ2 + O(ε2m )| ≤ 5εm + O(ε2m ) = O(εm ) as εm → 0.
|f (x)|
One can imagine that this type of error analysis becomes quite tedious when the problems
become large and the algorithms complicated. An advantage is that it is possible to perform
them with a computer again.
Nevertheless, it turns out that the estimates usually are quite rough and, even more important,
yield no particular insight. As we will see there are problems which intrinsically lead to relatively
large errors when the input data involve small errors independently of the algorithm used to solve
it. The above method of forward error analysis does not care about this so-called conditioning
of problems let alone that ill-conditioned problems are detected.
A more sophisticated way that meanwhile is common in numerical linear algebra is the backward
error analysis. But before turning to this idea we have a look at this notion of conditioning.
28
Lecture 10
Conditioning
f
y R
x
ξ η
E
The result set is R = f (E). The Conditioning depends on the size of R. The problem f is called
well-conditioned if R small and ill-conditioned if R is rather big.
Example: Consider a 2 × 2 linear system of equations or, as an equivalent geometric problem,
the cut of two lines.
f1 f2
η1 η2
y1 y2
Errors in the input data correspond to changes of the lines. Problem f1 is better conditioned
than problem f2 since the change of the intersection point is less drastic when shifting one of
the lines by about the same amount. In this particular example, a small angle between the lines
is deemed unfavourable.
The general goal is to find a significant number to measure how well a problem is conditioned.
Consider a simple scalar problem f : R → R. Using the Taylor expansions we have to first order
f (x + h) − f (x) ≈ f 0 (x)h from which we obtain
f (x + h) − f (x) h xf 0 (x) i h
≈ ×
f (x) f (x) x
29
LECTURE 10. CONDITIONING
which relates the relative output error (on the left hand side) to the relative input error h/x.
A meaningful definition of a condition number (of problem f in a point x) therefore is
xf 0 (x)
κf (x) := .
f (x)
kxk
κf (x) = kJ(x)k
kf (x)k
where J(x) is the Jacobian of f and kJ(x)k is a matrix norm induced by the vector norms used
for x on Cm and f (x) on Cn .
√ √
Example: f (x) = arcsin(x), then f 0 (x) = 1/ 1 − x2 and κf = x/( 1 − x2 arcsin(x)), κf → ∞
as x % 1, hence the evaluation of arcsin close to x = 1 is an ill-conditioned problem.
Another example of a conditioning argument for a simple 2 × 2 linear system solving scenario
is provided in the form of video material.
Conditioning of (SLE)
Consider the problem of computing the matrix-vector product f (A, x) = Ax =: b, x, b ∈ Cn ,
A ∈ Cn×n . Then f 0 (x) = A, and the condition number becomes κf (x) = kAkkxk/kAxk (we
omit the discussion of the trivial case x = 0). Assume now that A is invertible and observe that
then
kxk = kA−1 Axk ≤ kA−1 kkAxk.
We deduce that
kxk
κf (x) = kAk ≤ kAkkA−1 k
kAxk
Occasionally, when considering the problem g(A, b) = A−1 b =: x we just have to replace A by
A−1 in the above analysis to obtain the same estimate for the condition number.
Proposition 10.2. [3.3] Let A regular, Ax = b 6= 0 and A(x + ∆x) = b + ∆b. Then
k∆xk k∆bk
≤ κ(A) .
kxk kbk
In order to get an estimate with respect to perturbations of the matrix A we need the following
lemma:
Lemma 10.3. [3.4] (and available via recording) Assume that B ∈ Cn×n satisfies kBk < 1.
Then I + B is invertible and
k(I + B)−1 k ≤ (1 − kBk)−1 .
30
LECTURE 10. CONDITIONING
Proof. Clearly
k∆Ak
kA−1 ∆Ak ≤ kA−1 kk∆Ak = κ(A) . (+)
kAk
From the assumptions (A + ∆A)∆x = −∆Ax, hence
Bringing the results of the propositions 10.2 and 10.4 together one can show
Theorem 10.5. Let A, ∆A ∈ Cn×n regular such that κ(A)k∆Ak ≤ kAk and let b ∈ Cn \{0},
∆b ∈ Cn . Assume that x ∈ Cn solves Ax = b and x̂ solves (A + ∆A)x̂ = b + ∆b. Then
Hints:
• One can start from Ax̂ = b + ∆b − ∆Ax̂, and subtract Ax and b (given they are equal to
one another) from the lhs and rhs, respectively.
• Multiply from the left by A−1 and use norm properties (separate out terms as much
as possible with relevant inequalities). It also helps at this stage to notice that ||x̂|| ≤
||x|| + ||x̂ − x|| on the rhs. The extra terms can then be suitably shifted to the lhs and
re-cast into a term involving A and ∆A.
• Division by ||x|| and using the definition of the condition number κ(A) to replace relevant
terms involving A and its inverse lead us to an almost final result.
• At the final stage note one of our assumptions (i.e. κ(A)k∆Ak ≤ kAk) to ensure that a
suitable division does not become problematic and we reach our conclusion.
31
Lecture 11
Before we can proceed with an error analysis we have to say what an algorithm is. We give a
general definition first.
Definition 11.1. An algorithm is a process consisting of a set of elementary components (called
steps) to derive specific output data from specific input data where the following conditions have
to be fulfilled:
• finiteness: the whole process is described by a finite text,
• effectivity: each step can be executed on a computing machine,
• termination: the process terminates after a finite number of steps,
• determinacy: the course of the process is completely and uniquely prescribed.
√
Elementary executable operations are the operations {+, −, ×, /, ·}. An algorithm for a map f
is a decomposition
f = f (l) ◦ · · · ◦ f (1) , l ∈ N,
into maps f (i) involving at most one elementary executable operation.
Occasionally, a step will contain a reference to another algorithm.
In a realisation of an algorithm, elementary executable operations involve errors (assumption
A2), and we denote by φ(i) a realisation of f (i) so that φ = φ(l) ◦ · · · ◦ φ(1) is a realisation of f .
Backward Error Analysis: The basic idea is to assume that the computed value θ = φ(ξ) is the
exact result for perturbed input data ζ of ξ, i.e. θ = f (ζ). If ζ does not exist the algorithm is
termed not backward stable. If there are multiple choices for ζ (f not injective) we choose ζ
minimising k · k.
f
f
y R
x φ
ξ η
E
ζ f θ
Z
32
LECTURE 11. BACKWARD ERROR ANALYSIS
kζ − xk
absolute: kζ − xk, relative: .
kxk
kζ − xk
= O(εm ) as εm & 0. (11.1)
kxk
This notion of backward stability is useful in the following sense: Obtaining the solution with
an algorithm involving approximations instead of performing exact computation is converted
to exactly solving the problem with perturbed initial data. But then the conditioning of the
problem can tell us how far the computed solution may deviate from the exact one.
Recall that the condition number was intended to yield an estimate of the form
Recall that we finally are interested in estimating the forward error. The above outlined back-
ward error analysis typically yields sharper estimates than the standard forward error analysis
and, in addition, splits into a problem intrinsic part (the conditioning) and an algorithm intrinsic
part (the backward error). For a deeper discussion on error analysis, see [5] (Higham, 2002).
Example 1: The subtraction is backward stable.
The exact version is x = (x1 , x2 )T , f (x) = x1 − x2 = y. With some numbers |ε(i) | ≤ εm ,
i = 1, 2, 3,, the computed version is
where ε(4) = ε(1) + ε(3) + ε(1) ε(3) = O(εm ) as εm & 0, ε(5) similarly.
Hence f (ζ) = θ for ζ = (x1 (1 + ε(4) ), x2 (1 + ε(5) ))T , and we conclude that
Example 2: Let x, y ∈ Cn where, for simplicity, we assume that they are not defective. Com-
puting the standard inner product recursively by
φn (x, y) = hζ, yi
33
LECTURE 11. BACKWARD ERROR ANALYSIS
Proof. The recursive definition of the algorithm for the scalar product invites for a proof by
induction. If n = 1 we have the ordinary product on C which is backward stable as can be
shown similarly to the subtraction in the previous example.
Let n > 1 and assume that the assertion is true for n − 1. We then have
as well as
34
Lecture 12
and inequalities like |B| ≤ |A| for matrices have to be understood as valid for each element.
Proof. We consider the forward substitution for a unit lower triangular T only. The algorithm
can be recursively defined by
xk = bk − hlk−1 , xk−1 i
for k = 1, . . . , n where
and the li,j are the entries of T . Using the result for the scalar product, a realisation in floating
point arithmetic yields
x̂k = (bk − φk−1 (lk−1 , xk−1 ))(1 + ε(k) ) = (bk − hˆlk−1 , xk−1 i)(1 + ε(k) )
Setting (∆T )k,i := ˆlik−1 − lik−1 , i = 1, . . . , k − 1, and k = 2, . . . , n, and (∆T )k,k := −ε(1) then we
indeed have that (T + ∆T )x̂ = b, and (12.1) holds true, too.
35
LECTURE 12. ERROR ANALYSIS OF THE GAUSSIAN ELIMINATION
Theorem 12.2. [5.6] Assume that the LU factorisation of a matrix A ∈ Cn×n exists and denote
by L̂, Û the LU factors computed by LU. Then L̂Û = A + ∆A where
nεm
|∆A| ≤ |L̂||Û | ≤ (nεm + o(εm ))|L̂||Û |.
1 − nεm
As a consequence of Theorems 12.1 and 12.2 we obtain
Theorem 12.3. [5.7] Assume that the LU factorisation of a matrix A ∈ Cn×n exists and denote
by L̂, Û the LU factors computed by LU. Then the solution x̂ ∈ Cn for Ax = b computed by
GE satisfies
(A + ∆A)x̂ = b
with a matrix ∆A ∈ Cn×n satisfying
3nεm
|∆A| ≤ |L̂||Û | ≤ (3nεm + o(εm ))|L̂||Û |.
1 − 3nεm
In order to obtain an estimate for the relative backward error (something like k∆Ak/kAk) it
apparently would be sufficient to estimate k|L̂||Û |k in terms of kAk. To see that this can be
problematic consider the matrix
δ 1 1 0 δ 1
A= ⇒ L= 1 ,U=
1 1 δ 1 0 1 − 1δ
where 0 < δ 1 is small. The condition number is κ∞ (A) = 4 + O(δ), therefore the problem of
solving Ax = b is well-conditioned. But
δ 1 1
|L||U | = ⇒ |L||U | ∞ = O as δ → 0.
1 2δ − 1 δ
with P being the permutation matrix exchanging rows 1 and 2. And then
1 1
|L||U | = = P A,
δ 1
kU kmax
gn (A) := .
kAkmax
36
LECTURE 12. ERROR ANALYSIS OF THE GAUSSIAN ELIMINATION
kU k∞ ≤ ngn (A)kAk∞ .
Since all the matrix entries in L are ≤ 1 thanks to the pivoting we also have that
kLk∞ ≤ n.
(A + ∆A)x̂ = b
To conclude, we need an answer on how big the growth factor gn (A) can become.
Lemma 12.6. [5.10] gn (A) ≤ 2n−1 for all A ∈ Cn×n , and this estimate is sharp.
Remark: In view of the exponential growth of gn (A) in n the stability estimate (12.2) is rather
weak. A better estimate for the growth factor is obtained when applying complete pivoting
GECP. However, the upper bound in Lemma 12.6 is a worst case and, in practice, the additional
effort due to the higher cost for the pivot search in GECP does not count off. Other, more
costly algorithms for solving (SLE) with better stability properties will be discussed later on.
37
Lecture 13
Computational Cost
Example: Consider the following algorithm for the standard matrix-vector product.
1: for i = 1 to m do
2: xi := 0
3: for j = 1 to n do
4: xi := xi + ai,j bj
5: end for
6: end for
Line 4 involves 1 addition and 1 multiplication ; 2op. With the loop in line 3 we obtain
n × 2op = 2nop. The assignment in line 2 does not count as an operation but from the loop in
line 1 we get m × 2nop, whence Algorithm 5 has computational cost
C(m, n) = 2mnop.
Any algorithm for the matrix-vector product will have at least Θ(m) operations as m → ∞
which can be seen from the fact that m values have to be computed.
Some remarks:
• Estimating the computation time is not the aim as this depends too much on the computer
architecture. We rather want to get an idea of the complexity of the algorithm for which
the number of operations is a good measure.
38
LECTURE 13. COMPUTATIONAL COST
• High performance computers are parallel computers containing multiple processing units.
Issues here are not only the number of operations but also the data exchange and the
balancing of the work load. An algorithm is said to scale optimally if doubling the number
of processors halves the computation time.
Let us turn our attention to the computational cost of solving systems of linear equations with
Gaussian elimination, [5.1], [5.2].
It is relatively straightforward to show that the cost of FS or BS are O(n2 ) which leads to the
following result:
39
Lecture 14
Divide & Conquer is an important design paradigm for algorithms. The idea is to break down a
problem into sub-problems which are recursively solved until the problem become small enough
to be solved directly. After, the sub-solutions are combined in merging steps to yield the solution
to the originating problem.
As an example, we will look at a method to compute the matrix-matrix product which goes back
to Strassen (1969). Recall that the standard matrix-matrix product C = AB of to matrices
A, B ∈ Cn×n computed via
n
X
ci,j = ai,k bk,j , i, j = 1, . . . , n
k=1
has a computational cost of CMMStd (n) = Θ(n3 ) as n → ∞. Since n2 entries are to be computed
any algorithm has cost Ω(n2 ) as n → ∞.
Assume, just for convenience, that n = 2k with some k ∈ N and write
A11 A12 k k k−1 k−1
A= ∈ C2 ×2 , Aij ∈ C2 ×2
A21 A22
where the n2 contribution comes from the 4 additions of n2 × n2 -matrices. One can show that
this leads to a cost of Θ(n3 ) as n → ∞, and the algorithm is not better than the standard one.
40
LECTURE 14. DIVIDE & CONQUER ALGORITHMS
C11 = P1 + P4 − P5 + P7 ,
C12 = P3 + P5 ,
C21 = P2 + P4 ,
C22 = P1 + P3 − P2 + P6 . (14.2)
and since log2 (7) ≈ 2.807 < 3 we have got an algorithm that asymptotically is of lower cost
Θ(nlog2 (7) ) as n → ∞ than the previously presented algorithms. Let us briefly prove (14.3) -
also available as recording:
Proof. The recursive definition of the Strassen multiplication invites for an induction proof. For
k = 0 there is only one multiplication of scalars, and formula (14.3) indeed yields one.
Assume now that (14.3) is true for k − 1. As there are 18 additions of n2 × n2 -matrices in (14.1)
and (14.2) we obtain the formula
To deal with the case that n is no power of 2 one may pick k ∈ N such that 2k ≥ n > 2k−1 and
then define
A 0 k k
à = ∈ C2 ×2
0 0
by adding some blocks containing zeros, similarly with B̃. It turns out that the upper left
n × n block of C̃ := ÃB̃ then contains the desired product C = AB. Applying the Strassen
multiplication to à and B̃ involves a cost that is Θ((2k )log2 (7) ) as k → ∞, but since 2k ≤ 2n by
the choice of k in dependence of n this is Θ(nlog2 (7) ) as n → ∞. We may summarise the findings
in
41
LECTURE 14. DIVIDE & CONQUER ALGORITHMS
Theorem 14.1. Using the Strassen multiplication one can construct an algorithm for computing
the product of n × n-matrices, n ∈ N, with computational cost
An algorithm developed by Coppersmith and Winograd (1990) scales even better as it has a
cost of about Θ(n2.376 ) but it is not (yet?) practical as its advantage only becomes perceptible
for such big n that the corresponding matrices are just too big for even the most modern
supercomputers. Any algorithm will have a cost Ω(n2 ) as n2 is the number of elements to be
computed.
What is this studying of the exponent useful for? Well, if one can multiply n×n-matrices with an
asymptotic cost of O(nα ) as n → ∞, α ≥ 2, then it is also possible to invert regular n×n matrices
with cost O(nα ). For the proof one may use the Schur complement S = D22 − D12 ∗ D −1 D
11 12 of a
Hermitian matrix
D11 D12
D= ∗
D12 D22
and proceed recursively. The assertion on the cost then follows similarly as for the Strassen
multiplication which is why the proof is omitted.
42
Lecture 15
Definition 15.1. Given a matrix A ∈ Rm×n and a vector b ∈ Rm , the least squares problem
LSQ consists of minimising the function
1
g : Rn → R, g(x) = kAx − bk22 .
2
Example: Recall the linear regression problem. Given points (ξi , yi )m
i=1 find a linear function
ξ → x1 + x2 ξ such that
m
1X 2
g(x) = x1 + x2 ξi − yi is minimal.
2
i=1
In this case,
1 ξ1 y1
.. .. ,
A = . b = ... .
.
1 ξm ym
Theorem 15.2. ([7.1], available via recording here) x ∈ Rn solves the least squares problem if
and only if Ax − b⊥range(A), which is the case if and only if the normal equation
AT Ax = AT b (7.1)
is satisfied.
43
LECTURE 15. LEAST SQUARES PROBLEMS
which means that Ax − b⊥range(A). The other way round, if Ax − b⊥range(A) then hAx −
b, Ay − Axi = 0 for all y ∈ Rn , hence with Pythagoras
2g(y) = kAy − bk22 = kAy − Axk22 + kAx − bk22 ≥ kAx − bk22 = 2g(x).
For the second assertion we use that Ax − b⊥range(A) ⇔ Ax − b⊥ai where the ai , i = 1, . . . , m,
are the column vectors of A. But this is equivalent to
m m
hai , Axi i=1 = hai , bi i=1 ⇔ (7.1)
Example: In the linear regression problem the normal equation is a 2 × 2 system where
Pm Pm
T m i=1 ξ i T i=1 y i
A A = Pm Pm 2 , A b = Pm .
i=1 ξi i=1 ξi i=1 ξi yi
It is certainly possible to use (7.1) to solve LSQ, for example one could use Cholesky since AT A
is positive definite provided that A has full rank. But, as we will see later on, we have
for the condition number, and even the condition number κ2 (A) (which is not yet defined for
m 6= n) can be big in practical applications. There are better approaches, based on the QR
factorisation (later) or on the singular value decomposition that we consider next.
σ1 v1
u1 um
A U Σ V
σn vn
44
LECTURE 15. LEAST SQUARES PROBLEMS
u1 un
A U Σ V
σn vn
The SVD is a very powerful decomposition for analytical purposes as it provides much insight
into the properties of the linear map associated with the matrix. The rank is the number of
non-vanishing singular values, i.e., the biggest integer r ≤ p such that σr > 0 but σr+1 = 0 (if
r + 1 ≤ p). Moreover, the image of A is spanned by the first r left singular vectors,
range(A) = span{u1 , . . . , ur }
and the kernel is
kernel(A) = span{vr+1 , . . . , vp }.
In addition to this algebraic information the SVD also reveals some geometrical information.
Recalling the identities Avi = ui σi , i = 1, . . . , p, we see that the unit sphere in Rp containing
the vectors vi is mapped to an ellipsoid which has semi-axes in the directions of the ui that have
the extensions σi .
A
v2 u3
σ 2u 2
v1 σ 1u 1
Example: Images can be compressed using the SVD. Try the following matlab code (it’s
probably best to put it into an m-file):
load [Link];
figure;
image(X);
colormap(’gray’);
pause
[U,S,V] = svd(X);
figure;
for k=[Link]
image(U(:,1:k)*S(1:k,1:k)*V(:,1:k)’);
colormap(’gray’);
disp(k)
disp(’Reduction: ’)
disp(521*k / 64000)
pause
end
45
Lecture 16
Theorem 16.1. [2.12] Every matrix A ∈ Rm×n has a SVD, and the singular values are uniquely
determined.
Proof. We first show the existence by induction on p = min(m, n).
For p = 1 we may choose u = 1, and if a11 6= 0 we may set v = a11 /|a11 | and σ1 = |a11 | whilst
in the case a11 = 0 we choose v = 1 and σ1 = 0.
Let now p > 1 and assume without loss of generality that A 6= 0 (since the SVD is trivial
otherwise with U and V being the identity and Σ = 0). The map x 7→ kAxk2 on Rn is
continuous. When restricted to the compact unit sphere S n−1 = {kxk2 = 1 | x ∈ Rn } it attains
a maximum which we denote by v1 . Further, we define
Since
2
σ1 + kwk22
σ1 σ1 σ1
q
C 2
≥ C = ≥ σ12 + kwk22 ≥ σ12 + kwk22
w 2 w 2 Bw 2 w 2
we conclude that q
σ1 = kCk2 ≥ σ12 + kwk22 ⇒ w = 0.
46
LECTURE 16. MORE ON THE SINGULAR VALUE DECOMPOSITION
Using an induction argument again one can show the uniqueness of the singular values.
Corollary 16.2. kAk2 = σ1 .
Examples:
1. A symmetric matrix A ∈ Rn×n can be diagonalised in the form A = QΛQT with Q ∈ Rn×n
orthogonal and Λ = diag(λ1 , . . . , λn ) ∈ Rn×n . Without loss of generality we may assume
that |λ1 | ≥ · · · ≥ |λn |. Otherwise perform a similarity transformation of Λ with an
appropriate permutation matrix which then is absorbed into Q. Denote the columns of Q
by q1 , . . . , qn . A SVD of A is obtained by setting U := (u1 , . . . , un ) where ui = sign(λi )qi ,
Σ = diag(|λ1 |, . . . , |λn |), and V := Q.
2. Let A ∈ Rm×n with m ≥ n (so that p = min(m, n) = n). The matrix AT A ∈ Rn×n has
eigenvalues σi2 with corresponding eigenvectors vi , i = 1, . . . , p. The matrix AAT ∈ Rm×m
has eigenvectors {u1 , . . . , um } with corresponding eigenvalues {σ12 , . . . , σp2 , 0, . . . , 0} (with
m − n zeros).
To see this, let us exemplary consider the latter case. We have that
AAT = U ΣV T V ΣT U T ⇒ AAT U = U ΣΣT ,
but ΣΣT = diag(σ12 , . . . , σp2 , 0, . . . , 0) with exactly m − n zeros.
3. Assume now that A ∈ Rn×n is regular. Then the matrix
0 AT
H :=
A 0
has 2n eigenvalues {σ1 , −σ1 , σ2 , −σ2 ,. . . , σn, −σn}
with corresponding eigenvectors { uv11 , −u v1 v2 v2 vn
vn
1
, u2
, −u 2
, . . . , un
, −un }.
2n
To show
y
this, assumen that Hx = λx for some λ ∈ R and some x ∈ R \{0}. Writing
x = z with y, z ∈ R this means that
AT z = λy, AAT z = λAy = λ2 z,
⇒
Ay = λz, AT Ay = λAT z = λ2 y.
From the previous example we know that the eigenvalues of AT A and AAT are {σ12 , . . . , σn2 }
vi
where σn > 0 by the regularity of A. Hence λ = ±σi for some i ∈ {1, . . . , n}. That ±u i
is a corresponding eigenvector is easy to show.
Remark 16.3. The above results hold true analogously for complex matrices if the transposed
matrices are replaced by the adjoint matrices.
47
LECTURE 16. MORE ON THE SINGULAR VALUE DECOMPOSITION
48
Lecture 17
Conditioning of LSQ
The key parts of this lecture (and a few extra remarks) are also available via recording here.
Recall the normal equation (7.1) AT Ax = AT b for a solution to LSQ.
Introducing the pseudo-inverse or Moore-Penrose inverse
A† := (AT A)−1 AT
the solution is just x = A† b, provided A† exists. This is the case if and only if A has full rank
which is equivalent to AT A being regular.
Definition 17.1. [3.7] The condition number of a matrix A ∈ Cm×n with respect to a norm k · k
is (
kAk kA† k if A has full rank,
κ(A) =
∞ otherwise.
We remark that A† = A−1 if m = n so that the above definition is consistent with the previous
one for n × n matrices.
Lemma 17.2. If A has full rank: κ2 (A) = σ1 /σn where σ1 and σn are the biggest and smallest
singular value, respectively.
Proof. From Corollary 16.2 we know that kAk2 = σ1 . Writing A = U ΣV T for a SVD, a short
calculation results in a SVD
A† = V Σ̃Ũ T (17.1)
with Σ̃ = diag( σ1n , . . . , σ11 ) so that kA† k2 = 1
σn .
Moreover kbk22 = kb − Axk22 + kAxk22 ≥ kAxk22 which motivates to introduce the angle θ ∈ [0, π2 ]
via
kAxk2
cos(θ) = .
kbk2
To provide a geometric interpretation, this is the angle between the b and the range of A:
49
LECTURE 17. CONDITIONING OF LSQ
b
rg(A)
Ax
Theorem 17.3. Assume that x 6= 0 solves LSQ for data (A, b) and x + ∆x for (A, b + ∆b).
Then
k∆xk2 κ2 (A) k∆bk2
≤
kxk2 η cos(θ) kbk2
where η := kAk2 kxk2 /kAxk2 ≥ 1.
Proof. Assume that A has full rank (otherwise the assertion is trivial). From the assumptions
we furthermore have that ∆x = A† ∆b. Therefore
k∆xk2 kA† k2 k∆bk2 κ2 (A)k∆bk2 κ2 (A)k∆bk2 κ2 (A) k∆bk2
≤ = ≤ ≤ .
kxk2 kxk2 kAk2 kxk2 ηkAxk2 η cos(θ) kbk2
Imagine now that b is (almost) orthogonal to range(A) which means that the solution is close
to zero, kxk is small. A small error in the data b then may lead to a small absolute deviation in
the solution but can also lead to a big relative error in the solution. Example:
1 −1 + δ δ
A= , b= ; x= .
1 1 2
Now, think of a small error in the data b, ∆b = (, 0) ; ∆x = /2. Hence, k∆xk2 /kxk2 = /δ
may be large.
AT Ax = V Σ̂T Û T T T T T T T
| {zÛ} Σ̂V x = V Σ̂ Σ̂z = V Σ̂ y = V Σ̂ Û b = A b
=I
50
LECTURE 17. CONDITIONING OF LSQ
which is the normal equation (7.1) and, hence, x indeed solves LSQ.
The cost of LSQ SVD is dominated by computing a reduced SVD of A as the subsequent steps
essentially are matrix-vector multiplications only. According to [4] we have
This is more that for solving the normal equation with, for instance, GEPP (or Cholesky, a
special version for positive definite matrices). The benefit is better stability. Example:
1 1 2
1
A= δ 0 , b= δ
; x= .
1
0 δ δ
When using the normal equation we encounter problems for much larger values of δ than when
employing the method based on the SVD.
We will next learn about a method in between the two presented methods with respect to
stability and cost.
51
Lecture 18
QR factorisation
52
LECTURE 18. QR FACTORISATION
we obtain a
QR factorisation: A = QR. (18.2)
Theorem 18.1. Every matrix A ∈ Rm×n
with m ≥ n can be factorised in the form A = QR
with Q ∈ Rm×m orthogonal and R ∈ Rm×n upper triangular.
Solving LSQ using the QR factorisation [7.2]
Householder reflections
A QR factorisation may be computed by transforming the matrix to upper triangular form by
successively multiplying with appropriate orthogonal matrices as follows:
∗ ··· ··· ··· ∗ ∗ ··· ··· ··· ∗ ∗ ··· ··· ··· ∗
0 ∗ · · · · · · ∗ 0 ∗ · · · · · · ∗ 0 ∗ ··· ··· ∗
. . . . . .. .. ..
.. .. .. .. 0 ..
∗ · · · ∗ . . .
. . . . . . . . .. .. ..
. . . . . . . ..
Q1 ·() .
. . Q2 ·() .
. . . . . .
A −→ . .. −→ .. .. → · · · → ..
.. .. .. ..
..
. .
.
. . . . . ∗
.. .. .. .. .. .. .. ..
. . . . . . .
. 0
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
0 ∗ ··· ··· ∗ 0 0 ∗ ··· ∗ 0 ··· ··· ··· 0
| {z } | {z } | {z }
=:R(1) =:R(2) =:R(n)
(18.3)
so that R = Qn · · · Q1 A ⇔ A = QR with Q = QT1 · · · QTn . For the orthogonal matrices, reflections
(k−1) (k−1)
may be used. Consider the vector u(k) := (rk,k , . . . , rm,k )T ∈ Rm−k+1 . In step k, we want
(k)
to reflect it to −sign(u1 )ku(k) k2 e1 where e1 denotes the first standard basis vector in Rm−k+1 .
An appropriate reflection matrix is given by
Hk := Im−k+1 − 2v (k) (v (k) )T ∈ R(m−k+1)×(m−k+1)
v̂ (k) (k)
where v (k) := (k)
with v̂ (k) := sign(u1 )ku(k) k2 e1 + u(k)
kv̂ k2
53
LECTURE 18. QR FACTORISATION
u
v
−sign(u1)||u||2 e 1
reflection
plane
we obtain that
A reduced QR factorisation is obtained by dropping the last m − n columns of Q and the last
m − n rows of R (which contain zeros only).
54
Lecture 19
We will have a closer look at algorithm 18 discussing the computational complexity and error
analysis. But first, consider a concrete example with the following data:
3 3 2
A = 4 4 1 = R(0) .
0 6 2
k=1:
3
u(1) = 4 , ku(1) k2 = 5,
0
5 3 8 √
v̂ (1) = 0 + 4 = 4 , kv̂ (1) k2 = 80,
0 0 0
8
1
v (1) = √ 4
80 0
64 32 0 8 4 0
1 1
Q1 = H1 = I3 − 2v (1) (v (1) )T = I3 − 32 16 0 = I3 − 4 2 0 ,
40 5
0 0 0 0 0 0
3 3 2 40 40 20 −5 −5 −2
(1) (0) 1
R = Q1 R = 4 4 1 − 20 20 10 = 0 0 −1 .
5
0 6 2 0 0 0 0 6 2
55
LECTURE 19. QR FACTORISATION WITH HOUSEHOLDER
REFLECTIONS - CONTINUED
k=2:
(2) 0
u = , ku(2) k2 = 6,
6
√
(2) 0 6 6
v̂ = + = , kv̂ (2) k2 = 72,
6 0 6
1 6
v (2) = √ ,
72 6
1 36 36 0 −1
H2 = I 2 − = ,
36 36 36 −1 0
1 0 0
1 0
Q2 = = 0 0 −1 ,
0 H2
0 −1 0
−5 −5 −2
R(2) = Q2 R(1) = 0 −6 −2 = R.
0 0 1
We end up with 3 4
−5 0 5
R = R(2) , Q = Q1 Q2 = − 45 0 − 53
0 −1 0
Computational Complexity
Lemma 19.1. [5.11] The cost for QR HH except line 9 is
2
CQR HH (m, n) ∼ 2mn2 − n3 as m, n → ∞.
3
Proof. Let us first consider the cost for one k-step. Computing ku(k) k2 involves m − k additions,
m − k + 1 multiplications, and one square root, hence constructing v̂ (k) requires 2(m − k + 1) + 1
operations (we here assume that obtaining sign(u1 ) and changing the sign does not involve any
cost).
Line 8 essentially requires multiplying the lower right (m − k + 1) × (n − k + 1) block R̃(k−1) of
2
R(k−1) with Hk = Im−k+1 − kv̂(k) k2
v̂ (k) (v̂ (k) )T from the left:
2
• The computation of (s(k) )T := (v̂ (k) )T R̃(k−1) (these are n − k + 1 standard inner products
of vectors of length m − k + 1) involves m − k additions and m − k + 1 multiplications for
each column of R̃(k−1) , hence (n − k + 1)(2(m − k) + 1) operations.
56
LECTURE 19. QR FACTORISATION WITH HOUSEHOLDER
REFLECTIONS - CONTINUED
Error Analysis
The following result is restated from [5] (Higham, 2002):
Theorem 19.2. Let x denote the solution to LSQ with data (A, b) and x̂ the solution computed
via LSQ QR. Then x̂ minimises ĝ(y) = 12 k(A + ∆A)y − (b + ∆b)k22 with
Cmnεm
k∆ai k2 ≤ kai k2 , i = 1, . . . , n,
1 − Cmnεm
Cmnεm
k∆bk2 ≤ kbk2
1 − Cmnεm
where ai and ∆ai denote the column vectors of A and ∆A, respectively.
In the case m = n we may use the QR factorisation to solve (SLE). Recalling that for GEPP
we had an error bound of the form
we see from the previous theorem that LSQ QR has much better stability properties.
However, the cost for LSQ QR are CLSQ QR (n) ∼ 43 n3 as n → ∞ while for GEPP we only
had CGEPP (n) ∼ 32 n3 .
In practice, GEPP is preferred to solve (SLE). The exponential growth of the growth factor
gn (A) is only a worst case estimate and, in application, matrices do not behave that badly.
57
Lecture 20
Iterative methods aim for constructing a sequence {x(k) }k∈N ⊂ Cn such that x(k) → x. Within
this context, the two (typically competing) criteria are:
• the computation of x(k) from the data and previous iterates should be inexpensive (relative
to a direct SLE solve),
(2) the matrix is sparse so that the matrix-vector product, usually needed by iterative methods,
is cheap to compute (e.g. banded matrices),
(3) the computational resources are limited since from the actual iterate one may learn at
least something about the solution (e.g. real-time control).
error
direct
method
(1) ε
(2) iterative
method
εm
(3) time
The basic idea of linear iterative methods is to split A = M + N and, given some initial guess
x(0) ∈ Cn for the solution, to solve
M x(k) = b − N x(k−1)
M x ← M x(k) = −N x(k−1) + b → −N x + b ⇒ Ax = b.
58
LECTURE 20. LINEAR ITERATIVE METHODS FOR SLE
Assuming convergence we will need a criterion to stop the iteration. Let us define the
error e(k) := x − x(k) .
It would be best if we could ensure that ke(k) k is small enough, say, smaller than a given
tolerance. But since x is not available one has to estimate the error. This may be done using
the
residual (vector) r(k) := b − Ax(k) .
Observe that r(k) = Ae(k) from which we deduce that
ke(k) k = kA−1 r(k) k ≤ kA−1 kkr(k) k.
Similarly, kbk = kAxk ≤ kAkkxk, and provided that we are not in the trivial case x = b = 0 we
conclude that
1 kAk
≤ .
kxk kbk
Putting both inequalities together we obtain (similarly as Proposition 10.2) that
ke(k) k kAk −1 kr(k) k
≤ kA kkr(k) k = κ(A) . (20.1)
kxk kbk kbk
Error Analysis
Error analysis for iterative methods means analysing the convergence of algorithms (rather than
their stability).
Lemma 20.1. [6.1] Assume that e(k) = Re(k−1) (= Rk e(0) ) with some matrix R ∈ Cn×n . Then
e(k) → 0 for all e(0) if an only if ρ(R) < 1.
Proof. Assume that ρ(R) < 1. For every δ > 0 there is a matrix norm k·kδ with kRkδ ≤ ρ(R)+δ.
Choosing δ small enough such that ρ(R) + δ < 1 we obtain for all e(0) that
ke(k) kδ = kRk e(0) kδ ≤ kRkkδ ke(0) kδ → 0 as k → ∞.
In turn, if ρ(R) ≥ 1 then Re(0) = λe(0) for some e(0) 6= 0 and |λ| ≥ 1. Hence ke(k) k = |λ|k ke(0) k
does not converge to zero.
59
Lecture 21
Let us split our matrix A ∈ Cn×n in the form A = D + L + U where D = diag(a1,1 , . . . , an,n ) is
the diagonal part and L and U are the lower and upper triangular parts given by
( (
ai,j if i > j, ai,j if i < j,
li,j := ui,j :=
0 else, 0 else.
The Jacobi method is the linear iterative method that consists in choosing M = D and N =
L + U , whence
x(k) = D−1 b − (L + U )x(k−1) .
Example: For
2 −1 0 ··· 0
.. .. ..
−1
2 . . .
A= 0
.. .. ..
. . . 0
..
.. .. ..
. . . . −1
0 ··· 0 −1 2
we have
2 0 ··· ··· 0 0 ··· ··· ··· 0 0 −1 0 ··· 0
.. .. .. −1 . . .
.. .. . . .. .. ..
0
. . . .
.
. . . .
D = ... .. .. .. .. , L =
0 ... .. .. , U = .. .. ..
. . . . . . . . . 0
.. .. .. ..
.. .. .. .. .. ..
. . . 0 . . . . . . . −1
0 ··· ··· 0 2 0 ··· 0 −1 0 0 ··· ··· ··· 0
so that
1
2 0 ··· ··· 0 0 −1 0 · · · 0
.. .. ..
−1 . . . . . . . . .
..
0
. . .
.
x(k) = ... .. .. .. .. b −
0 . . . . . . . . . 0 x(k−1) .
. . . .
.. ..
.. .. .. .. ..
. . . 0 . . . . −1
1
0 ··· ··· 0 2 0 · · · 0 −1 0
60
LECTURE 21. THE JACOBI METHOD
(k) (k)
Writing x(k) = (x1 , . . . , xn )T , this means that
(k) 1 (k−1) (k−1)
xi = (bi + xi−1 + xi+1 ), 2 ≤ i ≤ n − 1,
2
(k) 1 (k−1)
x1 = (b1 + x2 ),
2
1 (k−1)
x(k)
n = (bn + xn−1 ).
2
One can weaken this criterion, just a little bit so that it becomes much more useful for quite a
few applications. We need the notion of irreducibility for this purpose. A matrix A ∈ Cn×n is
irreducible if there is no permutation matrix P such that
Ã1,1 Ã1,2
P T AP =
0 Ã2,2
where Ã1,1 and Ã2,2 both are square blocks of size p × p and (n − p) × (n − p) with some
1 ≤ p ≤ n − 1.
Graph of a matrix
This content is also covered in this video material. It can be very tedious to check this irre-
ducibility criterion but, fortunately, there is another way based on studying the oriented graph
G(A) of the matrix A. It consists of the vertices 1, . . . , n, and there is an (oriented) edge from
vertex i to vertex j (denoted by i → j) if ai,j 6= 0.
Example:
2 −1 0 2 −1 0
A1 = 0 2 −1 , A2 = −1 2 −1
0 0 2 0 −1 2
1 1
2 2
3 3
61
LECTURE 21. THE JACOBI METHOD
We say that two vertices i, j are connected if there is a chain of connecting edges (or direct
connections) i = i0 → i1 → · · · → ik = j with some k ∈ N. The graph G(A) then is called
connected if any two vertices i, j of it are connected. Irreducibility of the matrix A now may be
checked using the following lemma.
Proof. Exercise.
Theorem 21.3. If A is irreducible and satisfies the weak row sum criterion
P
(1) |ai,i | ≥ j6=i |ai,j | for all i = 1, . . . , n, and
P
(2) |ak,k | > j6=k |ak,j | for at least one index k ∈ {1, . . . , n}
Proof. (also covered in this video) Recall that we need to prove that ρ(R) < 1. Let e :=
(1, . . . , 1)T ∈ Cn and |R| = (|ri,j |)i,j . Then thanks to the first condition
n
X X |ai,j |
0 ≤ |R|e i = |ri,j | = ≤ 1 = ei
|ai,i |
j=1 j6=i
so that e ≥ |R|e ≥ |R|2 e ≥ . . . where the inequality for vectors here and in the following has to
be understood component-wise.
Let t(l) := e − |R|l e ≥ 0, l ∈ N. Assume now that there is a positive number of non-vanishing
components of t(l) that become stationary. We may assume that these are the first m entries
where m > 0 thanks to the second condition, i.e.,
(l) (l+1)
(l) b (l+1) b
t = , t =
0 0
where b(l) , b(l+1) ∈ Rm have positive entries, b(l) > 0, b(l+1) > 0.
Suppose that m < n. Then
(l+1) (l) (l)
b l+1 l+1 l b |R1,1 | |R1,2 | b
= e − |R| e ≥ |R|e − |R| e = |R|(e − |R| e) = |R| =
0 0 |R2,1 | |R2,2 | 0
with R1,1 ∈ Rm×m and the other blocks accordingly. Since b(l) > 0 necessarily |R2,1 | = 0.
Therefore R is not irreducible. And since ri,j = ai,j /ai,i if i 6= j we obtain that A is not
irreducible in contradiction to the assumption. Hence m = n.
Conwequently, t(l) > 0 as long as l is big enough (using the above contradiction argument again
we see that l > n is sufficient). This means that e > |R|l e, whence
62
Lecture 22
In some applications, the goal will be to decrease the relative forward error ke(k) k/kxk below a
given threshold while in others it is sufficient to decrease the relative backward error kr(k) k/kbk.
We will concentrate on the latter goal but recall that by (20.1) the two goals are related. Of
course, the knowledge of the condition number is required to deduce an estimate for the forward
error from the backward error.
So our goal is
kr(k) k ≤ εr kbk (22.1)
where εr > 0 is a given tolerance. Using r(k) = Ae(k) = ARk e(0) it is sufficient to achieve that
1 k kAkke(0) k
kAkkRkk ke(0) k ≤ εr kbk ⇔ ≥
kRk kbkεr
|{z}
>1
In practice, the iteration matrix R often depends on n in a very unfavourable way while the
other data A, b and e(0) do not affect the number of steps that much.
Assumption 22.1.
2. kRk = 1 − h(n) with a positive function h such that h(n) = Θ(n−β ) as n → ∞ with some
β > 0.
Theorem 22.2. Under Assumption 22.1, the computational cost to achieve (22.1) is bounded
by a function C(n, εr ) satisfying
63
LECTURE 22. COMPUTATIONAL COMPLEXITY OF LINEAR ITERATIVE
METHODS
Proof. (also covered in this video) Recall that log(1/(1 − x)) = x + x2 /2 + x3 /3 + . . . . Thus, by
Assumption 22.1.2
so that (log(kRk−1 ))−1 = Θ(nβ ) as n → ∞. From (22.2) and Assumption 22.1.3 we get
for the number of steps. Taking the cost per step into account (Assumption 22.1.1), the total
cost is
k](n, εr )Cone step (n) = Θ(nα+β log(ε−1
r )).
Assuming a polynomial dependence in Assumption 22.1.3 one would obtain additional log(n)
terms in the cost estimate.
• computing b − . . . is O(n),
So the essential cost is coming from the first step. If the matrix is sparse, i.e., the number of
non-vanishing entries in each row is ∼ nη with some η < 1 then the number of operations is
O(n1+η ). For instance, η = 0 if A is tridiagonal as the number of non-vanishing entries in each
row is bounded by a constant (namely 3). In any case, with α ∈ [1, 2] the general result on the
computational cost in Theorem 22.2 is applicable.
M := L + ωD, N := U + (1 − ω)D, ω ∈ R.
64
Lecture 23
65
LECTURE 23. NONLINEAR ITERATIVE METHODS, STEEPEST DESCENT
hr(k−1) , d(k−1) i
α(k−1) = . (23.5)
kd(k−1) k2A
Before we start looking at possible search directions we make two observations:
1. The residual is subject to the iterative formula
Introducing a help variable h(k−1) = Ar(k−1) it is possible to formulate the algorithm such that
only one matrix-vector multiplication per iteration step is required (exercise).
One observes that subsequent search directions are orthogonal with respect to the standard
scalar product: Thanks to (23.6) and d(k−1) = r(k−1)
66
LECTURE 23. NONLINEAR ITERATIVE METHODS, STEEPEST DESCENT
Figure 23.1: Behaviour of SD, zigzag path to the minimum due to orthogonal (w.r.t. the
Euclidean scalar product) search directions.
iterate close to the minimum. It turns out that the main axes of the level set ellipsoids are the
eigenspaces, and the stretching of the ellipsoids depends on the ratio of the eigenvalues, most
prominently λmax /λmin . Recalling that for positive definite matrices kAk2 = λmax and kA−1 k2 =
1/λmin we see that it is exactly the condition number κ2 (A) = kAk2 kA−1 k2 = λmax /λmin that
can serve as a measure how elongated the level sets are. We will see later on how the condition
number influences the convergence of SD.
67
Lecture 24
Since h is convex and tends to infinity as kγ0 k, . . . , kγl−1 k → ∞, it has a unique minimum γ̂.
Recalling that ∇g(x) = Ax − b = −r and using the A-orthogonality of the search directions we
obtain for all m = 0, . . . , l − 1 that
l−1
∂ D X E
0= h(γ̂) = ∇g x(0) + γ̂i d(i) , d(m)
∂γm
i=0
D l−1
X E l−1
X
(0) (i) (m) (0) (m)
= Ax − b + γ̂i Ad , d = −hr , d i + γ̂i hAd(i) , d(m) i
| {z }
i=0 i=0
=1 if i=m, =0 else
For l = n and on assuming that d(k) 6= 0 for all k = 0, . . . , n − 1 we have that x(0) +
span{d(0) , . . . , d(n−1) } = Rn , so the above lemma then means that x(n) is the global minimiser
68
LECTURE 24. CONJUGATE GRADIENT METHOD
of g and the desired solution to Ax = b. Going back to the case n = 2 and Figure 23.1, choosing
d(0) = r(0) as first search direction this means that the second search direction would ensure
jumping from x(1) immediately to the minimum and we would avoid the zigzag path. So the big
questions is: How can we obtain A-orthogonal search directions?
More precisely, given d(0) , . . . , d(k−1) (and the x(i) and r(i) ), how can an appropriate d(k) A-
orthogonal to all the previous search directions be obtained? The following ideas go back to
Hestenes and Stiefel:
1. Given any v 6∈ span{d(0) , . . . , d(k−1) }, such a vector can be computed via the Gram-Schmidt
orthogonalisation method (applied with the A-scalar product, of course):
Apart from stability issues, this becomes very expensive when k becomes big.
2. The choice v = r(k) = −∇g(x(k) ) is quite reasonable: If r(k) ∈ span{d(0) , . . . , d(k−1) } then
r(k) = 0 because x(k) = x already is the minimum by the preceding result in Lemma 24.1
and we would stop the iteration anyway. Moreover, since r(k) points in the direction of
the steepest descent it gives a good idea into which direction roughly to proceed next. So
set d(k) := d˜(k) (r(k) ).
3. Set d(0) = r(0) . Consequently, the first step of the iteration is the same as for SD.
The most amazing point with the above choices is that hd(i) , r(k) iA = 0 for i = 0, . . . , k − 2 as
we shall prove later on. This means that
k−2
(k) (k) hd(k−1) , r(k) iA (k−1) X hd(i) , r(k) iA (i)
d =r − d − d = r(k) + β (k) d(k−1) ,
kd(k−1) k2A kd(k−2) k2
| {z } i=0 | {z A}
=:β (k) =0
69
LECTURE 24. CONJUGATE GRADIENT METHOD
It is possible to formulate the algorithm such that only one matrix-vector multiplication per
iteration step is required (exercise).
Example: Consider the data
2 −1 1 (0) 0
A= , b= , x = .
−1 2 0 0
70
LECTURE 24. CONJUGATE GRADIENT METHOD
Step k = 2:
(1) (1) 0
h := Ad =
3/4
kr(1) k22 1/4 2
α(1) := (1) (1)
= =
hd , h i 3/8 3
1/2 2 1/4 1 2
x(2) (1)
:= x + α d = (1) (1)
+ =
0 3 1/2 3 1
0 2 0 0
r(2) := r(1) − α(1) h(1) = − =
1/2 3 3/4 0
71
Lecture 25
More on CG
We have motivated to use conjugate or A-orthogonal search directions in order to improve the
convergence in comparison with the straight forward steepest descent method. An open issue
has been the claim that the update of the search direction in each step, which is based on a
Gram-Schmidt orthogonalisation, is cheap as most of the terms drop out.
Lemma 25.1. Let x(1) , . . . , x(k) be the iterates computed by CG and assume that
r(0) , . . . , r(k) , d(0) , . . . , d(k−1) 6= 0. Then
Proof. Using the update formulas r(k) = r(k−1) − α(k−1) Ad(k−1) and the one for α(k−1) we obtain
that
hd(k−1) , r(k−1) i (k−1)
hd(k−1) , r(k) i = hd(k−1) , r(k−1) i − hd , Ad(k−1) i = 0 (25.1)
kd(k−1) k2A
which proves (2). A consequence of this Euclidean orthogonality is that
kd(k) k22 = kr(k) + β (k) d(k−1) k22 = kr(k) k22 + |β (k) |2 kd(k−1) k22 > kr(k) k22 > 0
and thanks to the choice d(0) = r(0) this is also true for k = 1. This implies assertion (4),
To show (3) we first observe that for k > 1 thanks to the A orthogonality of the d(i)
72
LECTURE 25. MORE ON CG
and by the choice d(0) = r(0) this is also true for the case k = 1. Therefore, using the already
proved identity (4)
for l = 0, . . . , k − 1.
Proof. (available as video recording here) We start with proving the second assertion by induc-
tion where the case l = 0 is clear thanks to the choice d(0) = r(0) . So let l > 0 and assume
1
that (25.2) is true for l − 1. Since α(l−1) 6= 0 we have that Ad(l−1) = α(l−1) (r(l−1) − r(l) ) ∈
span{r(l−1) , r(l) }. Therefore, using the induction hypothesis,
Now, r(l) = d(l) − β (l) d(l−1) ∈ span{d(l−1) , d(l) } which, with the induction hypothesis, yields that
Finally, since Ad(l−1) ∈ span{r(0) , . . . , Al r(0) } and using the induction hypothesis again,
d(l) = r(l) + β (l) d(l−1) = r(l−1) − α(l−1) Ad(l−1) + β (l) d(l−1) ∈ span{r(0) , . . . , Al r(0) }
so that
span{d(0) , . . . , d(l) } ⊆ span{r(0) , . . . , Al r(0) }.
Let us now come to the assertion on the A-orthogonality. We show this by induction, too,
where the case k = 1 trivially is fulfilled. So let k > 1 and assume that d(0) , . . . , d(k−1) are
A-orthogonal.
Consider an index i ≤ k − 1. Then for all j = i + 1, . . . , k
73
LECTURE 25. MORE ON CG
where the second term vanishes thanks to the induction hypothesis. But since hd(i) , r(i+1) i = 0
by lemma 25.1 we conclude that
where i = 0, . . . , k − 1.
Let l < k − 1. Then thanks to (25.2), Ad(l) ∈ span{r(l) , r(l+1) } ⊂ span{d(0) , . . . , d(l+1) }, so that,
using (25.3), hd(l) , r(k) iA = hAd(l) , r(k) i = 0. Consequently,
are called Krylov subspaces and play a prominent role in other iterative methods such as GMRES
and BiCGstab which, indeed, are even termed Krylov (sub)space methods. A characteristic
property of these methods is that the increment lies in the actual Krylov subspace:
Theorem 25.3. The CG algorithm reaches the exact solution to SLEs in at most n steps for
any x(0) .
74
Lecture 26
Error analysis for iterative methods is convergence analysis, rounding errors are not taken into
account. The concepts of analysing SD and CG are similar which is why they are presented
together. We start with a helpful lemma relating energy and Euclidean norm.
Lemma 26.1. Assume that ke(k) kA ≤ cq k ke(0) kA for all k ∈ N and some constants c, q > 0.
Then p
ke(k) k2 ≤ κ2 (A)cq k ke(0) k2 ∀k ∈ N.
Proof. Denoting the minimal and maximal eigenvalue of A by λmin and λmax , respectively, and
recalling that κ2 (A) = λmax /λmin :
1 1 λmax k 2 (0) 2
ke(k) k22 ≤ ke(k) k2A ≤ c2 q 2k ke(0) k2A ≤ (cq ) ke k2 .
λmin λmin λmin
The first results concerns SD (the previous and following proofs are also explained as recordings).
Theorem 26.2. The convergence rate of SD is
s !k
(k) 1
ke kA ≤ 1− ke(0) kA .
κ2 (A)
75
LECTURE 26. ERROR ANALYSIS - COMPARISON OF SD AND CG
1
Using the estimates kvk2A ≤ λmax kvk22 and kvk2A−1 ≤ 2
λmin kvk2 we obtain that this is
!
kr(k−1) k42
≤ 1− 1 g(x(k−1) )
λmax kr(k−1) k22 λmin kr(k−1) k22
1
= 1− g(x(k−1) ).
κ2 (A)
Therefore k
(k) 1
g(x )≤ 1− g(x(0) )
κ2 (A)
Using that g(x(l) ) = 21 ke(l) k2A for l = 0, k (see (23.2)) yields the assertion.
For CG we have the following result where P k denotes the set of polynomials p of degree ≤ k
with p(0) = 1 and Λ(A) is the set of eigenvalues of A:
Proof. We only prove the first equality. The second one is an exercise.
As x(0) − x(k) is an element of the Krylov space Kk (A, r(0) ) = span{r(0) , . . . , Ak−1 r(0) } we can
write
k−1
X k
X
(k) (k) (0) (0) (k) (0) j (0)
e =x−x =x−x +x −x =e + ηj+1 A |{z}
r = 1+ ηj Aj e(0)
j=0 =Ae(0) j=1
k
X k−1
X
(0) (0) j (0) (0)
p(A)e =e + γj A e =x−x + γj+1 Aj r(0) .
j=1 j=0
| {z }
∈Kk (A,r(0) )
p
But as we have seen in Lemma 24.1, the iterate x(k) minimises y 7→ ky − xkA = 2g(y) on
x(0) + Kk (A, r(0) ). Therefore
k−1
X
ke(k) kA = kx − x(k) kA ≤ x − x(0) + γj+1 Aj r(0) = kp(A)e(0) kA .
A
j=0
A recording of this previous result, as well as an introduction into the following topic, can be
found here.
76
LECTURE 26. ERROR ANALYSIS - COMPARISON OF SD AND CG
Chebyshev polynomials
These polynomials are defined by
1 p n p n
Tn (x) := x + x2 − 1 + x − x2 − 1 , x ∈ [−1, 1]
2
and fulfil the recursive formula
T0 (x) = 1, T1 (x) = x, Tn+1 (x) = 2xTn (x) − Tn−1 (x), n = 1, 2, . . .
A further definition is
Tn (x) = cos(n arccos(x)), x ∈ (−1, 1).
We see that max|x|≤1 |Tn (x)| ≤ 1, and indeed the Chebyshev polynomials play an important role
when optimising with respect to the norm k · k∞ .
Let λmax and λmin denote the maximal and minimal eigenvalue of A and consider the rescaled
Chebyshev polynomial 2x .
p(x) := Tn γ − Tn (γ)
λmax − λmin
where
λmax + λmin λmax /λmin + 1 κ2 (A) + 1
γ= = = , (26.2)
λmax − λmin λmax /λmin − 1 κ2 (A) − 1
where we recall that kAk2 = λmax and kA−1 k2 = λ−1 min so that κ2 (A) = λmax /λmin . For x ∈
[λmin , λmax ] we have that γ − 2x/(λmax − λmin ) ∈ [−1, 1], and since then |Tn (x)| ≤ 1 we arrive at
|p(x)| ≤ 1/Tn (γ), x ∈ [λmin , λmax ]. (26.3)
Writing κ := κ2 (A) we first observe that
s s
κ+1 (κ + 1)2 κ+1 (κ + 1)2 − (κ − 1)2
± − 1 = ±
κ−1 (κ − 1)2 κ−1 (κ − 1)2
√ √ √κ + 1 ±1
κ + 1 ± 4κ ( κ ± 1)2
= = √ √ = √ .
κ−1 ( κ + 1)( κ − 1) κ−1
Thanks to (26.2) and (26.3)
s !n s !n !
1 κ+1 (κ + 1)2 κ+1 (κ + 1)2
Tn (γ) = + −1 + − −1
2 κ−1 (κ − 1)2 κ−1 (κ − 1)2
√ √ !
1 κ + 1 n κ − 1 n
= √ + √ .
2 κ−1 κ+1
Using (26.3) we see that for all x ∈ [λmin , λmax ]
√κ + 1 n √κ − 1 n −1
!
|p(x)| ≤ 2 √ + √ (26.4)
κ−1 κ+1
Since p ∈ P k for k = n and since estimate (26.4) holds true for all x ∈ Λ(A) ⊂ [λmin , λmax ] we
deduce from Theorem 26.3 the following result:
Theorem 26.4. [6.20] If CG has not yet converged after step k then
pκ (A) + 1 k pκ (A) + 1 −k −1
!
2 2
ke(k) kA ≤ 2 p + p ke(0) kA .
κ2 (A) − 1 κ2 (A) − 1
77
Lecture 27
Computational Complexity
This lecture is available as recording alongside helpful extra materials on our Moodle page.
For the computational complexity we proceed analogously as in the context of the linear iterative
methods: Get an estimate for the required number of steps in terms of the system size n and
the tolerance εr , and then multiply with the cost per step.
Assumption 27.1.
1. Computing Ax involves a cost of Θ(nα ) as n → ∞ with some α ∈ [1, 2].
2. κ2 (A) = Θ(nβ ) as n → ∞ with some β ≥ 0.
3. ke(0) kA is uniformly bounded in n.
Theorem 27.2. Under Assumption 27.1, the cost to achieve ke(k) kA ≤ εr with SD is bounded
by a function C(n, εr ) satisfying
C(n, εr ) = Θ(nα+β log(ε−1r )) as (n, εr ) → (∞, 0).
q k
Proof. By Theorem 26.2, ke(k) kA ≤ 1 − κ2 1(A) ke(0) kA , hence it is sufficient to achieve that
k
ke(0) k A 1 log(ke(0) kA ) + log(ε−1
r )
≤ q ⇔ k≥ q =: k ] (n, εr ).
εr 1 − κ2 1(A) log 1/ 1 − κ2 (A)1
Theorem 27.3. Under Assumption 27.1, the cost to achieve ke(k) kA ≤ εr with CG is bounded
by a function C(n, εr ) satisfying
1
C(n, εr ) = Θ(nα+ 2 β log(ε−1
r )) as (n, εr ) → (∞, 0).
78
LECTURE 27. COMPUTATIONAL COMPLEXITY AND
PRECONDITIONING
Proof. As for
p the previous theorem but based on the estimate in Theorem 26.4. The fact that
there only κ2 (A) appears rather than κ2 (A) as in the estimate for SD (see Theorem 26.2)
leads to the prefactor 21 in front of β.
Preconditioning
Apparently, the lower the condition number κ2 (A) the better the convergence of SD and CG,
and this holds true for Krylov space methods in general. Since for every regular B ∈ Rn×n
an appropriate choice of B may yield a system with κ(AB) < κ(A). But a problem emerges:
Even if B ∈ Rn×n is positive definite (which we assume from now on) there is no guarantee that
AB is positive definite. To solve this problem let us recall what we effectively need for CG. The
symmetry of A is equivalent to
This means that A considered as a linear operator on Rn is self-adjoint with respect to the
standard inner product. Consider now the bilinear form
which, thanks to the positivity of B, is an inner product (recall the arguments around (1.4) on
page 15, and recall also the notation k · kB for the associated norm). It turns out that à = AB
is self-adjoint with respect to h·, ·iB : For all x, y ∈ Rn
Hence, we may just replace the standard inner product by h·, ·iB in algorithm CG and thus
obtain a method for computing the solution to b = AB x̃. The essential lines read as follows
where we inserted the definition of h·, ·iB :
1: d(0) := r (0) := b − AB x̃(0)
2: l(0) := hr (0) , Br (0) i
3: if l(0) ≤ εr then
4: return x̃(0)
5: else
6: for k = 1, 2, . . . do
7: h(k−1) := ABd(k−1)
8: α(k−1) := l(k−1) /hd(k−1) , Bh(k−1) i = hr(0) , Br(0) i/hBd(k−1) , ABd(k−1) i
9: x̃(k) := x̃(k−1) + α(k−1) d(k−1)
10: r(k) := r(k−1) − α(k−1) h(k−1) = r(k−1) − α(k−1) ABd(k−1)
11: l(k) := hr(k) , Br(k) i
12: if l(k) ≤ εr then
13: return x̃(k)
14: end if
15: β (k) := l(k) /l(k−1) = hr(k) , Br(k) i/hr(k−1) , Br(k−1) i
79
LECTURE 27. COMPUTATIONAL COMPLEXITY AND
PRECONDITIONING
The big question now: How to choose B? One will want that the computation of s(k) = Br(k) is
cheap. But a good approximation of A−1 or, at least, keeping κ(AB) low, is desired as well. To
find a good trade-off depends strongly on the actual problem and the used computing hardware
so, usually, requires some testing and trying. Some ideas to define preconditioners are stated
below. Observe that in algorithm PCG we need the action of B on a vector which may be
cheaper to compute than building the matrix B and employing the usual matrix-vector product
algorithm.
• Very simple but occasionally highly effective: Choose B := D−1 where D is the diagonal
of A.
80
LECTURE 27. COMPUTATIONAL COMPLEXITY AND
PRECONDITIONING
−1
A ≈ Ã = L̃L̃T . The preconditioner then is B := L̃L̃T .
• The linear solvers can serve as preconditioners as well. For instance, the action of B on a
vector may correspond to a few steps of the Jacobi method.
• Modern general preconditioners also include (possibly algebraic) multigrid methods (not
discussed in this course).
81
Lecture 28
Let A ∈ Cn×n . Suppose that Ax = αx with some x ∈ Cn \{0} and α ∈ C. Then hx, Axi =
αhx, xi. The fraction
hx, Axi
rA (x) :=
hx, xi
is called Rayleigh coefficient. And from the preceding calculation we see that Ax = rA (x)x if x
is an eigenvector. Remarkably, this provides a method to compute the eigenvalue corresponding
to an eigenvector.
Finding the eigenvalue of A is equivalent to finding the roots of the characteristic polynomial.
However, there is the following result of Abel (1824):
Theorem 28.1. If n ≥ 5 then there is a polynomial of degree n with rational coefficients that
1
has a real root which cannot be expressed by only using rational numbers, +, −, ∗, /, (·) k with
k ∈ N.
Conditioning
The question is what impact a small perturbation ∆A of A has on the eigenvalues.
Let λ(A) ∈ Cn denote the set of eigenvalues, ordered in decreasing absolute value and repeated
according to their algebraic multiplicity. The coefficients of the characteristic polynomial ρA (z)
of A depend continuously on the matrix entries, whence the roots as well. Therefore, λ : Cn×n →
Cn is a continuous function.
Theorem 28.2. Let λ be a simple eigenvalue of A with associated right and left normalised
eigenvectors x, y. Then for all sufficiently small ∆A ∈ Cn×n the matrix A+∆A has an eigenvalue
λ + ∆λ with
1
∆λ = hy, ∆Axi + O(k∆Ak22 ) as k∆Ak2 → 0.
hx, yi
Definition 28.3. Given A ∈ Cn×n and an eigenvalue λ ∈ C, let x, y ∈ Cn denote normalised
right and left eigenvector, respectively. Then the eigenvalue condition number is
(
minx,y (1/|hx, yi|) with x, y normalised right and left eigenvector s.t. hx, yi =
6 0
κA (λ) := n
∞ if no such right and left eigenvectors x, y ∈ C exist.
82
LECTURE 28. INTRODUCTION TO EIGENVALUE PROBLEMS
Corollary 28.4. Let λ ∈ C be a simple eigenvalue of A ∈ Cn×n with corresponding right and
left normalised eigenvectors x ∈ Cn and y ∈ Cn . Then for all sufficiently small ∆A ∈ Cn×n the
matrix A + ∆A has an eigenvalue λ + ∆λ with
A proof of the above theorem 28.2 is stated in Stuart & Voss [1] - Theorem 3.15. We look at an
example with a non-simple eigenvalue in order to see why things go wrong in that case.
Example: (1) Consider the matrix
1 1
A=
0 1
λ = 1 is an eigenvalue of A with algebraic multiplicity 2. A right eigenvector is x = (1, 0)T and
a left eigenvector is y = (0, 1), which means that κA (λ) = ∞. In fact, consider
0 0
∆A = .
δ 0
√ p
The matrix A + ∆A has eigenvalues 1 ± δ √ so that |∆λ| = |δ| but not O(k∆Ak2 ) = O(|δ|).
In this example we have δ 7→ λ1 (δ) = 1 + δ ∈ C for the first eigenvalue, and this curve is
continuous in δ = 0 but not differentiable.
(2) If A is Hermitian then the left and right eigenspaces coincide so that κA (λ) = 1 in this case.
Theorem 28.5.
(a) A vector x ∈ Rn \{0} is an eigenvector of A with eigenvalue λ if and only if rA (x) = λ and
∇rA (x) = 0.
(b) If x ∈ Rn is an eigenvector of A then
for any z 6= 0. If Ax = λx then rA (x) = hx, Axi/hx, xi = λ and ∇rA (x) = 2(λx−rA (x)x)/kxk22 =
0 as claimed.
Vice versa, if ∇rA (x) = 0 then Ax = rA (x)x so that (x, rA (x)) is an eigenpair of A.
(b) This follows from considering the Taylor expansion of rA around x and using part (a).
83
Lecture 29
Power Iteration
Recall that we consider the case of A ∈ Rn×n symmetric. Denote the eigenvalues by |λ1 | ≥
|λ2 | ≥ · · · ≥ |λn | with corresponding normalised eigenvectors x1 , . . . , xn ∈ Rn .
Idea: iterate z (k) = Az (k−1) and hope that the iterates align with the eigenspace of λ1 .
Example: consider
1 0 (0) 1
A= , z = .
0 12 1
Then
1 1 1 1
z (1) = , z (2) = , . . . , z (k) = →
2−1 2−2 2−k 0
which is an eigenvector to λ1 = 1.
Observe that λ(k−1) = rA (z (k−1) ). When the goal is to compute the eigenvalue (rather than an
eigenvector) then a criterion of the form |λ(k) − λ(k−1) | ≤ εr may be used.
For computing an eigenvector, a possible stopping criterion is kz (k) − z (k−1) k2 ≤ εr or kz (k) +
z (k−1) k2 ≤ εr where we have to distinguish the sign because λ1 may be negative. In fact, for
−1 0 (0) 1
A= , z = .
0 12 1
we obtain the iteratives
(−1)k
(1) −1 (2) 1 (3) −1 (k)
z = ,z = ,z = ...,z =
2−1 2−2 2−3 2−k
which do not converge, but the vectors (−1)k z (k) = (1, (−1)k 2−k )T converge to (1, 0)T .
84
LECTURE 29. POWER ITERATION
Error Analysis
Theorem 29.1. Assume that |λ1 | > |λ2 | ≥ . . . and that α1 := hx1 , z (0) i =
6 0. Then the sequences
(k) (k)
{z }k and {λ }k generated by PI satisfy
λ k λ 2k
2 2
kz (k) − σ (k) x1 k2 = O , |λ(k) − λ1 | = O as k → ∞
λ1 λ1
where σ (k) = α1 λk1 /|α1 λk1 | ∈ {±1}.
Pn
Proof. (recording available) Let us write z (0) = i=1 αi xi . Since α1 6= 0
n
X X αi λi k
Ak z (0) = αi λki xi = α1 λk1 x1 + xi
α1 λ 1
i i=2
Therefore
n
(k) Ak z (0) Ak z (0) α1 λk1 X αi λi k 1
z = k (0)
= k
√ = k
x 1 + xi √ .
kA z k2 |α1 λ1 | 1 + γk |α1 λ | α
i=2 1
λ 1 1 + γk
| {z 1}
=σ (k)
Thus,
Ak z (0) Az (k)
kz (k) − σ (k) x1 k2 ≤ z (k) − +− σ (k) x1
|α1 λk1 |
2 |α1 λk1 | 2
k (0) n
1 A z X αi λi k
= √ −1 + xi
1 + γk |α1 λk1 | 2 α λ1
i=2 1
2
1 p √
= 1− √ 1 + γk + γk
1 + γk
p √
= 1 + γk − 1 + γk
√
≤ 2 γk .
Now,
n n
1 λ2 2k X λi 2k 1 λ2 2k X 1 λ2 2k
γk ≤ |αi |2 ≤ |αi |2 = kz (0) k22
|α1 |2 λ1 λ2 |α1 |2 λ1 |α1 |2 λ1
i=2 | {z } i=1
≤1
so that
√ kz (0) k2 λ2 k
γk ≤
|α1 | λ1
from which we get the first assertion. The second follows with Theorem 28.5 (b):
λ 2k
2
|λ(k) − λ1 | = |rA (z (k) ) − rA (σ (k) x1 )| = O(kz (k) − σ (k) x1 k22 ) = O .
λ1
85
LECTURE 29. POWER ITERATION
86
LECTURE 29. POWER ITERATION
Computational Complexity
We proceed in the usual way for iterative methods, estimating the number of required steps and
the multiplying with the cost per step.
The goal
kz (k) − σ (k) x1 k2 ≤ εr (29.1)
is achieved if
kz (0) k2 λ2 k
2 ≤ εr
|α1 | λ1
which requires
log(ε−1 (0)
r ) + log(2kz k2 ) − log(|α1 |)
k ] (n, εr ) =
log(|λ2 /λ1 |)
steps.
Assume now that
λ1
= 1 + Θ(n−β ) for some β > 0 as n → ∞,
λ2
that kz (0) k2 and |α1 | are Θ(1) as n → ∞, and that the cost per iteration step is O(n) as n → ∞.
Theorem 29.2. Under the above assumptions, the cost to achieve (29.1) with PI is bounded by
a function C(n, εr ) satisfying
The proof is similar to the proof of Theorem 22.2 which explains the first term n1+β log(ε−1
r ).
Here, the β arises from the fact that λ2 /λ1 converges to 1 as n → ∞, and the 1 comes from the
cost per iteration step which requires that the matrix-vector multiplication is O(n) as n → ∞.
The difference to previous results is the additional term n3 which arises from transforming the
initial matrix to a tridiagonal matrix.
87
Lecture 30
88
LECTURE 30. SIMULTANEOUS ITERATION AND THE QR METHOD
If only the eigenvalues are required then there is a very elegant way to reformulate the simulta-
neous iteration. Define
Q(k) := (Z (k−1) )T Z (k) .
Then
Q(k) R(k) = (Z (k−1) )T Z (k) (Z (k) )−1 W (k) = (Z (k−1) )T W (k) = Λ(k)
and
So when we have a QR factorisation of the actual matrix approximating the eigenvalues Λ(k)
then we only have to interchange the two matrices to compute the next iterate Λ(k+1) .
89
Bibliography
[1] Andrew Stuart and Jochen Voss, Matrix Analysis and Algorithms, Lecture notes.
[2] Roger A. Horn and Charles R. Johnson, Matrix Analysis, Cambridge University Press,
1985.
[3] Gene H. Golub and Charles F. van Loan, Matrix Computations, 3. ed., Johns Hopkins
University Press, 1996.
[4] Lloyd Trefethen, David Bau Numerical Linear Algebra, SIAM, 1997.
[5] Nicholas Higham, Accuracy and Stability of Numerical Algorithms, SIAM, 2002.
[6] David Kincaid and Ward Cheney, Numerical Analysis, 3. ed., AMS, 2002.
[7] Thomas S. Shores, Applied Linear Algebra and Matrix Analysis, Springer, 2007.
90