0% found this document useful (0 votes)
25 views93 pages

Matrix Analysis and Algorithms Overview

The document outlines the MA398 Matrix Analysis and Algorithms course at the University of Warwick, covering key topics in numerical linear algebra such as Gaussian elimination, LU factorization, eigenvalue problems, and error analysis. It aims to provide a comprehensive understanding of algorithms for solving systems of linear equations, least squares problems, and eigenvalue problems, along with their computational costs and accuracy. The course includes various methods and algorithms, including iterative methods and matrix factorizations, to analyze and solve these mathematical problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views93 pages

Matrix Analysis and Algorithms Overview

The document outlines the MA398 Matrix Analysis and Algorithms course at the University of Warwick, covering key topics in numerical linear algebra such as Gaussian elimination, LU factorization, eigenvalue problems, and error analysis. It aims to provide a comprehensive understanding of algorithms for solving systems of linear equations, least squares problems, and eigenvalue problems, along with their computational costs and accuracy. The course includes various methods and algorithms, including iterative methods and matrix factorizations, to analyze and solve these mathematical problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MA398 Matrix Analysis and Algorithms

Mathematics Institute and Centre for Scientific Computing


University of Warwick
UNITED KINGDOM

December 13, 2021


Contents

1 Introduction 1

2 Gaussian Elimination 4

3 LU factorisation 7

4 Gaussian Elimination with Pivoting 10

5 Matrix Norms, Part I 14

6 Eigenvalues and Eigenvectors 18

7 Matrix Norms, Part II 21

8 Floating point representation 24

9 Error Analysis 26

10 Conditioning 29

11 Backward Error Analysis 32

12 Error Analysis of the Gaussian Elimination 35

13 Computational Cost 38

14 Divide & Conquer Algorithms 40

15 Least Squares Problems 43

16 More on the Singular Value Decomposition 46

17 Conditioning of LSQ 49

18 QR factorisation 52

19 QR factorisation with Householder reflections - continued 55

20 Linear Iterative Methods for SLE 58

21 The Jacobi Method 60

i
CONTENTS CONTENTS

22 Computational Complexity of Linear Iterative Methods 63

23 Nonlinear Iterative Methods, Steepest Descent 65

24 Conjugate Gradient Method 68

25 More on CG 72

26 Error Analysis - Comparison of SD and CG 75

27 Computational Complexity and Preconditioning 78

28 Introduction to Eigenvalue Problems 82

29 Power Iteration 84

30 Simultaneous Iteration and the QR Method 88

ii
Lecture 1

Introduction

Subject of this course are some problems that are central to numerical linear algebra and occur
in many applications.

Systems of Linear Equations (SLE)


Given A ∈ Cn×n , b ∈ Cn
find x ∈ Cn such that Ax = b.
Lots of theory has already been provided in courses on linear algebra such as solvability (non-
singular matrices), characterisation of linear maps (Jordan canonical form).

Least SQuares problems (LSQ)


Given A ∈ Rm×n , b ∈ Rm (typically m ≥ n)

find x ∈ Rn such that kAx − bk2 is minimal.

Example: Regression (linear). Given points (ξi , yi )m


i=1 find a linear function ξ 7→ x1 + x2 ξ such
that
m
X 2 1/2
g(x) := x1 + x2 ξi − yi is minimal.
i=1

In this case,    
1 ξ1 y1
 .. ..  ,
A = . b =  ...  .
 
. 
1 ξm ym

1
LECTURE 1. INTRODUCTION

EigenValue Problems (EVP)


Given A ∈ Cn×n
find (x, λ) ∈ (Cn \{0}) × C such that Ax = λx.
Sometimes, an eigenvector or an eigenvalue may be known, and the goal then is to compute the
corresponding eigenvalue or, respectively, a corresponding eigenvector.
Example: We consider a highly simplified weather model (available as supplementary recording
here). We distinguish only two states: sun and rain. If one day’s weather is s then the next
day’s weather is

• s with probability p ∈ (0, 1),

• r with probability 1 − p.

Similarly, there is a probability q ∈ (0, 1) for no change of state r and 1 − q for a change from r
to s.

1−p
p s r q
1−q
Let us denote the actual state by a vector µ ∈ R2 such that (1, 0) corresponds to s (i.e., we
associate state s with index 1) and (0, 1) corresponds to r (associate r with index 2). Let us
furthermore introduce the probability matrix P = (pi,j )2i,j=1 where pi,j denotes the probability
that the system being in state i changes to state j, i.e.,
 
p 1−p
P = .
1−q q

Multiplying the actual state with P from the right then yields a vector of probabilities for the
next day’s state, for instance if we start with a sunny day, µ(0) := (1, 0), then
 
(0) p 1−p
µ P = (1, 0) = (p, 1 − p) =: µ(1)
1−q q

indeed contains the correct probabilities for s and r. The nice thing about this notation is
that we simply may go on multiplying with P from the right to obtain the probabilities for the
subsequent days, for instance

µ(2) := µ(1) P = (p2 + (1 − p)(1 − q), p(1 − q) + (1 − p)q)

contains the probabilities for s and r on the second day, and


 
µ(k) := µ(k−1) P = µ(0) P k (1.1)

the probabilities for the k th day.

2
LECTURE 1. INTRODUCTION

Experts will recognise this iteration as a discrete Markov chain. In applications, one often is
interested in stationary distributions π satisfying
X
π = πP, πi ≥ 0 ∀i, πi = 1
i

which means that π is a left eigenvector of P for the eigenvalue 1. In fact, 1 always is an eigenvalue
of such probability matrices so the goal is just to compute a corresponding eigenvector. And of
further interest is when Markov chains often converge to such stationary states.
We will see such type of iterations as the Markov chain (1.1) again in a slightly more general
context of the so-called power iteration. Apart from the convergence, another questions imme-
diately arising is how long one has to iterate in order to obtain an acceptable approximation to
a stationary state (as this is an iterative method one will not expect to get an exact solution).

Aims and Objectives


The general goal is to understand the mathematical principles underlying the design of algo-
rithms for the previously mentioned problems in linear algebra, and their analysis with respect
to accuracy and cost.
At the end of the module you will familiar with concepts and ideas related to:

• diverse matrix factorisations

– to obtain analytical results,


– as the theoretical basis for designing algorithms,

• assessing algorithms with respect to computational cost (efficiency),

• error analysis, splitting up into

– conditioning of problems,
– stablity of algorithms,

• differences in the analysis of direct versus iterative methods.

3
Lecture 2

Gaussian Elimination

Problem: Given A ∈ Cn×n and b ∈ Cn find x ∈ Cn such that Ax = b.


Observation: If A = (ai,j )ni,j=1 is (upper) triangular, i.e.,
 
a1,1 · · · · · · a1,n
 .. .. 
 0 . . 
A= .

.. .. .. 

 .. . . . 
0 · · · 0 an,n
then the xi can be recursively computed as follows:

last row : an,n xn = bn ⇒ xn = bn /ann ,


row n − 1 : an−1,n−1 xn−1 + an−1,n xn = bn−1 ⇒ xn−1 = (bn−1 − an−1,n xn )/an−1,n−1 ,
.. ..
. .
n
X
first row: . . . ⇒ x1 = (b1 − a1,j xj )/a1,1 .
j=2

Algorithm 1 BS (backward substitution)


input: U = (ui,j )ni,j=1 ∈ Cn×n upper triangular, ui,i 6= 0 for all i = 1, . . . , n, b = (bi )ni=1 ∈ Cn .
output: x ∈ Cn solution to U x = b.
1: xn := bn /un,n
2: for i = n − 1 to 1 do
3: h := 0
4: for j = i + 1 to n do
5: h := h + ui,j xj
6: end for
7: xi := (bi − h)/ui,i
8: end for

The algorithm is also discussed as part of a recording here.


Assume now that A is an arbitrary matrix.
The following operations do not change the solution space:
• multiplying an equation with a number r ∈ C\{0},

4
LECTURE 2. GAUSSIAN ELIMINATION

• adding an equation to another equation.

Idea: Use such operations to transform A into a triangular matrix.


Example:    
2 1 1 4
A = 4 3 3  , b = 10 .
8 7 9 24

2x1 + x2 + x3 = 4
; 4x1 + 3x2 + 3x3 = 10
8x1 + 7x2 + 9x3 = 24

multiply the first equation with −2 = −a2,1 /a1,1 and add to the second equation,
multiply the first equation with −4 = −a3,1 /a1,1 and add to the third equation,

2x1 + x2 + x3 = 4
; 0 + x2 + x3 = 2
0 + 3x2 + 5x3 = 8

multiply the second equation with −3 and add to the third equation,

2x1 + x2 + x3 = 4
; x2 + x3 = 2
0 + 2x3 = 2

Now, use BS to compute the solution x = (1, 1, 1)T .


Observation: The first step corresponds to multiplying Ax = b with
 
1 0 0
L̃1 := −2 1 0
−4 0 1

from the left:     


1 0 0 2 1 1 2 1 1
L̃1 Ax = −2 1 0 4 3 3 x = 0 1 1 x,
−4 0 1 8 7 9 0 3 5
    
1 0 0 4 4
L̃1 b = −2 1 0
  10 = 2
 
−4 0 1 24 8
Similarly, the second step corresponds to a multiplication with
 
1 0 0
L̃2 := 0 1 0 .
0 −3 1

A matrix A = (ai,j ) ∈ Cn×n is unit lower triangular if

1. ai,j = 0 if 1 ≤ i < j ≤ n,

2. ai,i = 1 for all i = 1, . . . , n.

5
LECTURE 2. GAUSSIAN ELIMINATION

Lemma 2.1. [5.1] (a) Let L ∈ Cn×n unit lower triangular with non-zero entries only in column
k. Then L−1 is unit lower triangular with non-zero entries only below the diagonal in column k
with entries
(L−1 )i,k = −Li,k , i = k + 1, . . . , n.
(b) Let L, M ∈ Cn×n be unit lower triangular , k ∈ {1, . . . , n}, and
L has non-zero entries below the diagonal only in columns 1, . . . , k,
M has non-zero entries below the diagonal only in columns k + 1, . . . , n.
Then LM is unit lower triangular with
(
Li,j if j ≤ k,
(LM )i,j =
Mi,j else.

Proof. Exercise.

Consequence:
   
1 0 0 1 0 0
(L̃1 )−1 = 2 1 0 =: L1 , (L̃2 )−1 := 0 1 0 =: L2 ,
4 0 1 0 3 1

and we see that


A = (L̃1 )−1 L̃1 A = L1 (L̃2 )−1 L̃2 L̃1 A = L1 L2 U = LU
with    
1 0 0 2 1 1
L = 2 1 0 , U = 0 1 1 
4 3 1 0 0 2
The matrix A is factorised into a unit lower triangular matrix L and an upper triangular regular
(non-singular) matrix U . Given this, we may solve Ly = b by forward substitution (exercise,
formulate this in analogy to BS) and then U x = y by BS in order to compute the solution to
Ax = b.

6
Lecture 3

LU factorisation

The j th principal sub-matrix of A ∈ Cn×n is given by (Aj )k,l = ak,l , k, l = 1, . . . , j.

Theorem 3.1. Let A ∈ Cn×n .


(a) Assume that Aj is invertible for all j = 1, . . . , n. Then there is a unique factorisation
A = LU with L unit lower triangular and U regular and upper triangular.
(b) If Aj is singular for some j then there is no such factorisation.

Proof. (a) Proof is based on induction, similar proofs will follow → Exercise.
(b) Assume that A = LU exists but Aj is singular. In block form:
      
A11 A12 L11 0 U11 U12 L11 U11 ?
A= = =
A21 A22 L21 L22 0 U22 ? ?

with A11 = Aj , and since L11 is unit lower triangular and U11 is regular and upper triangular,
Aj = L11 U11 is the LU factorisation of Aj . But then

det(Aj ) = det(L11 U11 ) = det(L11 ) det(U11 ) 6= 0


| {z } | {z }
=1 6=0

in contradiction to the singularity of Aj .

Algorithm 2 LU
input: A = (aij )ni,j=1 ∈ Cn×n with det(Ak ) 6= 0, k = 1, . . . , n.
output: L ∈ Cn×n unit lower triangular, U ∈ Cn×n upper triangular and regular with A = LU .

1: U = A, L = I.
2: for k = 1 to n − 1 do
3: for j = k + 1 to n do
4: lj,k := uj,k /uk,k
5: uj,k := 0
6: for i = k + 1 to n do
7: uj,i := uj,i − lj,k uk,i
8: end for
9: end for
10: end for

7
LECTURE 3. LU FACTORISATION

We have uk,k 6= 0 in line 4:


After k − 1 steps we have the matrix
 
? ··· ··· ? ? · ?
.. .. .. 
0 . . .

. . .
 .. . . .. .. 
 
. . ? ? . .
U (k−1) = 0 · · · 0 = (L(k−1) )−1 A
 
 uk,k ? ··· ? 
0 · · · 0 ? ? ··· ?
 
 .. .. .. .. .. 
. . . . .
0 ··· 0 ? ? ··· ?

with L(k−1) unit lower triangular. Therefore


(k−1) (k−1)
uk,k det(Uk−1 ) = det(Uk ) = det((L(k−1) )−1 ) det(Ak ) 6= 0
| {z k }
=1

since det(Ak ) 6= 0 by assumption. Hence uk,k 6= 0 and we can divide in line 4.


Example:  
0 0 1
A = 1 1 0
0 2 1
is regular since det(A) = 2 but A1 = 0 and A2 both are singular.
Idea: Permutation of the rows of A, σ : (1, 2, 3) 7→ (3, 1, 2), which is equivalent to a multiplication
with the permutation matrix
   
0 1 0 1 1 0
P = 0 0 1 , P A = 0 2 1
1 0 0 0 0 1

and P A has regular submatrices. In fact, as P A is upper triangular, the LU factorisation is


obtained by setting U = P A and choosing L to be the identity.

Definition 3.2. P ∈ Cn×n is a permutation matrix if every row and every column contains
n − 1 zeros and 1 one.

We may write P = (eσ(1) , eσ(2) , . . . , eσ(n) ) where ej is the j th vector of the standard basis of Cn .
With π = σ −1 we also may write
 
— eπ(1) —
P = ..
.
 
.
— eπ(n) —

This means that if we write  


— a1 —
A=
 .. 
. 
— an —

8
LECTURE 3. LU FACTORISATION

with the row vectors ai ∈ Cn then


 
— aπ(1) —
PA = 
 .. 
. 
— aπ(n) —

hence, a multiplication with a permutation matrix from the left exchanges the rows according
to the associated permutation. Similarly, a multiplication with a permutation matrix from the
right exchanges the columns.

Theorem 3.3. Let A ∈ Cn×n be regular. Then there are a permutation matrix P ∈ Cn×n ,
L ∈ Cn×n unit lower triangular, and U ∈ Cn×n upper triangular with P A = LU .

Proof. (Recording available here) By induction on n. The case n = 1 is trivial as one just has
to choose P = L = 1 and U = A.
Let n > 1 and assume that the assertion is true for n − 1. Choose a permutation matrix P1 such
that a := (P1 A)1,1 6= 0. Such a matrix exists because A is regular, whence the first column of A
will contain a nonzero entry. We write

a u∗ a u∗
    
1 0
P1 A = = 1
l B al I 0 Ã

where l, u ∈ C(n−1)×1 , u∗ is the adjoint of u, and à = B − a1 lu∗ ∈ C(n−1)×(n−1) . The matrix Ã


is regular since
a u∗
   
1 0
0 6= det(P1 A) = det 1 det = a det(Ã).
al I 0 Ã
| {z }
=1

By the induction hypothesis there are P̃ (permutation), L̃ (unit lower triangular) and Ũ (regular
upper triangular) with P̃ Ã = L̃Ũ . Therefore
   
1 0 1 0 a u
P1 A = 1
al I 0 (P̃ )−1 L̃ 0 Ũ
| {z }
=:U
 
1 0
= U
(P̃ )−1 L̃
1
al
  
1 0 1 0
= U
0 (P̃ )−1 1
a P̃ l L̃
| {z }
=:L
 −1
1 0
⇒ P1 A = LU
0 (P̃ )−1
| {z }
=:P

where P is a permutation matrix and L and U are of the desired structure, too.

9
Lecture 4

Gaussian Elimination with Pivoting

Before step k:  
? ··· ··· ? ? · ?
.. .. .. 
0 . . .

. . .
 .. . . .. .. 
 
. . ? ? . .
U (k−1) = 0 · · ·
 
 0 uk,k ? ··· ? 
0 · · · 0 ? ? ··· ?
 
 .. .. .. .. .. 
. . . . .
0 ··· 0 ? ? ··· ?
and we have a problem if the pivot uk,k is zero.
In fact, a small uk,k is undesirable as it leads to stability problems → will see this later on.
There are basically two strategies to overcome this problem.

1. GEPP, Gaussian Elimination with Partial Pivoting: Swap rows to maximise |uk,k | among
the entries ul,k , l = k, . . . , n.

2. GECP, Gaussian Elimination with Complete Pivoting: Swap rows and columns to max-
imise |uk,k | among the entries ul,m , l, m = k, . . . , n.

In the following we will only consider GEPP since, in practice, partial pivoting usually is
sufficient and the gain in stability due to complete pivoting negligible.
With an appropriate permutation matrix Pk that realises the swap of the rows we then compute
U (k) = L̃k Pk U (k−1) so that in the end

U = L̃n−1 Pn−1 · · · L̃1 P1 A. (4.1)

The permutation associated with Pl exchanges l with some number il > l and leaves the other
entries unchanged,

(1, . . . , l − 1, l, l + 1, . . . , il − 1, il , il + 1, . . . , n) 7→ (1, . . . , l − 1, il , l + 1, . . . , il − 1, l, il + 1, . . . , n).

10
LECTURE 4. GAUSSIAN ELIMINATION WITH PIVOTING

For some k < l, consider the unit lower triangular matrix

··· ··· ··· ··· ··· ··· ··· ···


 
1 0 0
.. .. ..
. .
 
 0 .
 .. .. .. .. ..
 
 . . . . .


 . ..
..

 .
 . 0 1 . .


 . .. ..
.. ..

 . . .
 . . mk+1,k .


 . .. .. .. ..  ..
M =  .. . . 0 . . .


 . .. .. .. .. ..  ..
 .. . ml,k . . . .  .
 
 . .. .. .. .. .. .. ..
 ..

 . . . . . . 
 .
 . .. .. .. .. .. ..
 ..

 . mil ,k . . . . 
 .
 . .. .. .. .. ..
 ..

. . . . . 0 
0 ··· 0 mn,k 0 ··· ··· ··· ··· 0 1

Observation:

··· ··· ··· ··· ··· ··· ··· ···


 
1 0 0
.. .. ..
. .
 
 0  .
.. .. .. .. ..
 
. . . . .
 
 
 .. .. ..

. 0 1 . .
 
 
 .. .. .. .. ..


 . . mk+1,k . . 
.
Pl M Pl−1

= .. .. .. .. ..  ..
 . . . 0 . . 
 .
 .. .. .. .. .. ..  ..

 . . mil ,k . . . . 
 .
 .. .. .. .. .. .. ..  ..

 . . . . . . . 
 .
 .. .. .. .. .. ..  ..

 . . ml,k . . . . 
 .
 .. .. .. .. .. .. 
 . . . . . . 0 
0 ··· 0 mn,k 0 ··· ··· ··· ··· 0 1

As a consequence, the matrix


−1 −1
L0k := Pn−1 · · · Pk+1 L̃k Pk+1 · · · Pn−1

has the same structure as L̃k but just the entries in column k below the diagonal are permuted.
Since
L0k Pn−1 · · · Pk+1 = Pn−1 · · · Pk+1 L̃k
we obtain from (4.1) that

U = L0n−1 · · · L01 Pn−1 · P1 A ⇒ LU = P A


| {z } | {z }
=:L−1 =:P

which is a factorisation of the desired form.

11
LECTURE 4. GAUSSIAN ELIMINATION WITH PIVOTING

Algorithm 3 LUPP
input: A = (aij )ni,j=1 ∈ Cn×n regular.
output: L ∈ Cn×n unit lower triangular, U ∈ Cn×n regular upper triangular, P ∈ Cn×n
permutation matrix with P A = LU .
1: U = A, L = I, P = I.
2: for k = 1 to n − 1 do
3: choose i ∈ {k, . . . , n} such that |ui,k | is maximal
4: exchange (uk,k , . . . , uk,n ) with (ui,k , . . . , ui,n )
5: exchange (lk,1 , . . . , lk,k−1 ) with (li,1 , . . . , li,k−1 )
6: exchange (pk,1 , . . . , pk,n ) with (pi,1 , . . . , pi,n )
7: for j = k + 1 to n do
8: lj,k := uj,k /uk,k
9: uj,k := 0
10: for i = k + 1 to n do
11: uj,i := uj,i − lj,k uk,i
12: end for
13: end for
14: end for

A more elegant variant of the algorithms stores the permutation in a vector of length n initialised
with the numbers (1, . . . , n) that finally contains the permutation π associated with P , i.e., the
ith entry of that vector contains π(i). See example below.

Algorithm 4 GEPP (Gaussian elimination with partial pivoting)


input: A ∈ Cn×n regular, b ∈ Cn .
output: x ∈ Cn with Ax = b.
1: find P A = LU with algorithm LUPP
2: solve Ly = P b with FS
3: solve U x = y with BS

Example: Consider    
−2 2 0 0 0
 2 −4 1 1 0
A= , b= 
0 4 −2 0 2
1 1 0 1 3
and execute GEPP. In what follows considering the structure and properties of sign operations
when using lower/upper triangular matrix operations will become useful.
Step 1, computation of the LU factorisation with LUPP.
The last vector will contain the permutation π:
     
1 0 0 0 −2 2 0 0 1
0 1 0 0 2 −4 1 1  , π (0) = 2
L(0) =  (0)
     
0 0 1 0 U =  0
 
4 −2 0  3
0 0 0 1 1 1 0 1 4

(0) (0)
Apparently, 2 = |U1,1 | ≥ maxj≥1 |Uj,1 |, hence we need no permutation. One elimination step

12
LECTURE 4. GAUSSIAN ELIMINATION WITH PIVOTING

leads to      
1 0 0 0 −2 2 0 0 1
 −1 1 0 0  0 −2 1 1 2
L(1) =  U (1) = , π (1) = 
 0 0 1 0 0 4 −2 0 3
− 12 0 0 1 0 2 0 1 4
(1) (1)
Now, 4 = |U3,2 | ≥ maxj≥2 |Uj,2 |, hence we permute rows 2 and 3:
     
1 0 0 0 −2 2 0 0 1
0 1 0 0 (U (1) )0 =  0 4 −2 0 3
(L(1) )0 =  π (2)
 
 0 −2 1 1 , =
 
 −1 0 1 0 2
− 12 0 0 1 0 2 0 1 4
The next elimination step yields
     
1 0 0 0 −2 2 0 0 1
0 1 0 0 0 4 −2 0 3
L(2) =  U (2) π (2)

 = , = 
 −1 − 21 1 0 0 0 0 1 2
− 12 1
2 0 1 0 0 1 1 4
(1) (1)
We have that 1 = |U4,3 | ≥ maxj≥3 |Uj,3 |, hence we permute rows 3 and 4:
     
1 0 0 0 −2 2 0 0 1
0 1 0 0 0 4 −2 0  , π (3) = 3
(L(2) )0 =   (U (2) )0 = 
     
− 1 1
1 0   0 0 1 1  4
2 2
1
−1 − 2 0 1 0 0 0 1 2

No further elimination step is required as (U (2) )0 already is upper triangular. Hence


     
1 0 0 0 −2 2 0 0 1
0 1 0 0 0 4 −2 0  , π (3) = 3
L(3) =   U (3) = 
     
− 1 1
2 2 1 0  0 0 1 1 4
1
−1 − 2 0 1 0 0 0 1 2

One may now check that LU = P A with U = U (3) , L = L(3) , and


 
1 0 0 0
0 0 1 0
P = 0 0 0 1

0 1 0 0

which is the permutation matrix associated with π = π (3) .


Step2, solving Ly = P b. We have that
   
eπ(1) · b bπ(1)
P b =  ...  =  ... 
   

eπ(n) · b bπ(n)
which indicates how the action of the permutation matrix P on a vector can be computed given
the associated permutation π. In our example, the solution is y = (0, 2, 2, 1)T .
Step3, solving U x = y.
One can check that x = (1, 1, 1, 1)T which indeed is the solution to Ax = b.

13
Lecture 5

Matrix Norms, Part I

Some remarks on norms and inner products first.


A vector norm on Cn is a map k · k : Cn → R with
1. kxk ≥ 0 for all x ∈ Cn and kxk = 0 if and only it x = 0,
2. kαxk = |α|kxk for all α ∈ C, x ∈ Cn ,
3. kx + yk ≤ kxk + kyk for all x, y ∈ Cn .
Theorem 5.1. All norms on Cn are equivalent, i.e., given norms k · ka and k · kb , there are
constants 0 < c1 ≤ c2 < ∞ such that

c1 kxka ≤ kxkb ≤ c2 kxka ∀x ∈ Cn .

An inner product on Cn is a map h·, ·i : Cn × Cn → C with


1. hx, xi ∈ R+ for all x ∈ Cn , and hx, xi = 0 if and only if x = 0,
2. hx, yi = hy, xi for all x, y ∈ Cn ,
3. hx, αyi = αhx, yi for all α ∈ C and x, y ∈ Cn ,
4. hx, y + zi = hx, yi + hx, zi for all x, y, z ∈ Cn .
Examples:
• p-norm: kxkp := ( i |xi |p )1/p for p ∈ [1, ∞),
P

• maximum-norm: kxk∞ := maxi |xi |,


• standard inner product: hx, yi := i xi yi .
P

Lemma 5.2. [1.5] (Cauchy-Schwarz) We have that

|hx, yi|2 ≤ hx, xihy, yi ∀x, y ∈ Cn

with equality if and only if x = αy or y = αx for some α ∈ C.


Lemma 5.3. [1.6] Given an inner product h·, ·, i,
p
x 7→ hx, xi

is a norm on Cn .

14
LECTURE 5. MATRIX NORMS, PART I

Given A = (ak,l )m,n


k,l=1 ∈ C
m×n , the adjoint A∗ ∈ Cn×m is the matrix with entries (A∗ )
i,j = aj,i .

A vector x ∈ C can be considered as an n × 1 matrix allowing for the notation x ∈ C1×n .
n

Then  
y1 n

 .  X
x y = x1 . . . xn  ..  = xi yi = hx, yi.
yn i=1

We also write
xy ∗ = (xi y j )n,n
i,j=1 = x ⊗ y ∈ C
n×n
.
moreover,
hAx, yi = (Ax)∗ y = x∗ A∗ y = hx, A∗ yi
for all A ∈ Cm×n , x ∈ Cn and y ∈ Cm .
Some further definitions:

1. Q ∈ Cm×n is unitary if Q∗ Q = In ∈ Cn×n ,


Q ∈ Rm×n is orthogonal if QT Q = In ∈ Rn×n .

2. A ∈ Cn×n is Hermitian if A∗ = A,
A ∈ Rn×n is symmetric if AT = A.

3. A Hermitian matrix is positive (semi-)definite if

hx, Axi = x∗ Ax > 0 (≥ 0) ∀x ∈ Cn \{0}.

Q being unitary means that the columns are orthonormal. In the case m = n we have that
Q−1 = Q∗ and also QQ∗ = I.

Theorem 5.4. For any unitary Q ∈ Cm×n

hQx, Qyi = hx, yi, kQxk2 = kxk2 , ∀x, y ∈ Cn .

Proof. Exercise.

One can also show that if A ∈ Cn×n is positive definite then

hx, yiA := hx, Ayi is an inner product, (1.4)


p
kxkA := hx, xiA is a vector norm. (1.5)

We can consider Cm×n as a C vector space of dimension mn; possible norms:

kAkmax := max |ai,j | maximum norm,


i,j
 1/2
X
kAkF :=  |ai,j |2  Frobenius norm.
i,j

Another way of assigning a norm to a matrix is the operator norm: Given norms k · km̂ on Cm
and k · kn̂ on Cn define

kAxkm̂
kAk(m̂,n̂) := max = max kAxkm̂ .
x∈Cn \{0} kxkn̂ kxkn̂ =1

15
LECTURE 5. MATRIX NORMS, PART I

Definition 5.5. [1.25] A matrix norm on Cn×n is a mapping k·k : Cn×n → R with the properties

1. kAk ≥ 0 for all A ∈ Cn×n , and kAk = 0 if and only if A = 0,

2. kαAk = |α|kAk for all α ∈ C, A ∈ Cn×n ,

3. kA + Bk ≤ kAk + kBk for all A, B ∈ Cn×n ,

4. kABk ≤ kAkkBk for all A, B ∈ Cn×n .

The last condition makes the difference to a vector norm. It means that the norm is compatible
with the matrix-matrix product.

Definition 5.6. [1.26] Given a vector norm k · kv on Cn , we define the induced (operator) norm
k · km on Cn×n by
kAxkv
kAkm := max = max kAxkv .
x∈Cn \{0} kxkv kxkv =1

Theorem 5.7 (1.27). The induced norm k · km of a vector norm k · kv is a matrix norm with
kIn km = 1 and
kAxkv ≤ kAkm kxkv ∀A ∈ Cn×n , x ∈ C.

Proof. (Recording available here) Clearly kAkm ∈ R and ≥ 0. We have

kAxkv
kAkm = 0 ⇔ = 0 ∀x ∈ Cn \{0} ⇔ kAxkv = 0 ∀x ∈ Cn ⇔ A = 0
kxkv
which shows the first point. For the second point we observe that

kαAkm = max kαAxkv = max |α|kAxkv = |α| max kAxkv = |α|kAkm ,


kxkv =1 kxkv =1 kxkv =1

and similarly the third property can be deduced from the corresponding property of the vector
norm:

kA + Bkm = max k(A + B)xkv ≤ max (kAxkv + kBxkv )


kxkv =1 kxkv =1

≤ max kAxkv + max kBxkv = kAkm + kBkm .


kxkv =1 kxkv =1

Clearly kIn km = maxkxkv =1 kIn xkv = 1, and for any y ∈ Cn \{0}

kAxkv kAykv
kAkm = max ≥ ⇒ kAykv ≤ kAkm kykv .
x6=0 kxkv kykv
Using this we can show the submultiplicativity, the fourth property of matrix norms:

kABkm = max kABxkv ≤ max kAkm kBxkv = kAkm max kBxkv = kAkm kBkm
kxkv =1 kxkv =1 kxkv =1

Some remarks:

• We will use the same symbol / subscript for a vector norm and its induced norm, e.g. kxk2
and kAk2 .

16
LECTURE 5. MATRIX NORMS, PART I

• Not every matrix norm is an induced norm.


Exercise: Find an example. Hint: kIn k = 1 does not have to be fulfilled.

• kxk0 := kAxk is a vector norm if A is invertible. Theorem [1.27] yields


1
kxk ≤ kxk0 ≤ kAkkxk.
kA−1 k

Exercise: Show these inequalities.

Theorem 5.8. [1.2] The matrix norm induced by the infinity norm is the maximum row sum,
n
X
kAk∞ = max |ai,j |, A ∈ Cn×n
1≤i≤n
j=1

Proof. For any x ∈ Cn


X X X
kAxk∞ = max ai,j xj ≤ max |ai,j ||xj | ≤ max |ai,j |kxk∞ .
i i i
j j j

To show the other inequality let k ∈ {1, . . . , n} be the row index with maximal sum,
X X
max |ai,j | = |ak,j |.
i
j j

Define x ∈ Cn by
ak,j
(
|ak,j | if ak,j 6= 0,
xj :=
0 else.
Then kxk∞ = 1 and
X ak,j X ak,j X X
kAk∞ ≥ kAxk∞ = max ai,j ≥ ak,j = |ak,j | = max |ai,j |.
i |ak,j | |ak,j | i
j j j j

Theorem 5.9. [1.29] The matrix norm induced by the 1-norm is the maximum column sum,
n
X
kAk1 = kA∗ k∞ = max |ai,j |. A ∈ Cn×n .
1≤j≤n
i=1

17
Lecture 6

Eigenvalues and Eigenvectors

Given A ∈ Cn×n , λ is an eigenvalue of A if there is a x ∈ Cn \{0} such that Ax = xλ.


Characteristic polynomial: ρA (z) = det(A − zI).
λ is eigenvalue if and only if ρA (λ) = 0.
Algebraic multiplicity of an eigenvalue λ: largest integer q such that (z − λ)q is a factor of ρA (z).
Geometric multiplicity of an eigenvalue λ: dimension r of the kernel of A − λI.
Simple eigenvalue: r = q = 1.
Example:  
2 0 0 0
0 1 0 0
A= 0 0 1 1

0 0 0 1
has ρA (z) = (2 − z)(1 − z)3 . Eigenvalues λ = 1: Algebraic multiplicity q = 3, geometric
multiplicity r = 2 since the kernel of
 
1 0 0 0
0 0 0 0
A − λIn = 
0

0 0 1
0 0 0 0

is two-dimensional.
Theorem 6.1. r ≤ q.
Definition 6.2. The spectral radius of A ∈ Cn×n is

ρ(A) = max |λ| λ eigenvalue of A .

Similarity transformation: B = XAX −1 with regular X ∈ Cn×n .


Theorem 6.3. If B ∈ Cn×n is similar to A ∈ Cn×n then A and B have the same eigenvalues
with the same multiplicities.
Definition 6.4. A matrix A ∈ Cn×n is normal if it has n orthogonal eigenvectors.
Normal matrices can be diagonalised: Let Q := (q1 , . . . , qn ) with qi orthonormal eigenvectors,
hence Q is unitary. Then AQ = QΛ with Λ = diag(λ1 , . . . , λn ) where the λi are the eigenvalues
corresponding to the qi . Hence
A = QΛQ−1 = QΛQ∗

18
LECTURE 6. EIGENVALUES AND EIGENVECTORS

i.e., A is similar to a diagonal matrix.


Further consequence:

A∗ A = QΛ∗ Q∗ QΛQ∗ = QΛ∗ ΛQ∗ = QΛΛ∗ Q∗ = QΛQ∗ QΛ∗ Q∗ = AA∗ .

We will see below that normal matrices indeed can be characterised by this relation! To see this,
we need

Theorem 6.5. [2.2] Given A ∈ Cn×n , there is a unitary Q ∈ Cn×n and an upper triangular
T ∈ Cn×n such that A = QT Q∗ .

Proof. By induction on n. In the case n = 1 we clearly may choose Q = 1 and T = A.


Let n ≥ 2. By the fundamental theorem of algebra, the characteristic polynomial of A has a
zero λ and, hence, an eigenvalue. Let y1 denote a corresponding eigenvector with ky1 k2 = 1 and
extend it by {y2 , . . . , yn } to an orthonormal basis of Cn .
Setting U := (y1 , . . . , yn ) ∈ Cn×n , the identity Ay1 = λy1 leads to

λ r∗
 
AU = U with some r ∈ Cn−1 , Ã ∈ C(n−1)×(n−1) .
0 Ã

Using the induction assumption, there is a factorisation à = V T̃ V ∗ with upper triangular


T̃ ∈ C(n−1)×(n−1) and unitary V ∈ C(n−1)×(n−1) . Therefore,

r∗ λ r∗ V
     
λ ∗ 1 0 1 0
A=U U = U U ∗.
0 V T̃ V ∗ 0 V 0 T̃ 0 V∗
| {z } | {z }
=:Q =:T

Observing that Q is unitary since U and V are, this factorisation is of the desired structure.

Theorem 6.6. [2.3] If A ∈ Cn×n satisfies A∗ A = AA∗ then there is a unitary Q ∈ Cn×n and a
diagonal D ∈ Cn×n such that A = QDQ∗ .

Proof. We have from Th. [2.2] that A = QT Q∗ with T upper triangular. We will show that T
is diagonal.
Since QT ∗ T Q∗ = QT ∗ Q∗ QT Q∗ = A∗ A = AA∗ = QT Q∗ QT ∗ Q∗ = QT T ∗ Q∗ we obtain T ∗ T =
T T ∗ . Therefore
n
X n
X i
X
(T ∗ T )i,i = (T ∗ )i,k Tk,i = T k,i Tk,i = |Tk,i |2 (?1 )
k=1 k=1 k=1

where we used that T is triangular in the last identity. Similarly,


n
X

(T T )i,i = |Ti,k |2 . (?2 )
k=i

We now show by induction on i that the off-diagonalP entries of T vanish.


For i = 1 we get from (?1 ) and (?2 ) that |T1,1 | = nk=1 |T1,k |2 from which we conclude that
2

T1,k = 0 for k = 2, . . . , n.
Let now i > 1 and assume that Tk,j = 0 for 1 ≤ k ≤ i − 1 and all j > k. We need to show
that Ti,k = 0 for k = i + 1, . . . , [Link], in particular, Tk,i = 0 for k = 1, . . . , i − 1 we obtain
from (?1 ) and (?2 ) that |Ti,i |2 = nk=i |Ti,k |2 from which we can conclude that indeed Ti,k = 0
for k = i + 1, . . . , n.

19
LECTURE 6. EIGENVALUES AND EIGENVECTORS

The proofs of the following two statements are left as exercises:

Theorem 6.7. [2.4] If A ∈ Cn×n is Hermitian then there is a unitary Q and a real Λ ∈ Rn×n
such that A = QΛQ∗ .

Lemma 6.8. [2.5] A ∈ Cn×n positive definite. There is a positive definite A1/2 ∈ Cn×n (the
square root of A) such that A = A1/2 A1/2 .

Video material covering statements 6.3, 6.7 and 6.8 can be found here.

20
Lecture 7

Matrix Norms, Part II

Theorem 7.1. [1.30] For any matrix norm k · k, A ∈ Cn×n , and k ∈ N

ρ(A)k ≤ ρ(Ak ) ≤ kAk k ≤ kAkk

Proof. (Recording available here) Let B := Ak . If λ is an eigenvalue of A then λk is an eigenvalue


of B which implies the first inequality. The third inequality is a consequence of matrix norms
being sub-multiplicative. To prove the second inequality, let µ be the eigenvalue of B such that
ρ(B) = |µ|, let x ∈ Cn \ {0} be a corresponding eigenvector and set X := (x, . . . , x) ∈ Cn×n .
Then
kBkkXk ≥ kBXk = kµXk = |µ|kXk = ρ(B)kXk
from which we get the second inequality after dividing by kXk.

The inequalities in the above theorem become equalities if the matrix is normal and we use the
matrix norm induced by the Euclidean norm:

Theorem 7.2. [1.31] If A ∈ Cn×n is normal then ρ(A)l = kAkl2 for all l ∈ N.

Proof. Let x1 , . . . , xn be an orthonormal basis of eigenvectors of A corresponding to the eigen-


values λ1 , . . . , λn where w.l.o.g. ρ(A) = |λ1 |. For any x ∈ Cn we can write
n
X n
X
x= αj xj with αj = hxj , xi2 ⇒ kxk22 = |αj |2 .
j=1 j=1

We have as well that


n
X n
X
Ax = αj λj xj ⇒ kAxk22 = |λj αj |2 .
j=1 j=1

Therefore
|αj |2 |λj |2 2 2
P P
kAxk22 j j |αj | |λ1 |
= ≤ = |λ1 |2
kxk22 2 2
P P
j |αj | j |α j |
from which we see that kAk2 ≤ |λ1 |. Together with Theorem 7.1 the assertion follows.

A result for non-square matrices:

Theorem 7.3. [1.32] For all A ∈ Cm×n the equality kAk22 = ρ(A∗ A) holds true.

21
LECTURE 7. MATRIX NORMS, PART II

Proof. The matrix A∗ A is Hermitian, hence normal, and by Theorem 7.2

ρ(A∗ A) = kA∗ Ak2 = max kA∗ Axk2 = max max hy, A∗ Axi

(?)
kxk2 =1 kxk2 =1 kyk2 =1

where we used a duality argument◦ for the last identity. On the one hand,

(?) ≥ max hx, A∗ Axi = max hAx, Axi = max kAxk22 = kAk22 .
kxk2 =1 kxk2 =1 kxk2 =1

On the other hand, using Cauchy-Schwarz


 
(?) ≤ max max hAy, Axi ≤ max max kAyk2 kAxk2
kxk2 =1 kyk2 =1 kxk2 =1 kyk2 =1

max kAyk2 = kAk22 .


 
= max kAxk2
kxk2 =1 kyk2 =1

◦ We will not lean heavily on duality as part of the module, but some knowledge thereof could
become useful in other contexts. As a quick introduction (you can find more in Section 1.3 of
Stuart & Voss): Given a norm k · k on Cn , the pair (Cn , k · k) is a Banach space (a complete
normed vector space) B. The Banach space B 0 , the dual of B, is the pair (Cn , k · kB 0 ), where
kxkB 0 = maxkyk=1 |hx, yi|. The usage of max here implicity relies on the fact that a continuous
function on a closed, bounded set achieves its maximum value.

Exercise: Use the above theorem to show that kU Ak2 = kAk2 = kAV k2 for all unitary matrices
U ∈ Cm×m and V ∈ Cn×n .

δ-Jordan canonical form


For any A ∈ Cn×n there is an invertible S ∈ Cn×n and a J ∈ Cn×n such that A = SJS −1 where
J is a Jordan matrix, i.e., denoting by λ1 , . . . , λk the eigenvalues of A, J is of the form
 
Jn1 (λ1 ) 0 ··· 0
. ..
Jn2 (λ2 ) . .
 
 0 . 
J = .. .. ..


 . . . 0 
0 ··· 0 Jnk (λk )

with δ-Jordan blocks  


λl δ 0 ··· 0
 .. .. .. .. 
0
 . . . .
Jnl (λl ) =  ... .. .. ..
 
. . . 0
 ..
 
.. .. .. 
. . . . δ
0 ··· ··· 0 λl
This corresponds to the standard Jordan canonical form except that the 1s in the first off-
diagonal are replaced by δs. But recall that the off-diagonal elements serve to characterise the
dimensions of the eigenspaces and stand for discrete information whence we may represent this
information by a number different from 1. In fact, one can derive the above δ-Jordan canonical
form from the usual one by an appropriate similarity transformation (Exercise).
Later on, we will pick an arbitrary small δ and use the following lemma which says that the
spectral radius is no matrix norm but can be approximated by matrix norms.

22
LECTURE 7. MATRIX NORMS, PART II

Lemma 7.4. For any A ∈ Cn×n and δ > 0 there is a vector norm k · kδ on Cn such that the
induced norm fulfills ρ(A) ≤ kAkδ ≤ ρ(A) + δ.

Proof. The first inequality is true by Theorem 7.1.


Let A = Sδ Jδ Sδ−1 be the factorisation such that Jδ is the δ-Jordan canonical form and define
the norm k · kδ by
kxkδ := kSδ−1 xk∞ , x ∈ Cn
Then
kAxkδ
kAkδ = max
x6=0 kxkδ
kSδ−1 Axk∞
= max
x6=0 kSδ−1 xk∞

and inserting y = Sδ−1 x this is

kSδ−1 ASδ yk∞


= max
y, Sδ y6=0 kyk∞
kJδ yk∞
= max
y6=0 kyk∞

= kJδ k∞

and recalling that k · k∞ is the maximum row sum we obtain that this is
X
= max |(Jδ )i,j |
i
j

≤ max |λi | + δ
i
= ρ(A) + δ

where we used the special structure of Jδ for establishing the inequality.

23
Lecture 8

Floating point representation

Any number x ∈ R can be represented with respect to a basis β ∈ N\{0, 1}:


X
x=σ an β n , an ∈ {0, . . . , β − 1} ∀n ∈ Z, σ ∈ {±1} sign.
n∈Z

Computers are based on the dual system, β = 2. And as they are finite, the idea is to cut the
infinite sum and to approximate x by
 t
X  s
X
ξ = σ2e 1 + an 2−n = σ2e × (1.a1 . . . at )2 , e = 2−m bi 2i .
n=1 i=1

The (a1 . . . at ) are called mantissa, here of length t with fraction bits an ∈ {0, 1}, and the
(b1 . . . bs ) represent the exponent of length s with exponent bits bi ∈ {0, 1}. The number m is
called bias or shift.
Example, IEEE Standard 754 Double Precision:
There are 64 bits to represent a number. The first bit is the sign bit, the next eleven bits are
the exponent bits, and the final 52 bits are the fraction bits:
(σb1 . . . b11 a1 . . . a52 )
The bias is fixed at m = 1023.
You can try this out on a particular numerical example, e.g. converting 286.75 into an IEEE
Standard 754 format. In what follows we can use single (rather than double) precision just to
simplify some of the algebra. Key steps:
1. Represent the decimal number in standard binary: (286.75)10 = (100011110.11)2 . This
is a good opportunity for a refresher, particularly for the fractional parts which require
contributions of 2−1 and 2−2 in this case in order to represent the 0.75 part of the original
decimal number.
2. Normalise the binary number via binary shift (the so-called 1.m form) such that only one
hidden one is left at the start: 1.0001111011.
3. Adjust with the bias for the single precision format, which for us is 2(8−1) − 1 = 127. (You
can try out the double precision version as an exercise.)
4. The exponent value (+8 for 28 as performed in step 2) after renormalisation becomes
added to the bias (8 + 127 = 13510 ), which leads us to our 8−bit exponent structure being
(10000111)2 .

24
LECTURE 8. FLOATING POINT REPRESENTATION

5. Putting everything together (0 for the sign bit, 1000 0111 for the exponent bits (8 bits in to-
tal) and 0001 1110 1100 0000 0000 00 (23 bits in total, with padding at the end), we retrieve
our final result: (286.75)10 = (0100001110001111011000000000000)2IEEE754 single-precision

Additional resource: if you would like to try out some examples (or verify your own calcula-
tions) there are nice converters out there, as well as other worked-out cases of various degrees
of complexity.
The relative error when approximating a number x by its nearest neighbour ξ is

|ξ − x|
≤ εm ≈ 10−16 .
|x|

εm is called machine precision. For positive x, Values for ξ are between ≈ 10−320 and 10308 . If
x is bigger (smaller) then we have to deal with an overflow (underflow).
Landau (Big O) Notation. Asymptotic notation is a useful tool in assessing algorithmic
performance, as well as analysing (and indeed constraining) error levels in implementation.
You may have come across descriptions of algorithms (in terms of operation count) as being
linear/polynomial/exponential in the relevant variable n (e.g. the size of a matrix) or having a
concrete estimate as being O(n), O(n2 ), O(n log n) etc.
In general form, if we let f be a real- or complex-valued function and g be a real-valued function
on some unbounded subset of the positive real numbers, with g(x) strictly positive for sufficiently
large values of x. Then we write f (x) = O(g(x)) as x → ∞ if there exists a positive M ∈ R and
a x0 ∈ R such that |f (x)| ≤ M g(x) ∀ x ≥ x0 .
In more practical terms, the O(g(n)) provides a worst-case scenario estimate for the runtime of
an algorithm (in that it is bounded between 0 and M g(x). As a concrete example, we can say
that if f (x) = 5x3 + 6x + 2 then f = O(x3 ) given the cubic term provides the dominant route
towards increase as x → ∞. Similarly, there exists a lower bound equivalent (denoted by Ω).
Finally, this framework also provides the means to create an asymptotic tight bound using the
Θ-notation, which sets a proportionality relationship between the two relevant functions. In
other words, f (x) = Θ(g(x)) means that there exist positive constants c1 , c2 and x0 such that
0 ≤ c1 g(x) ≤ f (x) ≤ c2 g(x) ∀ x ≥ x0 .
Additional resources and examples: freeCodeCamp offers some nice background material
and visualisations on the topic if you are keen to develop your intuition on the topic.

25
Lecture 9

Error Analysis

In an abstract way, a problem consists of input data of a specific type that are transformed to
output data of a specific type using an algorithm:

Input Algorithm Result

Example: solving f : GL(n) × Cn → Cn , f (A, b) = A−1 b with GEPP.

The subject of numerical mathematics is to analyse and estimate errors in the result in terms
of errors in the input data and occurring when performing operations:

Input Algorithm Output


Errors Errors Errors

The question is: how close is the computed solution to the correct one?
Input Errors: We have two sources in mind,

1. floating point representation of numbers in computers,

2. input data uncertainty, for example parameters obtained from experimental measurements.

Algorithm Errors: Again, there are mainly two sources,

1. rounding errors when performing instructions on a computer, e.g., elementary operations


+, −, ×, / in floating point arithmetic,

2. approximation errors, for example


N
0 f (x + h) − f (x − h) X 1 n
f (x) ≈ , exp(x) ≈ x .
2h n!
n=0

26
LECTURE 9. ERROR ANALYSIS

Before starting to talk about errors, two definitions: If ξ is an approximation to a datum x


kξ − xk
absolute error: kξ − xk, relative error: .
kxk
By F ⊂ R we denote the floating point numbers that can be represented on our computing
machine (see Lecture 8).

Assumption 9.1. [3.19]

A1 For all x ∈ R there is an ε ∈ [−εm , εm ] such that

ξ = f l(x) = x(1 + ε).



A2 For each elementary executable operation ∗ ∈ {+, −, ×, /, ·} and every x, y ∈ F there is
a δ ∈ [−εm , εm ] with
x ~ y = (x ∗ y)(1 + δ)
where ~ denotes the computed version of ∗.

In particular, overflow and underflow are neglected.

Forward Error Analysis


Subject is the analysis of Θ = φ(E) where φ is an algorithm for f . The set E represents input
data that cannot be distinguished from x,

E := {ξ | kξ − xk ≤ δkxk} with a given accuracy / tolerance δ. (9.1)

The set R will be defined later on.

f
y R
x φ
ξ η
E
θ

Definition 9.2. [3.17] The forward error is defined by:

kφ(ξ) − f (x)k kθ − yk
absolute: kφ(ξ) − f (x)k = kθ − yk, relative: = .
kf (x)k kyk
An algorithm is forward stable if

kθ − yk kφ(ξ) − f (x)k kξ − xk
= = O(εm ) whenever = O(εm ) as εm & 0.
kyk kf (x)k kxk

27
LECTURE 9. ERROR ANALYSIS

A straight forward way towards an error analysis is to take the errors of order εm for every input
number and every elementary executable operation into account and to drop terms of order
o(εm ) as εm → 0.
Example: (Recording available here) Given three numbers x1 , x2 , x3 6= 0 we want to compute
f (x1 , x2 , x3 ) = x1 x2 /x3 .
A possible algorithm is to first compute the product x1 x2 and then to divide the result by x3 .
By assumption A1, there are floating point numbers ξi and small reals εi = O(εm ) as εm → 0
such that ξi = xi (1 + εi ), i = 1, 2, 3. With assumption A2, the first step of the algorithm
yields a number ξ1 ξ2 (1 + δ1 ) with some δ1 = O(εm ). The second step involves a δ2 = O(εm )
associated with the small error due to the division and results in (ξ1 ξ2 /ξ3 )(1 + δ1 )(1 + δ2 ) =
(ξ1 ξ2 /ξ3 )(1 + δ1 + δ2 + O(ε2m )) where we used that δ1 δ2 = O(ε2m ) = o(εm ). Together with the
erroneous input data the algorithm computes
ξ1 ξ2
(x1 , x2 , x3 ) 7→ (1 + δ1 + δ2 + O(ε2m ))
ξ3
x1 (1 + ε1 )x2 (1 + ε2 )
= (1 + δ1 + δ2 + O(ε2m ))
x3 (1 + ε3 )
x1 x2
= (1 − ε3 + O(ε2m ))(1 + ε1 + ε2 + δ1 + δ2 + O(ε2m ))
x3
x1 x2
= (1 + ε1 + ε2 − ε3 + δ1 + δ2 + O(ε2m )) =: f˜(x)
x3
Hence we obtain for the relative forward error that

|f˜(x) − f (x)|
= |ε1 + ε2 − ε3 + δ1 + δ2 + O(ε2m )| ≤ 5εm + O(ε2m ) = O(εm ) as εm → 0.
|f (x)|

One can imagine that this type of error analysis becomes quite tedious when the problems
become large and the algorithms complicated. An advantage is that it is possible to perform
them with a computer again.
Nevertheless, it turns out that the estimates usually are quite rough and, even more important,
yield no particular insight. As we will see there are problems which intrinsically lead to relatively
large errors when the input data involve small errors independently of the algorithm used to solve
it. The above method of forward error analysis does not care about this so-called conditioning
of problems let alone that ill-conditioned problems are detected.
A more sophisticated way that meanwhile is common in numerical linear algebra is the backward
error analysis. But before turning to this idea we have a look at this notion of conditioning.

28
Lecture 10

Conditioning

Conditioning indicates how sensitive the solution of a problem f : x 7→ y = f (x) may be to


small perturbations in the input data.
f

f
y R
x
ξ η
E

The result set is R = f (E). The Conditioning depends on the size of R. The problem f is called
well-conditioned if R small and ill-conditioned if R is rather big.
Example: Consider a 2 × 2 linear system of equations or, as an equivalent geometric problem,
the cut of two lines.
f1 f2

η1 η2
y1 y2

Errors in the input data correspond to changes of the lines. Problem f1 is better conditioned
than problem f2 since the change of the intersection point is less drastic when shifting one of
the lines by about the same amount. In this particular example, a small angle between the lines
is deemed unfavourable.
The general goal is to find a significant number to measure how well a problem is conditioned.
Consider a simple scalar problem f : R → R. Using the Taylor expansions we have to first order
f (x + h) − f (x) ≈ f 0 (x)h from which we obtain
f (x + h) − f (x) h xf 0 (x) i h
≈ ×
f (x) f (x) x

29
LECTURE 10. CONDITIONING

which relates the relative output error (on the left hand side) to the relative input error h/x.
A meaningful definition of a condition number (of problem f in a point x) therefore is

xf 0 (x)
κf (x) := .
f (x)

For higher dimensional problems f : Rm → Rn this reads

kxk
κf (x) = kJ(x)k
kf (x)k

where J(x) is the Jacobian of f and kJ(x)k is a matrix norm induced by the vector norms used
for x on Cm and f (x) on Cn .
√ √
Example: f (x) = arcsin(x), then f 0 (x) = 1/ 1 − x2 and κf = x/( 1 − x2 arcsin(x)), κf → ∞
as x % 1, hence the evaluation of arcsin close to x = 1 is an ill-conditioned problem.
Another example of a conditioning argument for a simple 2 × 2 linear system solving scenario
is provided in the form of video material.

Conditioning of (SLE)
Consider the problem of computing the matrix-vector product f (A, x) = Ax =: b, x, b ∈ Cn ,
A ∈ Cn×n . Then f 0 (x) = A, and the condition number becomes κf (x) = kAkkxk/kAxk (we
omit the discussion of the trivial case x = 0). Assume now that A is invertible and observe that
then
kxk = kA−1 Axk ≤ kA−1 kkAxk.
We deduce that
kxk
κf (x) = kAk ≤ kAkkA−1 k
kAxk
Occasionally, when considering the problem g(A, b) = A−1 b =: x we just have to replace A by
A−1 in the above analysis to obtain the same estimate for the condition number.

Definition 10.1. [3.2] The condition number of a square matrix A ∈ Cn×n is


(
kAkkA−1 k if A is regular,
κ(A) =
∞ otherwise.

One can show the following:

Proposition 10.2. [3.3] Let A regular, Ax = b 6= 0 and A(x + ∆x) = b + ∆b. Then

k∆xk k∆bk
≤ κ(A) .
kxk kbk

In order to get an estimate with respect to perturbations of the matrix A we need the following
lemma:

Lemma 10.3. [3.4] (and available via recording) Assume that B ∈ Cn×n satisfies kBk < 1.
Then I + B is invertible and
k(I + B)−1 k ≤ (1 − kBk)−1 .

30
LECTURE 10. CONDITIONING

Proposition 10.4. [3.5] Let A be regular, Ax = b 6= 0 and (A + ∆A)(x + ∆x) = b.


If κ(A)k∆Ak < kAk then
k∆xk κ(A) k∆Ak
≤ κ(A)k∆Ak kAk
.
kxk 1− kAk

Proof. Clearly
k∆Ak
kA−1 ∆Ak ≤ kA−1 kk∆Ak = κ(A) . (+)
kAk
From the assumptions (A + ∆A)∆x = −∆Ax, hence

(I + A−1 ∆A)∆x = −A−1 ∆Ax.

By Lemma 10.3 and (+) (I + A−1 ∆A)−1 exists so that

∆x = (I + A−1 ∆A)−1 A−1 ∆Ax.

Using (+) again yields the assertion:

k∆xk ≤ (1 − kA−1 ∆Ak)−1 kA−1 ∆Akkxk


1 k∆Ak
≤ κ(A)k∆Ak
κ(A)
1 − kAk kAk

Bringing the results of the propositions 10.2 and 10.4 together one can show

Theorem 10.5. Let A, ∆A ∈ Cn×n regular such that κ(A)k∆Ak ≤ kAk and let b ∈ Cn \{0},
∆b ∈ Cn . Assume that x ∈ Cn solves Ax = b and x̂ solves (A + ∆A)x̂ = b + ∆b. Then

kx̂ − xk κ(A)  k∆Ak k∆bk 


≤ + .
kxk 1 − κ(A)k∆Ak
kAk
kAk kbk

Hints:

• One can start from Ax̂ = b + ∆b − ∆Ax̂, and subtract Ax and b (given they are equal to
one another) from the lhs and rhs, respectively.

• Multiply from the left by A−1 and use norm properties (separate out terms as much
as possible with relevant inequalities). It also helps at this stage to notice that ||x̂|| ≤
||x|| + ||x̂ − x|| on the rhs. The extra terms can then be suitably shifted to the lhs and
re-cast into a term involving A and ∆A.

• Division by ||x|| and using the definition of the condition number κ(A) to replace relevant
terms involving A and its inverse lead us to an almost final result.

• At the final stage note one of our assumptions (i.e. κ(A)k∆Ak ≤ kAk) to ensure that a
suitable division does not become problematic and we reach our conclusion.

31
Lecture 11

Backward Error Analysis

Before we can proceed with an error analysis we have to say what an algorithm is. We give a
general definition first.
Definition 11.1. An algorithm is a process consisting of a set of elementary components (called
steps) to derive specific output data from specific input data where the following conditions have
to be fulfilled:
• finiteness: the whole process is described by a finite text,
• effectivity: each step can be executed on a computing machine,
• termination: the process terminates after a finite number of steps,
• determinacy: the course of the process is completely and uniquely prescribed.

Elementary executable operations are the operations {+, −, ×, /, ·}. An algorithm for a map f
is a decomposition
f = f (l) ◦ · · · ◦ f (1) , l ∈ N,
into maps f (i) involving at most one elementary executable operation.
Occasionally, a step will contain a reference to another algorithm.
In a realisation of an algorithm, elementary executable operations involve errors (assumption
A2), and we denote by φ(i) a realisation of f (i) so that φ = φ(l) ◦ · · · ◦ φ(1) is a realisation of f .

Backward Error Analysis: The basic idea is to assume that the computed value θ = φ(ξ) is the
exact result for perturbed input data ζ of ξ, i.e. θ = f (ζ). If ζ does not exist the algorithm is
termed not backward stable. If there are multiple choices for ζ (f not injective) we choose ζ
minimising k · k.
f

f
y R
x φ
ξ η
E
ζ f θ
Z

32
LECTURE 11. BACKWARD ERROR ANALYSIS

Definition 11.2. [3.17],[3.20] The backward error is defined by:

kζ − xk
absolute: kζ − xk, relative: .
kxk

An algorithm is backward stable if

kζ − xk
= O(εm ) as εm & 0. (11.1)
kxk

This notion of backward stability is useful in the following sense: Obtaining the solution with
an algorithm involving approximations instead of performing exact computation is converted
to exactly solving the problem with perturbed initial data. But then the conditioning of the
problem can tell us how far the computed solution may deviate from the exact one.
Recall that the condition number was intended to yield an estimate of the form

kθ − yk kφ(ξ) − f (x)k kf (ζ) − f (x)k kζ − xk


= = ≤ κf (x) .
kyk kf (x)k kf (x)k kxk

But this means that

relative forward error ≤ κf relative backward error.

Recall that we finally are interested in estimating the forward error. The above outlined back-
ward error analysis typically yields sharper estimates than the standard forward error analysis
and, in addition, splits into a problem intrinsic part (the conditioning) and an algorithm intrinsic
part (the backward error). For a deeper discussion on error analysis, see [5] (Higham, 2002).
Example 1: The subtraction is backward stable.
The exact version is x = (x1 , x2 )T , f (x) = x1 − x2 = y. With some numbers |ε(i) | ≤ εm ,
i = 1, 2, 3,, the computed version is

θ = f l(x1 ) f l(x2 ) = ξ1 ξ2 = (ξ1 − ξ2 )(1 + ε(3) ) =


= x1 (1 + ε(1) ) − x2 (1 + ε(2) ) (1 + ε(3) ) = x1 (1 + ε(4) ) − x2 (1 + ε(5) )


where ε(4) = ε(1) + ε(3) + ε(1) ε(3) = O(εm ) as εm & 0, ε(5) similarly.
Hence f (ζ) = θ for ζ = (x1 (1 + ε(4) ), x2 (1 + ε(5) ))T , and we conclude that

kζ − xk = k(ε(4) x1 , ε(5) x2 )T k ≤ εm kxk.

Example 2: Let x, y ∈ Cn where, for simplicity, we assume that they are not defective. Com-
puting the standard inner product recursively by

fn x, y = h(x1 , . . . , xn )T , (y1 , . . . , yn )T i = xn yn + fn−1 (x1 , . . . , xn−1 )T , (y1 , . . . , yn−1 )T


 
| {z } | {z }
=:xn−1 =:y n−1

is a backward stable algorithm in the sense that

φn (x, y) = hζ, yi

with a ζ ∈ Cn satisfying |ζk − xk | ≤ (nεm + o(εm ))|xk | as εm → 0, k = 1, . . . , n.

33
LECTURE 11. BACKWARD ERROR ANALYSIS

Proof. The recursive definition of the algorithm for the scalar product invites for a proof by
induction. If n = 1 we have the ordinary product on C which is backward stable as can be
shown similarly to the subtraction in the previous example.
Let n > 1 and assume that the assertion is true for n − 1. We then have

φn (x, y) = xn yn (1 + ε(1) ) + φn−1 (xn−1 , y n−1 (1 + ε(2) )




with |ε(i) | ≤ εm , i = 1, 2. By the induction hypothesis there is a ζ̃ n−1 := (ζ̃1 , . . . , ζ̃n−1 )T ∈


Cn−1 with kζ̃ n−1 − xn−1 k∞ /kxn−1 k∞ ≤ (n − 1)εm + o(εm ) such that φn−1 (xn−1 , y n−1 ) =
fn−1 (ζ̃ n−1 , y n−1 ). Setting

ζ n := xn (1 + ε(1) )(1 + ε(2) ), ζk := ζ̃k (1 + ε(2) ) k = 1, . . . , n − 1,

we obtain φn (x, y) = fn (ζ, y). Furthermore

|ζn − xn | ≤ (2εm + O(ε2m ))|xn | ≤ (nεm + o(εm ))|xn |,

as well as

|ζk − xk | ≤ |ζk − ζ̃k | + |ζ̃k − xk |


≤ εm |ζ̃k | + ((n − 1)εm + o(εm ))|xk | = (nεm + o(εm ))|xk |, k <n−1

which yields the asserted estimate of the relative backward error.

34
Lecture 12

Error Analysis of the Gaussian


Elimination

Some notation: If A ∈ Cm×n then

|A| := (|ai,j |)m,n


i,j=1 ∈ C
m×n
,

and inequalities like |B| ≤ |A| for matrices have to be understood as valid for each element.

Theorem 12.1. [5.5] FS and BS are backward stable.


The computed solution x̂ satisfies (T + ∆T )x̂ = b with some triangular matrix ∆T satisfying

|∆T | ≤ (nεm + o(εm ))|T | as εm → 0. (12.1)

Proof. We consider the forward substitution for a unit lower triangular T only. The algorithm
can be recursively defined by
xk = bk − hlk−1 , xk−1 i
for k = 1, . . . , n where

xk−1 = (x1 , . . . , xk−1 )T , lk−1 = (lk,1 , . . . , lk,k−1 )T

and the li,j are the entries of T . Using the result for the scalar product, a realisation in floating
point arithmetic yields

x̂k = (bk − φk−1 (lk−1 , xk−1 ))(1 + ε(k) ) = (bk − hˆlk−1 , xk−1 i)(1 + ε(k) )

with a vector ˆlk−1 ∈ Ck−1 such that

|ˆlik−1 − lik−1 | ≤ ((k − 1)εm + o(εm ))|lik−1 | ≤ (nεm + o(εm ))|lk,i |.

Setting (∆T )k,i := ˆlik−1 − lik−1 , i = 1, . . . , k − 1, and k = 2, . . . , n, and (∆T )k,k := −ε(1) then we
indeed have that (T + ∆T )x̂ = b, and (12.1) holds true, too.

Remark: The results (12.1) can be improved to


nεm
|∆T | ≤ |T |.
1 − nεm
With respect to the Gaussian elimination, the following result is proved in [5] (Higham, 2002).

35
LECTURE 12. ERROR ANALYSIS OF THE GAUSSIAN ELIMINATION

Theorem 12.2. [5.6] Assume that the LU factorisation of a matrix A ∈ Cn×n exists and denote
by L̂, Û the LU factors computed by LU. Then L̂Û = A + ∆A where
nεm
|∆A| ≤ |L̂||Û | ≤ (nεm + o(εm ))|L̂||Û |.
1 − nεm
As a consequence of Theorems 12.1 and 12.2 we obtain

Theorem 12.3. [5.7] Assume that the LU factorisation of a matrix A ∈ Cn×n exists and denote
by L̂, Û the LU factors computed by LU. Then the solution x̂ ∈ Cn for Ax = b computed by
GE satisfies
(A + ∆A)x̂ = b
with a matrix ∆A ∈ Cn×n satisfying
3nεm
|∆A| ≤ |L̂||Û | ≤ (3nεm + o(εm ))|L̂||Û |.
1 − 3nεm
In order to obtain an estimate for the relative backward error (something like k∆Ak/kAk) it
apparently would be sufficient to estimate k|L̂||Û |k in terms of kAk. To see that this can be
problematic consider the matrix
     
δ 1 1 0 δ 1
A= ⇒ L= 1 ,U=
1 1 δ 1 0 1 − 1δ

where 0 < δ  1 is small. The condition number is κ∞ (A) = 4 + O(δ), therefore the problem of
solving Ax = b is well-conditioned. But
   
δ 1 1
|L||U | = ⇒ |L||U | ∞ = O as δ → 0.
1 2δ − 1 δ

So we have no chance to estimate k|L̂||Û |k∞ in terms of kAk∞ = 2. But


     
1 1 1 0 1 1
PA = ⇒ L= ,U=
δ 1 δ 1 0 1−δ

with P being the permutation matrix exchanging rows 1 and 2. And then
 
1 1
|L||U | = = P A,
δ 1

so pivoting can cure this problem.


For a matrix A ∈ Cn×n let L, U denote the LU factors computed by GEPP if all calculations
are carried out exactly. Assume that the permutation matrix P̂ computed by GEPP coincides
with the correct one P . Let b ∈ Cn .

Definition 12.4. [5.9] The growth factor of A is defined by

kU kmax
gn (A) := .
kAkmax

36
LECTURE 12. ERROR ANALYSIS OF THE GAUSSIAN ELIMINATION

With this definition we obtain that

kU k∞ ≤ ngn (A)kAk∞ .

Since all the matrix entries in L are ≤ 1 thanks to the pivoting we also have that

kLk∞ ≤ n.

Altogether, replacing A by P A = P̂ A and using that kP k∞ = 1 we obtain a backward estimate:

Theorem 12.5. The algorithm GEPP computes a vector x̂ ∈ Cn solving

(A + ∆A)x̂ = b

where ∆A ∈ Cn×n satisfies


k∆Ak∞ ≤ Cn3 gn (A)εm kAk∞ (12.2)
with a constant C ≥ 3 independent of n.

To conclude, we need an answer on how big the growth factor gn (A) can become.

Lemma 12.6. [5.10] gn (A) ≤ 2n−1 for all A ∈ Cn×n , and this estimate is sharp.

Proof. See [1] (Stuart & Voss notes).

Remark: In view of the exponential growth of gn (A) in n the stability estimate (12.2) is rather
weak. A better estimate for the growth factor is obtained when applying complete pivoting
GECP. However, the upper bound in Lemma 12.6 is a worst case and, in practice, the additional
effort due to the higher cost for the pivot search in GECP does not count off. Other, more
costly algorithms for solving (SLE) with better stability properties will be discussed later on.

37
Lecture 13

Computational Cost

Recall Definition 11.1: An algorithm for a map f is a decomposition

f = f (l) ◦ · · · ◦ f (1) , l∈N

into maps f (i) involving at most one elementary executable operation.

Definition 13.1. [4.1] The cost of an algorithm is

C(n) = number of elementary executable operations [= l]

where n is a representative number for the size of the input data.

Example: Consider the following algorithm for the standard matrix-vector product.

Algorithm 5 MatVecStd (standard matrix-vector product)


input: A = (ai,j )m,n
i,j=1 ∈ C
m×n , b = (b )n n
i i=1 ∈ C .
output: x = Ab ∈ C . m

1: for i = 1 to m do
2: xi := 0
3: for j = 1 to n do
4: xi := xi + ai,j bj
5: end for
6: end for

Line 4 involves 1 addition and 1 multiplication ; 2op. With the loop in line 3 we obtain
n × 2op = 2nop. The assignment in line 2 does not count as an operation but from the loop in
line 1 we get m × 2nop, whence Algorithm 5 has computational cost

C(m, n) = 2mnop.

Any algorithm for the matrix-vector product will have at least Θ(m) operations as m → ∞
which can be seen from the fact that m values have to be computed.
Some remarks:

• Estimating the computation time is not the aim as this depends too much on the computer
architecture. We rather want to get an idea of the complexity of the algorithm for which
the number of operations is a good measure.

38
LECTURE 13. COMPUTATIONAL COST

• In older books there is a distinction between addition/subtraction (cheapest), multipli-


cation, division, and the square root (most expensive) motivated by the way how the
operations were carried out by the processors. Nowadays, the processors contain look-up
tables so that the time to execute each of those operations basically is the same and this
distinction is obsolete.

• High performance computers are parallel computers containing multiple processing units.
Issues here are not only the number of operations but also the data exchange and the
balancing of the work load. An algorithm is said to scale optimally if doubling the number
of processors halves the computation time.

Let us turn our attention to the computational cost of solving systems of linear equations with
Gaussian elimination, [5.1], [5.2].

Lemma 13.2. [5.2] LU has cost CLU (n) = 23 n3 + O(n2 ) as n → ∞.

Proof. Recall Algorithm 2.


Line 7: 1 addition and
Pn1 multiplication, 2op.
Line 6: loop over i, i=k+1 (2op) = 2(n − k)op.
+ 2(n − k)op.
Line 4: 1 division, 1 P
Line 3: loop over j, nj=k+1 (2(n − k) + 1)op = (n − k)(2(n − k) + 1)op.
Line 2: loop over k, as n → ∞
n−1
X n−1
X n−1
X
(n − k)(2(n − k) + 1) = 2 (n − k)2 + (n − k)
k=1 k=1 k=1
n−1
X
=2 l2 + O(n2 )
l=1
2
= (n − 1)n(2(n − 1) + 1) + O(n2 )
6
2
= n3 + O(n2 ).
3

It is relatively straightforward to show that the cost of FS or BS are O(n2 ) which leads to the
following result:

Theorem 13.3. The computational cost of GE is


2
CGE (n) ∼ n3 as n → ∞.
3
The cost of GEPP is the same since the pivoting does not affect the computational cost as
exchanging rows does not contribute to the cost according to our definition. But it should be
remarked that exchanging data is a big issue on parallel computers and has a huge influence on
the overall performance.
The algorithms seen so far have been relatively easy to analyse. Things are getting more inter-
esting when considering divide & conquer algorithms.

39
Lecture 14

Divide & Conquer Algorithms

Divide & Conquer is an important design paradigm for algorithms. The idea is to break down a
problem into sub-problems which are recursively solved until the problem become small enough
to be solved directly. After, the sub-solutions are combined in merging steps to yield the solution
to the originating problem.
As an example, we will look at a method to compute the matrix-matrix product which goes back
to Strassen (1969). Recall that the standard matrix-matrix product C = AB of to matrices
A, B ∈ Cn×n computed via
n
X
ci,j = ai,k bk,j , i, j = 1, . . . , n
k=1

has a computational cost of CMMStd (n) = Θ(n3 ) as n → ∞. Since n2 entries are to be computed
any algorithm has cost Ω(n2 ) as n → ∞.
Assume, just for convenience, that n = 2k with some k ∈ N and write
 
A11 A12 k k k−1 k−1
A= ∈ C2 ×2 , Aij ∈ C2 ×2
A21 A22

and analogously for B, C. C = AB then means that

C11 = A11 B11 + A12 B21 ,


C12 = A11 B12 + A12 B22 ,
C21 = A21 B11 + A22 B21 ,
C22 = A21 B12 + A22 B22 .
n
Computing C this way involves eight multiplications of 2k−1 × 2k−1 = 2 × n2 -matrices. Using
recursion, the cost of an algorithms based on this splitting satisfies

CMMSplit (n) = 8CMMSplit ( n2 ) + n2

where the n2 contribution comes from the 4 additions of n2 × n2 -matrices. One can show that
this leads to a cost of Θ(n3 ) as n → ∞, and the algorithm is not better than the standard one.

40
LECTURE 14. DIVIDE & CONQUER ALGORITHMS

However, defining the seven 2k−1 × 2k−1 -matrices

P1 := (A11 + A22 )(B11 + B22 ),


P2 := (A21 + A22 )B11 ,
P3 := A11 (B12 − B22 ),
P4 := A22 (B21 − B11 ),
P5 := (A11 + A12 )B22 ,
P6 := (A21 − A11 )(B11 + B12 ),
P7 := (A12 − A22 )(B21 + B22 ) (14.1)

one can show that

C11 = P1 + P4 − P5 + P7 ,
C12 = P3 + P5 ,
C21 = P2 + P4 ,
C22 = P1 + P3 − P2 + P6 . (14.2)

Using this recursively as in Algorithm 6 we obtain that the cost satisfies

CMMStrassen (n) = 7k+1 − 6 · 4k = 7nlog2 (7) − 6n2 , n = 2k , k ∈ N, (14.3)

and since log2 (7) ≈ 2.807 < 3 we have got an algorithm that asymptotically is of lower cost
Θ(nlog2 (7) ) as n → ∞ than the previously presented algorithms. Let us briefly prove (14.3) -
also available as recording:

Proof. The recursive definition of the Strassen multiplication invites for an induction proof. For
k = 0 there is only one multiplication of scalars, and formula (14.3) indeed yields one.
Assume now that (14.3) is true for k − 1. As there are 18 additions of n2 × n2 -matrices in (14.1)
and (14.2) we obtain the formula

CMMStrassen (2k ) = 7CMMStrassen (2k−1 ) + 18(2k−1 )2 ,

which yields, using the induction hypothesiss

= 7(7k − 6 × 4k−1 ) + 18 × 4k−1


= 7k+1 − 24 × 4k−1
= 7k+1 − 6 × 4k .

To deal with the case that n is no power of 2 one may pick k ∈ N such that 2k ≥ n > 2k−1 and
then define  
A 0 k k
à = ∈ C2 ×2
0 0
by adding some blocks containing zeros, similarly with B̃. It turns out that the upper left
n × n block of C̃ := ÃB̃ then contains the desired product C = AB. Applying the Strassen
multiplication to à and B̃ involves a cost that is Θ((2k )log2 (7) ) as k → ∞, but since 2k ≤ 2n by
the choice of k in dependence of n this is Θ(nlog2 (7) ) as n → ∞. We may summarise the findings
in

41
LECTURE 14. DIVIDE & CONQUER ALGORITHMS

Theorem 14.1. Using the Strassen multiplication one can construct an algorithm for computing
the product of n × n-matrices, n ∈ N, with computational cost

C(n) = Θ(nlog2 (7) ) as n → ∞.

An algorithm developed by Coppersmith and Winograd (1990) scales even better as it has a
cost of about Θ(n2.376 ) but it is not (yet?) practical as its advantage only becomes perceptible
for such big n that the corresponding matrices are just too big for even the most modern
supercomputers. Any algorithm will have a cost Ω(n2 ) as n2 is the number of elements to be
computed.
What is this studying of the exponent useful for? Well, if one can multiply n×n-matrices with an
asymptotic cost of O(nα ) as n → ∞, α ≥ 2, then it is also possible to invert regular n×n matrices
with cost O(nα ). For the proof one may use the Schur complement S = D22 − D12 ∗ D −1 D
11 12 of a
Hermitian matrix  
D11 D12
D= ∗
D12 D22
and proceed recursively. The assertion on the cost then follows similarly as for the Strassen
multiplication which is why the proof is omitted.

Algorithm 6 MMStrassen (Strassen matrix-matrix multiplication)


input: A, B ∈ Cn×n with n = 2k for some k ∈ N.
output: C = AB ∈ Cn×n .
1: if n = 1 then
2: return C := AB
3: else
4: compute P1 , . . . , P7 as defined in (14.1) using recursion for the matrix-matrix products
5: compute C11 , C12 , C21 , C22 according to (14.2)
6: return C
7: end if

42
Lecture 15

Least Squares Problems

Definition 15.1. Given a matrix A ∈ Rm×n and a vector b ∈ Rm , the least squares problem
LSQ consists of minimising the function
1
g : Rn → R, g(x) = kAx − bk22 .
2
Example: Recall the linear regression problem. Given points (ξi , yi )m
i=1 find a linear function
ξ → x1 + x2 ξ such that
m
1X 2
g(x) = x1 + x2 ξi − yi is minimal.
2
i=1

In this case,    
1 ξ1 y1
 .. ..  ,
A = . b =  ...  .
 
. 
1 ξm ym

Theorem 15.2. ([7.1], available via recording here) x ∈ Rn solves the least squares problem if
and only if Ax − b⊥range(A), which is the case if and only if the normal equation

AT Ax = AT b (7.1)

is satisfied.

43
LECTURE 15. LEAST SQUARES PROBLEMS

Proof. If x minimises g then for all y ∈ Rn


d
0= g(x + εy) ε=0

d 1 
= hAx + εAy − b, Ax + εAy − bi
dε 2 ε=0
d 1 1  1 
= hAx − b, Ax − bi + ε hAy, Ax − bi + hAx − b, Ayi + ε2 hAy, Ayi
dε 2 2 2 ε=0
d 1 1 2 
= kAx − bk2 + εhAy, Ax − bi + ε kAyk2
dε 2 2 ε=0
= hAy, Ax − bi

which means that Ax − b⊥range(A). The other way round, if Ax − b⊥range(A) then hAx −
b, Ay − Axi = 0 for all y ∈ Rn , hence with Pythagoras

2g(y) = kAy − bk22 = kAy − Axk22 + kAx − bk22 ≥ kAx − bk22 = 2g(x).

For the second assertion we use that Ax − b⊥range(A) ⇔ Ax − b⊥ai where the ai , i = 1, . . . , m,
are the column vectors of A. But this is equivalent to
m m
hai , Axi i=1 = hai , bi i=1 ⇔ (7.1)

Example: In the linear regression problem the normal equation is a 2 × 2 system where
 Pm   Pm 
T m i=1 ξ i T i=1 y i
A A = Pm Pm 2 , A b = Pm .
i=1 ξi i=1 ξi i=1 ξi yi

It is certainly possible to use (7.1) to solve LSQ, for example one could use Cholesky since AT A
is positive definite provided that A has full rank. But, as we will see later on, we have

κ2 (AT A) = (κ2 (A))2

for the condition number, and even the condition number κ2 (A) (which is not yet defined for
m 6= n) can be big in practical applications. There are better approaches, based on the QR
factorisation (later) or on the singular value decomposition that we consider next.

Singular Value Decomposition (SVD) [2.3]


This is a factorisation of the form A = U ΣV T where U ∈ Rm×m and V ∈ Rn×n are orthogonal
matrices and Σ ∈ Rm×n is a diagonal matrix with entries σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0 where
p = min(m, n). Those values σi are called singular values.

σ1 v1

u1 um
A U Σ V
σn vn

44
LECTURE 15. LEAST SQUARES PROBLEMS

Denoting by ui , i = 1 = 1, . . . , m, the column vectors of U and by vi , i = 1, . . . , n, the column


vectors of V the SVD says that Avi = σi ui , i = 1, . . . , p.
In the case m ≥ n(= p) which is of most interest to us the reduced singular value decomposition
is defined by dropping the last m − n columns of U and the last m − n rows of Σ. Defining
Û := (u1 , . . . , un ) ∈ Rm×n and Σ̂ := diag(σ1 , . . . , σn ) ∈ Rn×n we then have that A = Û Σ̂V T .
σ1 v1

u1 un
A U Σ V
σn vn

The SVD is a very powerful decomposition for analytical purposes as it provides much insight
into the properties of the linear map associated with the matrix. The rank is the number of
non-vanishing singular values, i.e., the biggest integer r ≤ p such that σr > 0 but σr+1 = 0 (if
r + 1 ≤ p). Moreover, the image of A is spanned by the first r left singular vectors,
range(A) = span{u1 , . . . , ur }
and the kernel is
kernel(A) = span{vr+1 , . . . , vp }.
In addition to this algebraic information the SVD also reveals some geometrical information.
Recalling the identities Avi = ui σi , i = 1, . . . , p, we see that the unit sphere in Rp containing
the vectors vi is mapped to an ellipsoid which has semi-axes in the directions of the ui that have
the extensions σi .
A
v2 u3
σ 2u 2
v1 σ 1u 1

Example: Images can be compressed using the SVD. Try the following matlab code (it’s
probably best to put it into an m-file):
load [Link];
figure;
image(X);
colormap(’gray’);
pause
[U,S,V] = svd(X);
figure;
for k=[Link]
image(U(:,1:k)*S(1:k,1:k)*V(:,1:k)’);
colormap(’gray’);
disp(k)
disp(’Reduction: ’)
disp(521*k / 64000)
pause
end

45
Lecture 16

More on the Singular Value


Decomposition

Theorem 16.1. [2.12] Every matrix A ∈ Rm×n has a SVD, and the singular values are uniquely
determined.
Proof. We first show the existence by induction on p = min(m, n).
For p = 1 we may choose u = 1, and if a11 6= 0 we may set v = a11 /|a11 | and σ1 = |a11 | whilst
in the case a11 = 0 we choose v = 1 and σ1 = 0.
Let now p > 1 and assume without loss of generality that A 6= 0 (since the SVD is trivial
otherwise with U and V being the identity and Σ = 0). The map x 7→ kAxk2 on Rn is
continuous. When restricted to the compact unit sphere S n−1 = {kxk2 = 1 | x ∈ Rn } it attains
a maximum which we denote by v1 . Further, we define

σ1 := kAv1 k2 = max kAxk2 = kAk2 .


x∈S n−1

Since A 6= 0 we have that σ1 > 0 and we can define


1
u1 := Av1 .
σ1
Let us now extend v1 to an orthonormal basis (ONB) {v1 , . . . , vn } of Rn , and u1 to an ONB
{u1 , . . . , um } of Rm . The matrices U1 := (u1 , . . . , um ) ∈ Rm×m and V1 := (v1 , . . . , vn ) ∈ Rn×n
then are orthogonal. Let
σ1 w T
 
T
C := U1 AV1 =:
0 B
with w ∈ Rn−1 and B ∈ Rm−1,n−1 where the zeros in the first column below the diagonal arise
from the fact that Av1 = σ1 u1 is orthogonal to the columns u2 , . . . , um of U1 . Then

kCk2 = max kU1T AV1 xk2 = max kAV1 xk2 = kAk2 = σ1 .


kxk2 =1 kV1 xk2 =1

Since
 2
σ1 + kwk22
      
σ1 σ1 σ1
q
C 2
≥ C = ≥ σ12 + kwk22 ≥ σ12 + kwk22
w 2 w 2 Bw 2 w 2

we conclude that q
σ1 = kCk2 ≥ σ12 + kwk22 ⇒ w = 0.

46
LECTURE 16. MORE ON THE SINGULAR VALUE DECOMPOSITION

By the induction hypothesis there is a singular value decomposition B = U2 Σ2 V2T of the (m −


1) × (n − 1) matrix B. Writing Σ2 = diag(σ2 , . . . , σp ) we observe that
 
0
σ1 = max kCxk2 ≥ max C = kByk2 = σ2 .
kxk2 =1 kyk2 =1, y∈Rn−1 y 2
Therefore    
1 0 σ1 0 1 0
A= U1 CV1T = U1 V1T
0 U2 0 Σ2 0 V2T
| {z } | {z } | {z }
=:U =:Σ =:V T
is a SVD of A.
Coming to the second claim, we remark that for any SVD A = U ΣV T
kAk2 = max kU ΣV T xk2 = max kΣV T xk2 = max = kΣyk2 = σ1 .
kxk2 =1 kV T xk2 =1 kyk2 =1

Using an induction argument again one can show the uniqueness of the singular values.
Corollary 16.2. kAk2 = σ1 .
Examples:
1. A symmetric matrix A ∈ Rn×n can be diagonalised in the form A = QΛQT with Q ∈ Rn×n
orthogonal and Λ = diag(λ1 , . . . , λn ) ∈ Rn×n . Without loss of generality we may assume
that |λ1 | ≥ · · · ≥ |λn |. Otherwise perform a similarity transformation of Λ with an
appropriate permutation matrix which then is absorbed into Q. Denote the columns of Q
by q1 , . . . , qn . A SVD of A is obtained by setting U := (u1 , . . . , un ) where ui = sign(λi )qi ,
Σ = diag(|λ1 |, . . . , |λn |), and V := Q.
2. Let A ∈ Rm×n with m ≥ n (so that p = min(m, n) = n). The matrix AT A ∈ Rn×n has
eigenvalues σi2 with corresponding eigenvectors vi , i = 1, . . . , p. The matrix AAT ∈ Rm×m
has eigenvectors {u1 , . . . , um } with corresponding eigenvalues {σ12 , . . . , σp2 , 0, . . . , 0} (with
m − n zeros).
To see this, let us exemplary consider the latter case. We have that
AAT = U ΣV T V ΣT U T ⇒ AAT U = U ΣΣT ,
but ΣΣT = diag(σ12 , . . . , σp2 , 0, . . . , 0) with exactly m − n zeros.
3. Assume now that A ∈ Rn×n is regular. Then the matrix
0 AT
 
H :=
A 0
has 2n eigenvalues {σ1 , −σ1 , σ2 , −σ2 ,. . . , σn, −σn}
with corresponding eigenvectors { uv11 , −u v1 v2 v2 vn
  vn 
1
, u2
, −u 2
, . . . , un
, −un }.
2n
To show
y
 this, assumen that Hx = λx for some λ ∈ R and some x ∈ R \{0}. Writing
x = z with y, z ∈ R this means that
AT z = λy, AAT z = λAy = λ2 z,

Ay = λz, AT Ay = λAT z = λ2 y.
From the previous example we know that the eigenvalues of AT A and AAT are {σ12 , . . . , σn2 }
vi
where σn > 0 by the regularity of A. Hence λ = ±σi for some i ∈ {1, . . . , n}. That ±u i
is a corresponding eigenvector is easy to show.
Remark 16.3. The above results hold true analogously for complex matrices if the transposed
matrices are replaced by the adjoint matrices.

47
LECTURE 16. MORE ON THE SINGULAR VALUE DECOMPOSITION

Computation of the SVD (remarks only, [7.3])


The idea is to exploit the last example which yields a relation between the eigenspaces of the
T
matrix H = 0AA0 and the SVD of A. Later on in this course we will see methods for com-
puting eigenvalues and eigenvectors. These will be iterative methods requiring a matrix-vector
multiplication per step as most costly ingredient (O(n2 ) as n → ∞). The matrix A therefore
is preprocessed. By applying some similarity transformations (that preserve the eigenvalues!)
it is transformed into bidiagonal form so that the matrix-vector multiplication with H has cost
≤ 6n. In [4] the total cost for computing a SVD of an m × n matrix is stated to be

CSVD (m, n) = 2mn2 + 11n3 + Θ(mn + n2 ) as m, n → ∞.

48
Lecture 17

Conditioning of LSQ

The key parts of this lecture (and a few extra remarks) are also available via recording here.
Recall the normal equation (7.1) AT Ax = AT b for a solution to LSQ.
Introducing the pseudo-inverse or Moore-Penrose inverse

A† := (AT A)−1 AT

the solution is just x = A† b, provided A† exists. This is the case if and only if A has full rank
which is equivalent to AT A being regular.

Definition 17.1. [3.7] The condition number of a matrix A ∈ Cm×n with respect to a norm k · k
is (
kAk kA† k if A has full rank,
κ(A) =
∞ otherwise.

We remark that A† = A−1 if m = n so that the above definition is consistent with the previous
one for n × n matrices.

Lemma 17.2. If A has full rank: κ2 (A) = σ1 /σn where σ1 and σn are the biggest and smallest
singular value, respectively.

Proof. From Corollary 16.2 we know that kAk2 = σ1 . Writing A = U ΣV T for a SVD, a short
calculation results in a SVD
A† = V Σ̃Ũ T (17.1)
with Σ̃ = diag( σ1n , . . . , σ11 ) so that kA† k2 = 1
σn .

Moreover kbk22 = kb − Axk22 + kAxk22 ≥ kAxk22 which motivates to introduce the angle θ ∈ [0, π2 ]
via
kAxk2
cos(θ) = .
kbk2
To provide a geometric interpretation, this is the angle between the b and the range of A:

49
LECTURE 17. CONDITIONING OF LSQ

b
rg(A)

Ax

Theorem 17.3. Assume that x 6= 0 solves LSQ for data (A, b) and x + ∆x for (A, b + ∆b).
Then
k∆xk2 κ2 (A) k∆bk2

kxk2 η cos(θ) kbk2
where η := kAk2 kxk2 /kAxk2 ≥ 1.
Proof. Assume that A has full rank (otherwise the assertion is trivial). From the assumptions
we furthermore have that ∆x = A† ∆b. Therefore
k∆xk2 kA† k2 k∆bk2 κ2 (A)k∆bk2 κ2 (A)k∆bk2 κ2 (A) k∆bk2
≤ = ≤ ≤ .
kxk2 kxk2 kAk2 kxk2 ηkAxk2 η cos(θ) kbk2

Imagine now that b is (almost) orthogonal to range(A) which means that the solution is close
to zero, kxk is small. A small error in the data b then may lead to a small absolute deviation in
the solution but can also lead to a big relative error in the solution. Example:
   
1 −1 + δ δ
A= , b= ; x= .
1 1 2

Now, think of a small error in the data b, ∆b = (, 0) ; ∆x = /2. Hence, k∆xk2 /kxk2 = /δ
may be large.

Solving LSQ using the SVD [7.3]

Algorithm 7 LSQ SVD (solving LSQ using a SVD)


input: A ∈ Rm×n with m ≥ n = rank(A), b ∈ Rm .
output: x ∈ Rn minimiser of g(x) = 12 kAx − bk22 .
1: compute the reduced SVD A = Û Σ̂V T
2: compute y = Û T b
3: solve Σ̂z = y
4: return x = V z

The thus computed x satisfies

AT Ax = V Σ̂T Û T T T T T T T
| {zÛ} Σ̂V x = V Σ̂ Σ̂z = V Σ̂ y = V Σ̂ Û b = A b
=I

50
LECTURE 17. CONDITIONING OF LSQ

which is the normal equation (7.1) and, hence, x indeed solves LSQ.
The cost of LSQ SVD is dominated by computing a reduced SVD of A as the subsequent steps
essentially are matrix-vector multiplications only. According to [4] we have

CLSQ SVD (m, n) = 2mn2 + 11n3 + Θ(mn + n2 ) as m, n → ∞.

This is more that for solving the normal equation with, for instance, GEPP (or Cholesky, a
special version for positive definite matrices). The benefit is better stability. Example:
   
1 1 2  
1
A= δ 0 , b= δ
    ; x= .
1
0 δ δ

When using the normal equation we encounter problems for much larger values of δ than when
employing the method based on the SVD.
We will next learn about a method in between the two presented methods with respect to
stability and cost.

51
Lecture 18

QR factorisation

Note that a full recording of this lecture is available here.


Let A ∈ Rm×n with m ≥ n and denote the column vectors by ai , i = 1, . . . , n.
To orthonormalise the columns of A one may proceed as follows:
q̂1
q̂1 := a1 q1 := with r1,1 := kq̂1 k2 ,
r1,1
q̂2
q̂2 := a2 − hq1 , a2 i q1 q2 := with r2,2 := kq̂2 k2 ,
| {z } r2,2
=:r1,2
q̂3
q̂3 := a3 − hq1 , a3 i q1 − hq2 , a3 i q2 q3 := with r3,3 := kq̂3 k2 ,
| {z } | {z } r3,3
=:r1,3 =:r2,3
.. ..
. .
n−1
X q̂n
q̂n := an − hqj , an i qj qn := with rn,n := kq̂n k2 ,
| {z } rn,n
j=1 =:r
j,n

This is the so-called Gram-Schmidt orthonormalisation. Equivalently to the above formulas, we


can write
i
X
ai = qk rik , i = 1, . . . , n, ⇔ A = Q̂R̂ (18.1)
k=1
where  
r1,1 r1,2 · · · r1,n
.. 
 0 r2,2 . . .

. 
Q̂ = (q1 , . . . , qn ) ∈ Rm×n and R̂ = 
 .. .. ..
 ∈ Rn×n .

 . . . rn−1,n 
0 ··· 0 rn,n
A factorisation of the form (18.1) with Q̂ having orthonormal column vectors and R̂ upper
triangular is called reduced QR factorisation. Extending {q1 , . . . , qn } by vectors {qn+1 , . . . , qm }
to an orthonormal basis of Rm , defining Q = (q1 , . . . , qm ) ∈ Rm×m and extending R̂ with an
(m − n) × n block of zeros to  

R := ∈ Rm×n
0

52
LECTURE 18. QR FACTORISATION

we obtain a
QR factorisation: A = QR. (18.2)
Theorem 18.1. Every matrix A ∈ Rm×n
with m ≥ n can be factorised in the form A = QR
with Q ∈ Rm×m orthogonal and R ∈ Rm×n upper triangular.
Solving LSQ using the QR factorisation [7.2]

Algorithm 8 LSQ QR (solving LSQ using a QR factorisation)


input: A ∈ Rm×n with m ≥ n = rank(A), b ∈ Rm .
output: x ∈ Rn minimiser of g(x) = 12 kAx − bk2 .
1: compute a reduced QR factorisation A = Q̂R̂
2: compute y = Q̂T b ∈ Rn
3: solve R̂x = y with BS

The thus computed x satisfies


AT Ax = R̂T Q̂T Q̂ R̂x = R̂T y = R̂T Q̂T b = AT b
| {z }
=I
which is the normal equation (7.1) and, hence, x indeed solves LSQ.
However, in practice, Gram-Schmidt is not used for computing the (reduced) QR factorisation,
mainly for stability reasons (though there are stabilised variants). Alternative (often better)
methods employing reflections or rotations are utilised instead.

Householder reflections
A QR factorisation may be computed by transforming the matrix to upper triangular form by
successively multiplying with appropriate orthogonal matrices as follows:
     
∗ ··· ··· ··· ∗ ∗ ··· ··· ··· ∗ ∗ ··· ··· ··· ∗
0 ∗ · · · · · · ∗ 0 ∗ · · · · · · ∗ 0 ∗ ··· ··· ∗
     
. . . . . .. .. .. 
 .. .. ..   .. 0  ..
 
   ∗ · · · ∗   . . . 
. . . . . . . . .. .. .. 
. . . . . . .  ..
 
Q1 ·()  .
 . .  Q2 ·()  .
 . . . . . .
A −→  . ..  −→  .. ..  → · · · →  ..
  
.. .. .. ..
 ..

. . 
.
 . . .  . . ∗
 .. .. ..   .. .. .. ..   ..
  

. . . . . . .
   . 0
 .. .. ..   .. .. .. ..   .. .. 
  
. . .   . . . .  . .
0 ∗ ··· ··· ∗ 0 0 ∗ ··· ∗ 0 ··· ··· ··· 0
| {z } | {z } | {z }
=:R(1) =:R(2) =:R(n)
(18.3)
so that R = Qn · · · Q1 A ⇔ A = QR with Q = QT1 · · · QTn . For the orthogonal matrices, reflections
(k−1) (k−1)
may be used. Consider the vector u(k) := (rk,k , . . . , rm,k )T ∈ Rm−k+1 . In step k, we want
(k)
to reflect it to −sign(u1 )ku(k) k2 e1 where e1 denotes the first standard basis vector in Rm−k+1 .
An appropriate reflection matrix is given by
Hk := Im−k+1 − 2v (k) (v (k) )T ∈ R(m−k+1)×(m−k+1)
v̂ (k) (k)
where v (k) := (k)
with v̂ (k) := sign(u1 )ku(k) k2 e1 + u(k)
kv̂ k2

53
LECTURE 18. QR FACTORISATION

where Im−k+1 denotes the identity matrix in R(m−k+1)×(m−k+1) .

u
v

−sign(u1)||u||2 e 1

reflection
plane

Indeed, using that


(k) (k)
kv̂ (k) k22 = (sign(u1 )ku(k) k2 e1 + u(k) )T (sign(u1 )ku(k) k2 e1 + u(k) )
(k) (k) (k) (k)
= ku(k) k22 + 2sign(u1 )u1 ku(k) k2 + ku(k) k22 = 2(sign(u1 )u1 ku(k) k2 + ku(k) k22 )

we obtain that

v̂ (k) (v̂ (k) )T u(k)


Hk u(k) = u(k) − 2
kv̂ (k) k2 kv̂ (k) k2
v̂ (k) (k) (k) (k)
= u(k) − 2 (k) 2
(sign(u1 )u1 ku(k) k2 + ku(k) k22 ) = u(k) − v̂ (k) = −sign(u1 )ku(k) k2 e1 .
kv̂ k2

Moreover, (Hk )−1 = (Hk )T = Hk is orthogonal so that


 
Ik−1 0
Qk := ∈ Rm×m
0 Hk
is orthogonal, too, and this is an appropriate matrix to be used in (18.3).

Algorithm 9 QR HH (computing a QR factorisation with Householder reflections)


input: A = (aij )m,n
i,j=1 ∈ R
m×n with m ≥ n and full rank.

output: Q ∈ R m×m orthogonal, R ∈ Rm×n upper triangular with A = QR.


1: R(0) := A, Q(0) := Im .
2: for k = 1 to n − 1 (to n if m > n) do
(k−1) (k−1)
3: u(k) := (rk,k , . . . , rm,k )T ∈ Rm−k+1
(k)
4: v̂ (k) := sign(u1 )ku(k) k2 e1 + u(k) ∈ Rm−k+1
5: v (k) := v̂ (k) /kv̂ (k) k2
6: Hk := Im−k+1 − 2v (k) (v (k) )T ∈ R(m−k+1)×(m−k+1)
7: Qk := diag(Ik−1 , Hk ) ∈ Rm×m
8: R(k) := Qk R(k−1)
9: Q(k) := Q(k−1) Qk
10: end for
11: return Q(n−1) and R(n−1) (Q(n) and R(n) if m > n)

A reduced QR factorisation is obtained by dropping the last m − n columns of Q and the last
m − n rows of R (which contain zeros only).

54
Lecture 19

QR factorisation with Householder


reflections - continued

We will have a closer look at algorithm 18 discussing the computational complexity and error
analysis. But first, consider a concrete example with the following data:
 
3 3 2
A = 4 4 1 = R(0) .
0 6 2

k=1:
 
3
u(1) = 4 , ku(1) k2 = 5,

0
     
5 3 8 √
v̂ (1) = 0 + 4 = 4 , kv̂ (1) k2 = 80,
0 0 0
 
8
1
v (1) = √ 4
80 0
   
64 32 0 8 4 0
1 1
Q1 = H1 = I3 − 2v (1) (v (1) )T = I3 − 32 16 0 = I3 − 4 2 0 ,
40 5
0 0 0 0 0 0
     
3 3 2 40 40 20 −5 −5 −2
(1) (0) 1
R = Q1 R =  4 4 1  − 20 20 10 =  0 0 −1 .
5
0 6 2 0 0 0 0 6 2

55
LECTURE 19. QR FACTORISATION WITH HOUSEHOLDER
REFLECTIONS - CONTINUED

k=2:
 
(2) 0
u = , ku(2) k2 = 6,
6

     
(2) 0 6 6
v̂ = + = , kv̂ (2) k2 = 72,
6 0 6
 
1 6
v (2) = √ ,
72 6
   
1 36 36 0 −1
H2 = I 2 − = ,
36 36 36 −1 0
 
  1 0 0
1 0
Q2 = = 0 0 −1 ,
0 H2
0 −1 0
 
−5 −5 −2
R(2) = Q2 R(1) =  0 −6 −2 = R.
0 0 1

We end up with  3 4

−5 0 5
R = R(2) , Q = Q1 Q2 = − 45 0 − 53 
0 −1 0

Computational Complexity
Lemma 19.1. [5.11] The cost for QR HH except line 9 is
2
CQR HH (m, n) ∼ 2mn2 − n3 as m, n → ∞.
3

Proof. Let us first consider the cost for one k-step. Computing ku(k) k2 involves m − k additions,
m − k + 1 multiplications, and one square root, hence constructing v̂ (k) requires 2(m − k + 1) + 1
operations (we here assume that obtaining sign(u1 ) and changing the sign does not involve any
cost).
Line 8 essentially requires multiplying the lower right (m − k + 1) × (n − k + 1) block R̃(k−1) of
2
R(k−1) with Hk = Im−k+1 − kv̂(k) k2
v̂ (k) (v̂ (k) )T from the left:
2

• The computation of (s(k) )T := (v̂ (k) )T R̃(k−1) (these are n − k + 1 standard inner products
of vectors of length m − k + 1) involves m − k additions and m − k + 1 multiplications for
each column of R̃(k−1) , hence (n − k + 1)(2(m − k) + 1) operations.

• Computing c(k) := kv̂ (k) k22 /2 requires 2(m − k + 1) operations.

• For computing t(k) := 1 (k)


c(k)
v̂ , m − k + 1 divisions are needed.

• To get the (m − k + 1) × (n − k + 1) matrix t(k) (s(k) )T (= 2


kv̂ (k) k22
v̂ (k) (v̂ (k) )T R̃(k−1) ) requires
(m − k + 1)(n − k + 1) multiplications.

• To finalise the computation of Hk R̃(k−1) = R̃(k−1) − 2


kv̂ (k) k22
v̂ (k) (v̂ (k) )T R̃(k−1) we have to
perform (m − k + 1)(n − k + 1) subtractions.

56
LECTURE 19. QR FACTORISATION WITH HOUSEHOLDER
REFLECTIONS - CONTINUED

Summing up the costs for each k-step we obtain


n 
X
CQR HH (m, n) = 2(m − k + 1) + 1 + (n − k + 1)(2(m − k) + 1)
k=1

+ 3(m − k + 1) + 2(m − k + 1)(n − k + 1)
n
X
= 5(m − k + 1) + 1 + (n − k + 1)(4(m − k + 1) − 1)
k=1
Xn
=4 (nm − mk − nk + k 2 ) + l.o.t. (lower order terms)
k=1
 n(n + 1) n(n + 1) 1 
= 4 mn2 − m −n + n(n + 1)(2n + 1) + l.o.t.
2 2 6
2 2 3
= 2mn − n + l.o.t.
3

Error Analysis
The following result is restated from [5] (Higham, 2002):

Theorem 19.2. Let x denote the solution to LSQ with data (A, b) and x̂ the solution computed
via LSQ QR. Then x̂ minimises ĝ(y) = 12 k(A + ∆A)y − (b + ∆b)k22 with

Cmnεm
k∆ai k2 ≤ kai k2 , i = 1, . . . , n,
1 − Cmnεm
Cmnεm
k∆bk2 ≤ kbk2
1 − Cmnεm
where ai and ∆ai denote the column vectors of A and ∆A, respectively.

In the case m = n we may use the QR factorisation to solve (SLE). Recalling that for GEPP
we had an error bound of the form

k∆Ak∞ ≤ Cn3 gn (A)εm kAk∞

we see from the previous theorem that LSQ QR has much better stability properties.
However, the cost for LSQ QR are CLSQ QR (n) ∼ 43 n3 as n → ∞ while for GEPP we only
had CGEPP (n) ∼ 32 n3 .
In practice, GEPP is preferred to solve (SLE). The exponential growth of the growth factor
gn (A) is only a worst case estimate and, in application, matrices do not behave that badly.

57
Lecture 20

Linear Iterative Methods for SLE

Iterative methods aim for constructing a sequence {x(k) }k∈N ⊂ Cn such that x(k) → x. Within
this context, the two (typically competing) criteria are:
• the computation of x(k) from the data and previous iterates should be inexpensive (relative
to a direct SLE solve),

• the convergence to the exact solution x is fast.


Such methods can become advantageous if:
(1) an error O(ε) with ε much bigger than εm is acceptable (e.g. discretisation of differential
equation),

(2) the matrix is sparse so that the matrix-vector product, usually needed by iterative methods,
is cheap to compute (e.g. banded matrices),

(3) the computational resources are limited since from the actual iterate one may learn at
least something about the solution (e.g. real-time control).
error

direct
method

(1) ε
(2) iterative
method
εm
(3) time

The basic idea of linear iterative methods is to split A = M + N and, given some initial guess
x(0) ∈ Cn for the solution, to solve

M x(k) = b − N x(k−1)

at every step. If x(k) → x then indeed

M x ← M x(k) = −N x(k−1) + b → −N x + b ⇒ Ax = b.

58
LECTURE 20. LINEAR ITERATIVE METHODS FOR SLE

Assuming convergence we will need a criterion to stop the iteration. Let us define the
error e(k) := x − x(k) .
It would be best if we could ensure that ke(k) k is small enough, say, smaller than a given
tolerance. But since x is not available one has to estimate the error. This may be done using
the
residual (vector) r(k) := b − Ax(k) .
Observe that r(k) = Ae(k) from which we deduce that
ke(k) k = kA−1 r(k) k ≤ kA−1 kkr(k) k.
Similarly, kbk = kAxk ≤ kAkkxk, and provided that we are not in the trivial case x = b = 0 we
conclude that
1 kAk
≤ .
kxk kbk
Putting both inequalities together we obtain (similarly as Proposition 10.2) that
ke(k) k kAk −1 kr(k) k
≤ kA kkr(k) k = κ(A) . (20.1)
kxk kbk kbk

Error Analysis
Error analysis for iterative methods means analysing the convergence of algorithms (rather than
their stability).
Lemma 20.1. [6.1] Assume that e(k) = Re(k−1) (= Rk e(0) ) with some matrix R ∈ Cn×n . Then
e(k) → 0 for all e(0) if an only if ρ(R) < 1.
Proof. Assume that ρ(R) < 1. For every δ > 0 there is a matrix norm k·kδ with kRkδ ≤ ρ(R)+δ.
Choosing δ small enough such that ρ(R) + δ < 1 we obtain for all e(0) that
ke(k) kδ = kRk e(0) kδ ≤ kRkkδ ke(0) kδ → 0 as k → ∞.
In turn, if ρ(R) ≥ 1 then Re(0) = λe(0) for some e(0) 6= 0 and |λ| ≥ 1. Hence ke(k) k = |λ|k ke(0) k
does not converge to zero.

In our case we have x(k) = −M −1 (N x(k−1) − b) and, as a consequence of Ax = b, we obtain


that x = −M −1 N x + M −1 b. Subtracting those two equations we see that e(k) = Re(k−1) with
R = −M −1 N . Since ρ(·) ≤ k·k for every matrix norm it is sufficient to show that k−M −1 N k < 1
to ensure convergence.
Remark 20.2. In the context of iterative methods, rounding errors are not discussed for the
following reason. We require that we have stable operations in every iteration step so that the
rounding errors are O(εm ) in every step. This error should be small in comparison with
ke(k−1) k − ke(k) k kr(k−1) k − kr(k) k
or .
ke(k−1) k kr(k−1) k
If this assumption is not fulfilled then the iteration converges too slow anyway and one should
abandon it. A good criterion is to require that in every step
kr(k−1) k − kr(k) k kr(k) k
≥δ ⇔ ≤1−δ with some 1 > δ  εm .
kr(k−1) k kr(k−1) k

59
Lecture 21

The Jacobi Method

Let us split our matrix A ∈ Cn×n in the form A = D + L + U where D = diag(a1,1 , . . . , an,n ) is
the diagonal part and L and U are the lower and upper triangular parts given by
( (
ai,j if i > j, ai,j if i < j,
li,j := ui,j :=
0 else, 0 else.

The Jacobi method is the linear iterative method that consists in choosing M = D and N =
L + U , whence
x(k) = D−1 b − (L + U )x(k−1) .


Example: For  
2 −1 0 ··· 0
 .. .. .. 
−1
 2 . . . 

A= 0
 .. .. .. 
. . . 0
 ..
 
.. .. .. 
 . . . . −1
0 ··· 0 −1 2
we have
     
2 0 ··· ··· 0 0 ··· ··· ··· 0 0 −1 0 ··· 0
 .. .. ..  −1 . . .
 ..   .. . . .. .. .. 
0
 . . .  . 
.
 . . . . 

D =  ... .. .. .. ..  , L = 
 0 ... .. ..  , U =  .. .. ..
 
. . . . . . . . . 0
 ..  .. ..   ..
     
.. ..  .. .. .. .. 
. . . 0  . . . . . . . −1
0 ··· ··· 0 2 0 ··· 0 −1 0 0 ··· ··· ··· 0

so that
1    
2 0 ··· ··· 0 0 −1 0 · · · 0
 .. .. ..  
−1 . . . . . . . . .
 ..  
0
 . . .

 . 



x(k) =  ... .. .. .. ..  b − 
 0 . . . . . . . . . 0  x(k−1)  .
  
. . . . 
 ..  ..
    
.. ..  .. .. ..  
. . . 0   . . . . −1 
1
0 ··· ··· 0 2 0 · · · 0 −1 0

60
LECTURE 21. THE JACOBI METHOD

(k) (k)
Writing x(k) = (x1 , . . . , xn )T , this means that
(k) 1 (k−1) (k−1)
xi = (bi + xi−1 + xi+1 ), 2 ≤ i ≤ n − 1,
2
(k) 1 (k−1)
x1 = (b1 + x2 ),
2
1 (k−1)
x(k)
n = (bn + xn−1 ).
2

For convergence it is sufficient if the spectral radius of the iteration matrix R = −M −1 N =


−D−1 (L + U ) is smaller than one.
Theorem 21.1. The Jacobi method is convergent if A satisfies
P
(1) |ai,i | > j6=i |ai,j | for all i (strong row sum criterion), or
P
(2) |aj,j | > i6=j |ai,j | for all j (strong column sum criterion).
Proof. In the first case we have that the entries = −ai,j /ai,i if i 6= j and ri,i = 0
P of R satisfy r1i,j P
for all i. Therefore ρ(R) ≤ kRk∞ = maxi j |ri,j | = maxi |ai,i | j6=i |ai,j | < 1. The other case
can be proved similarly.

One can weaken this criterion, just a little bit so that it becomes much more useful for quite a
few applications. We need the notion of irreducibility for this purpose. A matrix A ∈ Cn×n is
irreducible if there is no permutation matrix P such that
 
Ã1,1 Ã1,2
P T AP =
0 Ã2,2

where Ã1,1 and Ã2,2 both are square blocks of size p × p and (n − p) × (n − p) with some
1 ≤ p ≤ n − 1.

Graph of a matrix
This content is also covered in this video material. It can be very tedious to check this irre-
ducibility criterion but, fortunately, there is another way based on studying the oriented graph
G(A) of the matrix A. It consists of the vertices 1, . . . , n, and there is an (oriented) edge from
vertex i to vertex j (denoted by i → j) if ai,j 6= 0.
Example:   
2 −1 0 2 −1 0
A1 = 0 2 −1 , A2 = −1 2 −1
0 0 2 0 −1 2

1 1

2 2

3 3

61
LECTURE 21. THE JACOBI METHOD

We say that two vertices i, j are connected if there is a chain of connecting edges (or direct
connections) i = i0 → i1 → · · · → ik = j with some k ∈ N. The graph G(A) then is called
connected if any two vertices i, j of it are connected. Irreducibility of the matrix A now may be
checked using the following lemma.

Lemma 21.2. A is irreducible if and only if G(A) is connected.

Proof. Exercise.

Back to the question of convergence of the Jacobi method:

Theorem 21.3. If A is irreducible and satisfies the weak row sum criterion
P
(1) |ai,i | ≥ j6=i |ai,j | for all i = 1, . . . , n, and
P
(2) |ak,k | > j6=k |ak,j | for at least one index k ∈ {1, . . . , n}

then the Jacobi method converges.

Proof. (also covered in this video) Recall that we need to prove that ρ(R) < 1. Let e :=
(1, . . . , 1)T ∈ Cn and |R| = (|ri,j |)i,j . Then thanks to the first condition
n
X X |ai,j |

0 ≤ |R|e i = |ri,j | = ≤ 1 = ei
|ai,i |
j=1 j6=i

so that e ≥ |R|e ≥ |R|2 e ≥ . . . where the inequality for vectors here and in the following has to
be understood component-wise.
Let t(l) := e − |R|l e ≥ 0, l ∈ N. Assume now that there is a positive number of non-vanishing
components of t(l) that become stationary. We may assume that these are the first m entries
where m > 0 thanks to the second condition, i.e.,
 (l)   (l+1) 
(l) b (l+1) b
t = , t =
0 0

where b(l) , b(l+1) ∈ Rm have positive entries, b(l) > 0, b(l+1) > 0.
Suppose that m < n. Then
 (l+1)   (l)     (l) 
b l+1 l+1 l b |R1,1 | |R1,2 | b
= e − |R| e ≥ |R|e − |R| e = |R|(e − |R| e) = |R| =
0 0 |R2,1 | |R2,2 | 0

with R1,1 ∈ Rm×m and the other blocks accordingly. Since b(l) > 0 necessarily |R2,1 | = 0.
Therefore R is not irreducible. And since ri,j = ai,j /ai,i if i 6= j we obtain that A is not
irreducible in contradiction to the assumption. Hence m = n.
Conwequently, t(l) > 0 as long as l is big enough (using the above contradiction argument again
we see that l > n is sufficient). This means that e > |R|l e, whence

ρ(R)l ≤ ρ(Rl ) ≤ kRl k∞ ≤ k|R|l k∞ = max |R|l e i < max ei = 1



i i

so that ρ(R) < 1 as desired.

62
Lecture 22

Computational Complexity of Linear


Iterative Methods

In some applications, the goal will be to decrease the relative forward error ke(k) k/kxk below a
given threshold while in others it is sufficient to decrease the relative backward error kr(k) k/kbk.
We will concentrate on the latter goal but recall that by (20.1) the two goals are related. Of
course, the knowledge of the condition number is required to deduce an estimate for the forward
error from the backward error.
So our goal is
kr(k) k ≤ εr kbk (22.1)
where εr > 0 is a given tolerance. Using r(k) = Ae(k) = ARk e(0) it is sufficient to achieve that
 1 k kAkke(0) k
kAkkRkk ke(0) k ≤ εr kbk ⇔ ≥
kRk kbkεr
|{z}
>1

which is the case if an only if

log(kAk) + log(ke(0) k) − log(kbk) − log(εr )


k≥ =: k ] (n, εr ). (22.2)
log(kRk−1 )

In practice, the iteration matrix R often depends on n in a very unfavourable way while the
other data A, b and e(0) do not affect the number of steps that much.

Assumption 22.1.

1. The calculation of Rx involves a cost of Θ(nα ) as n → ∞ with some α > 0.

2. kRk = 1 − h(n) with a positive function h such that h(n) = Θ(n−β ) as n → ∞ with some
β > 0.

3. kAk, kbk, and ke(0) k are uniformly bounded in n.

Theorem 22.2. Under Assumption 22.1, the computational cost to achieve (22.1) is bounded
by a function C(n, εr ) satisfying

C(n, εr ) = Θ(nα+β log(ε−1


r )) as (n, εr ) → (∞, 0).

63
LECTURE 22. COMPUTATIONAL COMPLEXITY OF LINEAR ITERATIVE
METHODS

Proof. (also covered in this video) Recall that log(1/(1 − x)) = x + x2 /2 + x3 /3 + . . . . Thus, by
Assumption 22.1.2

log(kRk−1 ) = log(1/(1 − h(n))) = h(n) + higher order terms = Θ(n−β )

so that (log(kRk−1 ))−1 = Θ(nβ ) as n → ∞. From (22.2) and Assumption 22.1.3 we get

k ] (n, εr ) = Θ(nβ log(ε−1


r )) as (n, εr ) → (∞, 0)

for the number of steps. Taking the cost per step into account (Assumption 22.1.1), the total
cost is
k](n, εr )Cone step (n) = Θ(nα+β log(ε−1
r )).

Assuming a polynomial dependence in Assumption 22.1.3 one would obtain additional log(n)
terms in the cost estimate.

Computational complexity of the Jacobi method


In each iteration step:

• computing (L + U )x(k−1) involves at most O(n2 ) operations,

• computing b − . . . is O(n),

• computing D−1 (. . . ) is O(n), too.

So the essential cost is coming from the first step. If the matrix is sparse, i.e., the number of
non-vanishing entries in each row is ∼ nη with some η < 1 then the number of operations is
O(n1+η ). For instance, η = 0 if A is tridiagonal as the number of non-vanishing entries in each
row is bounded by a constant (namely 3). In any case, with α ∈ [1, 2] the general result on the
computational cost in Theorem 22.2 is applicable.

Variants of the Jacobi method


The successive over relaxation (SOR) method generalises the Jacobi method by setting

M := L + ωD, N := U + (1 − ω)D, ω ∈ R.

For ω = 1 we obtain the Gauss-Seidel method.


Results for convergence criteria and computations cost read similarly. Yet the relaxation parameter
ω can have a massive influence on the spectral properties of the iteration matrix R and speed up
the convergence a lot. The notion over-relaxation refers to choosing ω > 1, hence bigger than
in the Gauss-Seidel method.

64
Lecture 23

Nonlinear Iterative Methods,


Steepest Descent

We restrict our attention now to positive definite (symmetric)


p matrices A ∈ Rn×n .
Recall the notation hx, yiA := hx, Ayi and kxkA := hx, xiA .
Clearly, x ∈ Rn solves Ax = b if and only if x is minimiser of
1
g : Rn → R, g(y) = kAy − bk2A−1 . (23.1)
2
Recalling that e(k) = x − x(k) , r(k) = b − Ax(k) = Ae(k) one can easily show that
1 1
g(x(k) ) = kr(k) k2A−1 = ke(k) k2A . (23.2)
2 2
Hence, minimising g means
• minimising the residual in the k · kA−1 norm or
• minimising the error in the k · kA norm (energy norm).
We consider nonlinear iterative methods for solving SLEs that have the form
x(k) = x(k−1) + α(k−1) d(k−1) , (23.3)
where d(k−1) ∈ Rn is the search direction and α(k−1) ∈ R is the step length. The step length is
chosen such that
f (α) := g(x(k−1) + αd(k−1) )
is minimal. This uniquely determines α(k−1) since f is convex and tends to infinity as |α| → ∞.
This also allows to derive an explicit formula for the step length:
1 −1
f (α) = hαAd(k−1) + Ax (k−1)
{z − }b, A (αAd
(k−1)
+ Ax (k−1)
{z − }b)i
2 | |
−r(k−1) −r(k−1)
1 1
= α2 hAd(k−1) , d(k−1) i + αhAd(k−1) , −A−1 r(k−1) i
2 2
1 1
+ αh−r (k−1) (k−1)
,d i + h−r(k−1) , −A−1 r(k−1) i
2 2
1 2 (k−1) 2 1
= α kd kA − αhd(k−1) , r(k−1) i + kr(k−1) k2A−1 (23.4)
2 2
⇒ f 0 (α) = αkd(k−1) k2A − hd(k−1) , r(k−1) i

65
LECTURE 23. NONLINEAR ITERATIVE METHODS, STEEPEST DESCENT

As the minimiser fulfils f 0 (α(k−1) ) = 0 we obtain

hr(k−1) , d(k−1) i
α(k−1) = . (23.5)
kd(k−1) k2A
Before we start looking at possible search directions we make two observations:
1. The residual is subject to the iterative formula

r(k) = b − Ax(k) = b − Ax(k−1) − α(k−1) Ad(k−1) = r(k−1) − α(k−1) Ad(k−1) . (23.6)

2. A computation similar to the above on shows that


d
∂xi g(x(k−1) ) = g(x(k−1) + αei ) = −hei , r(k−1) i
dα α=0

from which we conclude that ∇g(x(k−1) ) = −r(k−1) .

Steepest Descent Method


The idea of this method is to choose d(k−1) = r(k−1) = −∇g(x(k−1) ). This choice is motivated
from the fact that the gradient points in the direction of the fastest increase, hence a sufficiently
small step in direction −∇g(x(k−1) ) will decrease the value of our target function g that is to be
minimised. According to (23.5) the optimal step length then is α(k−1) = kr(k−1) k22 /kr(k−1) k2A .

Algorithm 10 SD (steepest descent method)


input: A = (aij )ni,j=1 ∈ Rn×n positive definite, b, x(0) ∈ Rn , εr > 0.
output: x ∈ Rn with kAx − bk2 ≤ εr .
1: for k = 1, 2, . . . do
2: r(k−1) := b − Ax(k−1)
3: if kr(k−1) k2 ≤ εr then
4: return x(k−1)
5: else
6: α(k−1) := kr(k−1) k22 /kr(k−1) k2A
7: x(k) := x(k−1) + α(k−1) r(k−1)
8: end if
9: end for

Introducing a help variable h(k−1) = Ar(k−1) it is possible to formulate the algorithm such that
only one matrix-vector multiplication per iteration step is required (exercise).
One observes that subsequent search directions are orthogonal with respect to the standard
scalar product: Thanks to (23.6) and d(k−1) = r(k−1)

hr(k−1) , r(k) i = hr(k−1) , r(k−1) − α(k−1) Ar(k−1) i


kr(k−1) k22 (k−1)
= hr(k−1) , r(k−1) i − hr , Ar(k−1) i = 0.
kr(k−2) k2A
The effect is a zig-zag path of the iterates when approaching the minimum of g as illustrated
in Figure 23.1. Since g is a quadratic function the level sets of g are ellipsoids. As an observa-
tion, the longer the ellipsoids are stretched, the longer it takes for the algorithms to obtain an

66
LECTURE 23. NONLINEAR ITERATIVE METHODS, STEEPEST DESCENT

Figure 23.1: Behaviour of SD, zigzag path to the minimum due to orthogonal (w.r.t. the
Euclidean scalar product) search directions.

iterate close to the minimum. It turns out that the main axes of the level set ellipsoids are the
eigenspaces, and the stretching of the ellipsoids depends on the ratio of the eigenvalues, most
prominently λmax /λmin . Recalling that for positive definite matrices kAk2 = λmax and kA−1 k2 =
1/λmin we see that it is exactly the condition number κ2 (A) = kAk2 kA−1 k2 = λmax /λmin that
can serve as a measure how elongated the level sets are. We will see later on how the condition
number influences the convergence of SD.

67
Lecture 24

Conjugate Gradient Method

As in the previous lecture, A ∈ Rn×n is positive definite throughout.


In SD, subsequent search directions were orthogonal with respect to the standard scalar product.
Now, we consider A-orthogonal (or conjugate) search directions d(k) , i.e., hd(i) , d(j) iA = 0 if i 6= j.
This has the following advantage:
Lemma 24.1. Assume that the search directions d(0) , . . . , d(l−1) in the iteration (23.3) form an
A-orthogonal set. Then x(l) minimises g over the set x(0) + span{d(0) , . . . , d(l−1) }.
Proof. Consider the map
 l−1
X 
l (0)
h : R → R, h(γ0 , . . . , γl−1 ) = g x + γi d(i) .
i=0

Since h is convex and tends to infinity as kγ0 k, . . . , kγl−1 k → ∞, it has a unique minimum γ̂.
Recalling that ∇g(x) = Ax − b = −r and using the A-orthogonality of the search directions we
obtain for all m = 0, . . . , l − 1 that

l−1
∂ D  X  E
0= h(γ̂) = ∇g x(0) + γ̂i d(i) , d(m)
∂γm
i=0
D l−1
X E l−1
X
(0) (i) (m) (0) (m)
= Ax − b + γ̂i Ad , d = −hr , d i + γ̂i hAd(i) , d(m) i
| {z }
i=0 i=0
=1 if i=m, =0 else

so that γ̂m = hr(0) , d(m) i/kd(m) k2A .


On the other hand, the optimal step length is α(m) = hd(m) , r(m) i/kd(m) k2A . But using the
iterative formula for the residual (23.6)

hd(m) , r(m) i = hd(m) , r(m−1) i − α(m−1) hd(m) , Ad(m−1) i = · · · = hd(m) , r(0) i


| {z }
=0

whence γ̂m = α(m) so that (α (0) , . . . , α(l−1) ) is the minimiser of h.


Pl−1
Consequently, x = x + i=0 α(i) d(i) computed by the nonlinear iteration is the minimum
(l) (0)

of g on x(0) + span{d(0) , . . . , d(l−1) } as asserted.

For l = n and on assuming that d(k) 6= 0 for all k = 0, . . . , n − 1 we have that x(0) +
span{d(0) , . . . , d(n−1) } = Rn , so the above lemma then means that x(n) is the global minimiser

68
LECTURE 24. CONJUGATE GRADIENT METHOD

of g and the desired solution to Ax = b. Going back to the case n = 2 and Figure 23.1, choosing
d(0) = r(0) as first search direction this means that the second search direction would ensure
jumping from x(1) immediately to the minimum and we would avoid the zigzag path. So the big
questions is: How can we obtain A-orthogonal search directions?
More precisely, given d(0) , . . . , d(k−1) (and the x(i) and r(i) ), how can an appropriate d(k) A-
orthogonal to all the previous search directions be obtained? The following ideas go back to
Hestenes and Stiefel:

1. Given any v 6∈ span{d(0) , . . . , d(k−1) }, such a vector can be computed via the Gram-Schmidt
orthogonalisation method (applied with the A-scalar product, of course):

hd(k−1) , viA (k−1) hd(k−2) , viA (k−2) hd(0) , viA (0)


d˜(k) (v) := v − d − d − · · · − d .
kd(k−1) k2A kd(k−2) k2A kd(0) k2A

Apart from stability issues, this becomes very expensive when k becomes big.

2. The choice v = r(k) = −∇g(x(k) ) is quite reasonable: If r(k) ∈ span{d(0) , . . . , d(k−1) } then
r(k) = 0 because x(k) = x already is the minimum by the preceding result in Lemma 24.1
and we would stop the iteration anyway. Moreover, since r(k) points in the direction of
the steepest descent it gives a good idea into which direction roughly to proceed next. So
set d(k) := d˜(k) (r(k) ).

3. Set d(0) = r(0) . Consequently, the first step of the iteration is the same as for SD.

The most amazing point with the above choices is that hd(i) , r(k) iA = 0 for i = 0, . . . , k − 2 as
we shall prove later on. This means that
k−2
(k) (k) hd(k−1) , r(k) iA (k−1) X hd(i) , r(k) iA (i)
d =r − d − d = r(k) + β (k) d(k−1) ,
kd(k−1) k2A kd(k−2) k2
| {z } i=0 | {z A}
=:β (k) =0

hence the update of the search direction in fact is cheap!


In the algorithm below the scalars α(k−1) and β (k) are computed with slightly different formulas
that will turn out to be algebraically equivalent (see Lemma 25.1) but in practice turned out to
perform somewhat better.

69
LECTURE 24. CONJUGATE GRADIENT METHOD

Algorithm 11 CG (conjugate gradient method)


input: A = (aij )ni,j=1 ∈ Rn×n positive definite, b, x(0) ∈ Rn , εr > 0.
output: x ∈ Rn with kAx − bk2 ≤ εr .
1: d(0) = r (0) = b − Ax(0)
2: if kr (0) k2 ≤ εr then
3: return x(0)
4: else
5: for k = 1, 2, . . . do
6: α(k−1) := kr(k−1) k22 /kd(k−1) k2A
7: x(k) := x(k−1) + α(k−1) d(k−1)
8: r(k) := r(k−1) − α(k−1) Ad(k−1)
9: if kr(k) k2 ≤ εr then
10: return x(k)
11: end if
12: β (k) := kr(k) k22 /kr(k−1) k22
13: d(k) := r(k) + β (k) d(k−1)
14: end for
15: end if

It is possible to formulate the algorithm such that only one matrix-vector multiplication per
iteration step is required (exercise).
Example: Consider the data
     
2 −1 1 (0) 0
A= , b= , x = .
−1 2 0 0

We then have d(0) = r(0) = b.


Step k = 1:
 
(0) (0) 2
h := Ad =
−1
kr(0) k22 1 1
α(0) := (0) 2
= (0) (0) =
kd kA hd , h i 2
 
1/2
x(1) := x(0) + α(0) d(0) =
0
     
1 1 2 0
r(1) := r(0) − α(0) h(0) = − =
0 2 −1 1/2
kr(1) k22 1
β (1) := =
kr(0) k22 4
     
(1) (1) (1) (0) 0 1 1 1/4
d := r +β d = + =
1/2 4 0 1/2

70
LECTURE 24. CONJUGATE GRADIENT METHOD

Step k = 2:
 
(1) (1) 0
h := Ad =
3/4
kr(1) k22 1/4 2
α(1) := (1) (1)
= =
hd , h i 3/8 3
     
1/2 2 1/4 1 2
x(2) (1)
:= x + α d = (1) (1)
+ =
0 3 1/2 3 1
     
0 2 0 0
r(2) := r(1) − α(1) h(1) = − =
1/2 3 3/4 0

71
Lecture 25

More on CG

We have motivated to use conjugate or A-orthogonal search directions in order to improve the
convergence in comparison with the straight forward steepest descent method. An open issue
has been the claim that the update of the search direction in each step, which is based on a
Gram-Schmidt orthogonalisation, is cheap as most of the terms drop out.

Lemma 25.1. Let x(1) , . . . , x(k) be the iterates computed by CG and assume that
r(0) , . . . , r(k) , d(0) , . . . , d(k−1) 6= 0. Then

(1) kd(k) k2 ≥ kr(k) k2 > 0,

(2) hd(k−1) , r(k) i = 0,

(3) hr(k−1) , r(k) i = 0,

(4) α(k−1) = kr(k−1) k22 /kd(k−1) k2A > 0,

(5) β (k) = kr(k) k22 /kr(k−1) k22 > 0.

Proof. Using the update formulas r(k) = r(k−1) − α(k−1) Ad(k−1) and the one for α(k−1) we obtain
that
hd(k−1) , r(k−1) i (k−1)
hd(k−1) , r(k) i = hd(k−1) , r(k−1) i − hd , Ad(k−1) i = 0 (25.1)
kd(k−1) k2A
which proves (2). A consequence of this Euclidean orthogonality is that

kd(k) k22 = kr(k) + β (k) d(k−1) k22 = kr(k) k22 + |β (k) |2 kd(k−1) k22 > kr(k) k22 > 0

which is assertion (1). A further consequence of (25.1) is that for k > 1

hd(k−1) , r(k−1) i = hr(k−1) + β (k−1) d(k−2) , r(k−1) i = kr(k−1) k22 ,

and thanks to the choice d(0) = r(0) this is also true for k = 1. This implies assertion (4),

hdk−1) , r(k−1) i kr(k−1) k22


α(k−1) = = > 0.
kd(k−1) k2A kd(k−1) k2A

To show (3) we first observe that for k > 1 thanks to the A orthogonality of the d(i)

hr(k−1) , d(k−1) iA = hd(k−1) , d(k−1) iA − β (k) hd(k−2) , d(k−1) iA = kd(k−1) k2A ,


| {z }
=0

72
LECTURE 25. MORE ON CG

and by the choice d(0) = r(0) this is also true for the case k = 1. Therefore, using the already
proved identity (4)

hr(k−1) , r(k) i = hr(k−1) , r(k−1) i − α(k−1) hr(k−1) , Ad(k−1) i


kr(k−1) k22 (k−1) 2
= hr(k−1) , r(k−1) i − kd kA = 0.
kd(k−1) k2A
1
Finally we prove the update formula (5). Since α(k−1) > 0 we can write −Ad(k−1) = α(k−1)
(r(k) −
r(k−1) ), Using this and the already shown identities (3) and (4) we get

hd(k−1) , r(k) iA hr(k) − r(k−1) , r(k) i kr(k) k22


β (k) = − = = > 0.
kd(k−1) k2A α(k−1) kd(k−1) k2A kr(k−1) k22

The following lemma is central to CG:

Lemma 25.2. The vectors d(0) , . . . , d(k) are A-orthogonal. Moreover

span{r(0) , . . . , r(l) } = span{d(0) , . . . , d(l) } = span{r(0) , Ar(0) , . . . , Al r(0) } (25.2)

for l = 0, . . . , k − 1.

Proof. (available as video recording here) We start with proving the second assertion by induc-
tion where the case l = 0 is clear thanks to the choice d(0) = r(0) . So let l > 0 and assume
1
that (25.2) is true for l − 1. Since α(l−1) 6= 0 we have that Ad(l−1) = α(l−1) (r(l−1) − r(l) ) ∈
span{r(l−1) , r(l) }. Therefore, using the induction hypothesis,

Al r(0) = A(Al−1 r(0) ) ∈ span{Ad(0) , . . . , Ad(l−1) } ⊆ span{r(0) , . . . , r(l) }

which implies that

span{r(0) , Ar(0) , . . . , Al r(0) } ⊆ span{r(0) , . . . , r(l) }.

Now, r(l) = d(l) − β (l) d(l−1) ∈ span{d(l−1) , d(l) } which, with the induction hypothesis, yields that

span{r(0) , . . . , r(l) } ⊆ span{d(0) , . . . , d(l) }.

Finally, since Ad(l−1) ∈ span{r(0) , . . . , Al r(0) } and using the induction hypothesis again,

d(l) = r(l) + β (l) d(l−1) = r(l−1) − α(l−1) Ad(l−1) + β (l) d(l−1) ∈ span{r(0) , . . . , Al r(0) }

so that
span{d(0) , . . . , d(l) } ⊆ span{r(0) , . . . , Al r(0) }.
Let us now come to the assertion on the A-orthogonality. We show this by induction, too,
where the case k = 1 trivially is fulfilled. So let k > 1 and assume that d(0) , . . . , d(k−1) are
A-orthogonal.
Consider an index i ≤ k − 1. Then for all j = i + 1, . . . , k

hd(i) , r(j) i = hd(i) , r(j−1) i − α(j−1) hd(i) , Ad(j) i = hd(i) , r(j−1) i


| {z }
=0

73
LECTURE 25. MORE ON CG

where the second term vanishes thanks to the induction hypothesis. But since hd(i) , r(i+1) i = 0
by lemma 25.1 we conclude that

hd(i) , r(j) i = 0, j = i + 1, . . . , k (25.3)

where i = 0, . . . , k − 1.
Let l < k − 1. Then thanks to (25.2), Ad(l) ∈ span{r(l) , r(l+1) } ⊂ span{d(0) , . . . , d(l+1) }, so that,
using (25.3), hd(l) , r(k) iA = hAd(l) , r(k) i = 0. Consequently,

hd(l) , d(k) iA = hd(l) , r(k) iA +β (k) hd(l) , d(k−1) iA = 0,


| {z } | {z }
=0 =0

where we used the induction hypothesis for the second term.


In the only remaining case l = k − 1 it is the choice of β (k) which ensures A-orthogonality:

hd(k−1) , d(k) iA = hd(k−1) , r(k) iA + β (k) hd(k−1) , d(k−1) iA


hd(k−1) , r(k) iA (k−1) 2
= hd(k−1) , r(k) iA − kd kA = 0.
kd(k−1) k2A

The spaces in (25.2), denoted by

Kk (r(0 , A) := span{r(0) , . . . , Ak−1 r(0) }

are called Krylov subspaces and play a prominent role in other iterative methods such as GMRES
and BiCGstab which, indeed, are even termed Krylov (sub)space methods. A characteristic
property of these methods is that the increment lies in the actual Krylov subspace:

x(k) − x(k−1) ∈ Kk (r(0) , A).

A consequence of the previous results is

Theorem 25.3. The CG algorithm reaches the exact solution to SLEs in at most n steps for
any x(0) .

So in fact CG is a direct method. But in practice it is considered as an iterative method because


εr usually is much bigger than the machine precision εm so that the iteration terminates with
an approximation x(k) where k is much smaller than n.

74
Lecture 26

Error Analysis - Comparison of SD


and CG

Error analysis for iterative methods is convergence analysis, rounding errors are not taken into
account. The concepts of analysing SD and CG are similar which is why they are presented
together. We start with a helpful lemma relating energy and Euclidean norm.
Lemma 26.1. Assume that ke(k) kA ≤ cq k ke(0) kA for all k ∈ N and some constants c, q > 0.
Then p
ke(k) k2 ≤ κ2 (A)cq k ke(0) k2 ∀k ∈ N.
Proof. Denoting the minimal and maximal eigenvalue of A by λmin and λmax , respectively, and
recalling that κ2 (A) = λmax /λmin :
1 1 λmax k 2 (0) 2
ke(k) k22 ≤ ke(k) k2A ≤ c2 q 2k ke(0) k2A ≤ (cq ) ke k2 .
λmin λmin λmin

The first results concerns SD (the previous and following proofs are also explained as recordings).
Theorem 26.2. The convergence rate of SD is
s !k
(k) 1
ke kA ≤ 1− ke(0) kA .
κ2 (A)

Proof. Recalling (23.4) which with d(k) = r(k) reads


1 1
g(x(k−1) + αr(k−1) ) = α2 kr(k−1) k2A − αhr(k−1) , r(k−1) i + kr(k−1) k2A−1 ,
2 2
we first observe that
g(x(k) ) = g(x(k−1) + α(k−1) d(k−1) )
kr(k−1) k42 (k−1) 2 kr(k−1) k22 (k−1) 2 1 (k−1) 2
= kr kA − kr k2 + kr kA−1
2kr(k−1) k4A kr(k−1) k2A 2
1 kr(k−1) k42
= kr(k−1) k2A−1 −
2 2kr(k−1) k2A
!
kr(k−1) k42 1 (k−1) 2
= 1 − (k−1) 2 (k−1) kr kA−1 ,
kr kA kr k2A−1 |2 {z }
=g(x(k−1) )

75
LECTURE 26. ERROR ANALYSIS - COMPARISON OF SD AND CG

1
Using the estimates kvk2A ≤ λmax kvk22 and kvk2A−1 ≤ 2
λmin kvk2 we obtain that this is
!
kr(k−1) k42
≤ 1− 1 g(x(k−1) )
λmax kr(k−1) k22 λmin kr(k−1) k22
 
1
= 1− g(x(k−1) ).
κ2 (A)

Therefore  k
(k) 1
g(x )≤ 1− g(x(0) )
κ2 (A)
Using that g(x(l) ) = 21 ke(l) k2A for l = 0, k (see (23.2)) yields the assertion.

For CG we have the following result where P k denotes the set of polynomials p of degree ≤ k
with p(0) = 1 and Λ(A) is the set of eigenvalues of A:

Theorem 26.3. If CG has not yet converged after step k then

ke(k) kA = inf kp(A)e(0) kA ≤ inf max |p(λ)|ke(0) kA .


p∈P k p∈P k λ∈Λ(A)

Proof. We only prove the first equality. The second one is an exercise.
As x(0) − x(k) is an element of the Krylov space Kk (A, r(0) ) = span{r(0) , . . . , Ak−1 r(0) } we can
write
k−1
X  k
X 
(k) (k) (0) (0) (k) (0) j (0)
e =x−x =x−x +x −x =e + ηj+1 A |{z}
r = 1+ ηj Aj e(0)
j=0 =Ae(0) j=1

with appropriate η1 , . . . , ηk ∈ R. Let now q(λ) := 1 + kj=1 ηj λj ∈ P k . Then e(k) = q(A)e(0) so


P
that
kx − x(k) kA = ke(k) kA = kq(A)e(0) kA . (26.1)
For any polynomial p ∈ P k , p(λ) = 1 + kj=1 γj λj , we have
P

k
X k−1
X
(0) (0) j (0) (0)
p(A)e =e + γj A e =x−x + γj+1 Aj r(0) .
j=1 j=0
| {z }
∈Kk (A,r(0) )
p
But as we have seen in Lemma 24.1, the iterate x(k) minimises y 7→ ky − xkA = 2g(y) on
x(0) + Kk (A, r(0) ). Therefore
k−1
X
ke(k) kA = kx − x(k) kA ≤ x − x(0) + γj+1 Aj r(0) = kp(A)e(0) kA .
A
j=0

Together with (26.1) this proves the first equality. .

A recording of this previous result, as well as an introduction into the following topic, can be
found here.

76
LECTURE 26. ERROR ANALYSIS - COMPARISON OF SD AND CG

Chebyshev polynomials
These polynomials are defined by
1 p n p n 
Tn (x) := x + x2 − 1 + x − x2 − 1 , x ∈ [−1, 1]
2
and fulfil the recursive formula
T0 (x) = 1, T1 (x) = x, Tn+1 (x) = 2xTn (x) − Tn−1 (x), n = 1, 2, . . .
A further definition is
Tn (x) = cos(n arccos(x)), x ∈ (−1, 1).
We see that max|x|≤1 |Tn (x)| ≤ 1, and indeed the Chebyshev polynomials play an important role
when optimising with respect to the norm k · k∞ .
Let λmax and λmin denote the maximal and minimal eigenvalue of A and consider the rescaled
Chebyshev polynomial  2x .
p(x) := Tn γ − Tn (γ)
λmax − λmin
where
λmax + λmin λmax /λmin + 1 κ2 (A) + 1
γ= = = , (26.2)
λmax − λmin λmax /λmin − 1 κ2 (A) − 1
where we recall that kAk2 = λmax and kA−1 k2 = λ−1 min so that κ2 (A) = λmax /λmin . For x ∈
[λmin , λmax ] we have that γ − 2x/(λmax − λmin ) ∈ [−1, 1], and since then |Tn (x)| ≤ 1 we arrive at
|p(x)| ≤ 1/Tn (γ), x ∈ [λmin , λmax ]. (26.3)
Writing κ := κ2 (A) we first observe that
s s
κ+1 (κ + 1)2 κ+1 (κ + 1)2 − (κ − 1)2
± − 1 = ±
κ−1 (κ − 1)2 κ−1 (κ − 1)2
√ √  √κ + 1 ±1
κ + 1 ± 4κ ( κ ± 1)2
= = √ √ = √ .
κ−1 ( κ + 1)( κ − 1) κ−1
Thanks to (26.2) and (26.3)
s !n s !n !
1 κ+1 (κ + 1)2 κ+1 (κ + 1)2
Tn (γ) = + −1 + − −1
2 κ−1 (κ − 1)2 κ−1 (κ − 1)2
√ √ !
1  κ + 1 n  κ − 1 n
= √ + √ .
2 κ−1 κ+1
Using (26.3) we see that for all x ∈ [λmin , λmax ]
 √κ + 1 n  √κ − 1 n −1
!
|p(x)| ≤ 2 √ + √ (26.4)
κ−1 κ+1

Since p ∈ P k for k = n and since estimate (26.4) holds true for all x ∈ Λ(A) ⊂ [λmin , λmax ] we
deduce from Theorem 26.3 the following result:
Theorem 26.4. [6.20] If CG has not yet converged after step k then
 pκ (A) + 1 k  pκ (A) + 1 −k −1
!
2 2
ke(k) kA ≤ 2 p + p ke(0) kA .
κ2 (A) − 1 κ2 (A) − 1

77
Lecture 27

Computational Complexity and


Preconditioning

Computational Complexity
This lecture is available as recording alongside helpful extra materials on our Moodle page.
For the computational complexity we proceed analogously as in the context of the linear iterative
methods: Get an estimate for the required number of steps in terms of the system size n and
the tolerance εr , and then multiply with the cost per step.
Assumption 27.1.
1. Computing Ax involves a cost of Θ(nα ) as n → ∞ with some α ∈ [1, 2].
2. κ2 (A) = Θ(nβ ) as n → ∞ with some β ≥ 0.
3. ke(0) kA is uniformly bounded in n.
Theorem 27.2. Under Assumption 27.1, the cost to achieve ke(k) kA ≤ εr with SD is bounded
by a function C(n, εr ) satisfying
C(n, εr ) = Θ(nα+β log(ε−1r )) as (n, εr ) → (∞, 0).
q k
Proof. By Theorem 26.2, ke(k) kA ≤ 1 − κ2 1(A) ke(0) kA , hence it is sufficient to achieve that
 k
ke(0) k A 1 log(ke(0) kA ) + log(ε−1
r )
≤ q  ⇔ k≥  q =: k ] (n, εr ).
εr 1 − κ2 1(A) log 1/ 1 − κ2 (A)1

Using the Taylor expansion we see that


 
1 1
= x + O x2 as x → 0,

log √
1−x 2
and with x = 1/κ2 (A) = Θ(n−β ) one can proceed as in the proof of Theorem 22.2 to show the
assertion.

Theorem 27.3. Under Assumption 27.1, the cost to achieve ke(k) kA ≤ εr with CG is bounded
by a function C(n, εr ) satisfying
1
C(n, εr ) = Θ(nα+ 2 β log(ε−1
r )) as (n, εr ) → (∞, 0).

78
LECTURE 27. COMPUTATIONAL COMPLEXITY AND
PRECONDITIONING

(Spot the difference to the previous theorem!)

Proof. As for
p the previous theorem but based on the estimate in Theorem 26.4. The fact that
there only κ2 (A) appears rather than κ2 (A) as in the estimate for SD (see Theorem 26.2)
leads to the prefactor 21 in front of β.

Preconditioning
Apparently, the lower the condition number κ2 (A) the better the convergence of SD and CG,
and this holds true for Krylov space methods in general. Since for every regular B ∈ Rn×n

b = Ax = ABB −1 x = (AB)(B −1 x) =: AB x̃,

an appropriate choice of B may yield a system with κ(AB) < κ(A). But a problem emerges:
Even if B ∈ Rn×n is positive definite (which we assume from now on) there is no guarantee that
AB is positive definite. To solve this problem let us recall what we effectively need for CG. The
symmetry of A is equivalent to

hAx, yi = hx, Ayi ∀x, y ∈ Rn×n .

This means that A considered as a linear operator on Rn is self-adjoint with respect to the
standard inner product. Consider now the bilinear form

h·, ·iB : Rn × Rn → R, hx, yiB := hx, Byi

which, thanks to the positivity of B, is an inner product (recall the arguments around (1.4) on
page 15, and recall also the notation k · kB for the associated norm). It turns out that à = AB
is self-adjoint with respect to h·, ·iB : For all x, y ∈ Rn

hABx, yiB = hABx, Byi = xT B T AT By =


|{z} xT BABy = hx, B(AB)yi = hx, AByiB .
A,B symmetric

Hence, we may just replace the standard inner product by h·, ·iB in algorithm CG and thus
obtain a method for computing the solution to b = AB x̃. The essential lines read as follows
where we inserted the definition of h·, ·iB :
1: d(0) := r (0) := b − AB x̃(0)
2: l(0) := hr (0) , Br (0) i
3: if l(0) ≤ εr then
4: return x̃(0)
5: else
6: for k = 1, 2, . . . do
7: h(k−1) := ABd(k−1)  
8: α(k−1) := l(k−1) /hd(k−1) , Bh(k−1) i = hr(0) , Br(0) i/hBd(k−1) , ABd(k−1) i
9: x̃(k) := x̃(k−1) + α(k−1) d(k−1)  
10: r(k) := r(k−1) − α(k−1) h(k−1) = r(k−1) − α(k−1) ABd(k−1)
11: l(k) := hr(k) , Br(k) i
12: if l(k) ≤ εr then
13: return x̃(k)
14: end if  
15: β (k) := l(k) /l(k−1) = hr(k) , Br(k) i/hr(k−1) , Br(k−1) i

79
LECTURE 27. COMPUTATIONAL COMPLEXITY AND
PRECONDITIONING

16: d(k) := r(k) + β (k) d(k−1)


17: end for
18: end if
As we are interested in x rather than x̃ we may multiply line 9 with B from the left to obtain
x(k) := x(k−1) + α(k−1) Bd(k−1) . Doing the same with line 16 we see that we may introduce the
vectors q (k) := Bd(k) to replace the vectors d(k) . The preconditioner B only acts on the residual,
and it is possible to formulate and implement the algorithm such that only one matrix-vector
multiplication with B per step is required. Introducing s(k) := Br(k) we arrive at:

Algorithm 12 PCG (preconditioned conjugate gradient method)


input: A, B ∈ Rn×n positive definite, b, x(0) ∈ Rn , εr > 0.
output: x ∈ Rn with kAx − bkB ≤ εr .
1: r (0) := b − Ax(0)
2: q (0) := s(0) := Br (0)
3: l(0) := hr (0) , s(0) i
4: if l(0) ≤ εr then
5: return x(0)
6: else
7: for k = 1, 2, . . . do
8: h(k−1) := Aq (k−1)
9: α(k−1) := l(k−1) /hq (k−1) , h(k−1) i
10: x(k) := x̃(k−1) + α(k−1) q (k−1)
11: r(k) := r(k−1) − α(k−1) h(k−1)
12: s(k) := Br(k)
13: l(k) := hr(k) , s(k) i
14: if l(k) ≤ εr then
15: return x(k)
16: end if
17: β (k) := l(k) /l(k−1)
18: q (k) := s(k) + β (k) q (k−1)
19: end for
20: end if

The big question now: How to choose B? One will want that the computation of s(k) = Br(k) is
cheap. But a good approximation of A−1 or, at least, keeping κ(AB) low, is desired as well. To
find a good trade-off depends strongly on the actual problem and the used computing hardware
so, usually, requires some testing and trying. Some ideas to define preconditioners are stated
below. Observe that in algorithm PCG we need the action of B on a vector which may be
cheaper to compute than building the matrix B and employing the usual matrix-vector product
algorithm.

• Very simple but occasionally highly effective: Choose B := D−1 where D is the diagonal
of A.

• Computing the LU factorisation or, since A is symmetric, the Cholesky factorisation A =


LLT usually leads to a fill-in, i.e., zero-entries in a sparse matrix A become non-zero in the
triangular matrices of the factorisation. But often those entries are ’small’ compared to
the other entries. So one may neglect the fill-in and compute an incomplete factorisation

80
LECTURE 27. COMPUTATIONAL COMPLEXITY AND
PRECONDITIONING

 −1
A ≈ Ã = L̃L̃T . The preconditioner then is B := L̃L̃T .

• The linear solvers can serve as preconditioners as well. For instance, the action of B on a
vector may correspond to a few steps of the Jacobi method.

• Modern general preconditioners also include (possibly algebraic) multigrid methods (not
discussed in this course).

81
Lecture 28

Introduction to Eigenvalue Problems

Let A ∈ Cn×n . Suppose that Ax = αx with some x ∈ Cn \{0} and α ∈ C. Then hx, Axi =
αhx, xi. The fraction
hx, Axi
rA (x) :=
hx, xi
is called Rayleigh coefficient. And from the preceding calculation we see that Ax = rA (x)x if x
is an eigenvector. Remarkably, this provides a method to compute the eigenvalue corresponding
to an eigenvector.
Finding the eigenvalue of A is equivalent to finding the roots of the characteristic polynomial.
However, there is the following result of Abel (1824):

Theorem 28.1. If n ≥ 5 then there is a polynomial of degree n with rational coefficients that
1
has a real root which cannot be expressed by only using rational numbers, +, −, ∗, /, (·) k with
k ∈ N.

Consequently, algorithms for solving EVPs will be iterative.

Conditioning
The question is what impact a small perturbation ∆A of A has on the eigenvalues.
Let λ(A) ∈ Cn denote the set of eigenvalues, ordered in decreasing absolute value and repeated
according to their algebraic multiplicity. The coefficients of the characteristic polynomial ρA (z)
of A depend continuously on the matrix entries, whence the roots as well. Therefore, λ : Cn×n →
Cn is a continuous function.

Theorem 28.2. Let λ be a simple eigenvalue of A with associated right and left normalised
eigenvectors x, y. Then for all sufficiently small ∆A ∈ Cn×n the matrix A+∆A has an eigenvalue
λ + ∆λ with
1  
∆λ = hy, ∆Axi + O(k∆Ak22 ) as k∆Ak2 → 0.
hx, yi
Definition 28.3. Given A ∈ Cn×n and an eigenvalue λ ∈ C, let x, y ∈ Cn denote normalised
right and left eigenvector, respectively. Then the eigenvalue condition number is
(
minx,y (1/|hx, yi|) with x, y normalised right and left eigenvector s.t. hx, yi =
6 0
κA (λ) := n
∞ if no such right and left eigenvectors x, y ∈ C exist.

82
LECTURE 28. INTRODUCTION TO EIGENVALUE PROBLEMS

Corollary 28.4. Let λ ∈ C be a simple eigenvalue of A ∈ Cn×n with corresponding right and
left normalised eigenvectors x ∈ Cn and y ∈ Cn . Then for all sufficiently small ∆A ∈ Cn×n the
matrix A + ∆A has an eigenvalue λ + ∆λ with

|∆λ| ≤ κA (λ) k∆Ak2 + O(k∆Ak22 ) as ∆A → 0.




A proof of the above theorem 28.2 is stated in Stuart & Voss [1] - Theorem 3.15. We look at an
example with a non-simple eigenvalue in order to see why things go wrong in that case.
Example: (1) Consider the matrix  
1 1
A=
0 1
λ = 1 is an eigenvalue of A with algebraic multiplicity 2. A right eigenvector is x = (1, 0)T and
a left eigenvector is y = (0, 1), which means that κA (λ) = ∞. In fact, consider
 
0 0
∆A = .
δ 0
√ p
The matrix A + ∆A has eigenvalues 1 ± δ √ so that |∆λ| = |δ| but not O(k∆Ak2 ) = O(|δ|).
In this example we have δ 7→ λ1 (δ) = 1 + δ ∈ C for the first eigenvalue, and this curve is
continuous in δ = 0 but not differentiable.
(2) If A is Hermitian then the left and right eigenspaces coincide so that κA (λ) = 1 in this case.

Real Symmetric Case


From now on we will restrict our attention to symmetric real matrices A ∈ Rn×n , where the
eigenvalues are denoted by |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn | ≥ 0. Associated normalised eigenvectors are
denoted by x1 , . . . , xn ∈ Rn , which then also form an orthonormal basis of Rn .
Before proceeding with the first method to compute an eigenvalue let us state on result that
will be used later on in the convergence proof.

Theorem 28.5.
(a) A vector x ∈ Rn \{0} is an eigenvector of A with eigenvalue λ if and only if rA (x) = λ and
∇rA (x) = 0.
(b) If x ∈ Rn is an eigenvector of A then

|rA (x) − rA (y)| = O(kx − yk22 ) as y → x.

Proof. (a) A short calculation (exercise) shows that


2 
∇rA (z) = 2 Az − rA (z)z
kzk2

for any z 6= 0. If Ax = λx then rA (x) = hx, Axi/hx, xi = λ and ∇rA (x) = 2(λx−rA (x)x)/kxk22 =
0 as claimed.
Vice versa, if ∇rA (x) = 0 then Ax = rA (x)x so that (x, rA (x)) is an eigenpair of A.
(b) This follows from considering the Taylor expansion of rA around x and using part (a).

83
Lecture 29

Power Iteration

Recall that we consider the case of A ∈ Rn×n symmetric. Denote the eigenvalues by |λ1 | ≥
|λ2 | ≥ · · · ≥ |λn | with corresponding normalised eigenvectors x1 , . . . , xn ∈ Rn .
Idea: iterate z (k) = Az (k−1) and hope that the iterates align with the eigenspace of λ1 .
Example: consider    
1 0 (0) 1
A= , z = .
0 12 1
Then        
1 1 1 1
z (1) = , z (2) = , . . . , z (k) = →
2−1 2−2 2−k 0
which is an eigenvector to λ1 = 1.

Algorithm 13 PI (power iteration)


input: A ∈ Rn×n symmetric, z (0) ∈ Rn with kz (0) k2 6= 0 and hz (0) , x1 i =
6 0.
output: z (k) ∈ Rn and λ(k) ∈ R approximation to x1 and λ1 .
1: k = 1
2: repeat
3: w(k) := Az (k−1)
4: λ(k−1) := hw(k) , z (k−1) i
5: z (k) := w(k) /kw(k) k2
6: k := k + 1
7: until Stopping criterion fulfilled

Observe that λ(k−1) = rA (z (k−1) ). When the goal is to compute the eigenvalue (rather than an
eigenvector) then a criterion of the form |λ(k) − λ(k−1) | ≤ εr may be used.
For computing an eigenvector, a possible stopping criterion is kz (k) − z (k−1) k2 ≤ εr or kz (k) +
z (k−1) k2 ≤ εr where we have to distinguish the sign because λ1 may be negative. In fact, for
   
−1 0 (0) 1
A= , z = .
0 12 1
we obtain the iteratives
(−1)k
       
(1) −1 (2) 1 (3) −1 (k)
z = ,z = ,z = ...,z =
2−1 2−2 2−3 2−k

which do not converge, but the vectors (−1)k z (k) = (1, (−1)k 2−k )T converge to (1, 0)T .

84
LECTURE 29. POWER ITERATION

Error Analysis
Theorem 29.1. Assume that |λ1 | > |λ2 | ≥ . . . and that α1 := hx1 , z (0) i =
6 0. Then the sequences
(k) (k)
{z }k and {λ }k generated by PI satisfy
 λ k  λ 2k 
2 2
kz (k) − σ (k) x1 k2 = O , |λ(k) − λ1 | = O as k → ∞
λ1 λ1
where σ (k) = α1 λk1 /|α1 λk1 | ∈ {±1}.
Pn
Proof. (recording available) Let us write z (0) = i=1 αi xi . Since α1 6= 0
n
X  X αi  λi k 
Ak z (0) = αi λki xi = α1 λk1 x1 + xi
α1 λ 1
i i=2

and, using the orthonormality of the xi ,


n
 X αi 2 λi 2k 
kAk z (0) k22 = |α1 λk1 |2 1+ .
α1 λ 1
i=2
| {z }
=:γk ≥0

Therefore
n
(k) Ak z (0) Ak z (0) α1 λk1  X αi  λi k  1
z = k (0)
= k
√ = k
x 1 + xi √ .
kA z k2 |α1 λ1 | 1 + γk |α1 λ | α
i=2 1
λ 1 1 + γk
| {z 1}
=σ (k)

Thus,
Ak z (0) Az (k)
kz (k) − σ (k) x1 k2 ≤ z (k) − +− σ (k) x1
|α1 λk1 |
2 |α1 λk1 | 2
k (0) n
1 A z X αi λi k

= √ −1 + xi
1 + γk |α1 λk1 | 2 α λ1
i=2 1
2

1 p √
= 1− √ 1 + γk + γk
1 + γk
p √
= 1 + γk − 1 + γk

≤ 2 γk .
Now,
n n
1 λ2 2k X λi 2k 1 λ2 2k X 1 λ2 2k
γk ≤ |αi |2 ≤ |αi |2 = kz (0) k22
|α1 |2 λ1 λ2 |α1 |2 λ1 |α1 |2 λ1
i=2 | {z } i=1
≤1

so that
√ kz (0) k2 λ2 k
γk ≤
|α1 | λ1
from which we get the first assertion. The second follows with Theorem 28.5 (b):
 λ 2k 
2
|λ(k) − λ1 | = |rA (z (k) ) − rA (σ (k) x1 )| = O(kz (k) − σ (k) x1 k22 ) = O .
λ1

85
LECTURE 29. POWER ITERATION

Remarks on the Implementation


The most expensive part in each step is the matrix-vector multiplication in line 3 (the other
lines involve only O(n) operations as n → ∞). To reduce the cost, the idea is to transform the
matrix to B = SAS −1 with a regular matrix S so that B contains many zeros. Recall that such
a similarity transformation does not change the eigenvalues, and eigenvectors corresponding to
λ1 , . . . , λn are given by Sx1 , . . . , Sxn .
One possibility is to transform A to upper Hessenberg form,
 
∗ ··· ··· ··· ∗
∗ . . .
 .. 
 . 
B = 0
 . . . . .
. 
. . .
 .. . . .. 
 
.. ..
. . . . .
0 ··· 0 ∗ ∗
using Householder reflections again. But in contrast to the computation of a QR factorisation
only the entries below the first lower diagonal are made to vanish. Let us start with
 
  ∗ u1 · · · · · · un−1
∗ u1 · · · un−1 ±kuk2 ∗ · · · · · · ∗ 
 u1 ∗ · · · ∗
 
 .. . . .. 
..  ; Q1 A =  0

A= .

.. . .
  . . . 
 .. . . .   .. .. .

.. 
 . . . . . 
un−1 ∗ · · · ∗
0 ∗ ··· ··· ∗
where Q1 = diag(1, H1 ) with H1 = I − 2vv T ∈ R(n−1)×(n−1) is a reflection that leaves the first
row of A unchanged. When multiplying with Q−1 T
1 = Q1 = Q1 from the right then also the
entries above the first upper diagonal in the first row vanish:
 
∗ ±kuk2 0 · · · 0
±kuk2 ∗ · · · · · · ∗
 
−1
 .. .. .. 
Q1 AQ1 =  0
 . . ..
 .. .
. . . .
.
 . . . .
0 ∗ ··· ··· ∗
Observe that this is due to the symmetry of A. For an arbitrary matrix we would keep some
entries in the first row.
Since Q1 AQ−1 T
1 = Q1 AQ1 is symmetric, too, we may proceed in a similar fashion to transform
A to a tridiagonal matrix
 
∗ ∗ 0 ··· 0

∗ ∗ . .. ... .. 
 .
Qn−1 · · · Q1 AQ−1 T .. .. ..
1 · · · Qn−1 =  = B.
 
0 . . . 0
 .. . . . . . .


. . . . ∗
0 ··· 0 ∗ ∗
The matrix-vector multiplication with B now has a cost of O(n) as the other operations per
step.
It should be remarked that PI yields an eigenvector for B which, using Q1 , . . . , Qn−1 needs to
be transformed back to an eigenvector of A.

86
LECTURE 29. POWER ITERATION

Computational Complexity
We proceed in the usual way for iterative methods, estimating the number of required steps and
the multiplying with the cost per step.
The goal
kz (k) − σ (k) x1 k2 ≤ εr (29.1)
is achieved if
kz (0) k2 λ2 k
2 ≤ εr
|α1 | λ1
which requires
log(ε−1 (0)
r ) + log(2kz k2 ) − log(|α1 |)
k ] (n, εr ) =
log(|λ2 /λ1 |)
steps.
Assume now that
λ1
= 1 + Θ(n−β ) for some β > 0 as n → ∞,
λ2
that kz (0) k2 and |α1 | are Θ(1) as n → ∞, and that the cost per iteration step is O(n) as n → ∞.

Theorem 29.2. Under the above assumptions, the cost to achieve (29.1) with PI is bounded by
a function C(n, εr ) satisfying

C(n, εr ) = Θ(n1+β log(ε−1 3


r )+n ) as (n, εr ) → (∞, 0).

The proof is similar to the proof of Theorem 22.2 which explains the first term n1+β log(ε−1
r ).
Here, the β arises from the fact that λ2 /λ1 converges to 1 as n → ∞, and the 1 comes from the
cost per iteration step which requires that the matrix-vector multiplication is O(n) as n → ∞.
The difference to previous results is the additional term n3 which arises from transforming the
initial matrix to a tridiagonal matrix.

87
Lecture 30

Simultaneous Iteration and the QR


Method

A full video of this lecture can be found here.


Suppose that λ1 and x1 are known. If z (0) ⊥x1 then also Az (0) ⊥x1 , at least modulo numerical
errors, which may require to project to span{x1 }⊥ . As a consequence: PI with such a z (0) will
yield λ2 and x2 (under appropriate, not too severe assumptions). Afterwards, one could go on
picking a z (0) ⊥span{x1 , x2 } in order to compute λ3 and x3 and so on. But subsequent iterations
eventually are disadvantageous.
(0) (0)
Idea: Perform PI with a set {z1 , . . . , zn } of orthonormal vectors and re-orthonormalise after
(0) (0)
each multiplication with A. More precisely, defining Z (0) := (z1 , . . . , zn ) ∈ Cn×n , let us
consider the following simultaneous iteration:
for k = 1, 2, . . . do
W (k) := AZ (k−1)
compute a QR factorisation W (k) = Z (k) R(k) 
Λ(k) := (Z (k−1) )T W (k) = (Z (k−1) )T AZ (k−1)
end for
(k)
We write wi , i = 1, . . . , n, for the column vectors of W (k) in the following, and let us for
(k)
simplicity assume that λi > 0 so that we do not have to discuss oscillations. Clearly, w1 =
(k−1) (k) (k) (k)
Az1 , and from the QR factorisation w1 = z1 r1,1 . Because kz1 k2 = 1 we obtain that
(k)
r1,1 = kw1 k2 . But this means that
(k) (k−1)
(k) w1 Az1
z1 = (k)
= (k−1)
,
kw1 k2 kAz1 k2
(k)
which we recognise as the power iteration. We expect that z1 → x1 , and furthermore
(k) (k−1) T (k) (k−1) T (k−1) (k−1)
Λ1,1 = (z1 ) w1 = (z1 ) Az1 = rA (z1 ) → λ1 as k → ∞.
So far so good. Now, consider
(k) (k−1) (k) (k)
w2 = Az2 = z1 r1,2 + z2 r2,2 .
(k) (k) (k)
Taking the scalar product with z1 we see that r1,2 = hz1 , w2 i, whence
(k) (k) (k) (k) (k)
z2 r2,2 = w2 − hz1 , w2 iz1 .
| {z }
(k) (k)
projection of w2 onto span{z1 }⊥

88
LECTURE 30. SIMULTANEOUS ITERATION AND THE QR METHOD

(k) (k) (k) (k) (k) (k−1)


Since r2,2 = kw2 − hz1 , w2 iz1 k2 and recalling that w2 = Az2 we obtain that
(k−1) (k) (k−1) (k)
(k) Az2 − hz1 , Az2 iz1
z2 = (k−1) (k) (k−1) (k)
.
kAz2 − hz1 , Az2 iz1 k2
(k) (k) (k)
Given that z1 → x1 we expect that limk→∞ z2 ∈ span{x1 }⊥ . Altogether, the iterates z2
form an approximation to the power iteration on span{x1 }⊥ and, as discussed at the beginning,
can be expected to converge to x2 . We also see that then
(k−1) T (k) (k−1) T (k−1) (k−1)
Λ2,2 = (z2 ) w2 = (z2 ) Az2 = rA (z2 ) → λ2 as k → ∞.

Furthermore, using the orthogonality of x1 and x2 we expect that


(k−1) T (k)
Λ1,2 = (z1 ) w2 → xT1 Ax2 = xT1 (λ2 x2 ) = 0 as k → ∞.
(k)
Arguing in an analogous way for the other zi and entries of Λ(k) we altogether expect that

Z (k) → (x1 , . . . , xn ) and Λ(k) → diag(λ1 , . . . , λn ) as k → ∞.

If only the eigenvalues are required then there is a very elegant way to reformulate the simulta-
neous iteration. Define
Q(k) := (Z (k−1) )T Z (k) .
Then
Q(k) R(k) = (Z (k−1) )T Z (k) (Z (k) )−1 W (k) = (Z (k−1) )T W (k) = Λ(k)
and

R(k) Q(k) = (Z (k) )−1 W (k) (Z (k−1) )T Z (k)


= (Z (k) )T AZ (k−1) (Z (k−1) )−1 Z (k) = (Z (k) )T AZ (k) = Λ(k+1) .

So when we have a QR factorisation of the actual matrix approximating the eigenvalues Λ(k)
then we only have to interchange the two matrices to compute the next iterate Λ(k+1) .

Algorithm 14 QRI (QR iteration for eigenvalues)


input: A ∈ Rn×n symmetric and tridiagonal.
output: Λ ∈ Rn×n with diagonal entries approximating the eigenvalues of A.
1: Λ(0) := A
2: for k = 1, 2, . . . do
3: compute a QR factorisation Λ(k−1) = Q(k−1) R(k−1) (with Givens rotations)
4: Λ(k) := R(k−1) Q(k−1)
5: stop iteration if diagonal entries of Λ(k) are close to those of Λ(k−1) (modulo sign)
6: end for

Example: Numerically investigate (and perhaps consider a bit of history):


 
2 −1 0 0
−1 2 −1 0 
A=  0 −1
.
2 −1
0 0 −1 2

89
Bibliography

[1] Andrew Stuart and Jochen Voss, Matrix Analysis and Algorithms, Lecture notes.

[2] Roger A. Horn and Charles R. Johnson, Matrix Analysis, Cambridge University Press,
1985.

[3] Gene H. Golub and Charles F. van Loan, Matrix Computations, 3. ed., Johns Hopkins
University Press, 1996.

[4] Lloyd Trefethen, David Bau Numerical Linear Algebra, SIAM, 1997.

[5] Nicholas Higham, Accuracy and Stability of Numerical Algorithms, SIAM, 2002.

[6] David Kincaid and Ward Cheney, Numerical Analysis, 3. ed., AMS, 2002.

[7] Thomas S. Shores, Applied Linear Algebra and Matrix Analysis, Springer, 2007.

90

You might also like