0% found this document useful (0 votes)
40 views274 pages

Layton W., Sussman M. Numerical Linear Algebra 2020

The document is a publication titled 'Numerical Linear Algebra' authored by William Layton and Myron Sussman, aimed at senior undergraduates and beginning graduate students in mathematics, science, and engineering. It covers various topics in numerical linear algebra, presenting both algorithmic and theoretical perspectives, with an emphasis on practical applications and methods. The book includes detailed discussions on iterative methods, error analysis, and algorithms, supported by pseudocode for implementation.

Uploaded by

faragobence1006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views274 pages

Layton W., Sussman M. Numerical Linear Algebra 2020

The document is a publication titled 'Numerical Linear Algebra' authored by William Layton and Myron Sussman, aimed at senior undergraduates and beginning graduate students in mathematics, science, and engineering. It covers various topics in numerical linear algebra, presenting both algorithmic and theoretical perspectives, with an emphasis on practical applications and methods. The book includes detailed discussions on iterative methods, error analysis, and algorithms, supported by pseudocode for implementation.

Uploaded by

faragobence1006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 274

Numerical

Linear
Algebra
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


Numerical
Linear
Algebra
William Layton
Myron Sussman
University of Pittsburgh, USA

World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data


Names: Layton, W. J. (William J.), author. | Sussman, Mike Myron, author.
Title: Numerical linear algebra / William Layton, Mike Myron Sussman, University of Pittsburgh.
Description: New Jersey : World Scientific, 2020. | Includes bibliographical references and index.
Identifiers: LCCN 2020025413 | ISBN 9789811223891 (hardcover) |
ISBN 9789811224843 (paperback) | ISBN 9789811223907 (ebook) |
ISBN 9789811223914 (ebook other)
Subjects: LCSH: Algebras, Linear. | Numerical analysis.
Classification: LCC QA184.2 .L395 2020 | DDC 518/.43--dc23
LC record available at https://lccn.loc.gov/2020025413

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

Copyright © 2020 by World Scientific Publishing Co. Pte. Ltd.


All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.

For any available supplementary material, please visit


https://www.worldscientific.com/worldscibooks/10.1142/11926#t=suppl

Desk Editor: Liu Yumeng

Printed in Singapore

Yumeng - 11926 - Numerical Linear Algebra.indd 1 22/6/2020 5:34:35 pm


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page v

Preface

“It is the mark of an educated mind to rest satisfied


with the degree of precision that the nature of the subject
permits and not to seek exactness when only an approxi-
mation is possible.”
— Aristotle (384 BCE)

This book presents numerical linear algebra for students from a diverse
audience of senior level undergraduates and beginning graduate students in
mathematics, science and engineering. Typical courses it serves include:
A one term, senior level class on Numerical Linear Algebra.
Typically, some students in the class will be good programmers but have
never taken a theoretical linear algebra course; some may have had many
courses in theoretical linear algebra but cannot find the on/off switch on a
computer; some have been using methods of numerical linear algebra for a
while but have never seen any of its background and want to understand
why methods fail sometimes and work sometimes.
Part of a graduate “gateway” course on numerical methods.
This course gives an overview in two terms of useful methods in compu-
tational mathematics and includes a computer lab teaching programming
and visualization connected to the methods.
Part of a one term course on the theory of iterative meth-
ods. This class is normally taken by students in mathematics who want to
study numerical analysis further or to see deeper aspects of multivariable
advanced calculus, linear algebra and matrix theory as they meet applica-
tions.
This wide but highly motivated audience presents an interesting chal-
lenge. In response, the material is developed as follows: Every topic in

v
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page vi

vi Numerical Linear Algebra

numerical linear algebra can be presented algorithmically and theoretically


and both views of it are important. The early sections of each chapter
present the background material needed for that chapter, an essential step
since backgrounds are diverse. Next methods are developed algorithmically
with examples. Convergence theory is developed and the parts of the proofs
that provide immediate insight into why a method works or how it might
fail are given in detail. A few longer and more technically intricate proofs
are either referenced or postponed to a later section of the chapter.
Our first and central idea about learning is “to begin with the end in
mind ”. In this book the end is to provide a modern understanding of useful
tools. The choice of topics is thus made based on utility rather than beauty
or completeness. The theory of algorithms that have proven to be robust
and reliable receives less coverage than ones for which knowing something
about the method can make a difference between solving a problem and not
solving one. Thus, iterative methods are treated in more detail than direct
methods for both linear systems and eigenvalue problems. Among iterative
methods, the beautiful theory of SOR is abbreviated because conjugate
gradient methods are a (currently at least) method of choice for solving
sparse SPD linear systems. Algorithms are given in pseudocode based on
the widely used Matlab language. The pseudocode transparently presents
algorithmic steps and, at the same time, serves as a framework for computer
implementation of the algorithm.
The material in this book is constantly evolving. Welcome!
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page vii

Contents

Preface v

1. Introduction 1
1.1 Sources of Arithmetical Error . . . . . . . . . . . . . . . . 4
1.2 Measuring Errors: The Trademarked Quantities . . . . . . 9

2. Linear Systems and Finite Precision Arithmetic 13


2.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . 13
2.2 Eigenvalues and Singular Values . . . . . . . . . . . . . . . 17
2.2.1 Properties of eigenvalues . . . . . . . . . . . . . . 20
2.3 Error and Residual . . . . . . . . . . . . . . . . . . . . . . 22
2.4 When is a Linear System Solvable? . . . . . . . . . . . . . 27
2.5 When is an N×N Matrix Numerically Singular? . . . . . . 30

3. Gaussian Elimination 33
3.1 Elimination + Backsubstitution . . . . . . . . . . . . . . . 33
3.2 Algorithms and Pseudocode . . . . . . . . . . . . . . . . . 36
3.3 The Gaussian Elimination Algorithm . . . . . . . . . . . . 38
3.3.1 Computational Complexity and Gaussian
Elimination . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Pivoting Strategies . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Tridiagonal and Banded Matrices . . . . . . . . . . . . . . 48
3.6 The LU Decomposition . . . . . . . . . . . . . . . . . . . 52

4. Norms and Error Analysis 61


4.1 FENLA and Iterative Improvement . . . . . . . . . . . . . 61

vii
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page viii

viii Numerical Linear Algebra

4.2 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . 64


4.2.1 Norms that come from inner products . . . . . . . 66
4.3 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 A few proofs . . . . . . . . . . . . . . . . . . . . . 72
4.4 Error, Residual and Condition Number . . . . . . . . . . . 74
4.5 Backward Error Analysis . . . . . . . . . . . . . . . . . . . 79
4.5.1 The general case . . . . . . . . . . . . . . . . . . . 81

5. The MPP and the Curse of Dimensionality 89


5.1 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 1D Model Poisson Problem . . . . . . . . . . . . . . . . . 93
5.2.1 Difference approximations . . . . . . . . . . . . . . 93
5.2.2 Reduction to linear equations . . . . . . . . . . . . 95
5.2.3 Complexity of solving the 1D MPP . . . . . . . . 99
5.3 The 2D MPP . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4 The 3D MPP . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 The Curse of Dimensionality . . . . . . . . . . . . . . . . . 110
5.5.1 Computing a residual grows slowly with dimension 111

6. Iterative Methods 115


6.1 Introduction to Iterative Methods . . . . . . . . . . . . . . 115
6.1.1 Iterative methods three standard forms . . . . . . 121
6.1.2 Three quantities of interest . . . . . . . . . . . . . 122
6.2 Mathematical Tools . . . . . . . . . . . . . . . . . . . . . . 125
6.3 Convergence of FOR . . . . . . . . . . . . . . . . . . . . . 128
6.3.1 Optimization of ρ . . . . . . . . . . . . . . . . . . 130
6.3.2 Geometric analysis of the min-max problem . . . . 130
6.3.3 How many FOR iterations? . . . . . . . . . . . . . 134
6.4 Better Iterative Methods . . . . . . . . . . . . . . . . . . . 136
6.4.1 The Gauss–Seidel Method . . . . . . . . . . . . . 136
6.4.2 Relaxation . . . . . . . . . . . . . . . . . . . . . . 139
6.4.3 Gauss–Seidel with over-relaxation = Successive
Over Relaxation . . . . . . . . . . . . . . . . . . . 141
6.4.4 Three level over-relaxed FOR . . . . . . . . . . . . 143
6.4.5 Algorithmic issues: storing a large, sparse matrix . 145
6.5 Dynamic Relaxation . . . . . . . . . . . . . . . . . . . . . 147
6.6 Splitting Methods . . . . . . . . . . . . . . . . . . . . . . . 149
6.6.1 Parameter selection . . . . . . . . . . . . . . . . . 153
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page ix

Contents ix

6.6.2 Connection to dynamic relaxation . . . . . . . . . 153


6.6.3 The ADI splitting . . . . . . . . . . . . . . . . . . 154

7. Solving Ax = b by Optimization 157


7.1 The Connection to Optimization . . . . . . . . . . . . . . 159
7.2 Application to Stationary Iterative Methods . . . . . . . . 164
7.3 Application to Parameter Selection . . . . . . . . . . . . . 168
7.4 The Steepest Descent Method . . . . . . . . . . . . . . . . 171

8. The Conjugate Gradient Method 179


8.1 The CG Algorithm . . . . . . . . . . . . . . . . . . . . . . 179
8.1.1 Algorithmic options . . . . . . . . . . . . . . . . . 183
8.1.2 CG’s two main convergence theorems . . . . . . . 184
8.2 Analysis of the CG Algorithm . . . . . . . . . . . . . . . . 187
8.3 Convergence by the Projection Theorem . . . . . . . . . . 187
8.3.1 The Gram–Schmidt algorithm . . . . . . . . . . . 191
8.3.2 Orthogonalization of moments instead of
Gram–Schmidt . . . . . . . . . . . . . . . . . . . . 192
8.3.3 A simplified CG method . . . . . . . . . . . . . . 196
8.4 Error Analysis of CG . . . . . . . . . . . . . . . . . . . . . 198
8.5 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . 202
8.6 CGN for Non-SPD Systems . . . . . . . . . . . . . . . . . 204

9. Eigenvalue Problems 211


9.1 Introduction and Review of Eigenvalues . . . . . . . . . . 211
9.1.1 Properties of eigenvalues . . . . . . . . . . . . . . 215
9.2 Gershgorin Circles . . . . . . . . . . . . . . . . . . . . . . 217
9.3 Perturbation Theory of Eigenvalues . . . . . . . . . . . . . 219
9.3.1 Perturbation bounds . . . . . . . . . . . . . . . . . 220
9.4 The Power Method . . . . . . . . . . . . . . . . . . . . . . 221
9.4.1 Convergence of the power method . . . . . . . . . 222
9.4.2 Symmetric matrices . . . . . . . . . . . . . . . . . 223
9.5 Inverse Power, Shifts and Rayleigh Quotient Iteration . . 224
9.5.1 The inverse power method . . . . . . . . . . . . . 225
9.5.2 Rayleigh Quotient Iteration . . . . . . . . . . . . . 226
9.6 The QR Method . . . . . . . . . . . . . . . . . . . . . . . 227

Appendix A An Omitted Proof 231


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page x

x Numerical Linear Algebra

Appendix B Tutorial on Basic Matlab Programming 233


B.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
B.2 MATLAB Files . . . . . . . . . . . . . . . . . . . . . . . . 234
B.3 Variables, Values and Arithmetic . . . . . . . . . . . . . . 236
B.4 Variables Are Matrices . . . . . . . . . . . . . . . . . . . . 238
B.5 Matrix and Vector Operations . . . . . . . . . . . . . . . . 241
B.6 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . 246
B.7 Script and Function Files . . . . . . . . . . . . . . . . . . 248
B.8 MATLAB Linear Algebra Functionality . . . . . . . . . . . 250
B.8.1 Solving matrix systems in MATLAB . . . . . . . . 250
B.8.2 Condition number of a matrix . . . . . . . . . . . 251
B.8.3 Matrix factorizations . . . . . . . . . . . . . . . . 251
B.8.4 Eigenvalues and singular values . . . . . . . . . . 252
B.9 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
B.10 Execution Speed . . . . . . . . . . . . . . . . . . . . . . . 254
B.10.1 Initializing vectors and matrices in MATLAB . . . 255
B.10.2 Array notation and efficiency in MATLAB . . . . . 255

Bibliography 259

Index 261
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 1

Chapter 1

Introduction

There is no such thing as the Scientific Revolution, and


this is a book about it.
— Steven Shapin, The Scientific Revolution.

This book presents numerical linear algebra. The presentation is in-


tended for the first exposure to the subject for students from mathematics,
computer science, engineering. Numerical linear algebra studies several
problems:
Linear systems: Ax = b : Solve the N × N linear system.
Eigenvalue problems: Aφ = λφ : Find all the eigenvalues and eigen-
vectors or a selected subset.
Ill-posed problems and least squares: Find a unique useful solution
(that is as accurate as possible given the data errors) of a linear system that
is undetermined, overdetermined or nearly singular with noisy data.
We focus on the first, treat the second lightly and omit the third. This
choice reflects the order the algorithms and theory are built, not the impor-
tance of the three. Broadly, there are two types of subproblems: small to
medium scale and large scale. “large ” in large scale problems can be defined
as follows: a problem is large if memory management and turnaround time
are central challenges. Thus, a problem is not large if one can simply call
a canned linear algebra routine and solve the problem reliably within time
and resource constraints with no special expertise. Small to medium scale
problems can also be very challenging when the systems are very sensitive
to data and roundoff errors and data errors are significant. The latter is
typical when the coefficients and RHS come from experimental data, which
always come with noise. It also occurs when the coefficients depend on
physical constants which may be known to only one significant digit.

1
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 2

2 Numerical Linear Algebra

The origin of numerical linear algebra lies in a 1947 paper of von Neu-
mann and Goldstine [von Neumann and Goldstine (1947)]. Its table of
contents, given below, is quite modern in all respects except for the omis-
sion of iterative methods:

NUMERICAL INVERTING OF MATRICES OF HIGH


ORDER
JOHN VON NEUMANN AND H. H. GOLDSTINE

ANALYTIC TABLE OF CONTENTS

PREFACE
CHAPTER I. The sources of errors in a computation
1.1. The sources of errors.
(A) Approximations implied by the mathematical model.
(B) Errors in observational data.
(C) Finitistic approximations to transcendental and implicit
mathematical formulations.
(D) Errors of computing instruments in carrying out elemen-
tary operations: “Noise.” Round off errors. “Analogy”
and digital computing. The pseudo-operations.
1.2. Discussion and interpretation of the errors (A)–(D). Stability.
1.3. Analysis of stability. The results of Courant, Friedrichs, and
Lewy.
1.4. Analysis of “noise” and round off errors and their relation to
high speed computing.
1.5. The purpose of this paper. Reasons for the selection of its
problem.
1.6. Factors which influence the errors (A)–(D). Selection of the
elimination method.
1.7. Comparison between “analogy” and digital computing meth-
ods.
CHAPTER II. Round off errors and ordinary algebraical processes.
2.1. Digital numbers, pseudo-operations. Conventions regarding
their nature, size and use: (a), (b).
2.2. Ordinary real numbers, true operations. Precision of data.
Conventions regarding these: (c), (d).
2.3. Estimates concerning the round off errors:
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 3

Introduction 3

(e) Strict and probabilistic, simple precision.


n
(f) Double precision for expressions i=1 xi yj .
2.4. The approximative rules of algebra for pseudo-operations.
2.5. Scaling by iterated halving.
CHAPTER III. Elementary matrix relations.
3.1. The elementary vector and matrix operations.
3.2. Properties of |A|, |A| and N (A).
3.3. Symmetry and definiteness.
3.4. Diagonality and semi-diagonality.
3.5. Pseudo-operations for matrices and vectors. The relevant es-
timates.
CHAPTER IV. The elimination method.
4.1. Statement of the conventional elimination method.
4.2. Positioning for size in the intermediate matrices.
4.3. Statement of the elimination method in terms of factoring A
into semidiagonal factors C, B  .
4.4. Replacement of C, B  by B, C, D.
4.5. Reconsideration of the decomposition theorem. The unique-
ness theorem.
CHAPTER V. Specialization to definite matrices.
5.1. Reasons for limiting the discussion to definite matrices A.
5.2. Properties of our algorithm (that is, of the elimination
method) for a symmetric matrix A. Need to consider posi-
tioning for size as well.
5.3. Properties of our algorithm for a definite matrix A.
5.4. Detailed matrix bound estimates, based on the results of the
preceding section.
CHAPTER VI. The pseudo-operational procedure.
6.1. Choice of the appropriate pseudo-procedures, by which the
true elimination will be imitated.
6.2. Properties of the pseudo-algorithm.
6.3. The approximate decomposition of A, based on the pseudo-
algorithm.
6.4. The inverting of B and the necessary scale factors.
6.5. Estimates connected with the inverse of B.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 4

4 Numerical Linear Algebra

6.6. Continuation.
6.7. Continuation.
6.8. Continuation. The estimates connected with the inverse of A.
6.9. The general AI . Various estimates.
6.10. Continuation.
6.11. Continuation. The estimates connected with the inverse of
AI .
CHAPTER VII. Evaluation of the results.
7.1. Need for a concluding analysis and evaluation.
7.2. Restatement of the conditions affecting A and AI : (A) − (D).
7.3. Discussion of (A), (B): Scaling of A and AI .
7.4. Discussion of (C): Approximate inverse, approximate singu-
larity.
7.5. Discussion of (D): Approximate definiteness.
7.6. Restatement of the computational prescriptions. Digital char-
acter of all numbers that have to be formed.
7.7. Number of arithmetical operations involved.
7.8. Numerical estimates of precision.

1.1 Sources of Arithmetical Error

Errors using inadequate data are much less than those us-
ing no data at all.
— Babbage, Charles (1792–1871)
On two occasions I have been asked [by members of Par-
liament], ‘Pray, Mr. Babbage, if you put into the machine
wrong figures, will the right answers come out?’ I am not
able rightly to apprehend the kind of confusion of ideas
that could provoke such a question.
— Babbage, Charles (1792–1871)

Numerical linear algebra is strongly influenced by the experience of solv-


ing a linear system by Gaussian elimination and getting an answer that is
absurd. One early description was in von Neumann and Goldstine [von
Neumann and Goldstine (1947)]. They gave 4 sources of errors of types
A,B,C,D and a model for computer arithmetic that could be used to track
the sources and propagation of roundoff error, the error of type D. In order
to understand this type of error, it is necessary to have some understanding
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 5

Introduction 5

Table 1.1 Common precisions for real numbers.


Common name Bits Decimal digits Max exponent
Single precision 32 8 38
Double precision 64  16 308
Quadruple precision 128  34 4931

of how numbers are represented in computers and the fact that computer
arithmetic is only a close approximation to exact arithmetic. Integers, for
example, are typically represented in a computer in binary form, with a fi-
nite number of binary digits (bits), most commonly 32 or 64 bits, with one
bit reserved for the sign of the integer. Exceeding the maximum number of
digits can result in anomalies such as the sum of two large positive integers
being a negative integer.
Real numbers are typically stored in computers in essentially scientific
notation, base 2. As with integers, real numbers are limited in precision
by the necessity of storing them with a limited number of bits. Typical
precisions are listed in Table 1.1. In Fortran, single precision numbers are
called “real”, double precision numbers are called “double precision”, and
quadruple and other precisions are specified without special names. In C
and related languages, single precision numbers are called “float”, double
precision numbers are called “double”, and quadruple precision numbers
are called “long double”. In Matlab, numbers are double precision by
default; other precisions are also available when required.
Machine epsilon. The finite precision of computer numbers means
that almost all computer operations with numbers introduce additional
numerical errors. For example, there are numbers that are so small that
adding them to the number 1.0 will not change its value! The largest of
these is often called “machine epsilon ” and satisfies the property that
1+=1
in computer arithmetic.1 This error and other consequences of the finite
length of computer numbers are called “roundoff errors ”. Generation and
propagation of these roundoff errors contains some unpleasant surprises.
Everyone who writes computer programs should be aware of the possibilities
of roundoff errors as well as other numerical errors.
Common sources of numerical errors. The following five types of
error are among the most common sources of numerical errors in computer
programs.
1 The precise definition of machine epsilon varies slightly among sources. Some include
a factor of 2 so machine epsilon represents the smallest number that changes 1 when
added to it instead of the largest which doesn’t change the value of 1.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 6

6 Numerical Linear Algebra

(1) Input errors.


One of the most common source of errors in input errors. Typically,
you are faced with a program written using double precision numbers
but, no matter how hard you try to increase accuracy, only 2 significant
digits of accuracy come out. In this case, one likely culprit is an error
early in the program where the various constants are defined.

Example 1.1. Somewhere you might find a statement like:


pi = 3.1416
WRONG!
pi = 22.0/7.0
To preserve the program’s accuracy π must be input to the full sixteen
digit accuracy2 of a double precision number.
pi = 3.1415926535897932
A sneaky way around this is:
pi = 4.0 ∗ atan(1.0)

(2) Mixed mode arithmetic.


It is generally true that computer arithmetic between two integers will
yield an integer, computer arithmetic between single precision numbers
will yield a single precision number, etc. This convention gives rise to
the surprising result that 5/2 = 2 in computer integer arithmetic! It also
introduces the question of how to interpret an expression containing two
or more different types of numbers (called “mixed-mode” arithmetic).

Example 1.2. Suppose that the variable X is a single precision variable


and the variable Y is a double precision variable. Forming the sum (X +
Y) requires first a “promotion” of X temporarily to a double precision
value and then adding this value to Y. The promotion does not really
add precision because the additional decimal places are arbitrary. For
example, the single precision value 3.1415927 might be promoted to
the double precision value 3.1415927000000000. Care must be taken
when writing programs using mixed mode arithmetic.

Another error can arise when performing integer arithmetic and, espe-
cially, when mixing integer and real arithmetic. The following Fortran
example program seems to be intended to print the value 0.5, but it
2 In C, numbers are assumed to be double precision, but in Fortran, numbers must have

their precision specified in order to be sure of their precision.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 7

Introduction 7

will print the value 0.0 instead. Analogous programs written in C,


C++ and Java would behave in the same way. An analogous Matlab
program will print the value 0.5.
As an example of mixed mode arithmetic, consider this Fortran pro-
gram.

Example 1.3.
integer j,k
real x
j=1
k=2
x=j/k
print *,x
end
This program will first perform the quotient 1/2, which is chopped to
zero because integer division results in an integer. Then it will set x=0,
so it will print the value 0.0 even though the programmer probably
expected it to print the value 0.5.

A good way to cause this example program to print the value 0.5 would
be to replace the line x=j/k with the line x=real(j)/real(k) to con-
vert the integers to single precision values before performing the divi-
sion. Analogous programs written in C, C++ and Java can be modified
in an analogous way.
(3) Subtracting nearly equal numbers.
This is a frequent cause of roundoff error since subtraction causes a loss
of significant digits. This source arises in many applications, such as
numerical differentiation.

Example 1.4. For example, in a 4-digit mantissa base 10 computer,


suppose we do
1.234 × 101 − 1.233 × 101 = 1.000 × 10−3 .
We go from four significant digits to one. Suppose that the first term
is replaced with 1.235 × 101 , a difference of approximately 0.1%. This
gives
1.235 × 101 − 1.233 × 101 = 2.000 × 10−3 .
Thus, a 0.1% error in 1.234 × 101 can become a 100% error in the
answer!
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 8

8 Numerical Linear Algebra

(4) Adding a large number to a small one.


This causes the effect of the small number to be completely lost. This
can have profound effects when summing a series of applying a method
like the trapezoid rule to evaluate an integral numerically.

Example 1.5. For example, suppose that in our 4-digit computer we


perform
X = .1234 ∗ 103 + .1200 ∗ 10−2 .
This is done by making the exponents alike and adding the mantissas:
.1234 ∗ 103
\ CHOP to 4 digits
+ .0000/01200 ∗ 103
\ OR ROUND to 4 digits
= 0 .1234 ∗ 103 .

Thus the effect of the small addend is lost on the calculated value of
the sum.
(5) Dividing by a small number.
This has the effect of magnifying errors: a small percent error can
become a large percent error when divided by a small number.

Example 1.6. Suppose we compute, using four significant digits, the


following:
x = A − B/C,
where
A = 0.1102 × 109 ,
B = 0.1000 × 106 ,
C = 0.9000 × 10−3 .
We obtain B/C = .1111 × 109 and x = 0.9000 × 106 .
Suppose instead that there is a 0.01% error in calculating C, namely
C = 0.9001 × 10−3 .
Then we calculated instead
B/C = 0.1110 × 109 so x = 0.1000 × 107 .
Thus we have an 11% error in the result!
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 9

Introduction 9

Testing before division. When writing a computer program, in cases


where a denominator value can possibly be unrealistically smaller than
the numerator, it should be tested before doing the division. For ex-
ample, by choosing a tiny value appropriate to the quotient at hand,
possibly a small multiple of machine epsilon, and testing in the follow-
ing manner:

tiny=10 * machine epsilon


if |denominator | < tiny * | numerator |
error(’Division by zero.’)
else
x=numerator /denominator
end

1.2 Measuring Errors: The Trademarked Quantities

Mathematics is not a careful march down a well-cleared high-


way, but a journey into a strange wilderness, where the explorers
often get lost. Rigor should be a signal to the historian that the
maps have been made, and the real explorers have gone elsewhere.
— Anglin, W.S. in: “Mathematics and History”, Mathematical
Intelligencer, v. 4, no. 4.

Since every arithmetic operation induces roundoff error it is useful to


come to grips with it on a quantitative basis. Suppose a quantity is cal-
culated by some approximate process. The result, xcomputed , is seldom the
exact or true result, xtrue . Thus, we measure errors by the following con-
venient standards. These are “trademarked” terms.

Definition 1.1. Let || · || denote a norm. Then



− → →
e = the error := x TRUE − x COMPUTED ,
→ →
eABSOLUTE =  x TRUE − x COMPUTED ,
→ → →
eRELATIVE =  x TRUE − x COMPUTED / x TRUE ,

ePERCENT = e RELATIVE ∗ 100.

We generally have:

• Error: essential but unknowable. Indeed, if we know the error and the
approximate value, adding then gives the true value. If the true value
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 10

10 Numerical Linear Algebra

were knowable, then we wouldn’t be approximating to start with. If x is


a vector with 100,000 components then the error has 100,000 numbers
and is also thus beyond understanding in most cases.
• Absolute error: This replaces many numbers with one number: the
magnitude of the error vector. If the absolute error is reduced from
1.0 to .001, then we know for sure that the approximation is improved.
This is why we mostly look at error magnitudes and not errors.
• Relative error: An absolute error of 0.2 might be very bad or very good
depending on how big the true solution is. The relative error calibrates
the error against the true solution. If the relative error is 10 −5 then
the approximation has 5 significant digits of accuracy.
• Percent error: This gives another way to think of relative errors for
those comfortable with percentages.

Of course, we seldom know the true solution so it is useful to get a


“ballpark” estimate of error sizes. Here are some universally standard ways
to estimate roundoff errors:

(1) (Experimental roundoff errors test) repeat the calculation in


higher precision. The digit where the two results differ represents the
place where roundoff error has influenced the lower precision calcula-
tion. This also gives an estimate of how many digits are lost in the
lower precision calculation. From that one estimates how many are
lost in higher precision and thus how many to believe are correct.
(2) (Estimating model errors in the arithmetic model) Solve the
problem at hand twice-once with a given model and second with a
more “refined” or accurate arithmetic model. the difference between
the two can be taken as a ballpark measure for the error in the less
accurate discrete model.
(3) (Interval Arithmetic for estimating roundoff and other errors)
As a calculation proceeds, we track not only the arithmetic result but
also a “confidence interval” is predicted via a worse case type of cal-
culation at every step. Unfortunately, for long calculations, interval
arithmetic often gives a worst case confidence interval so wide that it
is not very useful.
(4) (Significant Digit Arithmetic) Similarly to Interval Arithmetic, the
number of significant digits is tracked through each computation.
(5) (Backward error analysis for studying sensitivity of problem
to roundoff error) For many types of computations, it has been shown
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 11

Introduction 11

rigorously that “the solution computed using finite precision is precisely


the exact solution in exact arithmetic to a perturbation of the original
problem”. Thus the sensitivity of a calculation to roundoff error can
be examined by studying the sensitivity of the continuous problem to
perturbations in its data.

Exercise 1.1. What are the 5 main causes of serious roundoff error? Give
an example of each.

Exercise 1.2. Consider approximating the derivative f  (a) by


f  (a) ≈ [f (a + h) − f (a)]/h
for h small. How can this introduce serious roundoff error?
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 13

Chapter 2

Linear Systems and Finite Precision


Arithmetic

“Can you do addition?” the White Queen asked. “What’s one


and one and one and one and one and one and one and one and
one and one?” “I don’t know,” said Alice. “I lost count.”
— Lewis Carroll, Through the Looking Glass.

2.1 Vectors and Matrices

So as they eddied past on the whirling tides,


I raised my voice: “O souls that wearily rove,
Come to us, speak to us — if it be not denied.”
Dante Alighieri, L’Inferno, Canto V c. 1300,
(translation of Dorothy L. Sayers).

A vector is an ordered collection of n real numbers, an n-tuple:


⎡ ⎤
x1
→ ⎢ ⎥
x = (x1 , x2 , . . . , xn )t = ⎣ ... ⎦ .
xn

Vectors are often denoted by an over arrow, by being written in bold or


(most commonly herein) understood from the context in which the vector
is used. A matrix is a rectangular array of real numbers
⎡ ⎤
a11 a12 . . . a1n
⎢ .. .. . . .. ⎥
Am×n =⎣ . . . . ⎦.
am1 am2 . . . amn

13
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 14

14 Numerical Linear Algebra

The transpose of a matrix, denoted At is an n × m matrix with the rows


and columns of A interchanged
⎡ ⎤
a11 . . . am1
⎢ a12 . . . am1 ⎥
⎢ ⎥
(At )n×m = ⎢ . . . ⎥.
⎣ .. . . .. ⎦
a1n . . . amn
In other words, if a matrix Am×n = (aij ) i=1,...,m then its transpose
j=1,...,n
(At )m×n = (aji ) j=1,...,n . For example,
i=1,...,m
⎡ ⎤
t 1 4
123 ⎣
= 2 6⎦.
456
3 6
Vector operations of scalar multiplication and vector addition are de-
fined so that vector addition is equivalent to forming the resultant of the
two (force) vectors by the parallelogram rule. Thus, if the vectors x, y rep-
resent forces, the sum x + y is defined so that x + y is the resultant force
of x and y. Conveniently, it means componentwise addition.
→ →
Definition 2.1. If α ∈ R and x , y are vectors

α x = (αx1 , αx2 , . . . , αxn )t ,
→ →
x + y = (x1 + y1 , x2 + y2 , . . . , xn + yn )t .

Vector addition and scalar multiplication share many of the usual prop-
erties of addition and multiplication of real numbers. One of the most
important vector operations is the dot product or the scalar product of two
vectors.
→ →
Definition 2.2. Given vectors x , y , the dot product or scalar product is
the real number
⎧ → → ⎫
⎪ x·y ⎪
⎪ ⎪

⎪ ⎪

⎨ or ⎪
⎪ ⎬
→ →
 x , y  := x1 y1 + x2 y2 + . . . + xn yn

⎪ ⎪
⎪ or ⎪
⎪ ⎪

⎩ → → ⎪
⎪ ⎭
(x, y )
and the usual (euclidean) length of the vector x is
 
→ →
||x||2 = x · x = x21 + x22 + . . . + x2n .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 15

Linear Systems and Finite Precision Arithmetic 15

With the definition of matrix multiplication (below) the dot product


can also be written x · y = xt y. Recall that the dot product is related to
the angle1 θ between two vectors by the formula
x, y
cos θ = .
||x||2 ||y||2
Actually, this formula shows that as long as any two quantities of the three
(θ, the dot product ·, · and the length || · ||2 ) are defined the third is
completely determined by the formula. Thus, existence of a dot product is
equivalent to being able to define angles between vectors.
If a linear system is to be equivalent to writing Matrix A times vector x
= vector b, then there is only one consistent way to define the matrix-vector
product. Matrix vector products are row × column. This means that the ith
component of Ax is equal to (row i of A) dot product (the vector x). Matrix
matrix multiplication is a direct extension of matrix-vector multiplication.

Definition 2.3. If Am×n is a matrix and xn×1 is an n-vector, the product


Ax is an m-vector given by

n
(Ax)i := aij xj .
j=1

If y is an m × 1 vector we can multiply y t A to obtain the transpose


of an n-vector given by
 t  n
yA j= aij yi .
i=1

Matrix multiplication is possible for matrices of compatible sizes. Thus


we can multiply AB if the number of columns of A equals the number of
rows of B:
Am×n Bn×p = Cm×p
and, in this case,

n
(AB)ij := Ai Bj , i = 1, . . . , m, j = 1, . . . , p.
=1
In words this is:

The i,j entry in AB is the dot product: (The ith row vector in A)· (the jth
column vector in B).
1 The same formula is also interpreted as the correlation between x and y, depending

on intended application.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 16

16 Numerical Linear Algebra

For example, a pair of linear systems can be combined into a single


system.
AN ×N x = b

⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 . . . a1n x1 b1
⎢ .. .. . . .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥
⎣ . . . . ⎦⎣ . ⎦ ⎣ . ⎦
an1 an2 . . . ann xn bn
and
AN ×N y = c

⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 . . . a1n y1 c1
⎢ .. .. . . .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥
⎣ . . . . ⎦⎣ . ⎦ ⎣ . ⎦
an1 an2 . . . ann yn cn
can be combined into the single, block system
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 . . . a1n x 1 y1 b1 c1
⎢ .. .. . . .. ⎥ ⎢ .. .. ⎥ = ⎢ .. .. ⎥ .
⎣ . . . . ⎦⎣ . . ⎦ ⎣ . . ⎦
an1 an2 . . . ann x n yn bn c n
Often this is written as
AX = B where X := [x|y]n×2 , B = [b|c]n×2 .
In sharp contrast with multiplication of real numbers, multiplication of
a pair of N × N matrices is generally not commutative!

Exercise 2.1.

a. Pick two 2 × 2 matrices A, B by filling in the digits of your phone


number. Do the resulting matrices commute? Test if the matrices
commute with their own transposes.
b. [a more advanced exercise] Do a literature search for conditions under
which two N × N matrices commute. If the entries in the matrices are
chosen at random, what is the probability they commute? This can be
calculated for the 2 × 2 case directly.

Exercise 2.2. Let x(t), y(t) be N vectors that depend smoothly on t.


g(t) := x(t) · y(t) is a differentiable function : R → R. By using the
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 17

Linear Systems and Finite Precision Arithmetic 17

definition of derivative and dot product prove the versions of the product
rule of differentiation

g  (t) = x (t) · y(t) + x(t) · y  (t).

Exercise 2.3. Pick two (nonzero) 3-vectors and calculate xt y and xy t .


Notice that the first is a number while the second is a 3 × 3 matrix. Show
that the dimension of the range of that matrix is, aside from special cases
where the range is just the zero vector, 1.

Exercise 2.4. Find two 2×2 matrices A and B so that AB = 0 but neither
A = 0 nor B = 0.

Exercise 2.5. Let x(t), y(t) be N vectors that depend smoothly on t. For
A an N × N matrix g(t) := x(t)t Ay(t) is a differentiable function : R → R.
Prove the following version of the product rule of differentiation

g  (t) = x (t)t Ay(t) + x(t)t Ay  (t).

2.2 Eigenvalues and Singular Values

“. . . treat Nature by the sphere, the cylinder and the cone . . . ”


— Cézanne, Paul (1839–1906)

One of the three fundamental problems of numerical linear algebra is to


find information about the eigenvalues of an N × N matrix A. There are
various cases depending on the structure of A (large and sparse vs. small
and dense, symmetric vs. non-symmetric) and the information sought (the
largest or dominant eigenvalue, the smallest eigenvalue vs. all the eigenval-
ues).

Definition 2.4 (eigenvalue and eigenvector). Let A be an N × N ma-


trix. The complex number λ is an eigenvalue of A if there is at least one


nonzero, possibly complex, vector φ = 0 with

− →

Aφ = λφ.


φ is an eigenvector associated with the eigenvalue λ. The eigenspace of
λ is the set of all linear combinations of eigenvectors of that λ.

Calculating λ, φ by hand (for small matrices typically) is a three step


process which is simple in theory but seldom practicable.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 18

18 Numerical Linear Algebra



Finding λ, φ for an N × N real matrix A by hand:

• Step 1: Calculate exactly the characteristic polynomial of A. p(λ) :=


det(A − λI) is a polynomial of degree N with real coefficients.
• Step 2: Find the N (counting multiplicities) real or complex roots of
p(λ) = 0. These are the eigenvalues
λ1 , λ2 , λ3 , · · ·, λN
• Step 3: For each eigenvalue λi , use Gaussian elimination to find a
non-zero solution of


[A − λi I] φ i = 0, i = 1, 2, · · ·, N

Example 2.1. Find the eigenvalues and eigenvectors of the 2 × 2 matrix


11
A= .
41
We calculate the degree 2 polynomial
1−λ 1
p2 (λ) = det(A − λI) = det = (1 − λ)2 − 4.
4 1−λ
Solving p2 (λ) = 0 gives
p2 (λ) = 0

(1 − λ)2 − 4 = 0

λ1 = 3, λ2 = −1.


The eigenvector φ 1 of λ1 = 3 is found by solving
x 0
(A − λI) =
y 0

−2 1 x 0
= .
4 −2 y 0
Solving gives
y = t, −2x + y = 0, or
1
x = t, for any t ∈ R.
2
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 19

Linear Systems and Finite Precision Arithmetic 19

Thus, (x, y)t = ( 12 t, t)t for any t = 0 is an eigenvector. For example, t = 2


gives
eigenvalue: λ1 = +3,

− 1
eigenvector: φ 1 =
2
1
eigenspace: {t 2 : −∞ < t < ∞}.
1


Similarly, we solve for φ 2
x 21 x 0
(A − λI) = =
y 42 y 0
or (x, y)t = (− 12 t, t)t . Picking t = 2 gives
eigenvalue: λ2 = −1,

− −1
eigenvector: φ 2 =
2
−1
eigenspace: {t : −∞ < t < ∞}.
2
It is sometimes true that there are not N independent eigenvectors.

Example 2.2. Find the eigenvalues and eigenvectors of the 2 × 2 matrix2


21
A= .
02
The characteristic polynomial is given by
p(λ) = (2 − λ)2
and there is a single root λ = 2 of multiplicity 2. To find one eigenvector


φ 1 , solve the system
x 01 x 0
(A − λI) = = .
y 00 y 0
All solutions of this system of equations satisfy x = 0 with y arbitrary.
Hence an eigenvector is given by

− 0
φ1 = .
1


A second eigenvector, φ 2 , would satisfy the same system, so there is no
linearly independent second eigenvector!
2 This matrix is easily recognized as a Jordan block.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 20

20 Numerical Linear Algebra

Example 2.3. Let

0 1
A= .
−1 0

We calculate as above and find

det[A − λI] = λ2 + 1 = 0
λ1 = i, λ2 = −i.

The eigenvector of λ = +i is calculated by Gaussian elimination to be

λ = +i, φ = (−i, 1)T .

Exercise 2.6. Find the eigenvector of λ = −i.

Exercise 2.7. Find the eigenvalues λ(ε) of

+ε 1
A= .
−1 −ε

2.2.1 Properties of eigenvalues

Eigenvalues and eigenvectors are mathematically interesting and important


because they give geometric facts about the matrix A. Two of these facts
are given in the following theorem.

Theorem 2.1. (i) Let A be an N × N matrix. If x is any vector in the


eigenspace of the eigenvalue λ then Ax is just multiplication of x by λ:
Ax = λx.
(ii) A is invertible if and only of no eigenvalue of A is zero.

It is much harder to connect properties of eigenvalues to values of specific


entries in A. In particular, the eigenvalues of A are complicated nonlinear
functions of the entries in A. Thus, the eigenvalues of A + B can have no
general correlation with those of A and B. In particular, eigenvalues are
not additive: generally λ(A + B) = λ(A) + λ(B).
Another geometric fact is given in the following exercise.

Exercise 2.8. Given two commuting matrices A and B, so that AB = BA,


show that if x is an eigenvector of A then it is also an eigenvector of B, but
with a possibly different eigenvalue.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 21

Linear Systems and Finite Precision Arithmetic 21

Proposition 2.1 (Eigenvalues of triangular matrices). If A is diag-


onal, upper triangular or lower triangular, then the eigenvalues are on the
diagonal of A.

Proof. Let A be upper triangular. Then, using ∗ to denote a generic non-


zero entry,
⎡ ⎤
a11 − λ ∗ ∗ ∗
⎢ 0 a22 − λ ∗ ∗ ⎥
⎢ ⎥
det [A − λI] = det ⎢ . ⎥=
⎣ 0 0 . . ∗ ⎦
0 0 0 ann − λ
expand down column 1 and repeat
= (a11 − λ)(a22 − λ) · . . . · (ann − λ) = pn (λ).
The roots of pn are obviously aii .

When the matrix A is symmetric, its eigenvalues and eigenvectors have


special, and very useful, properties.

Proposition 2.2 (Eigenvalues of symmetric matrices). If A is sym-


metric (and real) (A = At ), then:
(i) all the eigenvalues and eigenvectors are real.

− →

(ii) there exists N orthonormal3 eigenvectors φ 1 , . . . , φ N of A:


− → − 1, if i = j,
 φ i, φ j  =
0, if i = j.


(iii) if C is the N × N matrix with eigenvector φ j in the j th column
then
⎡ ⎤
λ1 0 . . . 0
⎢ 0 λ2 . . . 0 ⎥
⎢ ⎥
C −1 = C t and C −1 AC = ⎢ . . . . ⎥.
⎣ .. .. . . .. ⎦
0 0 . . . λN

In the case that A is not symmetric, the eigenvalues and eigenvectors


might not be real. In addition, there might be fewer than N eigenvectors.
Example 2.4. The matrix A below has eigenvalues given by λ1 = +i and
λ2 = −i:
0 1
A= .
−1 0
3 “Orthonormal” means that the vectors are orthogonal (mutually perpendicular so their

dot products give zero) and normal (their lengths are normalized to be one).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 22

22 Numerical Linear Algebra

For some calculations, the so called singular values of a matrix are of


greater importance than its eigenvalues.

 (Singular values). The singular values of a real N × N


Definition 2.5
matrix A are λ(At A).

The square root causes no problem in to the definition of singular values.


The matrix At A is symmetric so its eigenvalues are real. Further, they are
also nonnegative since At Aφ = λφ, both λ, φ are real and thus

At Aφ, φ = λφ, φ so


At Aφ,φ Aφ,Aφ |Aφ|22
λ= φ,φ = φ,φ = |φ|22
≥ 0.

Exercise 2.9. Prove that4


⎡ ⎤
a11 ∗ ∗ ∗
⎢ 0 a22 ∗ ∗ ⎥
⎢ ⎥
det ⎢ .. ⎥ = a11 · a22 · . . . · ann .
⎣ 0 0 . ∗ ⎦
0 0 0 ann

Exercise 2.10. Pick two (nonzero) 3-vectors and calculate the 3×3 matrix
xy t . Find its eigenvalues. You should get 0,0, and something nonzero.

Exercise 2.11. Let


1 t
A= .
−t 1
Find its eigenvalues and eigenvectors explicitly as a function of t. Determine
if they are differentiable functions of t.

2.3 Error and Residual

“The errors of definitions multiply themselves according as the


reckoning proceeds; and lead men into absurdities, which at last
they see but cannot avoid, without reckoning anew from the begin-
ning.”
— Hobbes, Thomas, In J. R. Newman (ed.), The World of
Mathematics, New York: Simon and Schuster, 1956.
4 Here “∗” denotes a generic non-zero real number. This is a common way to represent

the non-zero entries in a matrix in cases where either their exact value does not affect
the result or where the non-zero pattern is the key issue.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 23

Linear Systems and Finite Precision Arithmetic 23

Numerical linear algebra is concerned with solving the eigenvalue prob-



− →

lem A φ = λ φ (considered in Chapter 9) and solving the linear system
Ax = b (which we begin considering now). Computer solutions for these
problems are always wrong because we cannot solve either exactly to in-
finite precision. For the linear system we thus produce an approximation,
, to the exact solution, x = A−1 b. We are concerned, then, with “how
x
wrong” x is. Two useful measures are:

Definition 2.6. Let Ax = b and let x  be any vector. The error (vector)
is e := x − x
 and the residual (vector) is r := b − A
x.

Obviously, the error is zero if and only if the residual is also zero. Errors
and residuals have a geometric interpretation:

The size of the error is a measure of the distance between the


exact solution, x and its approximation x.
The size of the residual is a measure of how close x is to
satisfying the linear equations.

Example 2.5 (Error and residual for 2 × 2 systems). Consider the


2 × 2 linear system
x−y = 0
−0.8x + y = 1/2.
This system represents two lines in the plane, plotted in Figure 2.1, and
the solution of the system is the intersection of the two lines.
Consider the point P = (0.5, 0.7) which is on the plot as well. The size
of the error is the distance from P to the intersection of the two lines. The
error is thus relatively large in Figure 2.1. However, the size of the residual
is the distance of P to the two lines. For this example, the point P is close
to both lines so this is a case where the residual is expected to be smaller
than the error.

For general N ×N systems, the error is essential but, in a very real sense
unknowable. Indeed, if we knew the exact error then we could recover the
exact solution by x = x  + e. If we could find the exact solution, then
we wouldn’t be approximating it in the first place! The residual is easily
computable so it is observable. It also gives some indication about the
error as whenever r = 0, then necessarily e = 0. Thus much of numerical
linear algebra is about using the observable residual to infer the size of
the unknowable error. The connection between residual and error is given
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 24

24 Numerical Linear Algebra



    


Fig. 2.1 Lines L1, L2 and the point P.

in the following theorem, the Fundamental Equation of Numerical Linear


Algebra (FENLA).

Theorem 2.2 (FENLA). Given a square N ×N linear system AN ×N x =




b and x. Let e := x − x
 and r := b − A
x be the error and residual
respectively. Then
Ae = r.

Proof. This is an identity so it is proven by expanding and rearranging:


Ae = A(x − x
) = Ax − A
x = b − A
x = r.

In pursuit of error estimates from residuals, the most common vec-


tor and matrix operations include residual calculations, triad calculations,
quadratic form calculations, and norm calculations.

Residual Calculation. Given a square N × N linear system Ax = b and


, compute the residual:
a candidate for its solution N -vectors x


r := b − A x.
Triad Calculation. Given n-vectors →

x,→

y and →

z compute the vector


x + (→

x ·→

y )→

z.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 25

Linear Systems and Finite Precision Arithmetic 25

Quadratic Form Calculation. Given a square N ×N matrix AN ×N and


n-vectors →

x and →

y compute the number

n


y · (A→

x)=→

y tr A→

x = yi Aij xj .
i,j=1

The quadratic form reduces a lot of information (n2 + 2n real numbers)


to one real number.
Norm Calculation. For an n-vector → −x compute norms (weighted aver-
ages) such as the RMS (root mean square) norm

 n
1 
||→

x ||RMS =  |xi |2 .
n i=1

Often, norms of residuals are computed:




Step 1 : r := b − A
x

 n
1 
r||RMS = 
Step 2 : || |ri |2 .
n i=1

This last calculation is an example of a vector norm. In approximating


the solution (a vector) to a linear system, the error must be measured. The
error is typically measured by an appropriate norm (a generalization of the
idea of the length of a vector). Some typical vector norms are given in the
following definition.

Definition 2.7 (Three vector norms). Three commonly used vector


norms are given as follows.

n
||→

x ||1 := |xi |,
i=1

 n

− 
|| x ||2 :=  |xi |2 ,
i=1

||→

x ||∞ := max |xi |.
1≤i≤n

When solving problems with large numbers of unknowns (large n) it is


usually a good idea to scale the answers to be O(1). This can be done by
computing relative errors. It is sometimes5 done by scaling the norms so
5 Computer languages such as Matlab have built-in functions to compute these norms.

These built-in functions do not compute the scaled form.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 26

26 Numerical Linear Algebra

the vector of all 1’s has norm equal to 1 as follows



 n
1 1 
n
||→

x ||average := →

|xi |, and || x ||RM S :=  |xi |2 .
n i=1 n i=1

Exercise 2.12. Consider the 2 × 2 linear system with solution (1, 1)

1.01x + 0.99y = 2
0.99x + 1.01y = 2.

Let the approximate solution be (2, 0). Compute the following quantities.

(1) The error vector,


(2) The 2 norm of the error vector,
(3) The relative error (norm of the error vector divided by norm of the
exact solution),
(4) The residual vector, and,
(5) The 2 norm of the residual vector.

Exercise 2.13. Suppose you are given a matrix, A, a right hand side vec-
tor, b, and an approximate solution vector, x. Write a computer program
to compute each of the following quantities.

(1) The error vector,


(2) The 2 norm of the error vector,
(3) The relative error (norm of the error vector divided by norm of the
exact solution),
(4) The residual vector, and,
(5) The 2 norm of the residual vector.

Test your program with numbers from the previous exercise. Hint: If
you are using Matlab, the norm function can be used to compute the
unscaled quantity || · ||2 .

Exercise 2.14. Given a point (x0 , y0 ) and two lines in the plane. Calculate
the distance to the lines and relate it to the residual vector. Show that

||r||22 = (1 + m21 )d21 + (1 + m22 )d22

where m, d are the slopes and distance to the line indicated.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 27

Linear Systems and Finite Precision Arithmetic 27

2.4 When is a Linear System Solvable?

Mathematics is written for mathematicians.


— Copernicus, Nicholaus (1473–1543), “De Revolutionibus or-
bium coelestium”

“Of my 57 years I’ve applied at least 30 to forgetting most of


what I’ve learned or read, and since I’ve succeeded in this I have
acquired a certain ease and cheer which I should never again like to
be without. ... I have stored little in my memory but I can apply
that little and it is of good use in many and varied emergencies...”
— Emanuel Lasker

Much of the theory of linear algebra is dedicated to giving conditions


on the matrix A that can be used to test if an N × N linear system

Ax = b

has a unique solution for every right hand side b. The correct condition is
absolutely clear for 2 × 2 linear systems. Consider therefore a 2 × 2 linear
system

a11 a12 x b
= 1 .
a21 a22 y b2

The reason to call the variables x and y (and not x1 , x2 ) is that the 2 × 2
case is equivalent to looking in the x − y plane for the intersection of the
two lines (and the “solution” is the x − y coordinates of the intersection
point of the 2 lines)

Line L1: a11 x + a12 y = b1


Line L2: a21 x + a22 y = b2 .

Plotting two lines in the x − y plane the three possible cases are shown in
Figure 2.2.

(a) If L1 and L2 are not parallel, then a unique solution exists for all RHS.
(b) If L1 is on top of L2, than an infinite number of solutions exist for that
particular RHS and no solution for any other RHS.
(c) If L1 is parallel to L2 and they are not the same line, then no solution
exists. Otherwise, there are an infinite number of solutions.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 28

28 Numerical Linear Algebra

y3

1 2 3
x

(a) Lines L1, L2 are not parallel: Unique intersection


y3

1 2 3
x

(b) L1: x + y = 2, L2: 2x + 2x = 4 are on top of one another: Infinite


number of solutions
y3

1 2 3
x

(c) Lines L1, L2 are parallel: No intersection

Fig. 2.2 Three possibilities for two lines.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 29

Linear Systems and Finite Precision Arithmetic 29

Unique solvability thus depends on the angle between the two lines: If
it is not 0 or 180 degrees a unique solution exists for every possible right
hand side.
For the general N × N linear system, the following is known.

Theorem 2.3 (Unique solvability of Ax = b). The N × N linear sys-


tem Ax = b has a unique solution x for every right hand side b if and only
if any of the following equivalent conditions holds.


1. [The null space of A is trivial] The only solution of Ax = 0 is


the zero vector, x = 0 .
2. [Uniqueness implies existence] Ax = b has at most one solution
for every RHS b.
3. [Existence implies uniqueness] Ax = b has at least one solution
for every RHS b.
4. [A restatement of trivial null space] The kernel or null space of


A is { 0 }:


N (A) := {x : Ax = 0} = { 0 }.

5. [A restatement of existence implies uniqueness] The range of


A is RN :

Range(A) := {y : y = Ax for some x} = RN .

6. [Nonzero determinant condition] The determinant of A satisfies

det(A) = 0.

7. [Nonzero eigenvalue condition] No eigenvalue of A is equal to


zero:

λ(A) = 0 for all eigenvalues λ of A.

There are many more.

“The well does not walk to the thirsty man.”


Transuranian proverb (J. Burkardt)

Exercise 2.15. Consult reference sources in theoretical linear algebra


(books or online) and find 10 more unique solvability conditions.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 30

30 Numerical Linear Algebra

2.5 When is an N×N Matrix Numerically Singular?

To your care and recommendation am I indebted for having


replaced a half-blind mathematician with a mathematician with
both eyes, which will especially please the anatomical members of
my Academy.
— Frederick the Great (1712–1786), [To D’Alembert about La-
grange. Euler had vacated the post.] In D. M. Burton, Elementary
Number Theory, Boston: Allyn and Bacon, Inc., 1976.

Many of the theoretical conditions for unique solvability are conditions


for which no numerical value can be assigned to see how close a system
might be to being singular. The search for a way to quantify how close to
singular a system might be has been an important part of numerical linear
algebra.

Example 2.6 (Determinant does not measure singularity).


Consider the two lines
−x + y = 1 and
−1.0000001x + y = 2.
Their slopes are m = 1 and m = 1.0000001. Thus the angle between them
is very small and the matrix below must be almost singular
−1 1
.
−1.0000001 1
Many early researchers had conjectured that the determinant was a good
measure of this (for example, the above determinant is 0.0000001 which is
indeed small). However, multiplying the second equation through by 107
does not change the 2 lines, now written as
−x + y = 1 and
−10000001x + 10000000y = 20000000,
or (obviously) the angle between them. The new coefficient matrix is now
−1 1
.
−10000001 10000000
The linear system is still as approximately singular as before but the new
determinant is now exactly 1. Thus:
How close det(A) is to zero is not a measure of how close a
matrix is to being singular.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 31

Linear Systems and Finite Precision Arithmetic 31

Goldstine, von Neumann and Wilkinson found the correct path by look-
ing at 2×2 linear systems (we have been following their example). Consider
therefore a 2 × 2 linear system
a11 a12 x b
= 1
a21 a22 y b2

Line L1: a11 x + a12 y = b1
Line L2: a21 x + a22 y = b2 .
Plotting two lines in the x − y plane, geometrically it is clear that the right
definition for 2 × 2 systems of almost singular or numerically singular is as
follows.

Definition 2.8. For the 2 × 2 linear system above, the matrix A is almost
or numerically singular if the angle between the lines L1 and L2 is almost
zero or zero to numerical precision.

Exercise 2.16.

(1) For ε > 0 a small number, consider the 2 × 2 system:


x + y = 1,
(1 − 2ε)x + y = 2,
1 1
and let A = .
1 − 2ε 1
(2) Find the eigenvalues of the coefficient matrix A.
(3) Sketch the two lines in the x − y plane the system represents. On the
basis of your sketch, explain if A is ill conditioned and why.

In the following chapter, numerical methods for solving linear systems


are discussed and, along the way, the notion of numerical singularity will
be refined and methods to estimate numerical singularity of large systems
will be given.
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 33

Chapter 3

Gaussian Elimination

One of the main virtues of an electronic computer from the point


of view of the numerical analyst is its ability to “do arithmetic fast”.
— James Wilkinson, 1971.

Gaussian elimination is the basic algorithm of linear algebra and the


workhorse of computational mathematics. It is an algorithm for solving
exactly (in exact arithmetic) the N × N system:
AN ×N xxN ×1 = bN ×1 , where det(A) = 0. (3.1)
It is typically used on all matrices with mostly non-zero entries (so called
dense matrices) and on moderate sized, for example N ≤ 10, 000,1 matrices
which have only a few non zero entries per row that occur in some regular
pattern (these are called banded and sparse matrices). Larger matrices,
especially ones without some regular pattern of non zero entries, are solved
using iterative methods.

3.1 Elimination + Backsubstitution

Luck favors the prepared mind.


— Louis Pasteur

The N × N system of equations Ax = b is equivalent to


a11 x1 + a12 x2 + . . . +a1N xN = b1 ,
a21 x1 + a22 x2 + . . . +a2N xN = b2 ,
.. (3.2)
.
aN 1 x1 + aN 2 x2 + . . . +aN N xN = bN .
1 The number N = 10, 000 dividing small from large is machine dependent and will

likely be incorrect (too small) by a year after these notes appear.

33
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 34

34 Numerical Linear Algebra

Gaussian elimination solves it in two phases: elimination followed by


backsubstitution.
The Elimination Step: The elimination step reduces the matrix A
to upper form by operations which do not alter the solution of (3.1), (3.2).
These “Elementary Row Operations”2 are:

(1) Multiply a row of (3.2) by a non-zero scalar.


(2) Add a multiple of one row of (3.2) to another.
(3) Interchange two rows of (3.2).

To show how these are used, consider (3.1):


⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 . . . a1N x1 b1
⎢ a21 a22 . . . a2N ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ . .. . . . ⎥⎢ . ⎥ = ⎢ . ⎥.
⎣ .. . . .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
aN 1 aN 2 . . . a N N xn bn
Gaussian elimination proceeds as follows.

Substep 1: Examine the entry a11 . If it is zero or too small, find another
matrix entry and interchange rows or columns to make this the entry
a11 . This process is called “pivoting” and a11 is termed the “pivot
entry”. Details of pivoting will be discussed in a later section, so for
now, just assume a11 is already suitably large.
With the pivot entry non-zero, add a multiple of row 1 to row 2 to
make a21 zero:
a21
Compute: m21 := ,
a11
Then compute: Row 2 ⇐ Row 2 − m21 · Row 1.
This zeroes out the 2, 1 entry and gives
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 ... a1N x1 b1
⎢ 0 a22 − m21 a12 . . . a2N − m21 a1N ⎥ ⎢ x2 ⎥ ⎢ b2 − m21 b1 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ a31 a32 ... a3N ⎥⎢ x3 ⎥ ⎢ b3 ⎥
⎢ ⎥⎢ ⎥=⎢ ⎥.
⎢ . . .. . ⎥⎢ .. ⎥ ⎢ . ⎥
⎣ .. .. . .
. ⎦⎣ . ⎦ ⎣ .. ⎦
aN 1 aN 2 ... aN N xN bN
Note that the 2, 1 entry (and all the entries in the second row and
second component of the RHS) are now replaced by new values. Often
2 It is known that operation 1 multiplies det(A) by the scalar, operation 2 does not

change the value of det(A) and operation 3 multiplies det(A) by −1.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 35

Gaussian Elimination 35

the replacement is written by an arrow, such as a22 ⇐ a22 − m21 a12 .


Often its denoted by an equals sign. This is not a mathematical equals
sign but really denotes an assignment meaning : “Replace LHS by
RHS” as in a22 = a22 − m21 a12 . Since this replacement of values is
what is really done on the computer we have the system
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 . . . a1N x1 b1
⎢ 0 a22 . . . a2N ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ a31 a32 . . . a3N ⎥ ⎢ x3 ⎥ ⎢ b3 ⎥
⎢ ⎥⎢ ⎥=⎢ ⎥
⎢ . .. . . .. ⎥ ⎢ . ⎥ ⎢ . ⎥
⎣ .. . . . ⎦ ⎣ . ⎦ ⎣ .. ⎦
.
aN 1 aN 2 . . . a N N xN bN

where the second row now contains different numbers than before
step 1.
Substep 1 continued: Continue down the first column, zeroing out the
values below the diagonal (the pivot) in column 1:
a31
Compute: m31 := ,
a11
Then compute: Row 3 ⇐ Row 3 − m31 · Row 1,
... ...
aN 1
Compute: mN 1 := ,
a11
Then compute: Row n ⇐ Row N − mN 1 · Row 1.

The linear system now has the structure:


⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 . . . a1N x1 b1
⎢ 0 a22 . . . a2N ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ . .. . . .. ⎥ ⎢ .. ⎥=⎢ .. ⎥.
⎣ .. . . . ⎦⎣ . ⎦ ⎣ . ⎦
0 aN 2 . . . aN N xN bN

Step 2: Examine the entry a22 . If it is zero or too small, find another
matrix entry below (and sometimes to the right of) a22 and interchange
rows (or columns) to make this entry a22 . Details of this pivoting
process will be discussed later.
With the pivot entry non zero, add a multiple of row 2 to row 3 to
make a32 zero:
a32
Compute 3, 2 multiplier: m32 := ,
a22
Then compute: Row 3 ⇐ Row 3 − m32 · Row 2.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 36

36 Numerical Linear Algebra

Step 2 continued: Continue down column 2, zeroing out the values below
the diagonal (the pivot):
a42
Compute: m42 := ,
a22
Then compute: Row 4 ⇐ Row 4 − m42 · Row 2,
... ...
aN 2
Compute: mN 2 := ,
a22
Then compute: Row N ⇐ Row N − mN 2 · Row 2.

The linear system now has the structure:


⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 . . . a1N x1 b1
⎢ 0 a22 . . . a2N ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ . ⎥⎢ . ⎥ = ⎢ .. ⎥.
⎣ 0 0 . . . .. ⎦ ⎣ .. ⎦ ⎣ . ⎦
0 0 . . . aN N xN bN
Substeps 3 through N: Proceed as above for column 2, for each of
columns 3 through N. The diagonal entries a33 . . . aN N become pivots
(and must not be too small). When Gaussian elimination terminates,
the linear system has the structure (here depicted only for the case
N = 4, or 4 × 4 matrix):
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 a13 a14 x1 b1
⎢ 0 a22 a23 a24 ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 0 0 a33 a34 ⎦ ⎣ x3 ⎦ = ⎣ b3 ⎦ .
0 0 0 a44 x4 b4
The Backsubstitution Step: We now have reduced the linear system to
an equivalent upper triangular system with the same solution. That
solution is now quickly found by back substitution as follows.
Substep 1: aN N xN = bN so xN = bN /aN N
Substep 2: xN −1 = (bN −1 − aN −1,N xN ) /aN −1,N −1
Substep 3: xN −2 = (bN −2 − aN −2,N −1 xN −1 − aN −2,N xN ) /aN −2,N −2
Substeps 4–(N – 1): Continue as above.
Substep N: x1 = (b1 − a12 x2 − a13 x3 − · · · − a1N xN ) /a11 .

3.2 Algorithms and Pseudocode

Careful analysis of algorithms requires some way to make them more pre-
cise. While the description of the Gaussian elimination algorithm provided
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 37

Gaussian Elimination 37

in the previous section is clear and complete, it does not provide a straight-
forward roadmap to writing a computer program. Neither does it make
certain aspects of the algorithm obvious: for example it is hard to see why
the algorithm requires O(N 3 ) time for an N × N matrix.
In contrast, a computer program would provide an explicit implementa-
tion of the algorithm, but it would also include details that add nothing to
understanding the algorithm itself. For example, the algorithm would not
change if the matrix were written using single precision or double precision
numbers, but the computer program would. Further, printed computer
code is notoriously difficult for readers to understand. What is needed
is some intermediate approach that marries the structural precision of a
computer program with human language descriptions and mathematical
notation.
This intermediate approach is termed “pseudocode”. A recent
Wikipedia article3 describes pseudocode in the following way.

“Pseudocode is a compact and informal high-level description of


a computer programming algorithm that uses the structural con-
ventions of a programming language, but is intended for human
reading rather than machine reading. Pseudocode typically omits
details that are not essential for human understanding of the al-
gorithm, such as variable declarations . . . . The programming lan-
guage is augmented with natural language descriptions of the de-
tails, where convenient, or with compact mathematical notation.
The purpose of using pseudocode is that it is easier for humans
to understand than conventional programming language code, and
that it is a compact and environment-independent description of
the key principles of an algorithm. It is commonly used in text-
books and scientific publications that are documenting various al-
gorithms, and also in planning of computer program development,
for sketching out the structure of the program before the actual
coding takes place.”

The term “pseudocode” does not refer to a specific set of rules for ex-
pressing and formatting algorithms. Indeed, the Wikipedia article goes
on to give examples of pseudocode based on the Fortran, Pascal, and C
3 From: Wikipedia contributors, “Pseudocode”, Wikipedia, The Free Encyclopedia.

http://en.wikipedia.org/w/index.php?title=Pseudocode&oldid=564706654 (accessed
July 18, 2013). This article cites: Justin Zobel (2004). “Algorithms” in Writing for
Computer Science (second edition). Springer. ISBN 1-85233-802-4.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 38

38 Numerical Linear Algebra

computer languages. The goal of pseudocode is to provide a high-level


(meaning: understandable by a human) algorithm description with suffi-
cient detail to facilitate both analysis and conversion to a computer pro-
gram. A pseudocode description of an algorithm should:

• Expose the underlying algorithm;


• Hide unnecessary detail;
• Use programming constructs where appropriate, such as looping and
testing; and,
• Use natural and mathematical language where appropriate.

In this book, a pseudocode based on Matlab programming will be


used, and Matlab keywords and variables will be displayed in a special
font. In particular, a loop with index ranging from 1 through N will be
enclosed in the pair of statements for k=1:N and end, and a test will be
enclosed with the pair if ... and end. Subscripted variables are denoted
using parentheses, so that Aij would be denoted A(i,j). Although Mat-
lab statements without a trailing semicolon generally cause printing, the
trailing semicolon will be omitted here. If the pseudocode is used as a
template for Matlab code, this trailing semicolon should not be forgotten.

3.3 The Gaussian Elimination Algorithm

Algorithms are human artifacts. They belong to the world of


memory and meaning, desire and design.
— David Berlinski
“Go ahead and faith will come to you.”
— D’Alembert.

Notice that Gaussian elimination does not use the x values in computa-
tions in any way. They are only used in the final step of back substitution
to store the solution values. Thus we work with the augmented matrix:
an N × N + 1 matrix with the RHS ⎡ vector in the last ⎤ column
a11 a12 . . . a1n b1
⎢ a21 a22 . . . a2n b2 ⎥
⎢ ⎥
WN ×N +1 := ⎢ . . . . . ⎥.
⎣ .. .. . . .. .. ⎦
an1 an2 . . . ann bn
Further, its backsubstitution phase does not refer to any of the zeroed out
values in the matrix W . Because these are not referred to, their positions
can be used to store the multipliers mij .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 39

Gaussian Elimination 39

Evaluation of determinants using Gaussian elimination. It is


known that the elementary row operation 1 multiplies det(A) by the scalar,
the elementary row operation 2 does not change the value of det(A) and
the elementary row operation 3 multiplies det(A) by −1. Based on this
observation, Gaussian elimination is a very efficient way to calculate deter-
minants. If Elimination is performed and the number of row interchanges
counted we then have (after W is reduced to upper triangular)
det(A) = (−1)s w11 · w22 · ... · wnn ,
s = total number of swaps of rows and columns.
In contrast, evaluation of a determinant by cofactor expansions takes
roughly n! floating point operations whereas doing it using Gaussian elim-
ination only requires 23 n3 .
We shall see that backsubstitution is much cheaper and faster to perform
than elimination. Because of this, the above combination of elimination to
upper triangular form followed by backsubstitution is much more efficient
than complete reduction of A to the identity (so called, Gauss-Jordan elim-
ination).
If the pivot entry at some step is zero we interchange the pivot row or
column with a row or column below or to the right of it so that the zero
structure created by previous steps is not disturbed.
Exploiting the above observations, and assuming that pivoting is not
necessary, Gaussian elimination can be written in the following algorithm.

Algorithm 3.1 (Gaussian elimination without pivoting).


Given a N × (N + 1) augmented matrix W ,
for i=1:N-1
Pivoting would go here if it were required.
for j=i+1:N
if Wi,i is too small
error(’divisor too small, cannot continue’)
end
m = Wji /Wii
for k=(i+1):(N+1)
Wjk = Wjk − mWik
end
end
end
if WN,N is too small
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 40

40 Numerical Linear Algebra

error(’singular!’)
end

Gaussian elimination has 3 nested loops. Inside each loop, roughly


N (and on average N/2) arithmetic operations are performed. Thus, it’s
pretty clear that about O(N 3 ) floating point operations are done inside
Gaussian elimination for an N × N matrix.

Exercise 3.1. Consider two so-called “magic square” matrices.


⎡ ⎤
⎡ ⎤ 16 2 3 13
816 ⎢ 5 11 10 8 ⎥
A = ⎣ 3 5 7 ⎦ and B = ⎢
⎣ 9 7
⎥.
6 12 ⎦
492
4 14 15 1

Each of the rows, columns and diagonals of A sum to the same values, and
similarly for B. Gaussian elimination is written above for an augmented
matrix W that is N × N + 1. Modify it so that it can be applied to a square
matrix. Then write a computer program to do Gaussian elimination on
square matrices, apply it to the matrices A and B, and use the resulting
reduced matrix to compute the determinants of A and B. (det(A) = −360
and det(B) = 0.)

The backsubstitution algorithm is below. Backsubstitution proceeds


from the last equation up to the first, and Matlab notation for this “re-
verse” looping is for i=(N-1): -1:1.

Algorithm 3.2 (Backsubstitution). Given an N -vector x for storing


solution values, perform the following:

x(N)=W(N,N+1)/W(N,N)
for i=(N-1):-1:1
N
Compute the sum s = j=i+1 Wi,j xj
x(i)=(W(i,N+1)-s)/W(i,i)
end

N
The sum s = j=i+1 Wi,j xj can be accumulated using a loop, a stan-
dard programming approach to computing a sum is the following algorithm.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 41

Gaussian Elimination 41

n
Algorithm 3.3 (Accumulating the sum j=i+1 Wi,j xj ).

s=0
for j=(i+1):N
s=s+W(i,j)*x(j)
end

Thus the backsubstitution is given as:

Algorithm 3.4 (Backsubstitution-more detail).

x(N)=W(N,N+1)/W(N,N)
for i=(N-1):-1:1
n
Next, accumulate j=i+1 Wi,j xj
s=0
for j=(i+1):N
s=s+W(i,j)*x(j)
end
x(i)=(W(i,N+1)-s)/W(i,i)
end

Backsubstitution has two nested loops. The innermost loop contains one
add and one multiply, for two operations, there are roughly N (N − 1)/2
passes through this innermost loop. Thus, it is clear that, in the whole,
O(N 2 ) floating point operations are done inside backsubstitution for an
N × N matrix.

Exercise 3.2. Show that the complexity of computing det(A) for an N ×N


matrix by repeated expansion by cofactors is at least N !.

3.3.1 Computational Complexity and Gaussian


Elimination

“In mathematics, you don’t understand things. You just get


used to them”.
— J. von Neumann (1903–1957), quoted in: G. Zukov, The
dancing Wu Li masters, 1979.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 42

42 Numerical Linear Algebra

Computers perform several types of operations:

• Additions of real numbers


• Subtraction of a real numbers
• Multiplication of real numbers
• Division of real numbers
• Arithmetic of integers
• Tests (such as “Test if the number X > 0?”)
• Other logical tests
• Accessing memory to find a number to operate upon
• “Loops” meaning operations of the above type performed repeatedly until
some condition is met.

The cost (in time to execute) of each of these is highly computer de-
pendent. Traditionally arithmetic operations on real numbers have been
considered to take the most time. Memory access actually takes much
more time than arithmetic and there are elaborate programming strate-
gies to minimize the effect of memory access time. Since each arithmetic
operation generally requires some memory access, numerical analysts tra-
ditionally have rolled an average memory access time into the time for the
arithmetic for the purpose of estimating run time. Thus one way to estimate
run time is to count the number of floating point operations performed (or
even just the number of multiply’s and divides). This is commonly called
a “FLOP count” for FLoating point OPeration count. More elegantly it
is called “estimating computational complexity”. Counting floating point
operations gives

• Backsubstitution for an N × N linear system takes N (N − 1)/2 multi-


plies and N (N − 1)/2 adds. This is often summarized as N 2 FLOPS,
dropping the lower order terms.
• Gaussian elimination for an N × N linear system takes (N 3 − N )/3
multiplies (N 3 − N )/3 adds and N (N − 1)/2 divides. This is often
summarized as 23 N 3 FLOPS, dropping the lower order terms.

As an example, for a 1000 × 1000 linear system, Gaussian elimination


takes about 1000 times as long as backsubstitution. Doubling
 the size to
3
2000 × 2000 requires 8 times as long to run (as (2N ) = 8 · N 3 ).

Exercise 3.3. Estimate the computational complexity of computing a dot


product of two N -vectors.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 43

Gaussian Elimination 43

Exercise 3.4. Estimate the computational complexity of computing a


residual and then the norm of a residual.

Exercise 3.5. Verify the claimed FLOP count for Gaussian elimination
and back substitution.

3.4 Pivoting Strategies

“Perhaps the history of the errors of mankind, all things consid-


ered, is more valuable and interesting than that of their discoveries.
Truth is uniform and narrow; it constantly exists, and does not
seem to require so much an active energy as a passive aptitude of
the soul in order to encounter it. But error is endlessly diversified;
it has no reality, but it is the pure and simple creation of the mind
that invents it.”
— Benjamin Franklin,
Report of Dr. B. Franklin and other commissioners, Charged
by the King of France with the examination of Animal Magnetism,
as now practiced in Paris, 1784.

Gaussian elimination performs the operations


Wji Wik
Wjk = Wjk −
Wii
many times. This can cause serious roundoff error by division by small
numbers and subtraction of near equals. Pivoting strategies are how this
roundoff is minimized and its cascade through subsequent calculations con-
trolled.
We introduce the topic of pivoting strategies with an example (likely
due to Wilkinson) with exact solution (10, 1)
0.0003x + 1.566y = 1.569
0.3454x − 2.436y = 1.018 (3.3)
Solution = (10, 1)
In 4 significant digit base 10 arithmetic we calculate:
m = 0.3454/0.0003 = 1151
and solving first for y, we find:
0.0003x + 1.566y = 1.569
−1802y = −1805
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 44

44 Numerical Linear Algebra

so that y = 1.001. Solving further for x gives


.0003x = 1.569 − (1.566 · 1.001) = 1.569 − 1.568 = .001
so that the solution is
(x, y) = (3.333, 1.001).
This is very far from the exact solution and it seems likely that the error
is due to dividing by a small number in backsolving for x after getting the
approximate value of 1.001 for y. We consider two strategies for overcoming
this: rescaling before division (which FAILS) and swapping rows (which
works).
Attempt: (Rescaling FAILS)
Multiply equation (3.3) by 1000. This gives
0.3000x + 1566y = 1569
0.3454x − 2.436y = 1.018.
We find
m = 0.3454/0.3000 = 1.151
but, however it again fails:
(x, y) = (3.333, 1.001).
Again, the failure occurs during backsubstitution in the step
x = [1569 − 1566y]/0.3000
because the divisor is small with respect to both numerators.
Attempt: (Swapping rows SUCCEEDS)
0.3454x − 2.436y = 1.018
0.0003x + 1.566y = 1.569.
We find m = 8.686·10−4 and, again y = 1.001. This time, using the first
equation for the backsolve yields x = 10.00, a much better approximation.
This example suggests that pivoting, meaning to swap rows or columns,
is the correct approach. The choice of which rows or columns to swap is
known as a pivoting strategy. Common ones include:

Mathematical partial pivoting: Interchange only the rows, not the


columns, when the pivot entry Wii = 0. This strategy is not sufficient
to eliminate roundoff errors. Even if pivoting is done when Wii = 0 to
numerical precision, this strategy is not sufficient.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 45

Gaussian Elimination 45

Simple partial pivoting: Interchange rows to maximize |Wji | over j ≥ i.

Algorithm 3.5 (Simple partial pivoting). Given a column i


Find row j ≥ i so that |Wji | is maximized.
Swap rows j and i.

Simple partial pivoting is a common strategy, but there are better ones.
Scaled partial pivoting: Interchange only rows so that the pivot entry
Wii is the element in column i on or below the diagonal which is largest
relative to the size of the whole row that entry is in.

Algorithm 3.6 (Scaled partial pivoting). Given a column i


(1) Compute dj := maxi≤k≤N |Wjk |.
(2) Find row j ≥ i so that |Wji |/di is maximized.
(3) Swap rows i and j.

The following refinement of Algorithm 3.6 breaks the steps of that


algorithm into detailed pseudocode. The pseudocode in this algorithm
is intended to stand alone so that it can be “called” by name from
another, larger algorithm. Separate groups of code of this nature are
often called “functions”, “subroutines”, or “procedures”. The Matlab
syntax for a function is:
function [“return” values] = function name(arguments)

There may be zero or more return values and zero or more arguments.
If there are zero or one return values, the brackets (“[” and “]”) can
be omitted.

Algorithm 3.7 (Scaled partial pivoting (detailed)).


Given a row number, i, an N ×N matrix W , and the value
N , return the row number pivotrow with which it should be swapped.

function pivotrow = scaled_partial_pivoting(i,W,N)


% First, find the maximum in each row.
for j=i:N
d(j) = abs(W(j,j))
for k=i+1:N
if d(j) < abs(W(j,k))
d(j) = abs(W(j,k))
end
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 46

46 Numerical Linear Algebra

end
end

% Second, find the pivot row


pivotrow=i
pivot = abs(W(i,i))/d(i)
for j=i+1:N
if pivot < abs(W(j,i))/d(j)
pivot = abs(W(j,i))/d(j)
pivotrow = j
end
end

Scaled partial pivoting is a very commonly used strategy. It gives a


good balance between stability and computational cost.

Exercise 3.6. Give a detailed elaboration of the partial pivoting algorithm


(at a similar level of detail as the scaled partial pivoting algorithm).

Exercise 3.7. Multiply the first equation in (3.3) by 10,000. Show that
Scaled partial pivoting yields the correct answer in four-digit arithmetic,
but partial pivoting does not.

Full pivoting: Interchange rows and columns to maximize |Wik | over i ≥ j


and k ≥ j. Full pivoting is less common because interchanging columns
reorders the solution variables. The reordering must be stored as an
extra N ×N matrix to recover the solution in the correct variables after
the process is over.

Example 3.1. Suppose at one step of Gaussian elimination, the augmented


system is
⎡ ⎤
1.0 2.0 3.0 4.0 1.7
⎢ 0 10−10 2.0 3.0 6.0 ⎥
W =⎢ ⎣0
⎥.
2.0 3.0 1.0 −1000.0⎦
0 3.0 2.0 −5.0 35.0
The RHS vector and active submatrix are partitioned with lines (that are
not stored in W ). The pivot entry is now the 2, 2 entry (currently W (2, 2) =
10−10 ). For the different pivoting strategies we would have

• Mathematical pivoting: no swapping since 10−10 = 0.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 47

Gaussian Elimination 47

• Partial pivoting: Row 2 swap with Row 4 since 3.0 is the largest
entry below 10−10 .
• Scaled partial pivoting: Row 2 swap with Row 3 since 2.0/3.0 >
3.0/5.0.
• Full pivoting: Row 2 swap with Row 4 and Column 2 swap with
column 4 since −5.0 is the largest entry in absolute value in the active
submatrix.

Putting scaled partial pivoting into the Gaussian Elimination Algo-


rithm 3.1 yields the following algorithm. In this algorithm, a vector, p, is
also computed to keep track of row interchanges, although it is not needed
when applying Gaussian Elimination to an augmented matrix. This al-
gorithm is written so that it can be applied to any square or rectangular
N × M matrix with M ≥ N .

Algorithm 3.8 (Gaussian elimination with scaled partial pivoting).


Given a N × M (M ≥ N ) matrix W ,
for i = 1:N
p(i) = i
end
for i = 1:(N-1)
j = scaled_partial_pivoting(i,W,N)
Interchange p(i) and p(j)
Interchange rows i and j of W
for j=(i+1):N
m = W(j,i)/W(i,i)
for k =(i+1):M
W(j,k) = W(j,k) - m*W(i,k)
end
end
end
if WN,N is too small
error(’Matrix is singular!’)
end

Interchanging two components of p is accomplished by:


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 48

48 Numerical Linear Algebra

Algorithm 3.9 (Interchange components of p). Given a vector p and


two indices i and j.

temporary = p(i)
p(i) = p(j)
p(j) = temporary

Interchanging two rows of W is similar, but requires a loop.

Exercise 3.8. Write detailed pseudocode for interchanging two rows of W .

Exercise 3.9. Solve the 3 × 3 linear system with augmented matrix given
below by hand executing the Gaussian elimination with scaled partial piv-
oting algorithm:
⎡ ⎤
−1 2 −1 0
W = ⎣ 0 −1 2 1 ⎦ .
2 −1 0 0

Exercise 3.10. Suppose you are performing Gaussian elimination on a


square matrix A. Suppose that in your search for a pivot for column i
using simple partial pivoting you discover that maxj≥i |W (j, i)| is exactly
zero. Show that the matrix A must be singular. Would the same fact be
true if you were using scaled partial pivoting?

Exercise 3.11. In Algorithm 3.8, the matrix W is an N × M matrix and it


employs Algorithm 3.6. In that algorithm dj is constructed for i ≤ k ≤ N
and not i ≤ k ≤ M . When M > N , explain why it is a reasonable to ignore
some columns of W when pivoting.

3.5 Tridiagonal and Banded Matrices

“The longer I live, the more I read, the more patiently I think,
and the more anxiously I inquire, the less I seem to know.”
— John Adams

Gaussian elimination is much faster for banded matrices (in general)


and especially so for tridiagonal ones (in particular).

Definition 3.1. An N ×N matrix A is tridiagonal if Aij = 0 for |i−j| > 1.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 49

Gaussian Elimination 49

Thus a tridiagonal matrix is one that takes the form:


⎡ ⎤⎡ ⎤ ⎡ ⎤
d1 c 1 0 0 ... 0 x1 b1
⎢a d c 0 ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 2 2 2 0 ... ⎥ ⎢ x 2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ 0 a3 d3 c 3 ... 0 ⎥ ⎢ x 3 ⎥ ⎢ b3 ⎥
⎢ ⎥⎢ . ⎥ = ⎢ . ⎥.
⎢ .. .. .. ⎥⎢ . ⎥ ⎢ . ⎥
⎢ . . . ⎥⎢ . ⎥ ⎢ . ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 0 . . . 0 aN −1 dN −1 cN −1 ⎦ ⎣ xN −1 ⎦ ⎣ bN −1 ⎦
0 ... 0 0 a N dN xN bN
Performing elimination without pivoting does not alter the tridiagonal
structure. Thus the zeroes need not be stored (saving lots of storage: from
O(N 2 ) to O(N )). There is no point in doing arithmetic on those zeroes,
either, reducing FLOPS from O(N 3 ) to O(N ). There are two common
ways to store a tridiagonal linear system.
Method 1: storage as 4 vectors by:

−a : = (0, a2 , a3 , · · ·, aN −1 , aN )


d : = (d1 , d2 , d3 , · · ·, dN −1 , dN )

−c : = (c , c , c , · · ·, c
1 2 3 N −1 , 0)


b : = (b1 , b2 , b3 , · · ·, bN −1 , bN ).
Stored in this form the elimination and backsubstitution algorithms are
as follows.

Algorithm 3.10 (Tridiagonal Elimination). Given 4 N -vectors a, d,


c, b, satisfying a(1) = 0.0 and c(N ) = 0.0
for i = 2:N
if d(i-1) is zero
error(’the matrix is singular or pivoting is required’)
end
m = a(i)/d(i-1)
d(i) = d(i) - m*c(i-1)
b(i) = b(i) - m*b(i-1)
end
if d(N) is zero
error(’the matrix is singular’)
end

Clearly, tridiagonal Gaussian elimination has one loop. Inside the loop,
roughly five arithmetic operations are performed. Thus, it is clear that, on
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 50

50 Numerical Linear Algebra

the whole, O(N ) floating point operations (more precisely 5N-5 FLOPS)
are done inside tridiagonal Gaussian elimination for an N × N matrix.
The backsubstitution algorithm is as follows.4

Algorithm 3.11 (Tridiagonal Backsubstitution). Given an extra N -


vector x to store the solution values, perform the following:
x(N) = b(N)/d(N)
for i = N-1:-1:1
x(i)=( b(i) - c(i)*x(i+1) )/d(i)
end

Example 3.2. A tridiagonal matrix that frequently occurs is


tridiag(-1,2,-1) or
⎡ ⎤
2 −1 0 0 . . . 0
⎢ −1 2 −1 0 . . . 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 −1 2 −1 . . . 0 ⎥
⎢ ⎥.
⎢ .. .. .. ⎥
⎢ . . . ⎥
⎢ ⎥
⎣ 0 . . . 0 −1 2 −1 ⎦
0 . . . 0 0 −1 2
The first step in GE for this matrix is to replace:
Row2 <= Row2 − (−1/2)Row1.
This gives
⎡ ⎤
2 −1 0 0 . . . 0
⎢ 0 1.5 −1 0 . . . 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 −1 2 −1 . . . 0 ⎥
⎢ ⎥.
⎢ .. .. .. ⎥
⎢ . . . ⎥
⎢ ⎥
⎣ 0 . . . 0 −1 2 −1 ⎦
0 . . . 0 0 −1 2
This zeroes out the entire first column.
Backsubstitution for tridiagonal matrices is also an O(N ) algorithm
since there is one loop with a subtraction, a multiplication, and a division.
In summary, tridiagonal system solution without pivoting requires:

• 5N − 5 adds, multiplies and divides for elimination, and,


• 3N − 2 adds, multiplies and divides for backsubstitution.
4 Recall that the syntax “for i=N-1:-1:1” means that the loop starts at i=N-1 and i

decreases to 1.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 51

Gaussian Elimination 51

More generally, for a banded, sparse matrix with half bandwidth p


(and thus full bandwidth 2p + 1) banded sparse Gaussian elimination takes
O(p2 N ) FLOPS.
Method 2: Storage as a banded matrix with bandwidth three or half
bandwidth p = 1.
In this case we store the augmented matrix as
⎡ ⎤
b 1 b2 . . . b N
⎢ c1 c2 . . . 0 ⎥
W4×N := ⎢ ⎥
⎣ d1 d2 . . . d N ⎦ .
0 a2 . . . a N
Modification of the Gaussian elimination algorithm for this alternative
storage method is given below.

Algorithm 3.12 (Tridiagonal Gaussian Elimination: Band Storage).


Given a 4 × n augmented matrix W , with W41 = 0.0 and W2N = 0.0.

for i = 2:N
if W(3,i-1) is zero
error(’the matrix is singular or pivoting is required’)
end
m = W(4,i)/W(3,i-1)
W(3,i) = W(3,i) - m*W(2,i-1)
W(1,i) = W(1,i) - m*W(1,i-1)
end
if W(3,N) is zero
error(’the matrix is singular.’)
end

Exercise 3.12. Give a pseudocode algorithm for backsubstitution for tridi-


agonal matrices stored in band form.

Exercise 3.13. Extend the algorithms given here to general banded sys-
tems with half bandwidth p < N/2.

Exercise 3.14. What is the operation count for an N × N system for


Gaussian elimination? Back substitution? If the matrix is tridiagonal,
then what are the operation counts?

Exercise 3.15. If a 10000×10000 tridiagonal linear system takes 2 minutes


to solve using tridiagonal elimination plus backsubstitution, estimate how
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 52

52 Numerical Linear Algebra

long it would take to solve using the full GE plus full backsubstitution
algorithms. (This explains why it makes sense to look at the special case
of tridiagonal matrices.)

Exercise 3.16. Verify that Gaussian elimination requires O(p2 N ) FLOPS


for an N × N banded matrix with half-bandwidth p.

Exercise 3.17. The above algorithms for tridiagonal Gaussian elimination


contain lines such as “if W(3,N) is zero” or “if d(N) is zero”. If you were
writing code for a computer program, how would you interpret these lines?

Exercise 3.18. Write a computer program to solve the tridiagonal system


A = tridiag(−1, 2, −1) using tridiagonal Gaussian elimination with band
storage, Algorithm 3.12. Test your work by choosing the solution vector of
all 1 s and RHS containing the row sums of the matrix. Test your work for
system sizes N = 3, and N = 1000.

3.6 The LU Decomposition

“Measure what is measurable, and make measurable what is


not so.”
— Galilei, Galileo (1564–1642), Quoted in H. Weyl “Mathemat-
ics and the Laws of Nature” in I Gordon and S. Sorkin (eds.) The
Armchair Science Reader, New York: Simon and Schuster, 1959.

“Vakmanschap is meesterschap.”
— (Motto of Royal Grolsch NV, brewery.)

Suppose we could factor the N × N matrix A as the product

A = LU, L : lower triangular , U : upper triangular.

Then, we can solve the linear system Ax = b without Gaussian elimination


in two steps:

(1) Forward substitution: solve Ly = b.


(2) Backward substitution: solve U x = y.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 53

Gaussian Elimination 53

Step 1. Forward solve for y


Ly = b

⎡ ⎤⎡ ⎤ ⎡ ⎤
11 0 0 ... 0 0 y1 b1
⎢ 21 0 ... 0 0 ⎥ ⎢ y2 ⎥ ⎢ b2 ⎥
⎢ 22 ⎥⎢ ⎥ ⎢ ⎥
⎢ .. .. .. .. .. .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
⎢. . . . . . ⎥⎢. ⎥ = ⎢. ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ N −1,1 N −1,2 N −1,3 ... N −1,N −1 0 ⎦ ⎣ yN −1 ⎦ ⎣ bN −1 ⎦
N1 N2 N,3 ... N,N −1 N,N yN bN N
so

11 y1 = b1 ⇒ y1 = b1 / 11 ,

21 y1 + 22 y2 = b2 ⇒ y2 = (b2 − 21 y1 )/ 22 ,

and so on.
Step 2. Backward solve U x = y for x
Ux = y

⎡ ⎤⎡ ⎤ ⎡ ⎤
u11 u12 u1,3 . . . u1,N −1 u1N x1 y1
⎢0 u22 u23 . . . u2,N −1 u2N ⎥ ⎢ x2 ⎥ ⎢ y2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ .. .. . . .. . . .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
⎢. . . . . . ⎥⎢. ⎥ = ⎢. ⎥,
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣0 0 0 . . . uN −1,N −1 uN,N ⎦ ⎣ xN −1 ⎦ ⎣ yN −1 ⎦
0 0 0 ... 0 uN,N xN yN N
so
u N N x N = yN ⇒ xN = yN /uN N
and

uN −1,N −1 xN −1 + uN −1,N xN = yN −1 ⇒
xN −1 = (yN −1 − uN −1,N xN )/uN −1,N −1 .
Thus, once we compute a factorization A = LU we can solve linear systems
relatively cheaply. This is especially important if we must solve many linear
systems with the same A = LU and different RHS’s b. First consider the
case without pivoting.

Theorem 3.1 (Remarkable Algorithmic Fact). If no pivoting is used


and the Gaussian elimination algorithm stores the multipliers mij below the
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 54

54 Numerical Linear Algebra

diagonal of A the algorithm computes the LU factorization of A where L


and U are given by
⎡ ⎤
1 0 0 ... 0 0
⎢ m21 1 0 ... 0 0⎥
⎢ ⎥
⎢ .. . . . . .. ⎥ ,
L = ⎢. .. .. .. .. .⎥
⎢ ⎥
⎣ mN −1,1 mN −1,2 mN −1,3 . . . 1 0⎦
mN 1 mN 2 mN,3 . . . mN,N −1 1
⎡ ⎤
u11 u12 u1,3 . . . u1,N −1 u1N
⎢0 u22 u23 . . . u2,N −1 u2N ⎥
⎢ ⎥
⎢ ⎥
U = ⎢ ... ..
.
. . .. . .
. . .
..
. ⎥.
⎢ ⎥
⎣0 0 0 . . . uN −1,N −1 uN,N ⎦
0 0 0 ... 0 uN,N

Exercise 3.19. Prove Theorem 3.1 for the 3 × 3 case using the following
steps.

(1) Starting with a 3 × 3 matrix A, perform one column of row reductions,


resulting in a matrix
⎡ ⎤
1 00
L1 = ⎣ −m21 1 0 ⎦
−m31 0 1
with −mn,1 for n = 2, 3 denoting the multipliers used in the row re-
duction.
(2) Consider the matrix
⎡ ⎤
1 00
L1 = ⎣ m21 1 0 ⎦
m31 0 1
and show that
(a) L1 A = U1 , and
(b) L1 L1 = I, where I is the identity matrix, so L1 = (L1 )−1 .
Hence, A = L1 U1 .
(3) Similarly, perform one column of row reductions on the second column
of U1 and construct the matrix
⎡ ⎤
10 0
L1 = ⎣ 0 1 0⎦.
0 −m32 1
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 55

Gaussian Elimination 55

(4) Show that


(a) L2 U1 = U2 , and
(b) L2 L2 = I, where I is the identity matrix, so L2 = (L2 )−1 , and
(c) (this is the surprising part)
⎡ ⎤
1 0 0
L1 L2 = ⎣ m21 1 0⎦
m31 m32 1
so that A = L1 L2 U2 .

Exercise 3.20. Prove Theorem 3.1 for the general case, using Exercise 3.19
as a model.
Remark 3.1. When solving systems with multiple RHS’s, it is common
to compute L and U in double precision and store in the precision sought
in the answer (either single or double). This gives extra accuracy without
extra storage. Precisions beyond double are expensive, however, and are
used sparingly.
Remark 3.2. Implementations of Gaussian Elimination combine the two
matrices L and U together instead of storing the ones on the diagonal of
L and all the zeros of L and U , a savings of storage for N 2 real numbers.
The combined matrix is
⎡ ⎤
u11 u12 u1,3 . . . u1,N −1 u1N
⎢ m21 u22 u23 . . . u2,N −1 u2N ⎥
⎢ ⎥
⎢ .. .. .. .. . . .. ⎥
W = ⎢. . . . . . ⎥.
⎢ ⎥
⎣ mN −1,1 mN −1,2 mN −1,3 . . . uN −1,N −1 uN,N ⎦
mN,1 mN,2 mN,3 . . . mN,N −1 uN,N
Example 3.3. Suppose A is the 4 × 4 matrix below.
⎡ ⎤
3 1 −2 −1
⎢ 2 −2 2 3 ⎥
⎢ ⎥
⎣ 1 5 −4 −1 ⎦ .
3 1 2 3
Performing Gauss elimination without pivoting (exactly as in the algo-
rithm) and storing the multipliers gives
⎡ ⎤
3 1 −2 −1
⎢ 2 − 8 10 11 ⎥
⎢3 3 3 3 ⎥
W = ⎢ 1 7 5 23 ⎥ .
⎣ 3 −4 2 4 ⎦
1 0 8
5 − 26
5
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 56

56 Numerical Linear Algebra

Thus, A = LU where
⎡ ⎤ ⎡ ⎤
1 0 00 3 1 −2 −1
⎢ 2 1 0 0⎥ ⎢ 0 − 8 10 11 ⎥
⎢3 ⎥ ⎢ 3 3 3 ⎥
L=⎢1 7 ⎥ and U = ⎢ ⎥.
⎣ 3 −4 1 0⎦ ⎣ 0 0 52 234 ⎦
1 0 8
5 1 0 0 0 − 26
5

Exercise 3.21. Algorithm 3.12 describes tridiagonal Gaussian elimination.


Modify that algorithm to store the multipliers in the matrix W so that it
computes both the lower and upper tridiagonal factors.

Exercise 3.22. Suppose A has been factored as A = LU with L and U


given below. Use this factorization to solve Ax = e3 , where e3 = (0, 0, 1)t .
⎡ ⎤ ⎡ ⎤
100 234
L = ⎣2 1 0⎦,U = ⎣0 1 0⎦.
341 003

Exercise 3.23. Algorithm 3.8 describes the algorithm for Gaussian elimi-
nation with scaled partial pivoting for an augmented matrix W , but it does
not employ the combined matrix factor storage described in Remark 3.2.
Modify Algorithm 3.8 so that

(1) It applies to square matrices; and,


(2) It employs combined matrix factor storage.

The question remains: What happens when pivoting is required? To


help answer this question, we need to introduce the concept of a “permu-
tation” to make the notion of swapping rows clear.

Definition 3.2. A permutation vector is a rearrangement of the vector




p = [1, 2, 3, · · ·, N ]t .
A permutation matrix is an N × N matrix whose columns are rearrange-
ments of the columns of the N × N identity matrix. This means that there
is a permutation vector such that
j th column of P = p(j)th column of I.

For example, if N = 2, the permutation matrices are


01 10
P1 = and P2 = .
10 01
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 57

Gaussian Elimination 57

Note that
x1 x2 x1 x2
P1 = and P1−1 = .
x2 x1 x2 x1
If →
− y = P −1 →
p = (2, 1) then we compute →
− −
x by
for i = 1:2
y(i) = x( p(i) )
end
y = P −1 →
More generally, we compute →
− −
x by
for i=1:N
y(i)=x( p(i) )
end

Theorem 3.2 (A = P LU factorization). Gaussian elimination with


partial pivoting, as presented in Algorithm 3.8 and modified in Exer-
cise 3.23, computes the permutation vector, →−p , as part of the elimination
process and stores both the multipliers and the upper triangular matrix in
the combined matrix W . Thus, it constructs the factorization
A = P LU,
where P is the permutation matrix corresponding to the vector →

p.

Proof. The essential part of the proof can be seen in the 3 × 3 case, so
that case will be presented here.
The first step in Algorithm 3.8 is to find a pivot for the first column.
Call this pivot matrix P1 . Then row reduction is carried out for the first
column, with the result
A = (P1−1 P1 )A = P1−1 (P1 A) = P1−1 L1 U1 ,
where
⎡ ⎤ ⎡ ⎤
1 00 u11 ∗ ∗
L1 = ⎣ m21 1 0 ⎦ and U1 = ⎣ 0 ∗ ∗ ⎦
m31 0 1 0 ∗∗
where the asterisks indicate entries that might be non-zero.
The next step is to pivot the second column of U1 from the diagonal
down, and then use row-reduction to factor it:
A = P1−1 L1 (P2−1 L2 U )
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 58

58 Numerical Linear Algebra

where
⎡ ⎤ ⎡ ⎤
1 0 0 u11 u12 u13
L2 = ⎣ 0 1 0 ⎦ and U = ⎣ 0 u22 u23 ⎦ .
0 m32 1 0 0 u33

Finally, it must be shown that L1 P2−1 L2 can be expressed as P̃ −1 L. It


is clear that

L1 P2−1 L2 = (P2−1 P2 )L1 P2−1 L2 = P2−1 (P2 L1 P2−1 )L2 .

There are only two possibilities for the permutation matrix P2 . It can be
the identity, or it can be
⎡ ⎤
100
P2 = ⎣ 0 0 1 ⎦ .
010

If P2 is the identity, then P2−1 (P2 L1 P2−1 )L2 = L1 L2 and is easily seen to
be lower triangular. If not,
⎡ ⎤⎡ ⎤⎡ ⎤
100 1 00 100
P2 L1 P2−1 = ⎣ 0 0 1 ⎦ ⎣ m21 1 0 ⎦ ⎣ 0 0 1 ⎦
010 m31 0 1 010
⎡ ⎤
1 00
= ⎣ m31 1 0 ⎦ .
m21 0 1
Hence,
⎡ ⎤
1 0 0
P2 L1 P2−1 L2 = ⎣ m31 1 0 ⎦ ,
m21 m32 1
a lower triangular matrix.

Exercise 3.24. Complete the proof of Theorem 3.2 for N × N matrices.


Hint: It is important to realize that the fact that P2 L1 P2−1 L2 turns out to
be lower triangular depends strongly on the permutation matrix P2 involv-
ing indices greater than the indices of columns of L1 with non-zeros below
the diagonal.

Given the factorization A = P LU , the solution of Ax = b is then found


in three steps.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 59

Gaussian Elimination 59

Algorithm 3.13 (Solving P LU x = b).



− →
− →

(1) Compute d = P −1 b , i.e., rearrange b by:

for k=1:N
d(k)=b(p(k))
end


(2) Forward solve L→

y = d.
(3) Backsolve U →

x =→−y.

Example 3.4. Suppose in solving a 3×3, elimination swaps rows 1 and row
2. Then p = (1, 2, 3) is changed to p = (2, 1, 3) at the end of elimination.
Let b = (1, 3, 7)t , p = (2, 1, 3)t . Then d = P −1 b = (3, 1, 7)t .

Remark 3.3.

(1) Factoring A = P LU takes O(N 3 ) FLOPS but using it thereafter for


backsolves only takes O(N 2 ) FLOPS.
(2) If A is symmetric and positive definite then the P LU decomposition can
be further refined into an LLt decomposition known as the Cholesky
decomposition. It takes about 1/2 the work and storage of P LU . Gaus-
sian elimination for SPD matrices do not require pivoting, an important
savings in time and storage.

Exercise 3.25. Find the LU decomposition of


39
.
27

Exercise 3.26. Given the LU decomposition


2 3 10 2 3
=
8 11 41 0 −1
use it to solve the linear system

2x + 3y = 0
8x + 11y = 1.

Exercise 3.27. Write a computer program to perform Gaussian elimina-


tion on a square matrix, A, using partial pivoting (Algorithm 3.8 as modi-
fied in Exercise 3.23).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 60

60 Numerical Linear Algebra

(1) At the end of the algorithm, reconstruct the matrices P , L, and U


and compute a relative norm A − P LU / A. (You can use the

Frobenius norm A2fro = ij |Aij | .) The norm should be zero or
2

nearly zero. (Alternatively, perform the calculation A − P LU / A


without explicitly constructing P , L, and U .)
(2) To help debug your work, at the beginning of each of the column re-
duction steps (the second for i= loop), reconstruct the matrices P , L,
and U and compute a norm A − P LU . The norm should be zero
or nearly zero each time through the loop. (Alternatively, compute
the norm without reconstructing the matrix factors.) Once you are
confident your code is correct, you can eliminate this debugging code.
(3) Test your code on the N × N matrix consisting of all ones everywhere
except that the diagonal values are zero. Use several different values of
N as tests. The 3 × 3 case looks like
⎡ ⎤
011
A = ⎣1 0 1⎦. (3.4)
110

Exercise 3.28. (This exercise continues Exercise 3.27.) Write a computer


program to perform the backsubstitution steps, given the compressed ma-
trix W arising from Gaussian elimination with scaled partial pivoting. Test
your work by applying it to the N × N matrix A described in Exercise 3.27
with right side given by b = (N − 1, N − 1, . . . , N − 1)t , whose solution is
x = (1, 1, . . . , 1)t . Use several values of N for your tests.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 61

Chapter 4

Norms and Error Analysis

“fallor ergo sum.”


— Augustine.

In Gaussian elimination there are a large number of calculations. Each


operation depends upon all previous ones. Thus, round off error occurs and
propagates. It is critically important to understand and quantify precisely
what “numerically singular” or “ill conditioned” means, to quantify it and
to predict its effect on solution cost and accuracy. We begin this study in
this chapter.

4.1 FENLA and Iterative Improvement

An expert is someone who knows some of the worst mistakes


that can be made in his subject, and how to avoid them.
— Heisenberg, Werner (1901–1976), Physics and Beyond. 1971.

If the matrix is numerically singular or ill-conditioned, it can be diffi-


cult to obtain the accuracy needed in the solution of the system Ax = b
by Gaussian elimination alone. Iterative improvement is an algorithm to
increase the accuracy of a solution to Ax = b. The basic condition needed
is that Ax = b can be solved to at least one significant digit of accuracy.
Iterative improvement is based on the Fundamental Equation of Numerical
Linear Algebra (the “FENLA”).

Theorem 4.1 (FENLA). Let AN ×N , bn×1 and let x be the true solution
to Ax = b. Let x̂ be some other vector. The error e := x − x̂ and residual
r := b − Ax̂ are related by
Ae = r.

61
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 62

62 Numerical Linear Algebra

Proof. Since e = x − x̂, Ae = A(x − x̂) = Ax − Ax̂. Then, since Ax = b,


Ae = Ax − Ax̂ = b − Ax̂ = r.

Given a candidate for a solution x̂, if we could find its error ê(= x − x̂),
then we would recover the true solution
x = x̂ + ê (since x̂ + ê = x̂ + (x − x̂) = x).
Thus we can say the following two problems are equivalent:

Problem 1: Solve Ax = b.
Problem 2: Guess x̂, compute r̂ = b−Ax̂, solve Aê = r̂, and set x = x̂+ê.

This equivalence is the basis of iterative improvement.

Algorithm 4.1 (Iterative Improvement). Given a matrix A, a RHS


vector b, and a precision t, find an approximate solution x to the equation
Ax = b with at least t correct significant digits.

Compute the A = LU factorization of A in working precision


Solve Ax = b for candidate solution x̂ in working precision
for k=1:maxNumberOfIterations
Calculate the residual

r = b − Ax̂ (4.1)

in extended precision
Solve Ae = r (by doing 2 backsolves in working precision) for
an approximate error ê
Replace x̂ with x̂ + ê
if ê ≤ 10−t x̂
return
end
end
error(‘The iteration did not achieve the required error.’)

It is not critical to perform the residual calculation (4.1) in higher preci-


sion than that used to store the matrix A and vector x, but it substantially
improves the algorithm’s convergence and, in cases with extremely large
condition numbers, is required for convergence.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 63

Norms and Error Analysis 63

Using extended precision for the residual may require several iteration
steps, and the number of steps needed increases as A becomes more ill-
conditioned, but in all cases, it is much cheaper than computing the LU
decomposition of A itself in extended precision. Thus, it is almost always
performed in good packages.

Example 4.1. Suppose the matrix A is so ill conditioned that solving with
it only gives 2 significant digits of accuracy. Stepping through iterative
improvement we have:
x = 2 : sig-digits
Calculate r = b − A x
Solve A e = r
e = 2 : sig-digits
Then x̂ ⇐ x̂ + ê : 4 significant digits.
e = 2 : sig-digits
Then x̂ ⇐ x̂+ê : 6 significant digits, and so on until the desired accuracy
is attained.

Example 4.2. On a 3 significant digit computer, suppose we solve Ax = b


where (to 3 significant digits)

b = [5.90 7.40 10.0]t

and
⎡ ⎤
1.00 1.20 1.50
A = ⎣ 1.20 1.50 2.00 ⎦ .
1.50 2.00 3.00
The exact solution of Ax = b is

x = [2.00 2.00 1.00]t .

Step 1: Computing A = LU in working precision (using 3 significant digits


in this example) gives
⎡ ⎤ ⎡ ⎤⎡ ⎤
1.00 1.20 1.50 1.00 0.00 0.00 1.00 1.20 1.50
A = ⎣ 1.20 1.50 2.00 ⎦ = ⎣ 1.20 1.00 0.00 ⎦ ⎣ 0.00 0.0600 0.200 ⎦ .
1.50 2.00 3.00 1.50 3.33 1.00 0.00 0.00 0.0840
Step 2: Solving Ax = b in 3 significant digit arithmetic using 2 backsolves
(Ly = b and U x = y) gives

x̂ = [1.87 2.17 0.952]t .


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 64

64 Numerical Linear Algebra

Step 3: A “double precision” (6 digit) calculation of the residual gives


r = [−0.00200 − 0.00300 − 0.00100]t .
Step 4: The single precision (3 digit) solution of LU ê = r is
ê = [0.129 − 0.168 0.0476]t .
Step 5: Update solution
x̂ = x̂OLD + ê = [2.00 2.00 1.00]t
which is accurate to the full 3 significant digits!

Exercise 4.1. Algorithm 4.1 describes iterative improvement. For each


step, give the estimate of its computational complexity (its “FLOP count”).

Exercise 4.2. Show that, when double precision is desired, it can be more
efficient for large N to compute the LU factorization in single precision and
use iterative refinement instead of using double precision for the factoriza-
tion and solution. The algorithm can be described as:

(1) Convert the matrix A to single precision from double precision.


(2) Find the factors L and U in single precision.
(3) Use Algorithm 4.1 to improve the accuracy of the solution. Use A in
double precision to compute the double precision residual.

Estimate the FLOP count for the algorithm as outlined above, assuming
ten iterations are necessary for convergence. Count each double precision
operation as two FLOPs and count each change of precision as one FLOP.
Compare this value with the operation count for double precision factor-
ization with a pair of double precision backsolves.

4.2 Vector Norms

“Intuition is a gift. . . . Rare is the expert who combines an


informed opinion with a strong respect for his own intuition.”
— G. de Becker, 1997.

Iterative improvement introduces interesting questions like:

• How to measure improvement in an answer?


• How to measure residuals?
• How to quantify ill-conditioning?
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 65

Norms and Error Analysis 65

The answer to all these questions involves norms. A norm is a gener-


alization of length and is used to measure the size of a vector or a matrix.

Definition 4.1. Given x ∈ RN , a norm of x, x, is a nonnegative real


number satisfying

• (Definiteness) x ≥ 0 and x = 0 if and only if x = 0.


• (Homogeneity) For any real number α and all x ∈ RN
αx = |α|x.
• (The triangle inequality) : For all x, y ∈ RN
x + y ≤ x + y.

Example 4.3 (Important norms). (i) The Euclidean, or 2 , norm:


√  1/2
x2 = x · x = |x1 |2 + |x2 |2 + . . . + |xN |2 .
(ii) 1-norm or 1 norm:
x1 := |x1 | + |x2 | + . . . + |xN |.
(iii) The max norm or ∞ norm
x∞ := max |xj |.
1≤j≤N

(iv) The p-norm or p norm: for 1 ≤ p < ∞,


1/p
xp = (|x1 |p + |x2 |p + . . . + |xN |p ) .

The max norm is called the ∞ norm because


xp → max |xj |, as p → ∞.
j

Proposition 4.1 (Norm Equivalence). For all x ∈ RN have:


||x||∞ ≤ ||x||1 ≤ N ||x||∞ .

If the number of variables N is large, it is common to redefine these


norms to make them independent of n by requiring (1, 1, . . . , 1) = 1.
This gives the 
perfectly acceptable modifications of (i)–(iv) below:
N
(i) xRMS := N1 j=1 x2j , (the “Root, Mean, Square ” norm).
(ii) xAV G := N1 (|x1 | + . . . + |xN |), the Average size of the entries.
The weighted norms  · 1 (the average),  · 2 (the root mean square)
and  · ∞ (the maximum) are by far the most important. Only the  · 2 or
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 66

66 Numerical Linear Algebra

RMS norm comes from an inner product. Other weights are possible, such
as

N
 
N
|||x||| :=  ωj x2j , where ωj > 0 and ωj = 1.
j=1 j=1

Weighted norms are used in cases where different components have different
significance, uncertainty, impact on the final answer, etc.

Exercise 4.3. Show that xp → maxj |xj |, as p → ∞.

4.2.1 Norms that come from inner products


The Euclidean or l2 norm comes from the usual dot product by

||x||22 = x · x = |xi |2 .

Dot products open geometry as a tool for analysis and for understanding
since the angle1 between two vectors x, y can be defined through the dot
product by
x·y
cos(θ) = .
||x||2 ||y||2
Thus norms that are induced by dot products are special because they
increase the number of tools available for analysis.

Definition 4.2. An inner product on RN is a map: x, y → x, y∗ , mapping


RN × RN → R and satisfying the following:

• (definiteness) x, x∗ ≥ 0 and x, x∗ = 0 if and only if x = 0.


• (bilinearity) For any real number α, β and all x, y, z ∈ RN

αx + βy, z∗ = αx, z∗ + βy, z∗ .

• (symmetry) : For all x, y ∈ RN

x, y∗ = y, x∗ .

Proposition 4.2 (Innerproduct induces a norm). If ·, ·∗ is an in-


ner product then ||x||∗ = x, x∗ is the norm induced by the inner product.
1 In statistics this is called the correlation between x and y. If the value is 1 the vectors

point the same way and are thus perfectly correlated. If its −1 they are said to be
anti-correlated.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 67

Norms and Error Analysis 67

Since an inner product is a generalization of the usual euclidean dot


product it is therefore no surprise that norms and angles can be defined
through any given dot product by

Induced norm: ||x||∗ = x, x∗
x,y∗
Induced angle: cos∗ (θ) = ||x|| ∗ ||y||∗
.
The following definition shows that orthogonality has the expected
meaning.
Definition 4.3. Vectors x, y are orthogonal in the inner product x, y∗
if x, y∗ = 0 (and thus the induced angle between them is π/2). Vectors
x, y are orthonormal if they are orthogonal and have induced norm one
||x||∗ = ||y||∗ = 1. A set of vectors is orthogonal (respectively orthonormal)
if elements are pairwise orthogonal (respectively orthonormal).
We have used the subscript ∗ as a place holder in our definition of
inner product because it will be convenient to reserve x, y for the usual
euclidean dot product:

N
x, y := xj yj = x · y.
j=1
Vectors are operated upon by matrices so the question of how angles can
change thereby can be important. There is one special case with an easy
answer.
Theorem 4.2. Let x, y denote the usual euclidean inner product. If A is
an N × N real, symmetric matrix (i.e., if A = At or aij = aji ) then for all
x, y
Ax, y = x, Ay.
More generally, for all x, y and any N × N matrix A
Ax, y = x, At y.
Proof. We calculate (switching the double sum2 )
⎛ ⎞
N 
N
Ax, y = ⎝ aij xj ⎠ yi
i=1 j=1
!

N 
N
= aij yi xj = x, At y.
j=1 i=1

2 Every proof involving a double sum seems to be done by switching their order then

noticing what you get.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 68

68 Numerical Linear Algebra

The property that Ax, y = x, Ay is called self-adjointness with re-
spect to the given inner product ·, ·. If the inner product changes, the ma-
trices that are self-adjoint change and must be redetermined from scratch.
Often the problem under consideration will induce the norm one is
forced to work with. One common example occurs with SPD matrices.
Definition 4.4. An N × N matrix A is symmetric positive definite, SPD
for short, if

• A is symmetric: At = A, and
• A is positive definite: for all x = 0, xt Ax > 0.

SPD matrices can be used to induce inner products and norms as follows.
Definition 4.5. Suppose A is SPD. The A inner product and A norm are

x, yA := xt Ay, and ||x||A := x, xA .
The A inner product is of special importance for solutions of Ax =
b when A is SPD. Indeed, using the equation Ax = b, x, yA can be
calculated when A is SPD without knowing the vector x as follows:
t
x, yA = xt Ay = (Ax) y = bt y.

Exercise 4.4. Prove that if ·, ·∗ is an inner product then ||x||∗ = x, x∗
is a norm.
Exercise 4.5. Prove that if A is SPD then x, yA := xt Ay is an inner
product. Show that A, A2 , A3 , · · · are self adjoint with respect to the A
inner product: Ak x, yA = x, Ak yA .
Exercise 4.6. If ·, ·∗ satisfies two but not all three conditions of an inner
product find which conditions in the definition of a norm are satisfied and
which are violated. Apply your analysis to x, yA := xt Ay when A is not
SPD.
Exercise 4.7. The unit ball is {x : ||x||∗ ≤ 1}. Sketch the unit ball in R2
for the 1, 2 and infinity norms. Note that the only ball that looks ball-like
is the one for the 2-norm. Sketch the unit ball in the weighted 2 norm
induced by the inner product x, y := (1/4)x1 y1 + (1/9)x2 y2 .
Exercise 4.8. An N × N matrix is orthogonal if its columns are N
orthonormal (with respect to the usual euclidean inner product) vec-
tors. Show that if O is an orthogonal matrix then OT O = I, and that
||Ox||2 = ||x||2 .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 69

Norms and Error Analysis 69

4.3 Matrix Norms

“Wir müssen wissen.


Wir werden wissen.”
— David Hilbert (1862–1943) [Engraved on his tombstone in
Göttingen.]

It is easy to define a norm on matrices by thinking of an N × N matrix


as just an ordered collection of N 2 real numbers. For example, maxi,j |aij |
is a norm as is the so called Frobenius norm,


 n  n
AFrobenius :=  a2ij .
j=1 i=1

However, most such norms are not useful. Matrices multiply vectors.
Thus, a useful norm is one which can be used to bound how much a vector
grows when multiplied by A. Thus, under y = Ax we seek a notion of ||A||
under which
y = Ax ≤ Ax.
Starting with the essential function a matrix norm must serve and working
backwards gives the following definition.

Definition 4.6 (Matrix Norm). Given an N × N matrix A and a vector


norm  · , the induced matrix norm of A is defined by
Ax
A = max .
x∈RN ,x=0 x
By this definition, A is the smallest number such that
Ax ≤ Ax for all x ∈ RN .
The property that Ax ≤ Ax for all x ∈ RN is the key to using matrix
norms. It also follows easily from the definition of matrix norms that
I = 1.
Many features of the induced matrix norm follow immediately from prop-
erties of the starting vector norm, such as the following.

Theorem 4.3 (A norm on matrices). The induced matrix norm is a


norm on matrices because if A, B are N × N matrices and α a scalar,
then
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 70

70 Numerical Linear Algebra

(1) ||αA = |α|A||.


(2) A ≥ 0 and A = 0 if and only if A ≡ 0.
(3) ||A + B ≤ A + B.

Proof. Exercise!

Other features follow from the fact that matrix norms split products
apart, such as the following.

Theorem 4.4 (Properties of Matrix Norms). Let A, B be N ×N ma-


trices and α a scalar. Then

(1) Ax ≤ Ax.


(2) AB ≤ AB.
(3) If A is invertible, then for all x
x
≤ Ax ≤ Ax.
A−1 
(4) A = max x =1,x∈RN Ax.
(5) A−1  ≥ A1 .
(6) For any N × N matrix A and  ·  any matrix norm:
|λ(A)| ≤ A.

Proof. We will prove some of these to show how Ax ≤ Ax is used in
getting bounds on the action if a matrix. For example, note that A−1 Ax =
x. Thus
x ≤ A−1 Ax,
so
x
≤ Ax ≤ Ax.
A−1 
For (5), A−1 A = I so I = 1 ≤ A−1 A (using (2)), and A−1  ≥
1/A. For number 6, since Aφ = λφ. Thus |λ|φ = Aφ ≤ Aφ.

Remark 4.1. The fundamental property that AB ≤ AB for all A, B
shows the key to using it to structure proofs. As a first example, consider
the above proof of Ax−1 ≤ Ax. How is one to arrive at this proof? To
begin rearrange so it becomes x ≤ A−1 Ax. The top (upper) side of
such an inequality must come from splitting a product apart. This suggests
starting with A−1 Ax ≤ A−1 Ax. Next observe the LHS is just x.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 71

Norms and Error Analysis 71

The matrix norms is defined in a nonconstructive way. However, there


are a few special cases when the norm can be calculated:

• A∞ is calculable; it is the maximum row sum. It has value


n
A∞ = max |aij |.
1≤i≤n
j=1

• A1 is calculable; it is the maximum column sum. It has value



n
A1 = max |aij |.
1≤j≤n
i=1

• If A = At is symmetric then A2 is calculable. A2 = max{|λ| : λ is


an eigenvalue of A}
• For general AN ×N , A2 is calculable. It is the largest singular value
of A, or the square root of the largest eigenvalue of At A:

A2 = max{λ : λ is an eigenvalue of At A}.

Example 4.4. Let A be the 2 × 2 matrix


+1 −2
A= .
−3 +4
We calculate
||A||∞ = max{|1| + | − 2|, | − 3| + |4|} = max{3, 7} = 7,
||A||1 = max{|1| + | − 3|, | − 2| + |4|} = max{4, 6} = 6.
Since A is not symmetric, we can calculate the 2 norm either from the
singular values of A or directly from the definition of A. For the 2 × 2 case
we can do the latter. Recall
||A||2 = max{||Ax||2 : ||x||2 = 1}.
Every unit vector in the plane can be written as
x = (cos θ, sin θ)t for some θ.
We compute
Ax = (cos θ − 2 sin θ, −3 cos θ + 4 sin θ)t ,
||Ax||22 = (cos θ − 2 sin θ)2 + (−3 cos θ + 4 sin θ)2 ≡ f (θ).
Thus
"
||Ax||2 = max f (θ),
0≤θ≤2π

which is a calculus problem (Exercise 4.9).


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 72

72 Numerical Linear Algebra

Alternately we can compute


10 −14
At A = .
−14 20
Then we calculate the eigenvalues of At A by
10 − λ −14
0 = det ⇒ (10 − λ)(20 − λ) − 142 = 0,
−14 20 − λ
whereupon

||Ax||2 = max{λ1 , λ2 }.

Exercise 4.9. Complete the above two calculations of ||A||2 .

Exercise 4.10. Calculate the 1, 2 and ∞ norms of A


1 −3
A=
−4 7
and
1 −3
A= .
−3 7
Exercise 4.11. Show that an orthogonal change of variables preserves the
2 norm: ||Ox||2 = ||x||2 if Ot O = I.

Exercise 4.12. Prove that, for any induced matrix norm,


||I|| = 1, and
||A−1 || ≥ 1/||A||.

4.3.1 A few proofs


We next prove these claimed formulas.

Theorem 4.5 (Calculation of 2 norm of a symmetric matrix). If


A = At is symmetric then A2 is given by
A2 = max{|λ| : λ is an eigenvalue of A}.

Proof. If A is symmetric then it is diagonalizable by a real orthogonal


matrix3 O:
A = Ot ΛO, where Ot O = I and Λ = diag(λi ).
3 Recall that an orthogonal matrix is one where the columns are mutually orthonormal.

This implies Ot O = I.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 73

Norms and Error Analysis 73

We then have by direct calculation


# $2
Ax2
A2 = max
2
x∈RN ,x=0 x2
Ax, Ax Ot ΛOx, Ot AOx
= max = max
x∈RN ,x=0 x, x x∈RN ,x=0 x, x
ΛOx, ΛOx
= max .
x∈RN ,x=0 x, x
Now change variables on the RHS by y = Ox, x = Ot y. An elementary
calculation (Exercise 4.11) shows that an orthogonal change of variables
preserves the 2 norm: Ox2 = x2 . This gives
Λy, Λy Λy, Λy
A22 = max = max
y∈RN ,y=Ox=0 ||x||22 y∈RN ,y =0 y, y
 2 2
i λi yi 2
= max 2 = (|λ|max ) .
i yi
N
y∈R ,y =0

The proof the formulas for the 1 and infinity norms are a calculation.

Theorem 4.6 (Matrix 1-norm and ∞-norm). We have



N
A∞ = max |aij |,
1≤i≤N
j=1


N
A1 = max |aij |.
1≤j≤N
i=1

Proof. Consider A1 . Partition A by column vectors (so →



aj denotes the
th
j column) as
A = [→

a 1 |→

a2 | →

a3 |· · ·| −
a→
N] .

Then we have
Ax = x1 →

a1 + · · · + xN −
a→
N.

Thus, by the triangle inequality


||Ax||1 ≤ |x1 | · ||→

a1 ||1 + · · · + |xN | · ||−
a→
N ||1
!# $ !
 
N
≤ |x | i max ||→ −
a || = x
j 1 1 max |aij | .
j 1≤j≤N
i i=1
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 74

74 Numerical Linear Algebra

Dividing by x1 we have thus



N
A1 ≤ max |aij |.
1≤j≤N
i=1

To prove equality, we take j ∗ to be the index (of the largest column vector)
for which maxj ||→

aj ||1 = ||−
a→
j ∗ ||1 and choose x = ej ∗ . Then

Aej ∗ = 1→

a j ∗ and
||Aej ∗ ||1 ||aj ∗ ||1  N
= = ||aj ∗ ||1 = max |aij |.
||ej ∗ ||1 ||ej ∗ ||1 1≤j≤N
i=1
N
We leave the proof for A∞ = max1≤i≤N j=1 |aij | as an exercise.

“He (Gill) climbed up and down the lower half of the rock over
and over, memorizing the moves... He says that ‘... going up and
down, up and down eventually... your mind goes blank and you
climb by well cultivated instinct’. ”
— J. Krakauer, from his book Eiger Dreams.
n
Exercise 4.13. Show that A∞ = max1≤i≤n j=1 |aij |. Hint:


n 
n
(Ax)i = | aij xj | ≤ |aij ||xj | ≤
j=1 j=1
# $
n
max |xj | |aij | = ||x||∞ · (Sum of row i) .
j
j=1

Exercise 4.14. Show that for A an N × N symmetric matrix



A2F robenius = trace(At A) = λ2i .
i

Exercise 4.15. If || · || is a vector norm and U an N × N nonsingular


matrix, show that ||x||∗ := ||U x|| is a vector norm. When || · || = || · ||2 , find
a formula for the matrix norm induced by || · ||∗ .

4.4 Error, Residual and Condition Number

We [he and Halmos] share a philosophy about linear algebra:


we think basis-free, we write basis-free, but when the chips are
down we close the office door and compute with matrices like fury.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 75

Norms and Error Analysis 75

— Kaplansky, Irving, Paul Halmos: Celebrating 50 Years of


Mathematics.
“What is now proved was once only imagin’d.”
— W. Blake, The Marriage of Heaven and Hell, 1790-3.

If we solve Ax = b and produce an approximate solution x , then the


fundamental equation of numerical linear algebra, Ae = r, links error and
residual, where

error : = e = x − x
,
residual : = r = b − A
x.

Recall that, while e = 0 if and only if r = 0, there are cases where small
residuals and large errors coexist. For example, the point P = (0.5, 0.7)
and the 2 × 2 linear system

x−y = 0
−0.8x + y = 1/2

are plotted in Figure 4.1.



    


Fig. 4.1 Lines L1, L2 and the point P.

The point P is close to both lines so the residual of P is small. However,


it is far from the solution (the lines’ intersections). We have seen the
following qualitative features of this linkage:
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 76

76 Numerical Linear Algebra

• If A is invertible then r = 0 if and only if e = 0.


• The residual is computable but e is not exactly computable in a useful
sense since solving Ae = r for e is comparable to solving Ax = b for x.
• If A is well conditioned then ||r|| and ||e|| are comparable in size.
• If A is ill conditioned then ||r|| can be small while ||e|| is large.
• det(A) cannot be the right way to quantity this connection. Starting
with Ae = r we have (αA) e = αr and rescaling can make det(αA) = 1
using det(αA) = αn det(A).

This section makes a precise quantitative connection between the size


of errors and residuals and, in the process, quantify conditioning.

Definition 4.7. Let  ·  be a matrix norm. Then the condition number


of the matrix A induced by the vector norm  ·  is
cond · (A) := A−1 A.

Usually the norm in question will be clear from the context in which
cond(A) occurs so usually the subscript of the norm is omitted. The con-
dition number of a matrix is also often denoted by the Greek letter kappa:
κ(A) = cond(A) = condition number of A.

Theorem 4.7 (Relative Error ≤ cond(A)×Relative Residual). Let


Ax = b and let x̂ be an approximation to the solution x. With r = b − Ax̂
x − x̂ r
≤ cond(A) . (4.2)
x b

Proof. Begin with Ae = r then e = A−1 r. Thus,


e = A−1 r ≤ A−1 r. (4.3)
Since Ax = b we also know b = Ax ≤ Ax. Dividing the smaller
side of (4.3) by the larger quantity and the larger side of (4.3) by the smaller
gives
e A−1 r
≤ .
Ax b
Rearrangement proves the theorem.

Equation (4.2) quantifies ill-conditioning: the larger cond(A) is, the


more ill-conditioned the matrix A.

Remark 4.2. The manipulations in the above proof are typical of ones in
numerical linear algebra and the actual result is a cornerstone of the field.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 77

Norms and Error Analysis 77

Note the pattern: we desire an inequality ||error|| ≤ ||terms||. Thus, we


begin with an equation error = product then take norms of both sides.

Example 4.5. I = 1 so cond(I) = 1 in any induced matrix norm. Simi-


larly for an orthogonal matrix cond2 (O) = 1.

Example 4.6. Let


1.01 0.99
A=
0.99 1.01
then A∞ = 2. Since for any 2 × 2 matrix4
−1
ab 1 d −b
= ,
cd det(A) −c a
we can calculate A−1 exactly
25.25 −24.75
A−1 = .
−24.75 25.25
Hence, A−1 ∞ = 50. Thus cond(A) = 100 and errors can be (at most)
100× larger than residuals.

Example 4.7. Let A be as above and b = (2 2)t


1.01 0.99 2
A= , b= ,
0.99 1.01 2
so x = (1 1)t . Consider x̂ = (2 0)t . The error is e = x − x̂ = (−1 1)t and
e∞ = 1. The residual of x̂ is
−0.02
r = b − Ax̂ = .
−0.02
As r∞ = 0.02 we see
r 0.02 e 1
= and =
b 2 x 1
so the error is exactly 100× larger than the residual!

Example 4.8 (The Hilbert Matrix). The N ×N Hilbert matrix HN ×N


is the matrix with entries
1
Hij = , 1 ≤ i, j ≤ n.
i+j−1
This matrix is extremely ill conditioned even for quite moderate values
on n.
4 This formula for the inverse of a 2 × 2 matrix is handy for constructing explicit 2 × 2

examples with various features. It does not hold for 3 × 3 matrices.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 78

78 Numerical Linear Algebra

Example 4.9. Let x = (1.0, 1.0)t and


1.000 2.000 3.00
A= , b= .
.499 1.001 1.50
Given x̂ = (2.00 0.500)t we calculate r = b − Ax̂ = (0, 0.0015)t which is
“small” in both an absolute and relative sense. As in the previous example
we can find A−1 and then calculate A−1 ∞ . We find
% %
% 333.67 −666.67 %
A∞ = 3, A−1 ∞ = % %
% −166.67 333.33 % = 1000.34.

Thus,
cond(A) = 3001.02
and the relative residual can be 3000× smaller that the relative error. In-
deed, we find
x − x̂∞ r∞
= 1, and = 0.0045.
x∞ b∞
Example 4.10. Calculate cond(HN ), for N = 2, 3, · · · , 13. (This is easily
done in Matlab.) Plot cond(H) vs N various ways and try to find its
growth rate. Do a literature search and find it.

Exercise 4.16. Suppose Ax = b and that x  is an approximation to x.


Prove the result of Theorem 4.7 that
||x − x|| ||r||
≤ cond||·|| (A) .
||x|| ||b||
Prove the associated lower bound
||x − x|| ||r||
≥ [1/cond||·|| (A)] .
||x|| ||b||
Hint: Think about which way the inequalities must go to have the error on
top.

Exercise 4.17. Let A be the following 2 × 2 symmetric matrix. Find the


eigenvalues of A and the 2 norm of A and cond2 (A):
12
A= .
24

Exercise 4.18. Show that for any square matrix (not necessarily symmet-
ric) cond2 (A) ≥ |λ|max /|λ|min .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 79

Norms and Error Analysis 79

Exercise 4.19.

(1) If cond(A) = 106 and you solve Ax = b on a computer with 7 significant


digits (base 10), What is the expected number of significant digits of
accuracy of the solution?
(2) Let Ax = b and x  let be some approximation to x, e = x− x
, r = b−A
x.
(3) Show that Ae = r and explain at least 3 places where this equation
is useful or important in numerical linear algebra.
(4) Show that:
||e||/||x|| ≤ cond(A)||r||/||b||.

4.5 Backward Error Analysis

One of the principal objects of theoretical research in my de-


partment of knowledge is to find the point of view from which the
subject appears in its greatest simplicity.
— Gibbs, Josiah Willard (1839–1903)

For many problems in numerical linear algebra results of the following


type have been proven by meticulously tracking through computer arith-
metic step by step in the algorithm under consideration. It has been verified
in so many different settings that it has become something between a meta
theorem and a philosophical principle of numerical linear algebra.
Basic Result of Backward Error Analysis.5 The result x̂ of
solving Ax = b by Gaussian elimination in finite precision arith-
metic subject to rounding errors is precisely the same as the exact
solution of a perturbed problem
(A + E)x̂ = b + f (4.4)
where
E f 
, = O(machine precision).
||A|| ||b||
First we consider the effect of perturbations to the matrix A. Since the
entries in the matrix A are stored in finite precision, these errors occur even
if all subsequent calculations in Gaussian elimination were done in infinite
precision arithmetic.
5 This has been proven for most matrices A. There are a few rare types of matrices for

which it is not yet proven and it is an open question if it holds for all matrices (i.e., for
those rare examples) without some adjustments.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 80

80 Numerical Linear Algebra

Theorem 4.8 (Effect of storage errors in A). Let A be an N ×N ma-


trix. Suppose Ax = b and (A + E)x̂ = b. Then,
x − x̂ E
≤ cond(A) .
x̂ A
Proof. The proof has a well defined strategy that we shall use in other
proofs:
Step 1: By subtraction get an equation for the error driven by the
perturbation:
Ax = b
−(A
x = b − E
x)
subtract
A(x − x
) = E
x
 = A−1 E
x−x x.
Step 2: Bound error by RHS:
|| = ||A−1 E
||x − x x|| ≤ ||A−1 || · ||E|| · ||
x||.
Step 3: Rearrange to write in terms of relative quantities and condition
numbers:
x − x̂ E
≤ cond(A) .
x̂ A

Theorem 4.9 (Effect of storage errors in b). Let Ax = b and Ax̂ =


b + f . Then
x − x̂ f 
≤ cond(A) .
x b

Proof. Since Ax̂ = Ax + f , x − x̂ = −A−1 f , x − x̂ ≤ A−1 f  =


f
cond(A) A ≤ cond(A) fb x, because b ≤ Ax.

Remark 4.3 (Interpretation of cond(A)). When E is due to roundoff


errors E/A = O (machine precision). Then these results say that:
cond(A) tells you how many significant digits are lost (worst case) when
solving Ax = b. As an example, if machine precision carries 7 significant
digits, E/A = O(10−7 ), and if cond(A) = 105 then x̂ will have at least
7 − 5 = 2 significant digits.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 81

Norms and Error Analysis 81

Other properties of cond(A):

• cond(A) ≥ 1 and cond(I) = 1.


• Scaling A does not influence cond(A):
cond(αA) = cond(A), for any α = 0.
• cond(A) depends on the norm chosen but usually it is of the same order
of magnitude for different norms.
• cond(A) is not related to det(A). For example, scaling changes det(A)
but not cond(A):
det(αA) = αn det(A) but cond(αA) = cond(A).
• If A is symmetric then
cond2 (A) = |λ|max /|λ|min .
• If A is symmetric, positive definite and  ·  =  · 2 , then cond(A)
equals the spectral condition number, λmax /λmin
cond2 (A) = λmax /λmin .
• cond(A) = cond(A−1 ).
• We shall see in a later chapter that the error in eigenvalue and eigen-
vector calculations is also governed by cond(A).

The most important other result involving cond(A) is for the perturbed
system when there are perturbations in both A and b.

4.5.1 The general case


Say what you know, do what you must, come what may.
— Sonja Kovalevsky, [Motto on her paper “On the Problem of
the Rotation of a Solid Body about a Fixed Point”].

We show next that the error in


(A + E)x̂ = b + f compared to the true system: Ax = b
is also governed by cond(A). This requires some technical preparation.

Lemma 4.1 (Spectral localization). For any N × N matrix B and  · 


any matrix norm:
|λ(B)| ≤ B.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 82

82 Numerical Linear Algebra

Proof. Bφ = λφ. Thus |λ|φ = Bφ ≤ Bφ.

This result holds for any matrix norm. Thus, various norms of A can be
calculated and the smallest used are an inclusion radius for the eigenvalues
of A.

Theorem 4.10. Let BN ×N . Then we have

lim B n = 0 if and only if there exists a norm  ·  with B < 1.


n→∞

Proof. We prove B < 1 ⇒ B n  → 0. This is easy. The other direction


will be proven later.6 We have that B 2  = B · B ≤ B · B = B2 .
By induction it follows that B n  ≤ Bn and thus

B n  ≤ Bn → 0 as n → ∞.

We shall use the following special case of the spectral mapping theorem.
−1
Lemma 4.2. The eigenvalues of (I − B) are (1 − λ)−1 where λ is an
eigenvalue of B.

Proof. Let Bφ = λφ. Then, φ − Bφ = φ − λφ and (I − B)φ = (1 − λ)φ.


−1
Inverting we see that (1 − λ)−1 is an eigenvalue of (I − B) . Working
backwards (with details left as an exercise) it follows similarly that (1−λ)−1
−1
an eigenvalue of (I − B) implies λ is an eigenvalue of B.

Theorem 4.11 (The Neumann Lemma). Let BN ×N be given, with


B < 1. Then (I − B)−1 exists and
!

N
−1
(I − B) = lim 
B .
N →∞
=0

Proof. IDEA OF PROOF: Just like summing a geometric series:

S = 1 + α + α2 + · · · + αN
αS = α + · · · + αN + αN +1
(1 − α)S = 1 − αN +1 .
6 Briefly: Exercise 4.22 shows that given a matrix B and any ε > 0, there exists a norm

within ε of spr(B). With this result, if there does not exist a norm with B < 1, then
there is a λ(B) with |λ| = spr(B) > 1. Picking x = eigenvector of λ, we calculate:
|B n x| = |λn x| → ∞.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 83

Norms and Error Analysis 83

To apply this idea, note that since |λ| ≤ B, |λ| < 1. Further, λ(I − B) =
1 − λ(B) by the spectral mapping theorem.7 Since |λ(B)| < 1, λ(I − B) = 0
and (I − B)−1 exists. We verify that the inverse is as claimed. To begin,
note that

(I − B)(I + B + · · · + B N ) = I − B N +1 .

Since B N → 0 as N → ∞

I + B + · · · + BN =
= (I − B)−1 (I − B N +1 ) = (I − B)−1 − (I − B)−1 BB N → (I − B)−1 .

As an application of the Neumann lemma we have the following.

Corollary 4.1 (Perturbation Lemma). Suppose A is invertible and


A−1 E < 1, then A + E is invertible and

A−1 
(A + E)−1  ≤ .
1 − A−1 E

Exercise 4.20. Prove this corollary.

The ingredients are now in place. We give the proof of the general case.

Theorem 4.12 (The General Case). Let

Ax = b, (A + E)x̂ = b + f.

Assume A−1 exists and


E
A−1 E = cond(A) < 1.
A

Then
 &
x − x̂ cond(A) E f 
≤ + .
x 1 − A−1 E A b
7 An elementary proof is because the eigenvalues of λ(B) are roots of the polynomial

det(λI − B) = − det((1 − λ)I − (I − B)).


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 84

84 Numerical Linear Algebra

Proof. The proof uses same ideas but is a bit more delicate in the order
of steps. First,8
Ax = b ⇐⇒(A + E)x = b + Ex
(A + E)x̂ = b + f
(A + E)e = Ex − f
e = (A + E)−1 (Ex − f )

e ≤ (A + E)−1 (Ex + f ).


Now

x = A−1 b, x ≤ A−1 b
Ax = b so
b ≤ Ax, and x ≥ A−1 b.
Thus,
# $
e x f 
≤ (A + E)−1  E +
x x x
⎛ ⎞
e f 
≤ (A + E)−1  ⎝E + A ⎠
x '()* b
factor out this A
# $
e −1 E f 
≤ A(A + E)  + .
x A b
Finally, rearrange terms after using
A−1 
(A + E)−1  ≤ .
1 − A−1 E

Remark 4.4 (How big is the RHS?). If A−1 E  1, we can esti-
1
mate (e.g., 1−α  1 + α)
1
∼ 1 + A−1 E = 1 + small
1 − A−1 E
so that up to O(A−1 E) the first order error is governed by cond(A).
8 The other natural way to start is to rewrite
Ax = b ⇐⇒(A + E)x = b + Ex
Ax̂ = b + f − E x̂
e = A−1 (f − E x̂).
Since there are 2 natural starting points, the strategy is to try one and if it fails, figure
out why then try the other.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 85

Norms and Error Analysis 85

Remark 4.5 (Non-symmetric matrices). The spectral condition num-


ber can be deceptive for non-symmetric matrices. Since A ≥ |λ(A)|
for each of the eigenvalues λ(A) of A, A ≥ |λ|max (A) and A−1  ≥
|λ(A−1 )|max = 1/|λ(A)|min . We thus have
|λ(A)|max
cond(A) ≥
|λ(A)|min
i.e., spectral condition number ≤ condition number. For example, for A
and B below, cond2 (A) = cond2 (B) = O(105 ) but we calculate

1 −1 1 −1
A= , and B = ,
1 −1.00001 −1 1.00001
|λ|max (A) |λ|max (B)
∼ 1, while ∼ 4 · 105 .
|λ|min (A) |λ|min (B)

There are many, other results related to Theorem 4.12. For example, all
the above upper bounds as relative errors can be complemented by lower
bounds, such as the following.

Theorem 4.13. Let Ax = b. Given x̂ let r = b − Ax̂. Then,


x − x̂ 1 r
≥ .
x cond(A) b
Exercise 4.21. Prove the theorem.

The relative distance of a matrix A to the closest non-invertible ma-


trix is also related to cond(A). A proof due to Kahan9 is presented in
Exercise 4.22.

Theorem 4.14 (Distance to nearest singular matrix). Suppose A−1


exists. Then,
 &
1 A − B
= min : det(B) = 0 .
cond(A) A

Exercise 4.22. Theorem 4.14 is a remarkable result. One proof due to


Kahan depends on ingenious choices of particular vectors and matrices.
9 W. Kahan, Numerical linear algebra, Canadian Math. Bulletin, 9 (1966), pp. 757–

801. Kahan attributes the theorem to “Gastinel” without reference, but does not seem
to be attributing the proof. Possibly the Gastinel reference is: Noël Gastinel, Matrices
du second degré et normes générales en analyse numérique linéaire, Publ. Sci. Tech.
Ministére de l’Air Notes Tech. No. 110, Paris, 1962.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 86

86 Numerical Linear Algebra

(1) Show that if B is singular, then


1 A − B
≤ .
cond(A) A
Hint: If B is singular, there is an x so that Bx = 0 and (A − B)x = Ax.
Hence A − B ≥ (A − B)x/x.
(2) Show that there is a matrix B with
1 A − B
= .
cond(A) A
Hint: Show that it is possible to choose a vector y so that A−1 y =
A−1 y = 0, set w = (A−1 y)/A−1 y2 and set B = A − ywt .10

Example 4.11 (Perturbations of the right hand side b). Consider


the linear system with exact solution (1, 1):
3x1 + 4x2 = 7
5x1 − 2x2 = 3
and let f = (0.005, −0, 009)t so the RHS is changed to b+f = (7.005, 2.991).
The solution is now
x̂ = (0.999 1.002)t .
Since a small change in the RHS produced a corresponding small change in
the solution we have evidence that
3 4
5 −2
is well-conditioned.
Now modify the matrix to get the system (with exact solution still
(1, 1)t )
x1 + x2 = 2
1.01x1 + x2 = 2.01.
This system is ill-conditioned. Indeed, change the RHS a little bit to
b + f = (2.005, 2.005)t .
The new solution x̂ is changed a lot to
x̂ = (0, 2.005)t .
10 Recall that wt y = w, y is a scalar but that ywt , the “outer product,” is a matrix.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 87

Norms and Error Analysis 87

Example 4.12 (Changes in the coefficients). Suppose the coefficients


of the system (solution = (1, 1)t )
1x1 + 1x2 = 2
1.01x1 + 1x2 = 2.01
are changed slightly to read
1x1 + 1x2 = 2
1.0001x1 + 1x2 = 2.001.
Then, the exact solution changes wildly to
x̂ = (100, −98)t .
We still have a very small residual in the perturbed system
r1 = 2 − (1 · 100 + 1 · (−98)) = 0
r2 = 2.001 − (1.0001 · 100 + 1 · (−98)) = −0.009.

Example 4.13 (cond(A) and det(A) not related). Let ε denote a


small positive number. The matrix A below is ill conditioned and its de-
terminant is ε thus near zero:
1 1
.
1+ε 1
Rescaling the first row gives
ε−1 ε−1
.
1+ε 1
This matrix for an equivalent linear system has det(A) = 1 but cond(A) =
2ε−1 (ε−1 + 1 + ε) which can be high for ε small.

To summarize:

• If cond(A) = 10t then at most t significant digits are lost when solving
Ax = b.
• cond(A) = AA−1  is the correct measure of ill-conditioning; in par-
ticular, it is scale invariant whereas det(A) is not.
• For 2 × 2 linear systems representing two lines in the x1 , x2 plane,
cond(A) is related to the angle between the lines.
• The effects of roundoff errors and finite precision arithmetic can be
reduced to studying the sensitivity of the problem to perturbations.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 88

88 Numerical Linear Algebra

Exercise 4.23. Let Ax = b be a square linear system and suppose you


are given an approximate solution. Define the error and residual. State
and prove an inequality relating the relative error, relative residual and
cond(A).

Exercise 4.24. If A is a 2 × 2 matrix that is symmetric and positive def-


inite then the cond2 (A) = λmax (A)/λmin (A). If A is not symmetric there
can be very little connection between the condition number and the so-
called spectral condition number. Your goal in this exercise is to find
an example illustrating this. Specifically, find a 2 × 2 matrix A with
|λ|max (A)/|λ|min (A) = O(1), in other words of moderate size, but cond2 (A)
very very large, cond2 (A)  1. HINT: The matrix obviously cannot be
symmetric. Try writing down the matrix in Jordan canonical form
ab
A= .
0c

Exercise 4.25. If cond(A) = 105 and one solves Ax = b on a computer


with 8 significant digits (base 10), what is the expected number of significant
digits of accuracy in the answer? Explain how you got the result.

Exercise 4.26. Often it is said that “The set of invertible matrices is an


open set under the matrix norm.” Formulate this sentence as a mathemat-
ical theorem and prove it.

Exercise 4.27. For B an N ×N matrix. Show that for a > 0 small enough
then I − aB is invertible. What is the infinite sum in that case:
∞
an B n ?
n=0

Exercise 4.28. Verify that the determinant gives no insight into condi-
tioning. Calculate the determinant of the coefficient matrix of the system
1x1 + 1x2 = 2
10.1x1 + 10x2 = 20.1.
Recalculate after the first equation is multiplied by 10:
10x1 + 10x2 = 20
10.1x1 + 10x2 = 20.1.

For more information see the articles and books of Wilkinson [W61],
[W63].
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 89

Chapter 5

The MPP and the Curse of


Dimensionality

“What we know is not much. What we do not know is im-


mense.”
— Pierre-Simon de Laplace (1749–1827) (Allegedly his last
words). From: DeMorgan’s Budget of Paradoxes.

5.1 Derivation

ceiiinosssttuv, (“Ut tensio, sic vis.”).


— Robert Hooke

The Poisson problem is the model problem in mechanics and applied


mathematics and the discrete Poisson problem is the model problem in
numerical linear algebra. Since practitioners of the mad arts of numerical
linear algebra will spend much of their (professional) lives solving it, it is
important to understand where it comes from. Suppose “something” is
being studied and its distribution is not uniform. Thus, the density of that
“something” will be variable in space and possibly change with time as well.
Thus, let
u(x, t) := density, where x = (x1 , x2 , x3 ).
For example, if something is heat, then the
heat density = ρCp T (x, t),
where ρ = material density, Cp = specific heat and T (x, t) = temperature
at point x and time t.
To avoid using “something” too much, call it Q; keep the example of
heat in mind. Since Q is variable, it must be changing and hence undergoing
flow with “flux” defined as its rate of flow. Thus, define

F := flux of Q at a point x at a time t.

89
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 90

90 Numerical Linear Algebra

Assumption: Q is conserved.

The mathematical realization of conservation is:


For any region B
d
{total amount of Q in region B} =
dt
Total flux of Q through ∂B +
Total contribution of any sources or sinks of Q inside B.
Let, thus
f (x, t) := sources or sinks of Q per unit volume.
Mathematically, conservation becomes
+ +  +
d
u(x, t)dx + F (x, t) · n̂dσ = f (x, t)dx
dt B ∂B B
where n̂ is the outward unit normal to B. We shall fix x and fix the region B
to be the ball about x of radius ε (denoted Bε (x)). Recall the following fact
from calculus about continuous functions as well as the divergence theorem.

Lemma 5.1 (Averaging of continuous functions). If v(x) is a con-


tinuous function then
+
1
lim v(x )dx = v(x).
ε→0 vol(Bε ) B
ε

The divergence theorem applies in regions with smooth boundaries, with


polyhedral boundaries, with rough boundaries without cusps, and many
more regions. A domain must have a very exotic boundary for the diver-
gence theorem not to hold in it. Usually, in applied math the question of
“How exotic?” is sidestepped, as we do here, by just assuming the diver-
gence theorem holds for the domain. As usual, define a “regular domain”
as one to which the divergence theorem applies. We shall use if for spheres,
which are certainly regular domains.

Theorem 5.1 (The Divergence Theorem). If B is a regular domain


(in particular, B has no internal cusps) and if →

v (x) is a C 1 vector function
then
+ ,


div v dx = →
−v · n̂dσ.
B ∂B
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 91

The MPP and the Curse of Dimensionality 91

The divergence theorem implies


+ +

− →

F · n̂dσ = div F dx
∂B B

and thus (after dividing by vol(Bε )) conservation becomes:


+ + +
d 1 1 →
− 1
u(x, t)dx + div( F )dx = f dx.
dt vol(Bε ) Bε vol(Bε ) Bε vol(Bε ) Bε
Letting ε → 0 and using Lemma 5.1 gives the equation
∂u(x, t) →

+ div( F ) = f. (5.1)
∂t
This is one equation for four variables (u, F1 , F2 , F3 ). A connection between

flux F and density is needed. One basic description of physical phenomena
due to Aristotle is

“Nature abhors a vacuum”.

This suggests that Q often will flow from regions of high concentration
to low concentration. For example, Fourier’s law of heat conduction and
Newton’s law of cooling state that
Heat F lux = −k∇T
where k is the (material dependent) thermal conductivity. The analogous
assumption for Q is

Assumption: (Q flows downhill) The flux in Q, is given by




F = −k∇u.

Inserting this for F in (5.1) closes the system for u(x, t):
∂u
− div(k∇u) = f (x, t).
∂t
Recall that
∂2u ∂2u ∂2u
Δu = div grad u = + + .
∂x21 ∂x22 ∂x23
For simple materials, the value of the material parameter, k, can be taken
constant. Thus,
∂u
− kΔu = f (x, t).
∂t
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 92

92 Numerical Linear Algebra

If the process is at equilibrium (i.e. u = u(x), independent of t, and f =


f (x)) we have the model problem: find u(x) defined on a domain Ω in
Rd (d = 1, 2 or 3) satisfying

− Δu = f (x) inside Ω, u = 0 on the boundary ∂Ω. (5.2)

Remark 5.1. The boundary condition, u = 0 on ∂Ω, is the clearest one


that is interesting; it can easily be modified. Also, in R1 , R2 and R3 we
have
d2
Δu = u(x) in R1 ,
dx2
∂2u ∂2u
Δu = + in R2 ,
∂x21 ∂x22
∂2u ∂2u ∂2u
Δu = + + in R3 .
∂x21 ∂x22 ∂x23

Problem (5.2) is the model problem. What about the time dependent
problem however? One common way to solve it is by the “method of lines”
or time stepping. Pick a Δt (small) and let un (x) ∼ u(x, t)|t=nΔt .
Then
· u (x) − u
n n−1
∂u (x)
(tn ) = .
∂t Δt
Replacing ∂u
∂t by the difference approximation on the above RHS gives a
sequence of problems

un − un−1
− kΔun = f n
Δt
or, solve for n = 1, 2, 3, · · · ,
# $
1
−Δu + n
un = f n + un−1 ,
k

which is a sequence of many shifted Poisson problems.


We shall see that:

• Solving a time dependent problem can require solving the Poisson prob-
lem (or its ilk) thousands or tens of thousands of times.
• The cost of solving the Poisson problem increases exponentially in the
dimension from 1D to 2D to 3D.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 93

The MPP and the Curse of Dimensionality 93

5.2 1D Model Poisson Problem

Mechanics is the paradise of the mathematical sciences, because


by means of it one comes to the fruits of mathematics.
— da Vinci, Leonardo (1452–1519), Notebooks, v. 1, ch. 20.

The 1D Model Poisson Problem (MPP henceforth) is to find u(x)


defined on an interval Ω = (a, b) satisfying
−u (x) = f (x), a < x < b,
(5.3)
u(a) = g1 , u(b) = g2 ,
where f (x), g1 and g2 are given. If, for example, g1 = g2 = 0 the u(x)
describes the deflection of an elastic string weighted by a distributed load
f (x). As noted in Section 5.1, u(x) can also be the equilibrium temperature
distribution in a rod with external heat sources f (x) and fixed temperatures
at the two ends. Although it is easy to write down the solution of (5.3), there
are many related problems that must be solved for which exact solutions
are not attainable. Thus, we shall develop method for solving all such
problems.

5.2.1 Difference approximations


“Finite arithmetical differences have proved remarkably suc-
cessful in dealing with differential equations, ... in this book it
is shown that similar methods can be extended to the very compli-
cated system of differential equations which express the changes in
the weather.”
— Richardson, Lewis Fry (1881–1953), page 1 from the book
Lewis F. Richardson, Weather prediction by numerical process,
Dover, New York, 1965. (originally published in 1922)

Recall from basic calculus that


u(a + h) − u(a) u(a) − u(a − h)
u (a) = lim = lim .
h→0 h h→0 h
Thus, we can approximate by taking h nonzero but small as in:
· u(a + h) − u(a)
u (a) = lim =: D+ u(a),
h→0 h
· u(a) − u(a − h)
u (a) = lim =: D− u(a).
h→0 h
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 94

94 Numerical Linear Algebra

u u(x)

slope = D+ u(a)

slope = D− u(a)

slope = D0 u(a)
x

Fig. 5.1 A curve with tangent and two chords.

Graphically, we visualize these approximations as slopes of secant lines


approximating the sought slopes of the tangent line as in Figure 5.1.
It seems clear that often one of D+ u(a), D− u(a) will underestimate
u (a) and the other overestimate u (a). Thus averaging is expected to
increase accuracy (and indeed it does). Define thus

u(a + h) − u(a − h)
D0 u(a) = (D+ u(a) + D− u(a)) /2 = .
2h

To reduce the model BVP to a finite set of equations, we need an approxi-


mation to u in (5.3). The standard one is

· u(a + h) − 2u(a) + u(a + h)


u (a) = D+ D− u(a) = .
h2
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 95

The MPP and the Curse of Dimensionality 95

The accuracy of each approximation is found by using Taylor series.1 The


accuracy is (for smooth u)
u (a) = D+ u(a) + O(h),
u (a) = D− u(a) + O(h),
u (a) = D0 u(a) + O(h2 ),
u (a) = D+ D− u(a) + O(h2 ).
The expression, for example,
error in difference approximation = u (a) − D+ D− u(a) = O(h2 ).
The expression f (h) = O(h2 ) means that there is a constant C so that if
h is small enough, then |f (h)| ≤ Ch2 . If h is cut in half, then |f (h)| is
cut by approximately a fourth; and, h is cut by 10, then |f (h)| is cut by
approximately 100.

5.2.2 Reduction to linear equations


Although this may seem a paradox, all exact science is domi-
nated by the idea of approximation.
— Russell, Bertrand (1872–1970), in W. H. Auden and L. Kro-
nenberger (eds.) The Viking Book of Aphorisms, New York: Viking
Press, 1966.

Divide [a, b] into N equal subintervals with breakpoints denoted xj .


Thus, we define
b−a
h := , xj = a + jh, j = 0, 1, · · · , N + 1,
N +1
so we have the subdivision
a = x0 < x1 < x2 < · · · < xN < xN +1 = b.
At each meshpoint xj we will compute a uj ∼ u(xj ). We will, of course,
need one equation for each variable meaning one equation for each mesh-
point. Approximate −u = f (x) at each xj by using D+ D− uj :
−D+ D− uj = f (xj ),
or equivalently
uj+1 − 2uj + uj−1
− = f (xj ), for j = 1, 2, 3, · · ·, N.
h2
1 See any general numerical analysis book for this; it is not hard but would delay our

presentation to take this detour.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 96

96 Numerical Linear Algebra

Thus, the system of linear equations is


u0 = g 0
−uj+1 + 2uj − uj−1 = h2 f (xj ), j = 1, 2, . . . , N,
uN +1 = g1 .
Writing this out is instructive. It is
1u0 = g0
−1u0 +2u1 −1u2 = h2 f (x1 )
−1u1 +2u2 −1u3 = h2 f (x2 )
··· ··· ···
−1uN −1 +2uN −uN +1 = h2 f (xN )
1uN +1 = g1 .
This is N + 2 equations in N + 2 variables:
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 ··· 0 0 u0 g0
⎢ −1 2 −1 · · · 0 0 ⎥ ⎢ u ⎥ ⎢ h2 f (x ) ⎥
⎢ ⎥⎢ 1 ⎥ ⎢ 1 ⎥
⎢ .. ⎥ ⎢ . ⎥ ⎢ .. ⎥
⎢ .. .. .. ⎥ ⎢ . ⎥ ⎢ ⎥
⎢ 0 . . . . ⎥⎢ . ⎥ ⎢ . ⎥
⎢ . ⎥⎢ . ⎥ = ⎢ ⎥.
⎢ . .. .. .. ⎥⎢ . ⎥ ⎢ .. ⎥
⎢ . . . . 0 ⎥⎢ . ⎥ ⎢ . ⎥
⎢ ⎥⎢ ⎥ ⎢ 2 ⎥
⎣ 0 · · · 0 −1 2 −1 ⎦ ⎣ uN ⎦ ⎣ h f (xN ) ⎦
0 0 ··· 0 0 1 uN +1 g1
The first and last equations can be eliminated (or not as you prefer) to give
⎡ ⎤⎡ ⎤ ⎡ ⎤
2 −1 0 · · · 0 u1 h2 f (x1 ) + g0
⎢ −1 2 −1 ⎥ ⎢ u ⎥ ⎢ h2 f (x ) ⎥
⎢ ⎥⎢ 2 ⎥ ⎢ 2 ⎥
⎢ .. ⎥ ⎢ . ⎥ ⎢ .. ⎥
⎢ .. .. .. ⎥ ⎢ . ⎥ ⎢ ⎥
⎢ 0 . . . . ⎥⎢ . ⎥ ⎢ . ⎥
⎢ . ⎥⎢ . ⎥ = ⎢ ⎥.
⎢ . .. .. .. ⎥⎢ . ⎥ ⎢ .. ⎥
⎢ . . . . 0 ⎥⎢ . ⎥ ⎢ . ⎥
⎢ ⎥⎢ ⎥ ⎢ 2 ⎥
⎣ −1 2 −1 ⎦ ⎣ uN −1 ⎦ ⎣ h f (xN −1 ) ⎦
0 · · · 0 −1 2 uN h2 f (xN ) + g1
Because of its structure this matrix is often written as A = tridiag
(−1, 2, −1). The first important question for A is:

Does this linear system have a solution?

We will investigate invertibility of A.

Lemma 5.2 (Observation about averaging). Let a be the average of


x and y
x+y
a=
2
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 97

The MPP and the Curse of Dimensionality 97

then a must be between x and y:


If x < y, then x < a < y,
If x > y, then x > a > y, and
If x = y then a = x = y.
More generally, the same holds for weighted averages with positive
weighs: If
a = αx + βy
where
α + β = 1, α ≥ 0, β ≥ 0
then a must be between x and y.

Exercise 5.1. Prove this lemma about averaging.

We will use this observation about averaging to prove that A−1 exists.

Theorem 5.2. Let A = tridiag (−1, 2, −1). Then A−1 exists.

Proof. Suppose not. The Au = 0 has a nonzero solution u. Let uJ be the


component of u that is largest in absolute value:
|uJ | = max |uj | ≡ uMAX .
j

We can also assume uJ > 0; if uJ < 0 then note that A(−u) = 0 has Jth
component (−uJ ) > 0. Then, if J is not 1 or N , the J th equation in Au = 0
is
−uJ+1 + 2uJ − uJ−1 = 0
or
uJ+1 + uJ−1
uJ =
2
implying that uJ is between uJ−1 and uJ+1 . Thus either they are all zero
or
uJ = uJ−1 = uJ+1 ≡ uMAX .
Continuing across the interval (a, b) we get
u1 = u2 = . . . = uN ≡ umax .
Consider the equation at x1 : 2u1 − u2 = 0. Thus 2umax − 2umax = 0 so
umax ≡ 0 and uJ ≡ 0. We leave the case when J = 1 and J = N as
exercises.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 98

98 Numerical Linear Algebra

Exercise 5.2. An alternative proof that the matrix given by A =


tridiag(−1, 2, −1) is invertible involves a direct calculation of its determi-
nant to show it is not zero. Use row-reduction operations and induction to
show that det(A) = n + 1 = 0.

Remark 5.2. The eigenvalues of the matrix A = tridiag(−1, 2, −1) have,


remarkably, been calculated exactly. They are

λn = 4 sin2 , n = 1, . . . , N.
2N + 1
Thus, λmax (A) ≈ 4 and λmin ≈ Constant·h2 , so that

cond2 (A) = O(h−2 ).

Exercise 5.3. Assume that λn = 4 sin2 2Nnπ+1 , n = 1, . . . , N. From this


show that cond2 (A) = O(h−2 ) and calculate the hidden constant.

Exercise 5.4. This exercise will calculate the eigenvalues of the 1D matrix
A = tridiag (−1, 2, −1) exactly based on methods for solving difference
equations. If Au = λu then, for we have the difference equation

u0 = 0,
−uj+1 + 2uj − uj−1 = λuj , j = 1, 2, . . . , N,
uN +1 = 0.

Solutions to difference equations of this type are power functions. It is


known that the exact solution to the above is

uj = C1 R1j + C2 R2j

where R1/2 are the roots of the quadratic equation

−R2 + 2R − 1 = λR.

For λ to be an eigenvalue this quadratic equation must have two real roots
and there much be nonzero values of C1/2 for which u0 = uN +1 = 0. Now
find the eigenvalues!

Exercise 5.5. Consider the 1D convection diffusion equation (CDEqn for


short): for ε > 0 a small number, find u(x) defined on an interval Ω = (a, b)
satisfying
−εu (x) + u (x) = f (x), a < x < b,
(5.4)
u(a) = g1 , u(b) = g2 ,
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 99

The MPP and the Curse of Dimensionality 99

where f (x), g1 and g2 are given. Let u (a), u (a) be replaced by the differ-
ence approximations
· u(a + h) − 2u(a) + u(a + h)
u (a) = D+ D− u(a) =
h2
· u(a + h) − u(a − h)
u (a) = D0 u(a) = .
2h
With these approximations, the CDEqn is reduced to a linear system in
the same way as the MPP. Divide [a, b] into N subintervals h := Nb−a
+1 ,
xj = a + jh, j = 0, 1, · · · , N + 1,
a = x0 < x1 < x2 < · · · < xN < xN +1 = b.
At each meshpoint xj we will compute a uj ∼ u(xj ). We will, of course,
need one equation for each variable meaning one equation for each mesh-
point. Approximate −u = f (x) at each xj by using D+ D− uj :
−εD+ D− uj + D0 uj = f (xj ), for j = 1, 2, 3, · · ·, N.
(a) Find the system of linear equations that results. (b) Investigate invert-
ibility of the matrix A that results. Prove invertibility under the condition
h
P e := < 1.

P e is called the cell Peclet number.

Exercise 5.6. Repeat the analysis of the 1D discrete CDEqn from the last
exercise. This time use the approximation
· u(a) − u(a − h)
u (a) = D− u(a) = .
h

5.2.3 Complexity of solving the 1D MPP


To solve the model problem we need only store a tridiagonal matrix A and
the RHS b then solve Au = b using tridiagonal Gaussian elimination.
Storage Costs: ∼ 4h−1 real numbers: 4 × N = 4N real numbers ∼ 4h−1 .
Solution Costs: 5h−1 FLOPS: 3(N − 1) floating point operations for
elimination, 2(N − 1) floating point operations for backsubstitution.

This is a perfect result: The cost in both storage and floating point
operations is proportional to the resolution sought. If we want to see the
solution on scales 10× finer (so h ⇐ h/10) the total costs increases by a
factor of 10.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 100

100 Numerical Linear Algebra

5.3 The 2D MPP

The two-dimensional model problem is the first one that reflects some
complexities of real problems. The domain is taken to be the unit square
(to simplify the problem), Ω = (0, 1) × (0, 1). the problem is, given f (x, y)
and g(x, y), to approximate the solution u(x, y) of

−Δu = f (x, y), in Ω,


(5.5)
u(x, y) = g(x, y), on ∂Ω.

You can think of u(x, y) as the deflection of a membrane stuck at its edges
and loaded by f (x, y). Figure 5.2 below given a solution where g(x, y) = 0
and where f (x, y) > 0 and so pushes up on the membrane.

Fig. 5.2 An example solution of the 2D MPP.

Here we use (x, y) instead of (x1 , x2 ) because it is more familiar, we will


take g(x, y) ≡ 0 (to simplify the notation). Recall that

−Δu = −(uxx + uyy ).

Different boundary conditions and more complicated domains and opera-


tors are important and interesting. However, (5.5) is the important first
step to understand so we consider only (5.5) in this section.
To reduce (5.5) to a finite set of linear equations, we need to introduce
a mesh and approximate uxx and uyy as in the 1D problem by their second
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 101

The MPP and the Curse of Dimensionality 101

differences in the x and y directions, respectively


u(a + Δx, b) − 2u(a, b) + u(a − Δx, b)
uxx (a, b) ∼ , (5.6)
Δx2
· u(a, b + Δy) − 2u(a, b) + u(a, b − Δy)
uyy (a, b) = . (5.7)
Δy 2
To use these we introduce a mesh on Ω. For simplicity, take a uniform mesh
with N+1 points in both directions. Choose thus Δx = Δy = N1+1 =: h.
Then set
xi = ih, yj = jh, i, j = 0, 1, . . . , N + 1.
We let uij denote the approximation to u(xi , yj ) we will compute at each
mesh point. A 10 × 10 mesh (h=1/10) is depicted in Figure 5.3.

q q q q q q q q q q q q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q q q q q q q q q q q q

Fig. 5.3 A coarse mesh on the unit square, with interior nodes indicated by larger dots
and boundary nodes by smaller ones.

To have a square linear system, we need one equation for each variable.
There is one unknown (uij ) at each mesh point on Ω. Thus, we need one
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 102

102 Numerical Linear Algebra

equation at each mesh point. The equation for each mesh point on the
boundary is clear:
uij = g(xi , yj ) ( here g ≡ 0) for each xi , yj on ∂Ω. (5.8)
Thus, we need an equation for each xi , yj inside Ω. For a typical (xi ,
yj ) inside Ω we use the approximations (5.6) and (5.7). This gives
# $
ui+1j − 2uij + ui−1j uij+1 − 2uij + uij−1
− + = f (xi , yj ) (5.9)
h2 h2
for all (xi , yj ) inside of Ω.
The equations (5.8) and (5.9) give a square (N + 2)2 × (N + 2)2 linear
system for the uij ’s. Before developing the system, we note that (5.9) can
be simplified to read
−ui+1j − ui−1j + 4uij − uij+1 − uij−1 = h2 f (xi , yj ).
This is often denoted using the “difference molecule” represented by Fig-
ure 5.4,
−u(N ) − u(S) + 4u(P ) − u(E) − u(W ) = h2 f (P )

-1


  
-1
 4
 -1



-1


Fig. 5.4 A difference molecule with weights.

and by Figure 5.5 using the “compass” notation, where P is the mesh point
and N, S, E and W are the mesh points immediately above, below, to the
right and to the left of the point P.
The equation, rewritten in terms of the stencil notation, becomes
−u(N ) − u(S) + 4u(C) − u(E) − u(W ) = h2 f (C).
The discrete Laplacian, denoted Δh , in 2D is thus
−uij+1 − uij−1 + 4uij − ui+1j − ui−1j
−Δh uij := ,
h2
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 103

The MPP and the Curse of Dimensionality 103

q q q q q q q q q q q q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t Nm t t t q

q t t t t t Wm Cm Em t t q

q t t t t t t Sm t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q t t t t t t t t t t q

q q q q q q q q q q q q

Fig. 5.5 A sample mesh showing interior points and indicating a five-point Poisson
equation stencil and the “compass” notation, where C is the mesh point and N, S, E, W
are the mesh points immediately above, below, to the right and the left of C.

so the equations can be written compactly as


−Δh uij = f (xi , yj ), at all (xi , yj ) inside Ω,
(5.10)
uij = g(xi , yj ) ( ≡ 0 ) at all (xi , yj ) on ∂Ω.
The boundary unknowns can be eliminated so (5.10) becomes an N 2 × N 2
linear system for the N 2 unknowns:
AN 2 ×N 2 uN 2 ×1 = fN 2 ×1 . (5.11)
Since each equation couples uij to its four nearest neighbors in the mesh,
A will typically have only 5 nonzero entries per row. To actually find A
we must order the unknowns uij into a vector uk , k = 1, 2, . . . , N 2 . A
lexicographic ordering is depicted in Figure 5.6.
Thus, through the difference stencil, if uij is the k th entry in u, uij is
linked to uk−1 , uk+1 , uk+N and uk−N , as in Figure 5.7.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 104

104 Numerical Linear Algebra

q q q q q q q q q q q q

q 10m 20
m 30
m 40m 50m 60m 70m 80m 90m 100
m q

q 9m 19
m 29
m 39m 49m 59m 69m 79m 89m 99m q

q 8m 18
m 28
m 38m 48m 58m 68m 78m 88m 98m q

q 7m 17
m 27
m 37m 47m 57m 67m 77m 87m 97m q

q 6m 16
m 26
m 36m 46m 56m 66m 76m 86m 96m q

q 5m 15
m 25
m 35m 45m 55m 65m 75m 85m 95m q

q 4m 14
m 24
m 34m 44m 54m 64m 74m 84m 94m q

q 3m 13
m 23
m 33m 43m 53m 63m 73m 83m 93m q

q 2m 12
m 22
m 32m 42m 52m 62m 72m 82m 92m q

q 1m 11 m 31m 41m 51m 61m 71m 81m 91m q


m 21

q q q q q q q q q q q q

Fig. 5.6 Node numbering in a lexicographic order for a 10 × 10 mesh.

Thus, the typical k th row (associated with an interior mesh point (xi , yj )
not adjacent to a boundary point) of the matrix A will read:

(0, 0, . . . , 0, −1, 0, 0, . . . , 0, −1, 4, −1, 0, . . . , 0, −1, 0, . . . , 0). (5.12)

Conclusion: A is an N 2 × N 2 ( h−2 × h−2 ) sparse, banded matrix. It


will typically have only 5 nonzero entries per row. Its bandwidth is 2N + 1
and it’s half bandwidth is thus p = N ( h−1 ).
Complexity Estimates: For a given resolution h or given N storing A
as a banded matrix requires storing

2(N + 1) × N 2 real numbers  2h−3 real numbers.

Solving Au = f by banded Gaussian elimination requires

O((2N + 1)2 N 2 ) = O(N 4 )  O(h−4 ) FLOPS.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 105

The MPP and the Curse of Dimensionality 105

d t d
uk+1

t t t
uk−N uk uk+N

d t d
uk−1
Fig. 5.7 The 2D difference stencil at the kth point in a lexicographic ordering with N
numbered points in each direction.

The exact structure of A can easily be written down because all the choices
were made to keep A as simple as possible. A is an N 2 ×N 2 block tridiagonal
matrix (N × N blocks with each block an N × N matrix) of the form:
⎡ ⎤
T −I 0
⎢ ⎥
⎢ −I T . . . ⎥
A=⎢ ⎢ ⎥ (5.13)
. . ⎥
⎣ . . . . −I ⎦
0 −I T
where I is the N × N identity matrix and

T = tridiag(−1, 4, −1) (N × N matrix).

Exercise 5.7. Consider the 2D MPP with RHS f (x, y) = x − 2y and


boundary condition g(x, y) = x − y. Take h = 1/2 and write down the
difference approximation to u(1/2, 1/2). Compute is value.

Exercise 5.8. Consider the 2D MPP with RHS f (x, y) = x − 2y and


boundary condition g(x, y) = x − y. Take h = 1/3 and write down the 4 × 4
linear system for the unknown values of uij .

Exercise 5.9. The N 2 eigenvalues and eigenvectors of the matrix A in


(5.13) have been calculated exactly, just as in the 1D case. They are, for
n, m = 1, . . . , N ,
# $
nπ mπ
λn,m = 4 sin2 + sin2 ,
2(N + 1) 2(N + 1)
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 106

106 Numerical Linear Algebra

and
jnπ kmπ
(→

u n,m )j,k = sin sin ,
N +1 N +1
where j and k vary from 1, . . . , N . Verify these expressions by calculating
A→

u n,m and showing it is equal to λn,m → −
u n,m .

Exercise 5.10. Let the domain be the triangle with vertices at (0, 0), (1, 0),
and (0, 1). Write down the linear system arising from the MPP on this
domain with f(x, y) = x + y, g = 0 and N = 5.

5.4 The 3D MPP

The 3D model Poisson problem is to find


u = u(x, y, z)
defined for (x, y, z) in the unit cube
Ω := {(x, y, z)|0 < x, y, z < 1}
satisfying
−Δu = f (x, y, z) , in Ω,
u = g(x, y, z), on the boundary ∂Ω.
The Laplace operator in 3D is (writing it out)
∂2u ∂2u ∂2u
Δu = div grad u = + 2 + 2,
∂x2 ∂y ∂z
and a discrete Laplacian is obtained by approximating each term by the
1D difference in the x, y and z directions. We shall now develop the linear
system arising from the usual central difference model of this problem,
making the simplest choice at each step. First, take
1
h= , and set
N +1
1
Δx = Δy = Δz = h = .
N +1
Define the mesh points in the cube as
xi = ih, yj = jh, zk = kh for 0 ≤ i, j, k ≤ N + 1.
Thus a typical mesh point is the triple (xi , yj , zk ). There are (N + 2)3 of
these but some are known from boundary values; there are exactly N 3 of
these that need to be calculated. Thus we must have an N 3 × N 3 linear
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 107

The MPP and the Curse of Dimensionality 107

system: one equation for each unknown variable! Let the approximation at
the meshpoint (xi , yj , zk ) be denoted (as usual) by
uijk := approximation to u(xi , yj , zk ).
The discrete Laplacian in 3D is
ui+1jk − 2uijk + ui−1jk
Δh uijk :=
h2
uij+1k − 2uijk + uij−1k uijk+1 − 2uijk + uijk−1
+ 2
+ .
h h2
Collecting terms we get
Δh uijk :=
ui+1jk + uij+1k + uijk+1 − 6uijk + ui−1jk + uij−1k + uijk−1
.
h2
The 3D discrete model Poisson problem is thus
−Δh uijk = f (xi , yj , zk ), at all meshpoints (xi , yj , zk ) inside Ω
uijk = g(xi , yj , zk ) = 0, at all meshpoints (xi , yj , zk ) on ∂Ω.
In the above “at all meshpoints (xi , yj , zk ) inside Ω”means for 1 ≤ i, j, k ≤
N and “at all meshpoints (xi , yj , zk ) on ∂Ω” means for i or j or k = 0 or
N + 1. We thus have the following square (one variable for each meshpoint
and one equation at each meshpoint) system of linear equations (where
fijk := f (xi , yj , zk )).
For 1 ≤ i, j, k ≤ N ,
−ui+1jk − uij+1k − uijk+1 + 6uijk − ui−1jk − uij−1k − uijk−1 = h2 fijk .
And for i or j or k = 0 or N + 1,
uijk = 0. (5.14)
The associated difference stencil is sketched in Figure 5.8.
Counting is good!
This system has one unknown per meshpoint and one equation per mesh-
point. In this form it is a square (N + 2)3 × (N + 2)3 linear system. Since
uijk = 0 for all boundary meshpoints we can also eliminate these degrees
of freedom and get a reduced2 N 3 × N 3 linear system, for 1 ≤ i, j, k ≤ N :
−ui+1jk − uij+1k − uijk+1 + 6uijk − ui−1jk − uij−1k − uijk−1 = h2 fijk ,
2 We shall do this reduction herein. However, there are serious reasons not to do it if

you are solving more general problems: including these gives a negligably smaller sys-
tem and it is easy to change the boundary conditions. If one eliminates these unknowns,
then changing the boundary conditions can mean reformatting all the matrices and pro-
gramming again from scratch. On the other hand, this reduction results in a symmetric
matrix while keeping Dirichlet boundary conditions in the matrix destroys symmetry
and complicates the solution method.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 108

108 Numerical Linear Algebra


-1

-1

 A
A 
-1
 6
 -1

AA

-1
-1


Fig. 5.8 The difference molecule or stencil in 3D.

where subscripts of i, j, or k equal to 0 or N + 1 are taken to mean that


uijk = 0.
The complexities of these connections is revealed by considering the
nearest neighbors on the physical mesh that are linked in the system. A
uniform mesh is depicted in Figures 5.9 and 5.10.

Fig. 5.9 A 3D mesh.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 109

The MPP and the Curse of Dimensionality 109

i j k+1
zk+1

xk+1
xk
xk−1

yk+1 ij+1 k
zk
i+1 j k
yk ijk
i−1 j k

yk−1 ij−1 k

zk−1
i j k−1

Fig. 5.10 Geometry of a 3D uniform mesh. Each point has six neighbors, indicated by
heavy dots.

A typical row in the matrix (when a lexicographic ordering of meshpoints


is used) looks like

0, . . . , 0, −1, 0, . . . . . . , 0, −1, 0, . . . 0, −1, 6,


' () * ' () *
N 2 −N −1 N −1

− 1, 0, . . . , 0, −1, 0, . . . . . . , 0, −1, 0, . . . , 0
' () * ' () *
N −1 N 2 −N −1

where the value 6 is the diagonal entry. If the mesh point is adjacent to the
boundary then this row is modified. (In 3D adjacency happens often.)
To summarize, some basic facts about the coefficient matrix A of linear
system derived from the 3D model Poisson problem on an N × N × N mesh
with h = 1/(N + 1):

• A is N 3 × N 3 (huge if N = 100 say!).


• A has at most 7 nonzero entries in each row.
• N 3 equations with bandwidth = 2N 2 + 1 or half bandwidth p = N 2 .
• Storage as a banded sparse matrix requires storing
  .
N 3 × 2N 2 + 1 = 2N 5 real numbers.
• Solution using banded sparse Gaussian elimination requires about
 2
O( 2N 2 + 1 × N 3 ) = O(N 7 ) FLOPS.
• Suppose you need 10× more resolution (so Δx ← Δx/10, Δy ← Δy/10
and Δz ← Δz/10). Then h → h/10 and thus N → 10N . It follows
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 110

110 Numerical Linear Algebra

that
Storage requirements increase 100, 000 times, and
Solution by banded sparse GE takes 10, 000, 000 longer!

The 3D matrix has certain mathematical similarities to the 1D and 2D


matrices. Exploiting these one can show.

Theorem 5.3 (Eigenvalues of the discrete MPP). Let A denote the


coefficient matrix arising from the 3D model Poisson problem on a uni-
form mesh with h = 1/(N + 1). Then A is nonsingular. Furthermore, the
eigenvalues of A are given by
# # $ # $
pπ qπ
λpqr = 4 sin2 + sin2
2(N + 1) 2(N + 1)
# $$

+ sin2 , for 1 ≤ p, q, r ≤ N .
2(N + 1)
Thus,
. .
λmax (A) = 12, λmin (A) = h2 ,
cond2 (A) = O(h−2 ).

Exercise 5.11. Prove the claimed estimates of cond(A) from the formula
for the eigenvalues of A.

5.5 The Curse of Dimensionality

“Let us first understand the facts, and then we may seek for the
causes.”
— Aristotle.

The right way to compare the costs in storage and computer time of
solving a BVP is in terms of the resolution desired, i.e., in terms of the
meshwidth h. The previous estimates for storage and solution are sum-
marized in Table 5.1. Comparing these we see the curse of dimensionality
clearly: As the dimension increases, the exponent increases rather than the
constant or parameter being raised to the exponent. In other words:

The cost of storing the data and solving the linear system using
direct methods for the model problem increases exponentially
with the dimension.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 111

The MPP and the Curse of Dimensionality 111

Table 5.1 Costs of banded sparse GE.


1D 2D 3D

Storage cost (# real numbers) 4h−1 2h−3 2h−5

Solution cost (# FLOPs) 5h−1 O(h−4 ) O(h−7 )

To put some concrete numbers to this observation, on a typical inex-


pensive PC in 2012, one is able to store the data for a 2D model problem
on a mesh with h  1/500. In other words, one can store roughly
in 2D: 2 · 5003 = 250, 000, 000 double precision numbers.
If one is solving the 1D problem instead with this computer it could store
a matrix with h = hmin where
4h−1
min = 250, 000, 000, or hmin = 1.6 × 10
−8

which is an exceedingly small meshwidth. On the other hand, if you were


solving a 3D problem, the finest mesh you can store is
1
2h−5
min = 250, 000, 000, or hmin  ,
40
which is exceedingly coarse.
Using the same kind of estimates, suppose storage is not an issue and
that for h = 1/1000 solving the 2D problem takes 100 minutes. This means
the completely hypothetical computer is doing roughly 10−10 minute/flop.
From the above table we would expect the time required to solve the 1D
and 3D problems for the same resolution, h = 1/1000, to be:

in 1D: 10−7 minutes,


in 3D: 10+11 minutes!

This is the curse of dimensionality in turnaround times. In practical


settings, often programs are used for design purposes (solve, tweak one
design or input parameter, solve again, see what changes) so to be useful
one needs at least 1 run per day and 3 runs per day are desired.
How is one to break the curse of dimensionality? We start with one key
observation.

5.5.1 Computing a residual grows slowly with dimension


Though this be madness, yet there is method in’t.
— Shakespeare, William (1564–1616)
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 112

112 Numerical Linear Algebra

Given an approximate solution to the model Poisson problem we can


compute a residual cheaply since A only has a few nonzero entries per row.
The 1D case: We have3

for i=1:N
r(i) = h^2*f(i)-(-u(i+1)+2u(i)-u(i-1))
end

This takes 3 multiplications and 3 additions per row giving 6h−1


FLOPS. Notice that the matrix does not need to be stored—only the vec-
tors f, u and r, of length N 1/h.
The 2D case: The 2D case takes 5 multiplies and 5 adds per row for
−2
h rows by:

for i=1:N
for j=1:N
r(i,j)=h^2*f(i,j)-(-u(i,j+1)-u(i,j-1) + 4u(i,j)
-u(i+1,j)-u(i-1,j) )
end
end

This gives a total of only 10h−2 FLOPS and requires only 2h−2 real
numbers to be stored.
The 3D case: The 3D case takes 7 multiplies and adds per row for
−3
h rows by:

for i=1:N
for j=1:N
for k=1:N
r(i,j,k)=h^2*f(i,j,k) - ( ...
-u(i+1,j,k)-u(i,j+1,k)-u(i,j,k+1)
+6u(i,j) ...
-u(i-1,j,k)-u(i,j-1,k)-u(i,j,k-1) )
end
end
end

This gives a total of only 14h−3 FLOPS and requires only 2h−3 real
numbers to be stored.
3 When i=1, the expression “u(i-1) ” is to be interpreted as the boundary value at

the left boundary, and when i=N, the expression “u(i+1)” is to be interpreted as the
boundary value at the right boundary.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 113

The MPP and the Curse of Dimensionality 113

To summarize,

Costs for computing residual


dimension of model 1D 2D 3D
−1 −2
# real numbers storage 2h 2h 2h−3
# FLOPS 6h−1 10h−2 14h−3

The matrix A does not need to be stored for the MPP since we already
know the nonzero values and the components they multiply. More generally
we would only need to store the nonzero entries and a pointer vector to tell
which entry in the matrix is to be multiplied by that value. Thus the only
hope to break the curse of dimensionality is to use algorithms where the
work involves computing residuals instead of elimination! These special
methods are considered in the next chapter.

Exercise 5.12. Write a program to create, as an array, the matrix which


arises from the 2D model Poisson problem on a uniform N ×N mesh. Start
small and increase the dimension (N and the dimension of the matrices).

(1) Find the smallest h (largest N ) for which the program will execute
without running out of memory.
(2) Next from this estimate and explain how you did it: the smallest h
(largest N ) for which this can be done in 2D in banded sparse storage
mode. The same question in 1D. The same question in 3D.
(3) Make a chart of your findings and draw conclusions.

Exercise 5.13. If solving a 2D MPP program takes 30 minutes with N =


20000, estimate how long it would take to solve the problem with the same
value of h in 1D and in 3D. Explain.

Exercise 5.14. Same setting as the last problem. Now however, estimate
how long it would take to compute a residual in 1D, 2D and 3D. Explain
how you did the estimate.

Exercise 5.15. Think about the problem of computing Ax where A is large


and sparse but with a non zero structure less regular than for the MPP.
Thus, the non zero entries in A must be stored as well as (for each) somehow
the row and column number in which that entry appears. Formalize one
way to store A in this manner then write down in pseudo code how to
compute x → Ax. Many people have worked on sparse matrix storage
schemes so it is unlikely that your solution will be best possible. However,
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 114

114 Numerical Linear Algebra

after finding one answer, you will be able to quickly grasp the point of
the various sparse storage schemes. Next look in the Templates book,
Barrett, Berry, et al. [Barrett et al. (1994)] and compare your method to
Compressed Row Storage. Explain the differences.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 115

Chapter 6

Iterative Methods

“The road to wisdom? Well, it is plain


And simple to express:
Err and err and err again,
But less and less and less.”
— Piet Hein1

6.1 Introduction to Iterative Methods

Iterative methods for solving Ax = b are rapidly becoming the workhorses


of parallel and large scale computational mathematics. Unlike Gaussian
elimination, using them reliably depends on knowledge of the methods and
the matrix and there is a wide difference in the performance of different
methods on different problems. They are particularly important, and often
the only option, for problems where A is large and sparse. The key ob-
 → A
servation for these is that to compute a matrix-vector multiply, x x,
one only needs to store the nonzero entries in Aij and their indices i and
j. These are typically stored in some compact data structure that does
not need space for the zero entries in A. If the nonzero structure of A is
regular, as for the model Poisson problem on a uniform mesh, even i and j
need not be stored!
Consider the problem of solving linear systems
Ax = b, A : large and sparse.
As in chapter 5, computing the residual
r = b − A
x,  : an approximate solution
x
1 Piet Hein is a famous Danish mathematician, writer and designer. He is perhaps most

famous as a designer. Interestingly, he is a descendent of another Piet Hein from the


Netherlands who is yet more famous in that small children there still sing songs praising
him.

115
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 116

116 Numerical Linear Algebra

is cheap in both operations and storage. Iterative methods take a form


exploiting this, generally resembling:

Algorithm 6.1 (Basic iterative method). Given an approximate solu-


 and a maximum number of steps itmax:
tion x

Compute residual: r = b − A x
for i = 1:itmax
Use r to improve x 
Compute residual using improved x : r = b − A
x
Use residual and update to estimate accuracy
if accuracy is acceptable, exit with converged solution
end
Signal failure if accuracy is not acceptable.

As an example (that we will analyze in the next section) consider the


method known as first order Richardson, or FOR. In FOR, we pick the
number ρ > 0, rewrite Ax = b as

ρ(x − x) = b − Ax,

then guess x0 and iterate using

ρ(xn+1 − xn ) = b − Axn , or
1 1
xn+1 = I − A xn + b.
ρ ρ
Algorithm 6.2 (FOR = First Order Richardson). Given ρ > 0, tar-
get accuracy tol, maximum number of steps itmax and initial guess x0 :

Compute residual: r0 = b − Ax0


for n = 1:itmax
Compute update Δn = (1/ρ)rn
Compute next approximation xn+1 = xn + Δn
Compute residual rn+1 = b − Axn+1
Estimate residual accuracy criterion rn+1 /b <tol
Estimate update accuracy criterion Δn /xn+1  <tol
if both residual and update are acceptable
exit with converged solution
end
end
Signal failure if accuracy is not acceptable.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 117

Iterative Methods 117

We shall see that FOR is a terrific iterative method for introducing


the ideas and mathematics of the area but a very slow one for actually
solving Ax = b. Nevertheless, if there were no faster ones available, then it
would still be very widely used because of the curse of dimensionality. To
understand why, let us return to the example of the model Poisson problem
in 3D discussed in the previous chapter. Recall that, for a typical point
(xi , yj , zk ) ∈ Ω, the equation becomes
6uijk − ui+1jk − ui−1jk − uij+1k − uij−1k − uijk+1 − uijk−1 = h2 fijk ,
where uijk = u(xi , yj , zk ) and fijk = f (xi , yj , zk ) and if any point lies on
the boundary, its value is set to zero. Picking ρ = 6 (called the Jacobi
method) makes FOR particularly simple, and it is given by

Algorithm 6.3 (Jacobi iteration in 3D). Given a tolerance tol, a


maximum number of iterations itmax and arrays uold, unew and f, each
of size (N+1,N+1,N+1). with boundary values2 of uold and unew filled with
zeros:

h=1/N
for it=1:itmax
% initialize solution, delta, residual and rhs norms
delta=0
unorm=0
resid=0
bnorm=0

for i=2:N
for j=2:N
for k=2:N
% compute increment
au=-( uold(i+1,j,k) + uold(i,j+1,k) ...
+ uold(i,j,k+1) + uold(i-1,j,k) ...
+ uold(i,j-1,k) + uold(i,j,k-1) )
unew(i,j,k)=(h^2*f(i,j,k) - au)/6

% add next term to norms


delta=delta + (unew(i,j,k) - uold(i,j,k))^2
unorm=unorm + (unew(i,j,k))^2
resid=resid + (h^2*f(i,j,k) ...
2 Boundaries are locations for which i or j or k take on the values 1 or (N+1).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 118

118 Numerical Linear Algebra

- au - 6*uold(i,j,k))^2
bnorm=bnorm + (h^2*f(i,j,k))^2
end
end
end

uold=unew % set uold for next iteration

% complete norm calculation


delta=sqrt(delta)
unorm=sqrt(unorm)
resid=sqrt(resid)
bnorm=sqrt(bnorm)

% test for convergence


if resid<tol*bnorm & delta<tol*unorm
’solution converged’
return
end

end
error(’convergence failed’)

Remark 6.1. If Algorithm 6.3 were written to be executed on a computer,


the calculation of bnorm would be done once, before the loop began. Cal-
culating it on each iteration is a waste of computer time because it never
changes.

Programming the Jacobi method in this way is particularly simple for


a uniform mesh (as above). The computations reflect the underlying mesh.
Each approximate solution is computed at a given mesh point by averaging
the values of the points 6 nearest neighbors then adding this to the right
hand side. This style of programming is obviously parallel (think of 1 CPU
at each point on the mesh with nearest neighbor connections). Unfortu-
nately, the program must be rewritten from scratch whenever the geometry
of the mesh connectivity changes.
The Jacobi method requires that only three N × N × N arrays be
stored: f(i,j,k), containing the value f (xi , yj , zk ), and uold(i,j,k)
and unew(i,j,k) containing the values of the old and new (or updated)
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 119

Iterative Methods 119

approximations. Remarkably, this does not require that the coefficient ma-
trix be stored at all! Thus, provided it converges rapidly enough, we have
a method for overcoming the curse of dimensionality. Unfortunately, this
“provided ” is the key question: Iterative methods utility depend on speed of
convergence and, double unfortunately, we shall see that the Jacobi method
does not converge fast enough as the next example begins to indicate.

Example 6.1 (FOR for the 1D model Poisson problem). The 1D


model problem

−u = f (x), 0 < x < 1, u(0) = 0 = u(1)

with h = 1
5 and f (x) = x
5 leads to the 4 × 4 tridiagonal linear system

2u1 −u2 =1
−u1 +2u2 −u3 =2
−u2 +2u3 −u4 =3
−u3 +2u4 = 4.
The true solution is

u1 = 4, u2 = 7, u3 = 8, u4 = 6.

Taking ρ = 2 in FOR gives the iteration


1 1
uNEW
1 = + uOLD
2 2 2
2
uNEW
2 = + (uOLD
1 + uOLD
3 )/2
2
3
uNEW
3 = + (uOLD
2 + uOLD
4 )/2
2
4
uNEW
4 = + uOLD
3 /2.
2
(This is also known as the Jacobi iteration.) Taking u0 = uOLD =
(0, 0, 0, 0)t we easily compute the iterates
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1/2 3.44 3.93 4.00
⎢ 1 ⎥ ⎢ 6.07 ⎥ ⎢ 6.89 ⎥ ⎢ 7.00 ⎥
u1 = ⎢ ⎥ 10 ⎢ ⎥ 20 ⎢ ⎥ 35 ⎢ ⎥
⎣ 3/2 ⎦ , u = ⎣ 7.09 ⎦ , u = ⎣ 7.89 ⎦ , u = ⎣ 8.00 ⎦ .
2 5.42 5.93 6.00
This problem is only a 4 × 4 linear system and can be very quickly solved
exactly by hand. To solve to 2 digits by the Jacobi method took 35 steps
which is much slower.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 120

120 Numerical Linear Algebra

Exercise 6.1. For the choices below do 2 steps of FOR


21 1 0
ρ = 2, A = ,b = , x0 = .
12 −1 0
Exercise 6.2.

(1) Write a computer program to apply FOR with ρ = 2 (Jacobi iteration)


to the 1D Model Poisson Problem as described in Example 6.1, per-
forming only a fixed number of iterations, not checking for convergence.
Check your work by verifying the four iterates given in Example 6.1.
Warning: The four values u1 , u2 , u3 and u4 in Example 6.1 would
refer to the four values u(2), u(3), u(4) and u(5) if Algorithm 6.3
were to be written for 1D. This is because u(1)=0 and u(6)=0.
(2) Add convergence criteria as described in Algorithm 6.3 to your code.
For the same problem you did in part 1, how many iterations would be
required to attain a convergence tolerance of 10−8 ?
(3) For the 1D Model Poisson Problem that you coded above in part 1,
but with h = 1/200 and f (x) = x/200, how many iterations would be
required to attain a convergence tolerance of 10−8 ?
(4) It is easy to find the analytic solution the 1D Model Poisson Problem by
integrating twice. Compare the analytic solution with your computed
solution with h = 1/200 by computing the relative difference between
the two at each of the 200 mesh points and finding the square root of
the sum of squares of these differences divided by the square root of
the sum of squares of either solution (the relative two norm).

Exercise 6.3. Consider the 2D MPP with RHS f (x, y) = x − 2y and


boundary condition g(x, y) = x − y. Take h = 1/3 and write down the
4 × 4 linear system for the unknown values of uij . Take ρ = 2 and initial
guess the zero vector and do 2 steps of FOR. It will be easier to draw a big
picture of the physical mesh and do the calculations on the picture than to
write it all out as a matrix.
Exercise 6.4. Consider the 2D MPP with RHS f (x, y) = 0 and with
boundary condition g(x, y) = x − y.

(1) Write a computer program to solve this problem for h = 1/N using
FOR with ρ = 4 (Jacobi iteration). This amounts to modifying Algo-
rithm 6.3 for 2D instead of 3D.
(2) It is easy to see that the solution to this problem is u(x, y) = x −
y. Remarkably, this continuous solution is also the discrete solution.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 121

Iterative Methods 121

Verify that your code reproduces the continuous solution to within the
convergence tolerance for the case h = 1/3 (N = 3).
(3) Verify that your code reproduces the continuous solution to within the
convergence tolerance for the case h = 1/100 (N = 100).

6.1.1 Iterative methods three standard forms

“Truth emerges more readily from error than from confusion.”


— Francis Bacon

FOR can be generalized by replacing the value ρ with a matrix M . It


can be written as

Algorithm 6.4 (Stationary iterative method). Given a N ×N matrix


A, another N × N matrix M , a right side vector b and an initial guess x0 ,

n=0
while convergence is not satisfied
Obtain xn+1 as the solution of M (xn+1 − xn ) = b − Axn
n=n+1
end

The matrix M does not depend on the iteration counter n, hence the
name “stationary.” This algorithm results in a new iterative method for
each new choice of M , called a “preconditioner.” For FOR (which takes
very many steps to converge) M = (1/ρ)I. At the other extreme, if we
pick M = A then the method converges in 1 step but that one step is just
solving a linear system with A so no simplification is obtained. From these
two extreme examples, it is expected that some balance must be struck
between the cost per step (less with simpler M ) and the number of steps
(the closer M is to A the fewer steps expected).

Definition 6.1. Given an N × N matrix A, an N × N matrix M that


approximates A in some useful sense and for which the linear system M y =
d is easy to solve is a preconditioner of A.

Definition 6.2. For a function Φ, a fixed point of Φ(x) is any x satisfying


x = Φ(x) and a fixed point iteration is an algorithm approximating x
by guessing x0 and repeating xn+1 = Φ(xn ) until convergence.

There are three standard ways to write any stationary iterative method.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 122

122 Numerical Linear Algebra

(1) Residual-Update Form: rn = b − Axn is the residual and Δn =


xn+1 − xn is the update. Thus, the residual-update form is: given xn ,
rn = b − Axn
Δn = M −1 rn
xn+1 = xn + Δn .
This is often the way the methods are programmed.
(2) Fixed Point Iteration Form: A stationary iterative method can be
easily rewritten as a fixed point iteration. Define T = I − M −1 A then
we have
xn+1 = M −1 b + T xn =: Φ(xn ).
T is the iteration operator. This is the form used to analyze conver-
gence and rates of convergence.
(3) Regular Splitting Form: This form is similar to the last one for
FOR. Rewrite
A = M − N, so N = M − A.
Then Ax = b can be written M x = b + N x. The iteration is then
M xn+1 = b + N xn .
For FOR M = ρI and N + ρI − A so the regular splitting form becomes
ρIxn+1 = b + (ρI − A)xn .

Remark 6.2 (Jacobi method). As an example of a regular splitting


form, consider the case that A = D − L − U , where D is the diagonal
part of A, L is the lower triangular part of A, and U is the upper triangular
part of A. The iteration in this case takes the form
Dxn+1 = b + (L + U )xn .
This is called the Jacobi method.

6.1.2 Three quantities of interest


Detelina’s Law:
“If your program doesn’t run, that means it has an error in it.
If your program does run, that means it has two errors in it.”

There are three important quantities to track in the iterative methods:


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 123

Iterative Methods 123

(1) The error, en = x − xn , which is unknown but essential;


(2) The residual, rn = b − Axn , which is computable; and,
(3) The update, Δn = xn+1 − xn , which is computable.

The residual and updates are used to give indications of the error and
to decide when to stop an iterative method.

Theorem 6.1. For first order Richardson (T = TF OR = I − ρ−1 A), en ,


rn and Δn all satisfy the same iteration
en+1 = T en , rn+1 = T rn , and Δn+1 = T Δn .

Proof. This is by subtraction. Since


x = ρ−1 b + T x and
n+1 −1 n
x =ρ b + Tx ,
subtraction gives
(x − xn+1 ) = T (x − xn ) and en+1 = T en .
For the update iteration, note that
xn+1 = ρ−1 b + T xn and xn = ρ−1 b + T xn−1 .
Subtraction gives
(xn+1 − xn ) = T (xn − xn−1 ) and Δn+1 = T Δn .
The residual update is a little trickier to derive. Since
ρxn+1 = ρxn + b − Axn = ρxn + rn
multiply by −A and add ρb:
 
ρ b − Axn+1 = ρ (b − Axn ) − Arn
ρrn+1 = ρrn − Arn and thus
rn+1 = (I − ρ−1 A)rn = T rn .

Remark 6.3. For other iterations, the typical result is


en+1 = T en , Δn+1 = T Δn , and rn+1 = AT A−1 rn .
The matrix AT A−1 and T are similar matrices. For FOR, because of the
special form of T = ρI − ρ−1 A, AT A−1 = T since
AT A−1 = A(I − ρ−1 A)A−1 = AA−1 − ρ−1 AAA−1 = I − ρ−1 A = T.
Thus, rn+1 = T rn .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 124

124 Numerical Linear Algebra

This theorem has an important interpretation:

Δn , r n and en −→ 0 at the same rate.

It is entirely possible that residuals rn and errors en can be of widely


different sizes. However, since they both go to zero at the same rate, if the
residuals improve by k significant digits from the initial residual, the errors
will have typically also improved by k significant digits over the initial error.
The third big question is When to stop? (Alternately, how to measure
“satisfaction” with a computed answer.) The theorem is important because
it says that monitoring the (computable) residuals and updates is a valid
way to test it the (incomputable) error has been improved enough to stop.
Stopping Criteria: Every iteration should include three (!) tests of
stopping criteria.

(1) Too Many Iterations: If n ≥ itmax (a user supplied maximum


number of iterations), then the method is likely not converging to the
true solution: stop, and signal failure.
(2) Small Residual: With a preset tolerance tol1 (e.g., tol1= 10−6 ),
test if:
rn 
≤ tol1.
b
(3) Small Update: With tol2 a preset tolerance, test if:
Δn 
≤ tol2.
xn 

The program should terminate if either the first test or both the second
and third tests are satisfied. Usually other computable heuristics are also
monitored to check for convergence and speed of convergence. One example
is the experimental contraction constant
rn+1  Δn+1 
αn := or .
rn  Δn 
This is monitored because αn > 1 suggests divergence and αn < 1 but very
close to 1 suggests very slow convergence.
To summarize, the important points

• Iterative methods require minimal storage requirements. They are es-


sential for 3D problems!
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 125

Iterative Methods 125

• Basic iterative methods are easy to program.3 The programs are short
and easy to debug and often are inherently parallel.
• Iterative method’s convergence can be fast or not at all. The questions
of convergence (at all) and speed of convergence are essential ones that
determine if an iterative method is practical or not.

6.2 Mathematical Tools

“The Red Queen: ‘Why, sometimes I’ve believed as many as six


impossible things before breakfast...’
Alice: ‘Perhaps, but surely not all at the same time!’
The Red Queen: ‘Of course all at the same time, or where’s the
fun?’
Alice: ‘But that’s impossible!’
The Red Queen: ‘There, now that’s seven!’ ”
— Lewis Carroll, Alice in Wonderland

To analyze the critically important problem of convergence of iterative


methods, we will need to develop some mathematical preliminaries. We
consider an N × N iteration matrix T and a fixed point x and the iteration
xn
x = b + T x, xn+1 = b + T xn .
Subtracting we get the error equation
en = x − xn satisf ies en+1 = T en .

Theorem 6.2. Given the N × N matrix T , a necessary and sufficient


condition that for any initial guess
en −→ 0 as n→∞
is that there exists a matrix norm  ·  with
T  < 1.

Proof. Sufficiency is easy. Indeed, if T  < 1 we have


en  = T en−1  ≤ T en−1 .
3 Of course, in Matlab direct solution is an intrinsic operator (\), so even the simplest

iterative methods are more complicated. Further, often the requirements of rapid con-
vergence adds layers of complexity to what started out as a simple implementation of a
basic iterative method.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 126

126 Numerical Linear Algebra

Since en−1 = T en−2 , en−1  ≤ T en−2  so en  ≤ T 2 en−2 . Contin-


uing backward (for strict proof, this means : using an induction argument)
we find
en  ≤ T n e0 .
Since T  < 1, T n → 0 as n → ∞.
Proving that convergence implies existence of the required norm is
harder and will follow from the next two theorems that complete the circle
of ideas.

The proof that T  < 1 for some norm  ·  is also mathematically


interesting and important. It is implied by the next theorem.

Theorem 6.3. For any N × N matrix T , a matrix norm  ·  exists for


which T  < 1 if and only if for all eigenvalues λ(T )
|λ(T )| < 1.

Definition 6.3 (spectral radius). The spectral radius of an N × N


matrix T , spr(T ), is the size of the largest eigenvalue of T
spr(T ) = max{|λ| : λ = λ(T )}.

Theorem 6.4. Given any N × N matrix T and any ε > 0 there exists a
matrix norm  ·  with T  ≤ spr(T ) + ε.

Proof. See Appendix A

Using this result, the following fundamental convergence theorem holds.

Theorem 6.5. A necessary and sufficient condition that


en → 0 as n→∞ for any e0 ,
is that spr(T ) < 1.

Proof. That it suffices follow from the previous two theorems. It is easy to
prove that it is necessary. Indeed, suppose ρ(T ) ≥ 1 so T has an eigenvalue
λ
T φ = λφ with |λ| ≥ 1.
Pick e0 = φ. Then, e1 = T e0 = T φ = λφ, e2 = T e1 = λ2 φ, . . . ,
en = λn φ.
Since |λ| ≥ 1, en clearly does not approach zero as n → ∞.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 127

Iterative Methods 127

Since the eigenvalues of T determine if the iteration converges, it is


useful to know more about eigenvalues.

Definition 6.4 (similar matrices). B and P BP −1 are said to be similar


matrices.

Lemma 6.1 (Similar matrices have the same eigenvalues). Let B


be any N × N matrix and P any invertible matrix then the similar ma-
trices

B and P BP −1

have the same eigenvalues.

Proof. The proof is based on interpreting a similarity transformation as a


change of variable:

Bφ = λφ holds if and only if


P B(P −1 P )φ = λP φ, if and only if
(P BP −1 )Ψ = λΨ, where Ψ = P φ.

For many functions f (x) we can insert an N × N matrix A for x and


f (A) will still be well defined as an N × N matrix. Examples include
1
f (x) = ⇒ f (A) = A−1
x
1
f (x) = ⇒ f (A) = I + A−1
1+x
∞
x2 An
f (x) = ex = 1 + x + + . . . ⇒ f (A) = eA = ,
2! n=0
n!
f (x) = x2 − 1 ⇒ f (A) = A2 − I.

In general, f (A) is well defined (by its power series as in eA ) for any analytic
function. The next theorem, known as the Spectral Mapping Theorem, is
extremely useful. It says that

the eigenvalues of the matrix f (A) are f (the eigenvalues of A):

λ(f (A)) = f (λ(A)).


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 128

128 Numerical Linear Algebra

Theorem 6.6 (Spectral Mapping Theorem). Let f : C → C be an


analytic function.4 If (λ, φ) is an eigenvalue, eigenvector pair for A then
(f (λ), φ) is an eigenvalue, eigenvector pair for f (A).

Exercise 6.5. (a) Let A be a 2 × 2 matrix with eigenvalues 2, −3. Find


the eigenvalues of eA , A3 , (2A + 2I)−1 . For what values of a, b is the matrix
B = aI + bA invertible? Explain. (b) If the eigenvalues of a symmetric
matrix A satisfy 1 ≤ λ(A) ≤ 200, find an interval (depending on ρ) that
contains the eigenvalues of λ(T ), T = I − (1/ρ)A. For what values of ρ
are |λ(T )| < 1? (c) For the same matrix, find an interval containing the
eigenvalues of (I + A)−1 (I − A).

Exercise 6.6. If A is symmetric, show that cond2 (At A) = (cond2 (A))2 .

Exercise 6.7. Let A be invertible and f (z) = 1/z. Give a direct proof of
the SMT for this particular f (·). Repeat for f (z) = z 2 .

6.3 Convergence of FOR

I was a very early believer in the idea of convergence.


— Jean-Marie Messier

This section gives a complete and detailed proof that the First Order
Richardson iteration

ρ(xn+1 − xn ) = b − Axn (6.1)

converges, for any initial guess, to the solution of

Ax = b.

The convergence is based on two essential assumptions: that the matrix A


is symmetric, positive definite (SPD) and the parameter ρ is chosen large
enough. The convergence proof will also give an important information on

• how large is “large enough”,


• the optimal choice of ρ,
• the expected number of steps of FOR required.
4 Analytic is easily weakened to analytic in a domain, an open connected set, including

the spectrum of A.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 129

Iterative Methods 129

Theorem 6.7 (Convergence of FOR). Suppose A is SPD. Then FOR


converges for any initial guess x0 provided

ρ > λmax (A)/2.

Proof of Theorem 6.7. Rewrite Ax = b as ρ(x − x) = b − Ax. Subtract-


ing (6.1) from this gives the error equation

ρ(en+1 − en ) = −Aen , en = x − x n ,

or

en+1 = T en , T = (I − ρ−1 A).

From Section 6.2, we know that en → 0 for any e0 if and only if |λ(T )| < 1
for every eigenvalue λ of the matrix T . If f (x) = 1 − x/ρ, note that
T = f (A). Thus, by the spectral mapping theorem

λ(T ) = 1 − λ(A)/ρ.

Since A is SPD, its eigenvalues are real and positive:

0 < a = λmin (A) ≤ λ(A) ≤ λmax (A) = b < ∞.

We know en → 0 provided |λ(T )| < 1, or

−1 < 1 − λ(A)/ρ < +1.

Since λ(A) ∈ [a, b], this is implied by


x
−1 < 1 − < +1, f or a ≤ x ≤ b.
ρ
This is true if and only if (for ρ > 0)

−ρ < ρ − x < +ρ

or −2ρ < −x < 0 or 0 < x < 2ρ or


x
ρ> f or 0 < a ≤ x ≤ b.
2
This is clearly equivalent to
b λmax (A)
ρ> = .
2 2
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 130

130 Numerical Linear Algebra

6.3.1 Optimization of ρ
If you optimize everything, you will always be unhappy.
— Donald Knuth

Clearly, the smaller T 2 the faster en → 0. Now, from the above proof
T 2 = max |λ(T )| = max |1 − λ(A)/ρ|.
The eigenvalues λ(A) are a discrete set on [a, b]. A simple sketch (see the
next subsection) shows that

T 2 = max{|1 − λ(A)/ρ| : all λ(A)} =


max{|1 − λmin /ρ|, |1 − λmax /ρ|}.
We see that T 2 < 1 for ρ > b/2, as proven earlier. Secondly, we easily
calculate that the “optimal” value of ρ is
a+b λmin + λmax
α = αmin at ρ = = .
2 2
Let
λmax
κ=
λmin
denote the spectral condition number of A. Then, if we pick

• ρ = λmax (A), T 2 = 1 − κ1 ,
• ρ = (λmax + λmin )/2, T 2 = 1 − 2
κ+1 .

Getting an estimate of λmax (A) is easy; we could take, for example,



N
λmax ≤ AANYNORM = e.g. = max |aij |.
1≤i≤N
j=1

However, estimating λmin (A) is often difficult. The shape of α(ρ) also
suggests that it is better to overestimate ρ than underestimate ρ. Thus,
often one simply takes ρ = A rather than the “optimal” value of ρ. The
cost of this choice is that it roughly doubles the number of steps required.

6.3.2 Geometric analysis of the min-max problem


The problem of selecting an optimal parameter for SPD matrices A is a one
parameter min-max problem. There is an effective and insightful way to
solve all such (one parameter min-max) problems by drawing a figure and
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 131

Iterative Methods 131

saying “Behold!”.5 In this sub-section we shall solve the optimal parameter


problem by this geometric approach. We shall give the steps in detail
(possibly excruciating detail even) with apologies to the many readers for
whom the curve sketching problem is an easy one.
Following the previous sections, we have that the error satisfies
1
en+1 = Tρ en = (I − A)en .
ρ
FOR converges provided
|λ|max (T ) = max{|1 − λ(A)/ρ| : λ an eigenvalue of A} < 1.
The parameter optimization problem is then to find ρoptimal by
min max |1 − λ/ρ|.
ρ λ=λ(A)

To simplify this we suppose that only the largest and smallest eigenvalues
(or estimates thereof) are known. Thus, let
0 < a = λmin (A) ≤ λ ≤ b = λmax (A) < ∞
so that the simplified parameter optimization problem is
min max |1 − λ/ρ|.
ρ a≤λ≤b

Fix one eigenvalue λ and consider in the y − ρ plane the curve y = 1 − λ/ρ,
as in Figure 6.1. The plot also has small boxes on the ρ axis indicating
a = λmin (A), b = λmax (A) and the chosen intermediate value of λ.

rho

Fig. 6.1 Plot of y = 1 − λ/ρ for one value of λ.

5 All the better if one can say it in Greek.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 132

132 Numerical Linear Algebra

The next step is for this same one eigenvalue λ to consider in the y − ρ
plane the curve y = |1 − λ/ρ|, as shown in Figure 6.2. This just reflects up
the portion of the curve below the rho axis in the previous figure. We also
begin including the key level y = 1.

rho

Fig. 6.2 Plot of y = |1 − λ/ρ| for one value of λ.

The next step is to plot y = maxa≤λ≤b |1 − λ/ρ|. This means plotting


the same curves for a few more values of λ and taking the upper envelope
of the family of curves once the pattern is clear. We do this in two steps.
First Figure 6.3 includes more examples of y = |1 − λ/ρ|.

rho

Fig. 6.3 The family y = |1 − λ/ρ| for four values of λ.

The upper envelope is just whichever curve is on top of the family. We


plot it in Figure 6.4 with the two curves that comprise it.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 133

Iterative Methods 133

rho

Fig. 6.4 The dark curve is y = maxa≤λ≤b |1 − λ/ρ|.

The dark curve in Figure 6.4 is our target. It is

||T (ρ)||2 = max |1 − λ/ρ|.


a≤λ≤b

Checking which individual curve is the active one in the maximum, we find:

• Convergence: ||T (ρ)||2 < 1 if and only if ρ is bigger than the value
of rho at the point where 1 = −(1 − λmax /ρ). Solving this equation for
rho we find the condition

convergence if and only if ρ > λmax /2.

• Parameter selection: The optimal value of rho is where


minρ maxa≤λ≤b |1 − λ/ρ| is attained. This is the value of rho where
the dark, upper envelope curve is smallest. Checking the active con-
straints, it is where the two dashed curves cross and thus where
(1 − λmin /ρ) = −(1 − λmax /ρ). Solving for rho gives the value
λmin + λmax
ρoptimal = ,
2
the y value is given by
2
T (ρopotimal )2 = 1 − ,
κ+1
and the condition number is
λmax (A)
κ = cond2 (A) = .
λmin (A)
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 134

134 Numerical Linear Algebra

6.3.3 How many FOR iterations?

“Computing is no more about computers than astronomy is


about telescopes.”
— Edsger Dijkstra

The above analysis also gives insight on the expected number of itera-
tions for FOR to converge. Since

en = T en−1 , so we have en = T e0 .

Because of the multiplicative property of norms

en  ≤ T n e0  ≤ T n e0 .

Thus, the relative improvement in the error is


en 
≤ T n .
e0 
If we want the initial error to be improved by some factor ε then we want
en 
< ε.
e0 
Since en /e0  ≤ T n , it suffices that T n < ε or (taking logs and
solving for n)
ln( 1ε )
n≥ - ..
ln T1

Usually, we take ε = 10−1 and speak of number of iterations for each


significant digit of accuracy. This is
# $
1
n ≥ ln(10)/ ln .
T 
We can estimate how big this is using

T  = 1 − α, where α is small,
.
and ln(1 − α) = −α + O(α2 ) (by Taylor series). This gives
1 . −1
- .= α , where T  = 1 − α.
1
T

Refer to the previous α = 1/κ(A) for ρ = (λmax (A) + λmin (A))/2.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 135

Iterative Methods 135

Conclusions

(1) For ρ = λmax (A), FOR requires approximately


.
n = ln(10) · κ(A) iterations
per significant digit of accuracy.
(2) For ρ = (λmax (A) + λmin (A))/2, FOR requires approximately
. 1
n = ln(10) (κ(A) + 1) iterations
2
per significant digit of accuracy.
.
(3) For the model Poisson problem, κ(A) = O(h−2 ), this gives
n = O(h−2 ) iterations
per significant digit of accuracy.
(4) The problem of FOR is that it is too slow. On, e.g., 100 × 100 meshes
it requires tens of thousands of iterations for each significant digit
sought. Thus, in the hunt for “better” iterative methods, it is clear
“better” means “faster” which means fewer iterations per significant
digit which means find an iteration for which its iteration operator T
satisfies spr(T ) is smaller that of FOR!

Exercise 6.8. Let A be N ×N and SPD. Consider FOR for solving Ax = b.


Define the A-norm by:
√  
|x|A := xt Ax = < Ax, x > = < x, Ax >.
Give a complete convergence analysis of the FOR error in the A norm
(paralleling our analysis). What is the optimal ρ? In particular show that
for the optimal value of ρ
# $
κ−1
x − xn A ≤ x − xn−1 A .
κ+1
What is the number of iterations per significant digit for the MPP? If
you prefer, you can explore this computationally instead of theoretically
[Choose one approach: analysis or computations, not both].

Exercise 6.9. Consider error in FOR yet again. Suppose one chooses
2 values of ρ and alternates with ρ1 , ρ2 , ρ1 , ρ2 , etc. Relabel the steps as
follows:
ρ1 (xn+1/2 − xn ) = b − Axn
ρ2 (xn+1 − xn+1/2 ) = b − Axn+1/2 .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 136

136 Numerical Linear Algebra

Eliminate the half step to write this as an stationary iterative method [i.e.,
relate xn+1 to xn ]. Analyze convergence for SPD A. Can this converge
faster with two different values of ρ than with 2 steps of one value of ρ?
If you prefer, you can explore this computationally instead of theoretically
[Choose one approach: analysis or computations. It will be most exciting if
you work with someone on this problem with one person doing the analysis
and the other the numerical explorations].

Exercise 6.10. Consider the 2D Model Poisson Problem on a uniform


mesh with h = 1/(N + 1), boundary condition: g(x, y) = 0, right hand
side: f (x, y) = x + y.

(1) Take h = 1/3 and write down the 4 × 4 linear system in matrix vector
form.
(2) Given an N × N mesh, let u(i, j) denote an N × N array of approxima-
tions at each (xi , yj ). Give pseudocode for computing the residual
r(i, j) (N × N array) and its norm. c. Suppose the largest N for which
the coefficient matrix can be stored in banded sparse form (to be solved
by Gaussian elimination) is N = 150.
(3) Estimate the largest value of N the problem can be stored to be solved
by First Order Richardson. Explain carefully!

6.4 Better Iterative Methods

“Computer scientists want to study computing for its own sake;


computational scientists want to build useful things.”
— Greg Wilson

FOR has a huge savings in storage over Gaussian elimination but not in
time to calculate the solution. There are many better iterative methods; we
consider a few such algorithms in this section: the Gauss–Seidel method,
over-relaxation and the SOR method. These are still used today — not
as solvers but as preconditioners for the Conjugate Gradient method of
Chapter 7.

6.4.1 The Gauss–Seidel Method

“Euler published 228 papers after he died, making the deceased


Euler one of history’s most prolific mathematicians.”
— William Dunham
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 137

Iterative Methods 137

The Gauss–Seidel Method is easiest to understand for the 2D model


problem by comparing it with the Jacobi method (which is FOR with ρ =
4). The Jacobi or ρ = 4 FOR method is:

Algorithm 6.5 (Jacobi Algorithm for the 2D MPP). Given an ar-


ray uold of size N+1 by N+1 with boundary values filled with zeros, a maxi-
mum number of iterations itmax, and a tolerance tol,
h=1/N
for it=1:itmax
for i=2:N
for j=2:N
(∗) unew(i,j)=h^2*f(i,j)+ ...
( uold(i+1,j)+uold(i-1,j)+ ...
uold(i,j+1)+uold(i,j-1) )/4
end
end
if convergence is satisfied, exit
unew=uold
end

The idea of Gauss–Seidel is to use the best available information instead:


if unew is known at a neighbor in step (∗), why not use it instead of uold?
This even makes it simpler to program and reduces the storage needed
because we no longer have to track old and new values and simply use the
most recent one.

Algorithm 6.6 (Gauss–Seidel algorithm for the 2D MPP). Given


an array u of size N+1 by N+1 with boundary values filled with zeros, a
maximum number of iterations itmax, and a tolerance tol,
h=1/N
for it=1:itmax
for i=2:N
for j=2:N
(∗) u(i,j)=h^2*f(i,j)+ ...
( u(i+1,j)+u(i-1,j)+u(i,j+1)+u(i,j-1) )/4
end
end
if convergence is satisfied, exit
end
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 138

138 Numerical Linear Algebra

Algebraically, Jacobi and Gauss–Seidel for a general linear system are


equivalent to splitting A into
A = D + L + U,
D = diagonal of A,
L = lower triangular part of A
U = upper triangular part of A.
Ax = b is rewritten as (L + D + U )x = b. The Jacobi iteration for Ax = b
is
 
D xn+1 − xn = b − Axn
equivalently: (Jacobi for Ax=b)
Dx n+1
= b − (L + U )x . n

The Gauss–Seidel iteration for Ax = b is


 
(D + U ) xn+1 − xn = b − Axn
equivalently: (Gauss–Seidel for Ax=b)
(D + U )x n+1
= b − Lx . n

Both take the general form

pick M then:
 
M x − xn = b − Axn .
n+1

There is a general theory for stationary iterative methods in this general


form. The heuristics that are derived from this theory are easy to summa-
rize:

• M must be chosen so that spr(I − M −1 A) < 1 and so that M Δx = r


is easy to solve.
• The closer M is to A the faster the iteration converges.

Costs for the MPP: In most cases, the Gauss–Seidel iteration takes
approximately 21 as many steps as Jacobi iteration. This is because, intu-
itively speaking, each time (*) in Algorithm 6.6 is executed, it involves half
old values and half updated values. Thus, using Gauss–Seidel over FOR
cuts execution time roughly in half. However, the model problem still needs
1 −2
2 O(h ) iterations. Cutting costs by 50% is always good. However, the
essential problem is how the costs grow as h → 0. In other words, the goal
should be to cut the exponent as well as the constant!
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 139

Iterative Methods 139

6.4.2 Relaxation

“Taniyama was not a very careful person as a mathematician.


He made a lot of mistakes, but he made mistakes in a good direc-
tion, and so eventually, he got right answers. I tried to imitate him,
but I found out that it is very difficult to make good mistakes.”
— Goro Shimura, of the Shimura–Taniyama Conjecture

“The time to relax is when you don’t have time for it.”
— Sydney J. Harris

Relaxation is an ingenious idea. It is appealing because

• it is algorithmically easy to put into any iterative method program.


• it introduces a parameter that must be chosen. With the right choice,
often it can reduce the number of iterations required significantly.

The second point (cost reduction) happens in cases where the number
of steps is sensitive to the precise choice of the parameter. However, it is
not appealing because

• it introduces a parameter which must be chosen problem by problem


and the number of steps can increase dramatically for slightly non-
optimal choices.

The idea is simple: pick the relaxation parameter ω, then add one line
to an existing iterative solver as follows.

Algorithm 6.7 (Relaxation Step). Given ω > 0, a maximum number


of iterations itmax and x0 :

for n=1:itmax
Compute xn+1
temp by some iterative method

temp + (1 − ω)x
Compute xn+1 = ωxn+1 n

if xn+1 is acceptable, exit


end

Since the assignment operator “=” means “replace the value on the left
with the value on the right”, in a computer program there is sometimes
no need to allocate extra storage for the temporary variable xn+1
temp . Under-
relaxation means 0 < ω < 1 and is a good choice if the underlying iteration
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 140

140 Numerical Linear Algebra

undershoots and overshoots in an alternating manner. Small positive val-


ues of ω can slow convergence to an impractical level, but rarely cause
divergence. Over-relaxation means ω > 1 and is a good choice when the
underlying iteration is progressing slowly in a single direction. The right
choice of ω can drastically improve convergence, but can cause divergence
if ω is too big. This is because under relaxation is just linear interpolation
between the past two values while over-relaxation is linear extrapolation
from the past two values. For matrices arising from the MPP and similar
problems, a theory for finding the optimal value of ω during the course of
the iteration is well-established, see Hageman and Young [Hageman and
Young (1981)].

Exercise 6.11. In this exercise, you will see a simple example of how over-
relaxation or under-relaxation can accelerate convergence of a sequence.
For a number r with |r| < 1, consider the sequence6 {en = (r)n }∞ n=0 .
This sequence satisfies the recursion
en = ren−1 (6.2)
and converges to zero at a rate r. Equation (6.2) can be relaxed as
en = ωren−1 + (1 − ω)en = (1 + ω(r − 1)en−1 . (6.3)

(1) Assume that 0 < r < 1 is real, so that the sequence {en } is of one
sign. Show that there is a value ω0 so that if 1 < ω < ω0 , then (6.3)
converges more rapidly than (6.2).
(2) Assume that −1 < r < 0 is real, so that the sequence {en } is of
alternating sign. Show that there is a value ω0 so that if 0 < ω0 < ω <
1, then (6.3) converges more rapidly than (6.2).
(3) Assume that r is real, find the value ω0 and show that, in this very
special case, the relaxed expression converges in a single iteration.

Exercise 6.12. Show that FOR with relaxation does not improve conver-
gence. It just corresponds to a different value of ρ in FOR.

Exercise 6.13. Consider Gauss–Seidel plus relaxation (which is the SOR


method studied next). Eliminate the intermediate (temporary) variable
and show that the iteration operator is
# $−1 # $
1 1−ω
T (ω) = D+L D−U .
ω ω
6 The notation (r)n means the nth power of r as distinct from en , meaning the nth

iterate.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 141

Iterative Methods 141

6.4.3 Gauss–Seidel with over-relaxation = Successive Over


Relaxation

“The researches of many commentators have already thrown


much darkness on this subject, and it is probable that if they con-
tinue we shall soon know nothing at all about it.”
— Mark Twain

SOR = Successive Over Relaxation is one of the most famous algorithms


in numerical analysis. It is simply Gauss–Seidel plus over relaxation. For
many years it was the method of choice for solving problems like the model
Poisson problem and its theory is both lovely and complete. Unfortunately,
it also includes a tuning parameter that must be chosen. For the MPP it is
tabulated and for a class of problems including MPP, methods for closely
approximating the optimal ω are well-known. For more complex problems
finding the optimal ω, while theory assures us that it exists, presents prac-
tical difficulties.
Heuristics exist for choosing good guesses for optimal ω, sometimes
equation by equation. In industrial settings, there is a long history of
refined heuristics based on theoretical results. In one case at Westinghouse,
Dr. L. A. Hageman7 was asked to devise automated methods for finding
optimal ω values for a large computer program used for design of nuclear
reactors. This program typically ran for hours and would often fail in the
middle of the night, prompting a telephone call to the designer who had
submitted the problem. The designer could sometimes get the program
running again by reducing the chosen value of ω, otherwise, a late-night
trip to work was required. In addition to the inconvenience, these failures
caused loss of computer time, a limited and valuable resource at the time.
Dr. Hageman relieved the users of estimating ω and failures of the program
were largely limited to modelling errors instead of solution errors. Methods
for estimating ω can be found in Hageman and Young [Hageman and Young
(1981)].
For A = L + D + U SOR is as follows:

Algorithm 6.8 (SOR for Ax=b). Given ω > 0, a maximum number of


iterations itmax and x0 :

Compute r0 = b − Ax0
for n=1:itmax
7 Personal communication.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 142

142 Numerical Linear Algebra

Compute xn+1
temp by one GS step:
 
temp − x
(D + U ) xn+1 = b − Axn
n

temp + (1 − ω)x
Compute xn+1 = ωxn+1 n

Compute rn+1 = b − Axn+1


if xn+1 and rn+1 are acceptable, exit
end

For the 2D MPP the vector x is the array u(i, j) and the action of D, U
and A can be computed directly using the stencil. That D + U is upper
triangular means just use the most recent value for any u(i, j). It thus
simplifies as follows.

Algorithm 6.9 (SOR algorithm for the 2D MPP). Given an array


u of size N+1 by N+1 with boundary values filled with zeros, a maximum
number of iterations itmax, a tolerance tol, and an estimate for the opti-
mal omega =omega (see below):

h=1/N
for it=1:itmax
for i=2:N
for j=2:N
uold=u(i,j)
u(i,j)=h^2*f(i,j) ...
+ (u(i+1,j)+u(i-1,j)+u(i,j+1)+u(i,j-1))/4
u(i,j)=omega*u(i,j)+(1-omega)*uold
end
end
if convergence is satisfied, exit
end

Convergence results for SOR are highly developed. For example, the fol-
lowing is known.

Theorem 6.8 (Convergence of SOR). Let A be SPD and let TJacobi =


D−1 (L + U ) be the iteration matrix for Jacobi (not SOR). If spr(TJacobi ) <
1, then SOR converges for any ω with 0 < ω < 2 and there is an optimal
choice of ω, known as ωoptimal , given by
2
ωoptimal =  .
2
1+ 1 − (spr(TJacobi ))
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 143

Iterative Methods 143

For ω = ωoptimal and TSOR , TGaussSeidel the iteration matrices for SOR
and Gauss–Seidel respectively, we have
2
spr(TSOR ) = ωoptimal − 1 < spr(TGaussSeidel ) ≤ (spr(TJacobi )) < 1.

The dramatic reason SOR was the method of choice for ω = ωoptimal is
that it reduces the exponent in the complexity estimate for the MPP
f rom O(h−2 ) to O(h−1 ).

Exercise 6.14. Theorem 5.3 presents the eigenvalues of the 3D MPP ma-
trix, and the analogous expression for the 2D MPP (A) is
# # $ # $$
2 pπ 2 qπ
λpq = 4 sin + sin ,
2(N + 1) 2(N + 1)
for 1 ≤ p, q ≤ N .
Using this expression along with the observation that the diagonal of A is a
multiple of the identity, find spr(TJacobi ) and spr(TSOR ) for ω = ωoptimal .
How many iterations will it take to reduce the error from 1 to 10−8 using:
(a) Jacobi, and (b) SOR with ω = ωoptimal for the case that N = 1000?

6.4.4 Three level over-relaxed FOR


“I guess I should warn you if I turn out to be particularly clear,
you’ve probably misunderstood what I said.”
— Alan Greenspan, at his 1988 confirmation hearings.

Adding a relaxation step to FOR just results in FOR with a changed


value of ρ. It is interesting that if a relaxation step is added to a 2 stage
version of FOR, it can dramatically decrease the number of steps required.
The resulting algorithm is often called Second Order Richardson and it
works like this:

Algorithm 6.10 (Second order Richardson). Given the matrix A,


initial vector u0 , values of ρ and ω, and a maximum number of steps itmax:
Do one FOR step:
r0 = b − Au0
u1 = u0 + r0 /ρ
for n = 1:itmax
rn = b − Aun
un+1 n n
T EM P = u + r /ρ
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 144

144 Numerical Linear Algebra

T EM P + (1 − ω)u
un+1 = ωun+1 n−1

if converged, exit, end


end

It has some advantages to the other SOR on some parallel architectures


(and some other disadvantages as well, such as having to optimize over two
parameters).
It can reduce the number of iterations as much as SOR. It takes more
to program it and requires more storage than SOR. However, it is parallel
for the model problem while SOR is less so. In 2 stage algorithms, it is
usual to name the variables uOLD , uN OW and uN EW .

Algorithm 6.11 (Second order Richardson for the MPP). Given a


maximum number of iterations itmax, an (N + 1) × (N + 1) mesh on a
square, starting with guesses uold(i,j), unow(i,j) and choices of ρ =
rho, and ω = omega

for its=1:itmax
for i=2:N
for j=2:N
au = - uold(i+1,j) - uold(i-1,j) ...
+ 4.0*uold(i,j) ...
- uold(i,j+1) - uold(i,j-1)
r(i,j) = h^2*f(i,j) - au
unow(i,j) = uold(i,j) + (1/rho)*r(i,j)
unew(i,j) = omega*unow(i,j) + (1-omega)*uold(i,j)
end
end
Test for convergence
if convergence not satisfied
Copy unow to uold and unew to unow
for i=2:N
for j=2:N
uold(i,j)=unow(i,j)
unow(i,j)=unew(i,j)
end
end
else
Exit with converged result
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 145

Iterative Methods 145

end

end

Convergence analysis has been performed for two stage methods.8

6.4.5 Algorithmic issues: storing a large, sparse matrix

“Just as there are wavelengths that people cannot see, and


sounds that people cannot hear, computers may have thoughts that
people cannot think.”
— Richard Hamming, a pioneer numerical analyst

Neo: The Matrix.


Morpheus: Do you want to know what it is?
Neo: Yes.
Morpheus: The Matrix is everywhere.
— an irrelevant quote from the film “The Matrix”

If you are using an iterative method to solve Ax = b, most typically


the method will be written in advance but all references to A will be made
through a function or subroutine that performs the product Ax. It is in
this subroutine that the storage for the matrix A is determined. The “best”
storage scheme for very large systems is highly computer dependent. And,
there are problems for which A need not be stored at all.
For example, for the 2D Model Poisson problem, a residual (and its
norm) can be calculated on the physical mesh as follows.

Algorithm 6.12 (Calculating a residual: MPP). Given an array u of


size N+1 by N+1 containing the current values of the iterate, with correct
values in boundary locations,

rnorm=0
for i=2:N
for j=2:N
au=4*u(i,j)-(u(i+1,j)+u(i-1,j)+u(i,j+1)+u(i,j-1))
r(i,j)=h^2*f(i,j)-au
8 N.K. Nichols, On the convergence of two-stage iterative processes for solving linear

equations, SIAM Journal on Numerical Analysis, 10 (1973), 460–469.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 146

146 Numerical Linear Algebra

rnorm=rnorm+r(i,j)^2
end
end
rnorm=sqrt(rnorm/(N-1)^2)

Note that because the nonzero entries are known and regular the above
did not even need to store the nonzero entries in A. We give one impor-
tant example of a storage scheme for more irregular patterned matrices:
CRS=Compressed Row Storage.

Example 6.2 (CRS=Compressed Row Storage). Consider a sparse


storage scheme for the following matrix A below (where “· · · ” means all
the rest zeroes).
⎡ ⎤
2 −1 0 0 3 0 0 . . .
⎢ 0 2 0 1 0 0 5 ...⎥
⎢ ⎥
⎢ −1 2 −1 0 1 0 0 . . . ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 0 3 2 1 0 1 ...⎥.
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ ... ⎦

To use A we need to first store the nonzero entries. In CRS this is done,
row by row, in a long vector. If the matrix has M nonzero entries we store
them in an array of length M named value
value = [2, −1, 3, 2, 1, 5, −1, 2, −1, 1, 3, 2, 1, 1, . . . ].
Next we need to know in the above vector the index where each row starts.
For example, the first 3 entries, 2, −1, 3, come from row 1 in A. Row 2
starts with the next (4th in this example) entry. This metadata can be
stored in an array of length M named row, containing indices where each
row starts. Of course, the first row always starts with the first value in
value, so there is no need to store the first index, 1, leaving (M − 1) row
indices to be stored. By convention, the final index in value is (M + 1).
row = [4, 7, 11, . . . , M ).
Now we know that Row 1 contains entries 1, 2, 3 (because Row 2 starts with
entry 4), we need to store the column numbers that each entry in value
corresponds with in the global matrix A. This information can be stored
in a vector of length M named col.
col = [1, 2, 5, 2, 4, 7, . . . ].
With these three arrays we can calculate the matrix vector product as
follows.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 147

Iterative Methods 147

Algorithm 6.13 (Matrix-Vector product with CRS). Given the N -


vector x and the N ×N matrix A stored in CRS, this computes the N -vector
y = Ax.

first=1
for i=1:N
y(i)=0
for j=first:row(i)-1
k=col(j)
y(i)= y(i) + value(j)*x(k)
end
first=row(i)
end

Exercise 6.15. Write a pseudocode routine for calculating x → At x when


A is stored in CRS. Compare your routine to the one given in in the Tem-
plates book, Barrett, Berry, et al. [Barrett et al. (1994)], Section 4.3.2.

Remark 6.4. The Matlab program uses a variant of CRS to implement


its sparse vector and matrix capabilities. See Gilbert, Moler and Schreiber
[Gilbert et al. (1992)] for more information.

6.5 Dynamic Relaxation

There is one last approach related to stationary iterative methods we need


to mention: Dynamic Relaxation. Dynamic relaxation is very commonly
used in practical computing. In many cases for a given practical problem
both the evolutionary problem (x (t) + Ax(t) = b) and the steady state
problem (Ax = b) must eventually be solved. In this case it saves program-
mer time to simply code up a time stepping method for the evolutionary
problem and time step to steady state to get the solution of the steady
problem Ax = b. It is also roughly the same for both linear and nonlinear
systems, a highly valuable feature. Excluding programmer effort, however,
it is however almost never competitive with standard iterative methods for
solving linear systems Ax = b. To explain in the simplest case, suppose
A is SPD and consider the linear system Ax = b. We embed this into the
time dependent system of ODEs

x (t) + Ax(t) = b, for 0 < t < ∞


(IVP)
x(0) = x0 the initial guess.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 148

148 Numerical Linear Algebra

Since A is SPD it is not hard to show that as t → ∞ the solution x(t) →


A−1 b.

Theorem 6.9. Let A be SPD. Then for any initial guess the unique so-
lution to (IVP) x(t) converges to the unique solution of the linear system
Ax = b:
x(t) → A−1 b as t → ∞.

Thus one way to solve the linear system is to use any explicit method
for the IVP and time step to steady state. There is in fact a 1 − 1 corre-
spondence between time stepping methods for some initial value problem
associated with Ax = b and stationary iterative methods for solving Ax = b.
While this sounds like a deep meta-theorem it is not. Simply identify the it-
eration number n with a time step number and the correspondence emerges.
For example, consider FOR
ρ(xn+1 − xn ) = b − Axn .
Rearrange FOR as follows:
xn+1 − xn
+ Axn = b where Δt := ρ−1 .
Δt
This shows that FOR is exactly the forward Euler method for IVP with
timestep and pseudo-time
Δt := ρ−1 and tn = nΔt and xn  x(tn ).
Similarly, the linear system Ax = b can be embedded into a second order
equation with damping
x (t) + ax (t) + Ax(t) = b, for a > 0 and 0 < t < ∞
x(0) = x0 , x (0) = x1 the initial guesses.
Timestepping gives an iterative method with 2 parameters (a, Δt) and thus
resembles second order Richardson.
xn+1 − 2xn + xn−1 xn+1 − xn−1
+ a + Axn = b .
Δt2 2Δt
The reasons this approach is not competitive, if programmer time is not
counted, include:

• The evolutionary problem is forced to compute with physical time


whereas an iterative method can choose some sort of pseudo-time that
leads to steady state faster.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 149

Iterative Methods 149

• The evolutionary problem seeks time accuracy to the preselected prob-


lem whereas iterative methods only seek to get to the steady state
solution as fast as possible.

Exercise 6.16. Find the IVP associated with the stationary iterative
methods Gauss–Seidel and SOR.

Exercise 6.17. Complete the connection between second order Richardson


and the second order IVP.

Exercise 6.18. Show that the solution of both IVP’s converges to the
solution of Ax = b as t → ∞.

6.6 Splitting Methods

The classic and often very effective use of dynamic relaxation is in splitting
methods. Splitting methods have a rich history; entire books have been
written to develop aspects of them so we shall give one central and still
important example, the Peaceman–Rachford method. Briefly, the N × N
matrix A is split as

A = A1 + A2

where the subsystems A1 y = RHS1 and A2 y = RHS2 are “easy to solve”.


Usually easy to solve means easy either in computer time or in programmer
effort; often A is split so that A1 and A2 are tridiagonal (or very close
to tridiagonal) or so that you already have a code written to solve the
subsystems that is highly adapted to their specific features. Given that the
uncoupled problems can be solved, splitting methods then are applied to
solve the coupled problems such as9

(A1 + A2 )x = b,
d
x(t) + (A1 + A2 )x(t) = f (t).
dt
We consider the first two problems. We stress that splitting methods in-
volve two separate steps and each is important for the success of the whole
method:

• Pick the actual splitting A = A1 + A2 .


• Pick the splitting method to be used with that splitting.
9 Another possibility would be F1 (x) + F2 (x) = 0, where Ai = Fi .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 150

150 Numerical Linear Algebra

The first splitting method and in many ways still the best is the
Peaceman–Rachford method.

Algorithm 6.14 (Peaceman–Rachford Method). Pick parameter ρ


> 0. Pick initial guess x0 .
Until satisfied: given xn
Solve
(ρI + A1 )xn+1/2 = b + (ρI − A2 )xn , (PR, step 1)
(ρI + A2 )xn+1 = b + (ρI − A1 )xn+1/2 . (PR, step 2)
Test for convergence.

Each step is consistent with Ax = b. (If xn+1/2 = xn = x then rear-


ranging (ρI + A1 )xn+1/2 = b + (ρI − A2 )xn gives Ax = b.) The first half
step is A1 implicit and A2 explicit while the second half step reverses and
is A2 implicit and A1 explicit. The classic questions are:

• When does it converge?


• How fast does it converge?
• How to pick the methods parameter?

We attack these by using the fundamental tools of numerical linear


algebra.

Lemma 6.2. The Peaceman–Rachford method satisfies


xn+1 = /b + T xn
where the iteration operator T = TP R is
TP R = TP R (ρ) = (ρI + A2 )−1 (ρI − A1 )(ρI + A1 )−1 (ρI − A2 ).

Proof. Eliminating the intermediate variable xn+1/2 gives this immedi-


ately.

Thus Peaceman–Rachford converges if and only if10 spr(TP R ) < 1. This


is a product of four terms. It can be simplified to a product of the form
F (A1 ) · F (A2 ) by commuting the (non-commutative) terms in the product
using the following observation.

Lemma 6.3 (AB similar to BA). Let A, B be N ×N matrices. If either


A or B is invertible, then AB is similar to BA
AB ∼ BA.
10 Recall the spectral radius is spr(T ) :=max{|λ|: λ is an eigenvalue of T }.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 151

Iterative Methods 151

Thus
spr(AB) = spr(BA).

Proof. Exercise!

Define the function11 T : C → C by


ρ−z
T (z) = .
ρ+z
Using the property that we can commute matrices without altering the
spectral radius of the product, we find
0 1
spr(TP R (ρ)) = spr (ρI − A1 )(ρI + A1 )−1 (ρI − A2 )(ρI + A2 )−1
= spr [T (A1 )T (A2 )] .
We are now ready to prove one of the most famous results in iterative
methods.

Theorem 6.10 (Kellogg’s lemma). Let B be an N × N real matrix. If


xT Bx > 0 for all 0 = x ∈ RN ,
then
||T (B)||2 < 1.

Proof. Let x = 0 be given. Then


||T (B)x||22 T (B)x, T (B)x
=
||x||2
2 x, x
(ρI − B)(ρI + B)−1 x, (ρI − B)(ρI + B)−1 x
= .
x, x
Now change variables by y = (ρI + B)−1 x, so x = (ρI + B)y. We then have
||T (B)x||22 (ρI − B)y, (ρI − B)y
=
||x||2
2 (ρI + B)y, (ρI + B)y
ρ2 ||y||22 − 2ρy T By + ||By||22
= .
ρ2 ||y||22 + 2ρy T By + ||By||22
Checking the numerator against the denominator and recalling xT Bx > 0,
they agree term by term with one minus sign on top and the corresponding
sign a plus on bottom. Thus
||T (B)x||22
< 1 and ||T (B)||2 ≤ 1.
||x||22
11 This is an abuse of notation to use T for so many things. However, it is not too

confusing and standard in the area (so just get used to it).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 152

152 Numerical Linear Algebra

To prove strict inequality, assume equality holds. Then if T (B)2 = 1,


there must exist at least one x = 0 for which ||T (B)x||22 = ||x||22 . The same
argument shows that for this x, xT Bx = 0, a contradiction.

Using Kellogg’s lemma we thus have a very strong convergence result


for (PR).

Theorem 6.11 (Convergence of Peaceman–Rachford). Let ρ > 0,


xT Ai x ≥ 0 for i = 1, 2 with xT Ai x > 0 for all 0 = x ∈ RN for one of
i = 1 or 2. Then (PR) converges.

Proof. We have
spr(TP R ) = spr [T (A1 )T (A2 )] .
By Kellogg’s lemma we have for both i = 1, 2, ||T (Ai )||2 ≤ 1 with one of
||T (Ai )||2 < 1. Thus, ||T (A1 )T (A2 )||2 < 1 and
spr(T (A1 )T (A2 )) ≤ ||T (A1 )T (A2 )||2 < 1.

Remark 6.5 (What does the condition xT Ai x > 0 mean?).


Peaceman–Rachford is remarkable in that its convergence is completely
insensitive to the skew symmetric part of A. To see this, recall that any
matrix can be decomposed into the sum of its symmetric part and its skew-
symmetric part A = As + Ass by
1 T
As = (A + AT ), so (As ) = As
2
1 T
Ass = (A − AT ), so (Ass ) = −Ass .
2
For x a real vector it is not hard to check12 that xT Ass x ≡ 0. Thus,
xT Ai x > 0 means:
The symmetric part of A is positive definite. The skew symmetric part
of A is arbitrary and can thus be arbitrarily large.

Exercise 6.19. The following iteration is silly in that each step costs as
much as just solving Ax = b. Nevertheless, (and ignoring this aspect of it)
prove convergence for matrices A with xT Ax > 0 and analyze the optimal
parameter:
w(ρI + A)xn+1 = b + ρxn .
12 a
 T
:= xT Ass x = xT Ass x = xT AssT x ≡ −xT Ass x. Thus a = −a so a = 0.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 153

Iterative Methods 153

Exercise 6.20. Analyze convergence of the Douglas–Rachford method


given by:
(ρI + A1 )xn+1/2 = b + (ρI − A2 )xn ,
(ρI + A2 )xn+1 = A2 xn + ρxn+1/2 .

6.6.1 Parameter selection


The Peaceman–Rachford method requires a one parameter optimization
problem be solved to pick ρ. We shall use exactly the same method as for
FOR to solve the problem for SPD A. The solution process is exactly the
same as for FOR but its conclusion is quite different, as we shall see. In
this section we shall thus assume further that
A, A1 , A2 are SPD.
We shall actually solve the following problem where B plays the role of
A1 , A2 :

Problem 6.1. Given the N × N SPD matrix B, find


ρoptimal = arg min ||T (B)||2 , or
ρ
ρ−λ
ρoptimal = arg min max | |.
ρ λmin (B)≤λ≤λmax (B) ρ+λ

Consider φ(ρ) = | ρ−λ


ρ+λ |, we follow the same steps as for FOR and sketch
the curves below for several values of λ. We do this in two steps. In
Figure 6.5 we plot more examples of y = | ρ−λ ρ+λ |.
The upper envelope is the curve is on top of the family. We plot it in
Figure 6.6 with the two curves that comprise it.
Solving for the optimal value by calculating the intersection point of the
two curves comprising the upper envelop, we find

ρoptimal = λmax λmin

κ−1 2
||TP R (ρoptimal )||2 = √ =1− √ .
κ+1 κ+1

6.6.2 Connection to dynamic relaxation


The PR method can be written in residual-update form by eliminating the
intermediate step and rearranging. There results
1  
(ρI + A1 )(ρI + A2 ) xn+1 − xn = b − Axn . (PR, step 1)
2
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 154

154 Numerical Linear Algebra

rho

Fig. 6.5 The family y = | ρ−λ


ρ+λ
| for four values of λ.

rho

Fig. 6.6 The dark curve is y = maxa≤λ≤b | ρ−λ


ρ+λ
|.

6.6.3 The ADI splitting

To motivate the first (and still important) splitting A = A1 + A2 , we recall


a remark from the Gaussian elimination chapter.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 155

Iterative Methods 155

Remark 6.6 (How fast is tridiagonal Gaussian elmination?).


Tridiagonal Gaussian elimination has 1 loop. Inside each loop roughly 3
arithmetic operations are performed. Thus, O(N ) floating point operations
are done inside tridiagonal Gaussian elimination for an N × N matrix. If
we solve an N × N linear system with a diagonal matrix, it will take N
divisions (one for each diagonal entry). The operation count of 3N − 3
multiplies and divides for tridiagonal elimination is remarkable. Solving a
tridiagonal linear system is almost as fast as solving a diagonal (completely
uncoupled) linear system.

Indeed, consider the 2D MPP. Recall that the domain is the unit square,
Ω = (0, 1) × (0, 1). Approximate uxx and uyy by
. u(a + Δx, b) − 2u(a, b) + u(a − Δx, b)
uxx (a, b) = , (6.4)
Δx2
. u(a, b + Δy) − 2u(a, b) + u(a, b − Δy)
uyy (a, b) = . (6.5)
Δy 2
Introduce a uniform mesh on Ω with N + 1 points in both directions: Δx =
Δy = N1+1 =: h and
xi = ih, yj = jh, i, j = 0, 1, . . . , N + 1.
Let uij denote the approximation to u(xi , yj ) we will compute at each mesh
point. On the boundary use
uij = g(xi , yj ) ( here g ≡ 0) for each xi , yj on ∂Ω
and eliminate the boundary points from the linear system. For a typical
(xi , yj ) inside Ω we use
# $
ui+1j − 2uij + ui−1j uij+1 − 2uij + uij−1
− + = f (xi , yj ) (6.6)
h2 h2
for all (xi , yj ) inside of Ω
uij = g(xi , yj ) ( ≡ 0 ) at all (xi , yj ) on ∂Ω. (6.7)
The boundary unknowns can be eliminated giving an N 2 ×N 2 linear system
for the N 2 unknowns:
AN 2 ×N 2 uN 2 ×1 = fN 2 ×1 .
To split A with the ADI = Alternating Direction Implicit splitting we use
the directional splitting already given above:
A = A1 + A2 , where
ui+1j − 2uij + ui−1j
A1 = −
h2
uij+1 − 2uij + uij−1
A2 = − .
h2
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 156

156 Numerical Linear Algebra

Remark 6.7. Solving (ρI + Ai )v = RHS requires solving one N × N ,


tridiagonal linear system per horizontal mesh line (when i = 1 ) or ver-
tical mesh line (when i = 2). Solving tridiagonal linear systems is very
efficient in both time and storage; one Peaceman–Rachford step with the
ADI splitting is of comparable cost to 6 FOR steps.

Exercise 6.21. If one full PR-ADI step costs the same as 6 FOR steps,
is it worth doing PR-ADI? Answer this question using results on condition
numbers of tridiag(−1, 2, −1) and the estimates of number of steps per
significant digit for each method.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 157

Chapter 7

Solving Ax = b by Optimization

“According to my models, we are doubling the paradigm shift


rate approximately every decade.”
— From a letter to Scientific American by Ray Kurzweil

“Fundamentals, fundamentals. If you don’t have them you’ll


run into someone else’s.”
— Virgil Hunter (Boxing trainer)

Powerful methods exist for solving Ax = b when A is SPD based on a deep


connection to an optimization problem. These methods are so powerful that
often the best methods for solving a general linear system Bx = f is to
pass to the least squares equations (B t B) x = B t f in which the coefficient
matrix A := B t B is now coerced to be SPD (at the expense of squaring
its condition number). We begin to develop them in this chapter, starting
with some background.

Definition 7.1 (SPD matrices). AN ×N is SPD if it is symmetric,


that is A = At , and positive definite, that is xt Ax > 0 for x = 0.
If A and B are symmetric we say A > B if A − B is SPD, i.e., if
xt Ax > xt Bx for all x = 0.
A is negative definite if −A is positive definite.
A is nonsymmetric if A = At , skew-symmetric if At = −A and
indefinite if there are choices of x for which xt Ax is both positive and
negative.
A nonsymmetric (real) matrix A satisfying xt Ax > 0 for all real vectors
x = 0 is called positive real.

157
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 158

158 Numerical Linear Algebra

It is known that a symmetric matrix A is positive definite if and only if


all λ(A) > 0.

Lemma 7.1 (The A-inner product). If A is SPD then


x, yA = xt Ay
is a weighted inner product on Rn , the A-inner product, and
 √
xA = x, xA = xt Ax
is a weighted norm, the A-norm.

Proof. x, yA is bilinear:


u + v, yA = (u + v)t Ay = ut Ay + v t Ay = u, yA + v, yA
and
 
αu, yA = (αu)t Ay = α ut Ay = αu, yA .
x, yA is positive
x, xA = xt Ax > 0, for x = 0, since A is SPD.
x, yA is symmetric:
x, yA = xt Ay = (xt Ay)t = y t At xtt = y t Ax = y, xA .

Thus x, xA is an inner product and, as a result, xA = x, xA is an
induced norm on RN .

We consider two examples that show that the A-norm is a weighted 2


type norm.
10
Example 7.1. A = . Then the A norm of [x1 , x2 ]t is
02
2

10 x1
xA = [x1 , x2 ] = x21 + 2x2 2 ,
02 x2
which is exactly a weighted l2 norm.

Example 7.2. Let A2×2 be SPD with eigenvalues λ1 , λ2 (both positive)


and orthonormal eigenvectors φ1 , φ2 :
Aφj = λj φj .
Let x ∈ R be expanded
2

x = αφ1 + α2 φ2 .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 159

Solving Ax = b by Optimization 159

Then, by orthogonality, the l2 norm is calculable from either set of coordi-


nates (x1 , x2 ) or (α, β) the same way

x2 = x21 + x22 = α2 + β 2 .

On the other hand, consider the A norm:


t
x2A = (α1 φ1 + βφ2 ) A (α1 φ1 + βφ2 )
t
= (α1 φ1 + βφ2 ) A (α1 λ1 φ1 + βλ2 φ2 )
by orthogonality of φ1 , φ2
= λ1 α 2 + λ2 β 2 .

Comparing the 2 norm and the A norm

x2 = α2 + β 2 and x2A = λ1 α2 + λ2 β 2 ,

we see that  · A is again exactly a weighted 2 norm, weighted by the


eigenvalues of A.

Exercise 7.1. For A either (i) not symmetric, or (ii) indefinite, consider

x, y → x, yA .

In each case, what properties of an inner product fail?

Exercise 7.2. If A is skew symmetric show that for real vectors xt Ax = 0.


Given an N × N matrix A, split A into its symmetric and skew-symmetric
parts by
A + At
Asymmetric =
2
A − At
Askew = .
2
Verify that A = Asymmetric + Askew . Use this splitting to show that any
positive real matrix is the sum of an SPD matrix and a skew symmetric
matrix.

7.1 The Connection to Optimization

“Nature uses as little as possible of anything.”


— Kepler, Johannes (1571–1630)
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 160

160 Numerical Linear Algebra

We consider the solution of Ax = b for SPD matrices A. This system


has a deep connection to an associated optimization problem. For A an
N × N SPD matrix, define the function J(x1 , . . . , xN ) → R by

1 t
J(x) = x Ax − xt b.
2
Theorem 7.1. Let A be SPD. The solution of Ax = b is the unique mini-
mizer of J(x). There holds

1
J(x + y) = J(x) + y t Ay > J(x) for any y.
2
 is any other vector in RN then
Further, if x

x − x
2A = 2 (J(
x) − J(x)) . (7.1)

Proof. This is an identity. First note that since A is SPD if y = 0 then


y t Ay > 0. We use x = A−1 b and A is SPD. Expand and collect terms:
1
J(x + y) = (x + y)t A(x + y) − (x + y)t b
2
1 1 1 1
= xt Ax + xt Ax + · 2y t Ax + y t Ay − xt b − y t b
2 2 2 2
1 t 1 t
= J(x) + y (Ax − b) + y Ay + y Ay
t
2 2
1 t
J(x) + y Ay > J(x).
2
This is the first claim. The second claim is also an identity: we expand
the LHS and RHS and cancel terms until we reach something equal. The
formal proof is then this verification in reverse. Indeed, expanding

2A = (x − x
x − x )t A(x − x
) = xt Ax − x
t Ax − xt A t A
x+x x
= (since Ax = b) = xb − x
t b − x
t b + x
t A t A
x=x x − 2
xt b + xt b

and

x) − J(x)) = x
2 (J( t A
x − 2
xt b − xt Ax + 2xt b
= (since Ax = b) = x x − 2
t A xt b − xt [Ax − b] + xt b
=x x − 2
t A xt b + xt b,

which are obviously equal. Each step is reversible so the result is proven.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 161

Solving Ax = b by Optimization 161

Thus, for SPD A we can write.

Corollary 7.1. For A SPD the following problems are equivalent:


solve : Ax = b,
minimize : J(y).

The equivalence can be written using the terminology of optimization


as x = A−1 b is the argument that minimizes J(y):
x = arg min J(y).
y∈RN

Example 7.3 (The 2 × 2 case). Consider the 2 × 2 linear system A→



x =


b . Let A be the symmetric 2 × 2 matrix
ac
A= .
cd
Calculating the eigenvalues, it is easy to check that A is SPD if and only if
a > 0, d > 0, and c2 − ad < 0.
Consider the energy functional J(x). Since → −x is a 2 vector, denote it by
t
(x, y) . Since the range is scalar, z = J(x, y) is an energy surface:
# $ # $
1 ac x b1
z = J(x, y) = (x, y) − (x, y)
2 cd y b2
10 2 1
= ax + 2cxy + dy 2 − (b1 x + b2 y).
2
This surface, shown in Figure 7.1, is a paraboloid opening up if and only is
the above condition on the eigenvalues hold: a > 0, d > 0, and c2 − ad < 0.
One example is plotted below. The solution of Ax = b is the point in the
x − y plane where z = J(x, y) attains its minimum value.

Minimization problems have the added advantage that it is easy to cal-


culate if an approximate solution has been improved: If J(the new value) <
J(the old value) then it has! It is important that the amount J(∗) decreases
correlates exactly with the decrease in the A-norm error as follows.
Equation (7.1) shows clearly that solving Ax = b (so x = x) is equivalent
to minimizing J(·) (since J( x) ≥ J(x) and equals J(x) only when x  ≡ x).
Theorem 7.1 and Corollary 7.1 show that powerful tools from optimization
can be used to solve Ax = b when A is SPD. There is a wide class of iterative
methods from optimization that take advantage of this equivalence: descent
methods. The prototypical descent method is as follows.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 162

162 Numerical Linear Algebra

z=J(x,y)

Fig. 7.1 An example of z = J(x, y).

Algorithm 7.1 (General descent method). Given Ax = b, a qua-


dratic functional J(·) that is minimized at x = A−1 b, a maximum num-
ber of iterations itmax and an initial guess x0 :

Compute r0 = b − Ax0
for n=1:itmax
(∗) Choose a direction vector dn
Find α = αn by solving the 1D minimization problem:
(∗∗) αn = arg minα Φ(xn + αdn )
n+1
x = x n + α n dn
r n+1
= b − Axn+1
if converged, exit, end
end

The most common examples of steps (∗) and (∗∗) are:

• Functional: Φ(x) := J(x) = 12 xt Ax − xt b,


• Descent direction: dn = −∇J(xn ).

These choices yield the steepest descent method. Because the functional
J(x) is quadratic, there is a very simple formula for αn in step (∗∗) for
steepest descent:
dn · r n
αn = , where rn = b − Axn . (7.2)
dn · Adn
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 163

Solving Ax = b by Optimization 163

It will be convenient to use the ·, · notation for dot products so this formula
is equivalently written
dn , rn  dn , rn 
αn = = n n .
d , Ad 
n n d , d A
The difference between descent methods arises from:

(1) The functional minimized, and most commonly,


(2) The choice of descent direction.

Many choices of descent direction and functionals have been tried. Ex-
amples of other choices include the following:
Choice of descent direction dn :

• Steepest descent direction: dn = −∇J(xn ).


• Random directions: dn = a randomly chosen vector
• Gauss–Seidel like descent: dn cycles through the standard basis of unit
vectors e1 , e2 , · · ·, eN and repeats if necessary.
• Conjugate directions: dn cycles through an A-orthogonal set of vectors.

Choice of Functionals to minimize:

• If A is SPD the most common choice is


1 t
J(x) = x Ax − xt b.
2
• Minimum residual methods: for general A,
1
J(x) := b − Ax, b − Ax.
2
• Various combinations such as residuals plus updates:
1 1
J(x) := b − Ax, b − Ax + x − xn , x − xn .
2 2
dn ·r n d
Exercise 7.3. Prove (7.2) that αn = dn ·Adn . Hint: Set dα (J(x + αd)) =0
and solve.

Exercise 7.4. Consider solving Ax = b by a descent method for a general


non-SPD, matrix A. Rewrite the above descent algorithm to minimize at
each step ||b − Axn ||22 := rn · rn . Find a formula for αn . Find the steepest
descent direction for ||b − Axn ||22 .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 164

164 Numerical Linear Algebra

Exercise 7.5. If A is an N × N SPD matrix and one has access to a


complete set of A-orthogonal vectors φ1 , · · ·, φN show that the solution to
Ax = b can be written down in closed form (but using inner products).
Find the number of FLOPs required to get the solution by just calculating
the closed form solution.

Exercise 7.6. For A SPD and C an N ×N matrix and ε a small parameter,


consider the minimization problem:
1 1
xε = arg min Jε (x) := xt Ax + ||Cx||22 − xt b.
2 2ε
Find the linear system xε satisfies. Prove the coefficient matrix is SPD.
Show that xε → A−1 b as ε → ∞. Next consider the case ε → 0 and show
xε → N ullspace(C), i.e., Cxε → 0.

Exercise 7.7. Let A be the symmetric 2 × 2 matrix


ac
A= .
cd
Find a necessary and sufficient condition on trace(A) and det(A) for A to
be SPD.

7.2 Application to Stationary Iterative Methods

“As a historian, I cannot believe how low the standards are in


mathematics! In my field, no one would put forth an argument
without at least ten proofs, whereas in mathematics they stop as
soon as they have found a single one!”
— An irate historian berating Andrey Kolmogorov.
“How long will you delay to be wise?” — Epictetus

Consider a stationary iterative method based on decomposing A by


A = M − N. With this additive decomposition, Ax = b is equivalent to
M x = b + N x. The induced iterative method is then
M (xn+1 − xn ) = b − Axn or (7.3)
M xn+1 = b + N xn .
Obviously, if M = A this converges in one step but that one step is just
solving Ax = b. The matrix M must approximate A and yet systems
M xn+1 = RHS must also be very easy to solve. Sometimes such a matrix
M is call a preconditioner of the matrix A and A = M − N is often called
a regular splitting of A. Examples include
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 165

Solving Ax = b by Optimization 165

• FOR: M = ρI, N = ρI − A
• Jacobi: M = diag(A)
• Gauss–Seidel: M = D + L (the lower triangular part of A).

Householder (an early giant in numerical linear algebra and matrix the-
ory) proved a very simple identity for (7.3) when A is SPD.

Lemma 7.2 (Householder lemma). Let A be SPD and let xn be given


by (7.3). With en = x − xn
1
etn Aen − etn+1 Aen+1 = (xn+1 − xn )P (xn+1 − xn ) (7.4)
2
where P = M + M t − A.

Proof. This is an identity: expand both sides and cancel to check that is
true. Next reverse the steps to give the proof.

Corollary 7.2 (Convergence of FOR, Jacobi, GS and SOR).


For A SPD, if P is positive definite (7.3) converges. The convergence is
monotonic in the A norm:
en A > en+1 A > . . . −→ 0 as n → 0.

Proof. The proof is easy but there are so many tools at hand it is also
easy to start on the wrong track and get stuck there. Note that en A
is monotone decreasing and bounded below by zero. Thus it has a non-
negative limit. Since it converges to something, the Cauchy criteria implies
that
 t 
en Aen − etn+1 Aen+1 → 0.
Now reconsider the Householder Relation (7.4). Since the LHS → 0 we
must have the RHS → 0 too.1 Since P > 0, this means
||xn+1 − xn ||P → 0.
Finally the iteration itself,
 
M xn+1 − xn = b − Axn ,
implies that if xn+1 −xn → 0 (the LHS), then the RHS does also: b−Axn →
0, so convergence follows.
1 This step is interesting to the study of human errors. Since we spend our lifetime

reading and writing L to R, top to bottom, it is common for our eyes and brain to
process the mathematics = sign as a one directional relation ⇒ when we are in the
middle of a proof attempt.
June 25, 2020 13:34 ws-book9x6 Numerical Linear Algebra 11926-main page 166

166 Numerical Linear Algebra

Let us apply this result to the above examples.

Theorem 7.2 (Convergence of FOR, GS, Jacobi, SOR).


2
• FOR converges monotonically in k · kA if
1
P = M + M t − A = 2ρI − A > 0 if ρ >
λmax (A).
2
• Jacobi converges monotonically in the A-norm if diag(A) > 12 A.
• Gauss–Seidel converges monotonically in the A norm for SPD A in all
cases.
• SOR converges monotonically in the A-norm if 0 < ω < 2.

Proof. This follows easily from Householder’s result as follows. For Jacobi,
M = diag(A), so
1
P = M + M t − A = 2diag(A) − A > 0 if diag(A) > A.
2
For GS, since A = D + L + U where (since A is symmetric) U = Lt and

P = M + M t − A = (D + L) + (D + L)t − A =
D + L + D + Lt − (D + L + Lt ) = D > 0
for SPD A. For SOR we calculate (as above using Lt = U )
P = M + M t − A = M + M t − (M − N ) = M t + N
1−ω 2−ω
= ω −1 D + Lt + D−U = D > 0,
ω ω
for 0 < ω < 2. Convergence of FOR in the A-norm is left as an exercise.

“WRITE. FINISH THINGS. KEEP WRITING.”


— Neil Gaiman

Exercise 7.8. Consider the 2D MPP on a uniform N ×N mesh. Divide the


domain in half Ω = Ω1 ∪ Ω2 (not through any meshpoints) partitioning the
mesh into two subsets of equal numbers. This then partitions the solution
and RHS accordingly as (if we first order the mesh points in Ω1 then in Ω2 )
u = (u1 , u2 ). Show that the MPP then takes the block form
    
A1 −C u1 f
= 1 .
−C A2 u2 f2
2 Here A > B means (A − B) is positive definite, i.e. xt Ax > xt Bx for all x 6= 0. Also,

monotonic convergence in the A-norm means the errors satisfy ken+1 kA < ken kA for all
n.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 167

Solving Ax = b by Optimization 167

Find the form of A1 and A2 . Show that they are diagonally semi-dominant.
Look up the definition and show they are also irreducibly diagonally dom-
inant. Show that the entries in C are nonnegative.

Exercise 7.9 (Convergence of block Jacobi). Continuing the last


problem, consider the block Jacobi method given below
# n+1 $
A1 0 u1 un1 f1 A1 −C un1
− = − .
0 A2 un+1
2 un2 f2 −C A2 un2
Use Householders theorem to prove this converges.

Exercise 7.10. Repeat the above for block FOR and for Block Gauss–
Seidel.

Exercise 7.11 (Red-Black block methods). Consider the 2D MPP on


a uniform N × N mesh. Draw a representative mesh and color the mesh-
points by red-black like a typical checkerboard (chess players should think
of greed and buff). Note that the 5 point star stencil links red points only to
black and black only to red. Order the unknowns as first red then black, par-
titioning the mesh vertices into two subsets of about equal numbers. This
then partitions the solution and RHS accordingly as u = (uRED , uBLACK ).
Show that the MPP then takes the block form
A1 −C uRED fRED
= .
−C A2 uBLACK fBLACK
Find the form of A1,2 . It will be best to do this for a fixed, e.g., 4 × 4
mesh before jumping to the general mesh. Analyze the structure of the
submatrices. Based on their structure, propose and analyze convergence of
a block iterative method. Again, try it on a 4 × 4 mesh first.

Exercise 7.12. Let A be N × N and SPD. Consider FOR for solving


Ax = b. Show that for optimal ρ we have
# $
λmax (A) − λmin (A)  
J(x ) − J(x) ≤
n
J(xn−1 ) − J(x) .
λmax (A) + λmin (A)
λmax −λmin
Express the multiplier λmax +λmin in terms of cond2 (A).

Exercise 7.13. Consider the proof of convergence when P > 0. This proof
goes back and forth between the minimization structure of the iteration and
the algebraic form of it. Try to rewrite the proof entirely in terms of the
functional J(x) and ∇J(x).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 168

168 Numerical Linear Algebra

Exercise 7.14. Give a complete and detailed proof of the Householder


lemma. Give the details of the proof that P > 0 implies convergence.

7.3 Application to Parameter Selection

“The NSA is a self-licking ice cream cone.”


— An anonymous senior official of the National Security
Agency.

Consider Richardson’s method FOR for A an SPD matrix:


ρ(xn+1 − xn ) = b − Axn , or xn+1 = xn + ρ−1 rn
where rn = b − Axn . We have an idea of “optimal” value of ρ
ρoptimal = (λmax (A) + λmin (A)) /2
which minimizes the maximum error over all possible initial conditions. It
is, alas, hard to compute.
We consider here another idea of optimal:

Given xn , find ρ = ρn which will make xn+1 as accurate as


possible on that step.

The algorithm would be:

Algorithm 7.2. Given x0 and a maximum number of iterations, itmax:


for n=0:itmax
Compute rn = b − Axn
Compute ρn via a few auxiliary calculations
xn+1 = xn + (ρn )−1 rn
if converged, exit, end
end

Lemma 7.3. In exact arithmetic, the residuals rn = b − Axn of FOR


satisfies
rn+1 = rn − ρ−1 Arn .

Proof.
xn+1 = xn − ρ−1 rn .
Multiply by “−A” and add b to both sides. This gives
b − Axn+1 = b − Axn − ρ−1 Arn ,
which is the claimed iteration.
June 25, 2020 13:34 ws-book9x6 Numerical Linear Algebra 11926-main page 169

Solving Ax = b by Optimization 169

Exercise 7.15. In Chapter 6, Exercise 6.4 (page 120) required a computer


program to FOR for the 2D MPP with ρ = 4 (the Jacobi method). Modify
this computer program so that it can use an arbitrary ρ. The 2D analog of
. .
Theorem 5.3 (page 110) in Chapter 5 shows that λmax = 8 and λmin = h2 ,
so a reasonably good choice is ρ = (8 + h2 )/2.
Test your program by solving the 2D MPP with h = 1/10, RHS f = 0,
and with boundary conditions g(x, y) = x−y. Use as initial guess the exact
solution u = x − y. You should observe convergence in a single iteration.
If it takes more than five iterations, or if it does not converge, you have an
error in your program.
How many iterations are required to reach a convergence criterion of
1.e − 4 when h = 1/100 and the initial guess is u(x, y) = 0 in the interior
and u(x, y) = x − y on the boundary?

Different formulas for selecting ρ emerge from different interpretations


of what “as accurate as possible” means.
Option 1: Residual minimization: Pick ρn to minimize krn+1 k2 . By
the last lemma,
krn+1 k2 = krn − ρ−1 Arn k2 = hrn − ρ−1 Arn , rn − ρ−1 Arn i.
Since rn is fixed, this is a simple function of ρ
e = hrn − ρ−1 Arn , rn − ρ−1 Arn i
J(ρ)
or
e = hrn , rn i − 2hrn , Arn iρ−1 − hArn , Arn iρ−2 .
J(ρ)
Taking Je0 (ρ) = 0 and solving for ρ = ρoptimal gives
hArn , Arn i
ρoptimal = .
hrn , Arn i
The cost of using this optimal value at each step: two extra dot products
per step.
Option 2: J minimization: Pick ρn to minimize J(xn+1 ). In this case
we define
φ(ρ) = J(xn+1 ) = J(xn + ρ−1 rn )
1
= (xn + ρ−1 rn )t A(xn + ρ−1 rn ) − (xn + ρ−1 rn )t b.
2
Expanding, setting φ0 (ρ) = 0 and solving, as before, gives
hrn , Arn i
ρoptimal = .
hrn , rn i
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 170

170 Numerical Linear Algebra

Option 2 is only available for SPD A. However, for such A it is preferable


to Option 1. It gives the algorithm

Algorithm 7.3. Given x0 the matrix A and a maximum number of itera-


tions, itmax:

r1 = b − Ax1
for n=1:itmax
n n
ρn = Ar ,r 
r n ,r n 
−1
xn+1 = xn + (ρn ) rn
if satisfied, exit, end
rn+1 = b − Axn+1
end

Exercise 7.16. In Exercise 7.15 you wrote a computer program to use


FOR for the 2D MPP with arbitrary ρ, defaulting to the optimal value for
the 2D MPP. In this exercise, you will modify that program to make two
other programs: (a) One for Algorithm 7.3 for Option 2, and, (b) One for
Option 1.
In each of these cases, test your program by solving the 2D MPP with
h = 1/10, RHS f = 0, and with boundary conditions g(x, y) = x−y. Use as
initial guess the exact solution u = x − y. You should observe convergence
in a single iteration. If it takes more than five iterations, or if it does not
converge, you have an error in your program.
To implement Algorithm 7.3, you will have to write code defining the
vector variable r for the residual rn and in order to compute the matrix-
vector product Arn , you will have write code similar to the code for au
(giving the product Au), but defining a vector variable. This is best done
in a separate loop from the existing loop.
To implement Option 1, the residual minimization option, all you need
to do is change the expression for ρ.
In each case, how many iterations are required for convergence when
h = 1/100 when the initial guess is u(x, y) = 0 in the interior and u(x, y) =
x − y on the boundary?

The connection between Options 1 and 2 is through the (celebrated)


“normal” equations. Since Ae = r, minimizing ||r||22 = rt r is equivalent
to minimizing ||Ae||22 = et At Ae = ||e||2At A . Since At A is SPD minimizing
||e||2At A is equivalent to minimizing the quadratic functional associated with
June 25, 2020 13:34 ws-book9x6 Numerical Linear Algebra 11926-main page 171

Solving Ax = b by Optimization 171

(At A)x = At b.
1
J(x)
e = xt At Ax − xt At b.
2
If we are solving Ax = b with A an N × N nonsingular matrix then we can
convert it to the normal equations by multiplication by At :
(At A)x = At b.
Thus, minimizing the residual is equivalent to passing to the normal equa-
tions and minimizing J(·). Unfortunately, the bandwidth of At A is (typi-
cally) double the bandwidth of A. Further, passing to the normal equations
squares the condition number of the associated linear system.

Theorem 7.3 (The normal equations). Let A be N ×N and invertible.


Then At A is SPD. If A is SPD then λ(At A) = λ(A)2 and
2
cond2 (At A) = [cond2 (A)] .

Proof. Symmetry: (At A)t = At Att = At A. Positivity: xt (At A)x =


(Ax)t Ax = |Ax|2 > 0 for x nonzero since A is invertible. If A is SPD,
then At A = A2 and, by the spectral mapping theorem,
λmax (A2 )
cond2 (At A) = cond2 (A2 ) =
λmin (A2 )
2
λmax (A)2

λmax (A)
= =
λmin (A)2 λmin (A)
2
= [cond2 (A)] .

2
The relation cond2 (At A) = [cond2 (A)] explains why Option 2 is bet-
ter. Option 1 implicitly converts the system to the normal equations and
thus squares the condition number of the system being solved then applies
Option 2. This results in a very large increase in the number of iterations.

7.4 The Steepest Descent Method

A small error in the former will produce an enormous error in


the latter.
— Henri Poincaré
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 172

172 Numerical Linear Algebra

We follow two rules in the matter of optimization:


Rule 1. Don’t do it.
Rule 2 (for experts only). Don’t do it yet - that is, not until
you have a perfectly clear and unoptimized solution.
— M. A. Jackson

“Libenter homines et id quod volunt credunt.” — an old saying.

The steepest descent method is an algorithm for minimizing a functional


in which, at each step, the choice of descent direction is made which makes
the functional decrease as much as possible at that step. Suppose that a
functional J(x) is given. The direction in which J(·) decreases most rapidly
at a point xn is

(∗) d = −∇J(xn ).

Consider the line L in direction d passing through xn . For α ∈ R, L is


given by the equation

x = xn + αd.

Steepest descent involves choosing α so that J(·) is maximally decreased


on L,

(∗∗) J(xn + αn d) = min J(xn + αd).


α∈R

When A is an SPD matrix and J(x) = 12 xt Ax − xt b each step can be


written down explicitly. For example, simple calculations show

dn := −∇J(xn ) = rn = b − Axn ,

and for αn we solve for α = αn in


d
J(xn + αdn ) = 0

to give αn
dn , rn 
αn = , in general,
dn , Adn 
and with dn = rn :
rn , rn  ||rn ||2
αn = = n 2.
r , Ar 
n n ||r ||A
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 173

Solving Ax = b by Optimization 173

Algorithm 7.4 (Steepest descent). Given an SPD A, x0 , r0 = b − Ax0


and a maximum number of iterations itmax

for n=0:itmax
rn = b − Axn
αn = rn , rn /rn , Arn 
xn+1 = xn + αn rn
if converged, exit, end
end

Comparing the above with FOR with “optimal” parameter selection


we see that Steepest descent (Algorithm 7.3 corresponding to Option 2) is
exactly FOR with αn = 1/ρn where ρn is picked to minimize J(·) at each
step.
How does it really work? It gives only marginal improvement over
constant α. We conclude that better search directions are needed. The
next example shows graphically why Steepest Descent can be so slow:

→ 2 0 2
Example 7.4. N = 2, i.e., x = (x, y)t . Let A = ,b= . Then
0 50 0

→ 1 2 0 x 2
J( x ) = [x, y] − [x, y]
2 0 50 y 0
1
= (2x2 + 50y 2 ) − 2x = x2 − 2x + 25y 2 +1 − 1
2 ' () *
ellipse

(x − 1)2 y2
= + − 1.
12 ( 15 )2

Convergence of Steepest Descent


The fundamental convergence theorem of steepest descent (given next) as-
serts a worst case rate of convergence that is no better than that of FOR.
Unfortunately, the predicted rate of convergence is sharp.

Theorem 7.4 (Convergence of SD). Let A be SPD and κ =


λmax (A)/λmin (A). The steepest descent method converges to the solution of
Ax = b for any x0 . The error x − xn satisfies
# $n
κ−1
x − xn A ≤ x − x0 A
κ+1
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 174

174 Numerical Linear Algebra

x0 = (11, 1)
limit= (1, 0)

x4 = (6.37, 0.54)

Fig. 7.2 The first minimization steps for Example 7.4. The points x0 , . . . , x4 , . . . are
indicated with dots, the level curves of J are ellipses centered at (1, 0) and construction
lines indicate search directions and tangents.

and
# $n
κ−1  
J(xn ) − J(x) ≤ J(x0 ) − J(x) .
κ+1
Proof. We shall give a short proof that for one step of steepest descent
# $
κ−1
x − xn A ≤ x − xn−1 A .
κ+1
If this holds for one step then the claimed result follows for n steps. We
observe that this result has already been proven! Indeed, since steepest
descent picks ρ to reduce J(·) maximally and thus the A norm of the error
maximally going from xn−1 to xn it must also reduce it more than for any
other choice of ρ including ρoptimal for FOR. Let xnF OR be the result from
one step from xn−1 of First Order Richardson with optimal parameter. We
have proven that
# $
κ−1
x − xnF OR A ≤ x − xn−1 A .
κ+1
Thus
# $
κ−1
x − xn A ≤ x − xnF OR A ≤ x − xn−1 A ,
κ+1
completing the proof for the error. The second result for J(xn ) − J(x) is
left as an exercise.

κ+1 = 1 − κ+1 . For the model Poisson problem, typically


We note that κ−1 2

λmax = O(1) while λmin = O(h2 ) and thus κ = O(h2 ) so steepest descent
requires O(h−2 ) directions to converge.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 175

Solving Ax = b by Optimization 175

Theorem 7.5. The convergence rate κ−1 κ+1 of steepest descent is sharp. It
is exactly the rate of convergence when the initial error is e0 = φ1 + φ2 and
when e0 = φ1 − φ2 where φ1,2 are the eigenvectors of λmin (A) and λmax (A)
respectively.

Proof. Let φ1,2 be the eigenvectors of λmin (A) and λmax (A) respectively.
Consider two possible selections of initial guesses: Pick
x0 = x − (φ1 + φ2 ) or x0 = x − (φ1 − φ2 ).
We proceed by direct calculations (which are not short but routine step by
step): if we choose x0 = x + (φ1 + φ2 ) then e0 = φ1 + φ2 . We find
x1 = x0 + α0 (b − Ax0 ) =
x1 = x0 + α0 Ae0 (since Ae = r)
x − x = x − x − α0 Ae
1 0 0

e1 = e0 − α0 Ae0 = (φ1 + φ2 ) − α0 A(φ1 + φ2 )


e1 = (1 − α0 λmin )φ1 + (1 − α0 λ max)φ2 .
Next calculate similarly
r0 , r0 
α0 =
r0 , Ar0 
Ae0 , Ae0 
=
Ae0 , A2 e0 
A(φ1 + φ2 ), A(φ1 + φ2 )
=
A(φ1 + φ2 ), A2 (φ1 + φ2 )
2
= · · ·· = .
λmin + λmax
We thus have
# $
2λmin 2λmax
e = 1−
1
φ1 + (1 − )φ2
λmin + λmax λmin + λmax
# $
κ−1
= (rearranging) = (φ1 − φ2 ).
κ+1
The rest of the calculations are exactly as above. These show that in
the two cases
# $
κ−1
1
e = (φ1 ∓ φ2 )
κ+1
# $2
κ−1
e2 = (φ1 ± φ2 ).
κ+1
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 176

176 Numerical Linear Algebra

Proceeding by induction,
# $n
κ−1
n
e = (φ1 either ± or ∓ φ2 ),
κ+1
in the two cases, which is exactly the predicted rate of convergence.

Exercise 7.17. Suppose you must solve a very large sparse linear system
Ax = b by some iterative method. Often one does not care about the in-
dividual millions of entries in the solution vector but one only wants a few
statistics [i.e., numbers] such as the average. Obviously, the error in the
averages can be much smaller than the total error in every component or
just as large as the total error. Your goal is to try to design iterative meth-
ods which will produce accurate statistics more quickly than an accurate
answer.
To make this into a math problem, let the (to fix ideas) statistic be a
linear functional of the solution. Define a vector l and compute

L = lt x = l, x
if, e.g., L = average(x) then
l = (1/N, 1/N, ..., 1/N )t .

Problem:

Solve : Ax = b,
Compute : L = l, x

or: Compute L = ... while solving Ax = b approximately. There are many


iterative methods you have studied. develop/adapt/optimize one [YOUR
CHOICE OF IDEAS] for this problem! You must either [YOUR CHOICE]
analyze it or give comprehensive numerical tests. Many approaches are
possible, e.g., note that this can be written as a N + 1 × N + 1 system for
x, L:
A 0 x b
= .
−lt 1 L 0
You will have to negotiate with this problem as well. There is no set
answer! Every method can be adapted to compute L faster and no method
will always be best.

Exercise 7.18. The standard test problem for nonsymmetric systems is the
2D CDEqn = 2D model discrete Convection Diffusion equation. Here ε is a
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 177

Solving Ax = b by Optimization 177

small to very small positive parameter. (Recall that you have investigated
the 1D CDEqn in Exercise 5.5, page 98.)
−εΔu + ux = f, inside (0, 1) × (0, 1)
u = g, on the boundary.
Discretize the Laplacian by the usual 5-point star and approximate ux by
u(I + 1, J) − u(I − 1, J)
ux (xI , yJ )  .
2h
Find the associated difference stencil. This problem has 2 natural parame-
ters:
1
h = , the meshwidth; and,
N +1
h
P e := , the “cell Péclet number”.

The interesting case is when the cell Péclet3 number P e  1, i.e., when
ε  h.
Hint: You have already written programs for the 2D MPP in Exercises 7.15
and 7.16. You can modify one of those programs for this exercise.

(1) Debug your code using h = 1/5, g(x, y) = x − y, and f (x, y) = 1. The
exact solution in this case is u(x, y) = x − y. Starting from the exact
solution, convergence to 1.e-3 should be achieved in a single iteration
of a method such as Jacobi (FOR with ρ = 4).
(2) Fix h = 1/50, f (x, y) = x + y, and g(x, y) = 0. Pick three iterative
methods (your choice). Solve the nonsymmetric linear system for a
variety of values4 of ε = 1, 1/10, 1/100, 1/1000, 1/10000, starting from
u(x, y) = 0, to an accuracy of 10−3 . Report the results, consisting of
convergence with the number of iterations or nonconvergence. Describe
the winners and losers for small cell P e and for large cell P e.

Exercise 7.19. For A a large, sparse matrix and  ·  the euclidean or


l2 norm, consider a general iterative method below for solving Ax = b,
starting from a guess vector x0 .
3 The Péclet number is named after the French physicist Jean Claude Eugène Péclet. It

is given by Length × Velocity / Diffusion coefficient. In our simple example, the velocity
is the vector (1,0) and the diffusion coefficient is ε. The cell Peclet number, also denoted
by Pe, is the Peclet number associated with one mesh cell so the length is taken to be
the meshwidth.
4 For ε = 1, your solution should appear much like the MPP2D solution with the same

right side and boundary conditions. For smaller ε, the peak of the solution is pushed to
larger x locations. Nonconvergence is likely for very small ε.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 178

178 Numerical Linear Algebra

r0 = b − Ax0
for n=0:itmax
Choose dn
(∗) Pick αn to minimize b − A(xn + αdn )2
xn+1 = xn + αn dn
(∗∗) rn+1 = b − Axn+1
if converged, return, end
end

(1) Show that step (∗∗) can be replaced by: (∗∗) rn+1 = rn − αn Adn .
(2) Find an explicit formula for the optimal value of α in step (∗).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 179

Chapter 8

The Conjugate Gradient Method

“The cook was a good cook,


as cooks go,
and as cooks go,
she went.”
— Saki

The conjugate gradient method was proposed by Hestenes and Stiefel


in 1952. Initially it was considered a direct method for solving Ax = b for
A SPD since (in exact arithmetic) it gives the exact solution in N steps or
less. Soon it was learned that often a very good solution is obtained after
many fewer than N steps. Each step requires a few inner products, and one
matrix multiply. Like all iterative methods, its main advantage is when the
matrix vector multiply can be done quickly and with minimal storage of A.

8.1 The CG Algorithm

Why not the best? — Jimmy Carter

The conjugate gradient method is the best possible method1 for solving
Ax = b for A an SPD matrix. We thus consider the solution of

Ax = b, where A is large, sparse and SPD.

1 “Best possible” has a technical meaning here with equally technical qualifiers. We shall

see that the kth step of the CG method computes the projection (the best approximation)
with respect to the A-norm into a k dimensional subspace.

179
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 180

180 Numerical Linear Algebra

First we recall some notation.

Definition 8.1. Assume that A is an SPD matrix. x, y denotes the


Euclidean inner product:
x, y = xt y = x1 y1 + x2 y2 + . . . + xn yn .
x, yA denotes the A-inner product

N
x, yA = xt Ay = xi Aij yj .
i,j=1

The A-norm is
 √
xA = x, xA = xt Ax.
The quadratic functional associated with Ax = b is
1 t
J(x) = x Ax − xt b.
2
The conjugate gradient method (hereafter: CG) is a descent method.
Thus, it takes the general form.

Algorithm 8.1 (Descent Method for solving Ax = b with A SPD).


Given an SPD A, x0 and a maximum number of iterations itmax
r0 = b − Ax0
for n=0:itmax
(∗)Choose a descent direction dn
αn := arg minα J(xn + αdn ) = dn , rn /dn , Adn 
xn+1 = xn + αn dn
rn+1 = b − Axn+1
if converged, stop, end
end

CG differs from the slow steepest descent method by step (∗) the choice
of search directions. In Steepest Descent dn = rn while in CG dn is calcu-
lated by a two term recursion that A orthogonalizes the search directions.
The CG algorithm is very simple to write down and easy to program.
It is given as follows:2
2 We shall use fairly standard conventions in descent methods; we will use roman letters

with superscripts to denote vectors, d, r, x, · · ·, and greek letters, α, β, · · ·, with subscripts


to denote scalars. For example, we denote the nth descent direction vector dn and the nth
scalar multiplier by αn . One exception is that eigenvectors will commonly be denoted
by φ.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 181

The Conjugate Gradient Method 181

Algorithm 8.2 (Conjugate Gradient Algorithm). Given an SPD A,


x0 and a maximum number of iterations itmax

r0 = b − Ax0
d0 = r 0
for n=1:itmax
αn−1 = dn−1 , rn−1 /dn−1 , Adn−1 
xn = xn−1 + αn−1 dn−1
rn = b − Axn
if converged, stop, end
βn = rn , rn /rn−1 , rn−1 
dn = rn + βn dn−1
end

CG has the following features:

• In steepest descent, dn is chosen to be a locally optimal search direction.


• In CG, dn is chosen to be a globally optimal search direction. The
problem of finding dn is thus a global problem: in principle, dn depends
on all the previous search directions d0 , d1 , d2 , . . . , dn−2 and dn−1 . CG,
however, has an amazing property:
• For SPD A, the dependence on d0 , . . . , dn−3 drops out and dn depends
only on dn−1 and dn−2 .
• CG is the fastest convergent iterative method in the A-norm.
• CG can be written as a three term recursion or a coupled two term
recursion. - .
• CG requires typically O cond(A) iterations per significant digit of
accuracy.
• CG requires barely more work per step than steepest descent. As
stated above, it takes 2 matrix-vector multiplies per step plus a few
dot products and triads. If the residual is calculated using Lemma 277
by rn+1 = rn −αn Adn , then it only requires one matrix-vector multiply
per step.
• In exact arithmetic, CG reaches the exact solution of an N × N system
in N steps or less.
• The CG method has many orthogonality properties. Thus, there are
many ways to write the algorithm that are mathematically equivalent
(in exact arithmetic) and an apparent (not real) multitude of CG meth-
ods.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 182

182 Numerical Linear Algebra

For general nonsymmetric matrices, there is no iterative method with


all the above good properties of CG. Two methods that are popular now
are GMRES which has a full recursion3 and thus is very expensive when
a lot of iterates are required, and CGN — which is just CG for the normal
equations
At Ax = At b, At A is SPD.
This typically requires O (cond(A)) iterates per significant digit of accuracy
sought.

Example 8.1. As a concrete example, consider solving the 2D model Pois-


son problem on a 100 × 100 mesh. Thus h = 101
1
and we solve

A u = f where A is 10, 000 × 10, 000.
Note that cond(A)  O(h−2 ) = O(10, 000). Thus, we anticipate:

• Steepest descent requires  50, 000 to 100, 000 iterations to obtain 6


significant digit of accuracy.
• CG will produce the exact solution in the absence of round off error in
 10, 000
 iterations, however,
• Since cond(A)  100, CG will produce an approximate solution with
6 significant digits of accuracy in  500 − 1000 iterations!
• With simple preconditioners (a topic that is coming) we get 6 digits in
 30 − 40 iterations!

Exercise 8.1. Write a computer program implementing Algorithm 8.2.


Write the program so it can be applied to any given matrix A, with any
given initial guess x0 and right side b. Assume the iteration is converged
when both the conditions rn  < b and un −un−1  < un  are satisfied
for given tolerance . Consider the matrix
⎡ ⎤
2 −1 0 0 0
⎢ −1 2 −1 0 0 ⎥
⎢ ⎥
A1 = ⎢⎢ 0 −1 2 −1 0 ⎥ .

⎣ 0 0 −1 2 −1 ⎦
0 0 0 −1 2

(1) Apply your program to the matrix A1 using the exact solution xexact =
[1, 2, 3, 4, 5]t and b1 = A1 xexact , starting with x0 = xexact . Demonstrate
convergence to xexact in a single iteration with  = 10−4 .
3 All previous dn must be stored and used in order to compute xn+1 .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 183

The Conjugate Gradient Method 183

(2) Apply your program to A1 and b1 with tolerance  = 10−4 but with
initial guess x0 = 0. Demonstrate convergence to xexact in no more
than five iterations.
(3) Repeat the previous two cases with the matrix
⎡ ⎤
2 −1 0 −1 0
⎢ −1 3 −1 0 −1 ⎥
⎢ ⎥
A2 = ⎢ ⎥
⎢ 0 −1 2 −1 0 ⎥ .
⎣ −1 0 −1 3 −1 ⎦
0 −1 0 −1 3

Exercise 8.2. Recall that Exercises 7.15 (page 169) and 7.16 (page 170)
had you wrote programs implementing iterative methods for the 2D MPP.
Write a computer program to solve the 2D MPP using conjugate gra-
dients Algorithm 8.2. How many iterations does it take to converge when
N = 100 and  = 1.e − 8?
Recommendation: You have already written and tested a conjugate
gradient code in Exercise 8.1 and a 2D MPP code in Exercise 7.16. If
you replace the matrix-vector products Axn appearing in your conjugate
gradient code with function or subroutine calls that use 2D MPP code to
effectively compute the product without explicitly generating the matrix
A, you can leverage your earlier work and save development and debugging
time.

8.1.1 Algorithmic options

There are many algorithmic options (we will list two below) but the above
is a good, stable and efficient form of CG.

(1) An equivalent expression for αn is

rn , rn 
αn = .
dn , Adn 

(2) The expression rn+1 = b − Axn+1 is equivalent to rn+1 = rn − αn Adn


in exact arithmetic. To see it is equivalent, we note that the residuals
satisfy their own iteration.

Lemma 8.1. In exact arithmetic, the CG residuals satisfy

rn+1 = rn − αn Adn .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 184

184 Numerical Linear Algebra

Proof. Since xn+1 = xn + αn dn , multiply by −A and add b to both


sides. This gives

b − Axn+1 = b − Axn −αn Adn ,


' () * ' () *
r n+1 rn

as claimed above.

Thus, the residual can be calculated 2 ways: directly at the cost of an


extra matrix-vector multiply and via the above step. Direct calculation
doubles the number of matrix-vector multiplies per step over rn+1 = rn −
αn Adn . Some have reported that for highly ill-conditioned systems, it can
be preferable to calculate the residual directly, possibly even in extended
precision.

Exercise 8.3. Consider the CG method, Algorithm 8.2. Show that it can
be written as a three term recursion of the general form xn+1 = αn xn +
βn xn−1 + cn .

Exercise 8.4. In Exercise 8.1, you wrote a computer program to implement


Algorithm 8.2. Double-check it on the 5 × 5 SPD matrix A1 from that
exercise by choosing any vector b and verify that the system Ax = b can be
solved in five iterations.
Make a copy of your program and modify it to include the alternative
expressions for αn and rn described above. Verify that the modified pro-
gram gives rise to the same sequence of coefficients αn and βn , iterates xn
and residuals rn as the original.

8.1.2 CG’s two main convergence theorems

“All sorts of computer errors are now turning up. You’d be sur-
prised to know the number of doctors who claim they are treating
pregnant men.”
— Anonymous Official of the Quebec Health Insurance Board,
on Use of Computers in Quebec Province’s Comprehensive Medical-
care system. F. 19, 4:5. In Barbara Bennett and Linda Am-
ster, Who Said What (and When, and Where, and How) in 1971:
December–June, 1971 (1972), Vol. 1, 38.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 185

The Conjugate Gradient Method 185

The global optimality properties of the CG method depends on a specific


family of subspaces, the Krylov subspaces. First recall.
3 4
Definition 8.2. Let z 1 , · · ·, z m be m vectors. Then span z 1 , · · ·, z m is
the set of all linear combinations of5z 1 , · · ·, z m , i.e., the subspace
6
3 1 4 m
span z , · · ·, z m
= x= αi z : αi ∈ R .
i

i=1

It will be important to know the form of the CG iterates and search


directions. To find the correct subspaces, we step through the algorithm:
d0 = r 0
x1 = x0 + αr0 ∈ x0 + span{r0 }.
From Lemma 8.1 we have
r1 = r0 − αAr0 ∈ r0 + A · span{r0 },
so
d = r1 + βr0 = r0 − αAr0 + βr0 ∈ span{r0 , Ar0 }.
1

Thus
/{r0 − αAr0 + βr0 }, so
x2 = x1 + αd1 = x0 + αr0 + α
x2 ∈ x0 + span{r0 , Ar0 }, and similarly
r1 ∈ r0 + A · span{r0 , Ar0 }.
Continuing, we easily find the following.
Proposition 8.1. The CG iterates xj , residuals rj and search directions
dj satisfy
xj ∈ x0 + span{r0 , Ar0 , · · ·, Aj−1 r0 },
rj ∈ r0 + A · span{r0 , Ar0 , · · ·, Aj−1 r0 }
and
d ∈ span{r0 , Ar0 , · · ·, Aj−1 r0 }.
j

Proof. Induction.

The subspace and affine subspaces, known as Krylov subspaces, are


critical to the understanding of the method.
Definition 8.3. Let x0 be given and r0 = b − Ax0 . The Krylov subspace
determined by r0 and A is
Xn = Xn (A; r0 ) = span{r0 , Ar0 , . . . , An−1 r0 }
and the affine Krylov space determined by r0 and A is
Kn = Kn (A; x0 ) = x0 + Xn = {x0 + x : x ∈ Xn }.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 186

186 Numerical Linear Algebra

The first important theorem of CG method is the following. It explains


the global error minimization linked to the choice of search directions.

Theorem 8.1. Let A be SPD. Then the CG method satisfies the following:
(i) The nth residual is globally optimal over the affine subspace Kn in
the A−1 -norm
||rn ||A−1 = min ||r||A−1 .
r∈r 0 +AXn

(ii) The nth error is globally optimal over Kn in the A-norm


||en ||A = min ||e||A .
e∈Kn
n
(iii) J(x ) is the global minimum over Kn
J(xn ) = minx∈Kn J(x).
(iv) Furthermore, the residuals are orthogonal and search directions are
A-orthogonal:
rk · rl = 0 , for k = l,
dk , dl A = 0 , for k = l.

These are algebraic properties of CG iteration, proven by induction.


Part (iv) already implies the finite termination property.

Exercise 8.5. Prove the theorem by induction, starting from Algo-


rithm 8.2. You may find Lemma 8.1 helpful.

Corollary 8.1. Let A be SPD. Then in exact arithmetic CG produces the


exact solution to an N × N system in N steps or fewer.

Proof. Since the residuals {r0 , r1 , . . . , rN −1 } are orthogonal they are lin-
early independent. Thus, rl = 0 for some l ≤ N .

Using the properties (i) through (iv), the error in the nth CG step will
be linked to an analytic problem: the error in Chebychev interpolation.
The main result of it is the second big convergence theorem for CG.

Theorem 8.2. Let A be SPD. Given any ε > 0 for


# $
1 2
n≥ cond(A) ln +1
2 ε
the error in the CG iterations is reduced by ε:
xn − xA ≤ εx0 − xA .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 187

The Conjugate Gradient Method 187

8.2 Analysis of the CG Algorithm

Art has a double face, of expression and illusion, just like science
has a double face: the reality of error and the phantom of truth.
— René Daumal
‘The Lie of the Truth’. (1938) translated by Phil Powrie (1989).
In Carol A. Dingle, Memorable Quotations (2000).

The form of the CG algorithm presented in the last section is quite


computationally efficient. It has developed over some years as many equiv-
alences and identities have been derived for the method. We give two
different (but of course equivalent if you look deeply enough) analytical
developments of CG. The first is a straightforward application of the
Pythagorean theorem. CG sums an orthogonal series and the orthogonal
basis vectors are generated by a special method, the Orthogonalization of
Moments algorithm. Putting these two together immediately gives a sim-
plified CG method which has the essential and remarkable features claimed
for it.
The second approach is indirect and more geometric. In this second
approach, we shall instead define the CG method by (CG as n dimensional
minimization). This definition makes the global optimization property ob-
vious. However, it also suggests that the nth step requires an n dimensional
optimization calculation. Thus the work in this approach will be to show
that the n dimensional optimization problem can be done by a 1 dimen-
sional line search. In other words, it will be to show that the n dimensional
optimization problem can be done by either one 3 term recursion or two
coupled 2-term recursions. This proves that (CG as n dimensional mini-
mization) can be written in the general form announced in the introduction.
The key to this second approach is again the Orthogonalization of Moments
algorithm.
Since any treatment will adopt one or the other, the Orthogonalization
of Moments algorithm will be presented twice.

8.3 Convergence by the Projection Theorem

Fourier is a mathematical poem.


— Thomson, [Lord Kelvin] William (1824–1907)

We begin with some preliminaries. The best approximation under a


norm given by an inner product in a subspace is exactly the orthogonal
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 188

188 Numerical Linear Algebra

projection with respect to that inner product. Thus we start by recalling


some fundamental properties of these best approximations. Let X ⊂ RN be
an n (for n < N ) dimensional subspace and x ∈ RN . Given an inner prod-
1/2
uct and associated norm4 ·, ·∗ ,  · ∗ = ·, ·∗ , the best approximation
x ∈ X to x, is the unique x ∈ X satisfying:
n n

x − xn ∗ = min x − x
/∗ .
∈X
x

If K is the affine space K = x0 + X (where x0 is fixed), then the best


approximation in K is the solution to

x − xK ∗ = min x − x
/∗ .
∈K
x

The two best approximations are related. Given x, xK , the best approxi-
mation in K, is given by xK = x0 + xX where xX is the best approximation
in X to x − x0 .

Theorem 8.3 (Pythagorean or Projection Theorem). Let X be a


subspace of Rn and x ∈ RN . Then, the best approximation to x in X,
xn ∈ X

x − xn ∗ = min x − x
/∗
∈X
x

is determined by

x − xn , x
/∗ = 0, ∀/
x ∈ X.

Further, we have

x2∗ = xn 2∗ + x − xn 2∗ .

Let x0 ∈ RN and let K be an affine sub-space K = x0 + X. Given x ∈ RN


there exists a unique best approximation xn ∈ K to x:

x − xn ∗ = min x − x
/∗ .
∈K
x

The error is orthogonal to X

x − xn , x
/∗ = 0, ∀/
x ∈ X.

Proof. See any book on linear algebra!


4 The innerproduct is “fixed but arbitrary”. Think of the usual dot product and the

A-inner product for concrete examples.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 189

The Conjugate Gradient Method 189

The best approximation in K = x0 + X is determined by x − xn , x /∗ =



x ∈ X. Let e1 , . . . , en be a basis for X. Expand xn = x0 + cj ej ∈ K
0, ∀/
then the vector of undetermined coefficients satisfies the linear system
Gc = f, Gij = ei , ej ∗ , fj = x − x0 , ej ∗ .
Here G is called the “Gram matrix” or “Gramian”.

Definition 8.4. Let {φ1 , φ2 , . . . , φn } be a basis for X and ·, ·∗ an inner
product on X. The associated Gram matrix G of the basis is
Gij = φi , φj ∗ .

Thus, the general way to calculate the best approximation in an n-


dimensional affine subspace K = x0 + X is to pick a basis for X, assemble
the n × n Gram matrix and solve an n × n linear system. Two questions
naturally arise:

• How to calculate all the inner products fj if x is the sought but unknown
solution of Ax = b?
• Are there cases when the best approximation in K can be computed
at less cost than constructing a basis, assembling G and then solving
Gc = f ?

For the first question there is a clever finesse that works when A is SPD.
Indeed, if we pick ·, ·∗ = ·, ·A then for Ax = b,
x, yA = xt Ay = (Ax)t y = bt y = b, y
which is computable without knowing x. For the second question, there is a
case when calculating the best approximation is easy: when an orthogonal
basis is known for X. This case is central to many mathematical algorithms
including CG.

Definition 8.5 (Orthogonal basis). Let {φ1 , φ2 , . . . , φn } be a basis for


X and ·, ·∗ an inner produce on X. An orthogonal basis for X is a basis
{φ1 , φ2 , . . . , φn } satisfying additionally
φi , φj ∗ = 0 whenever i = j.

If the basis {φ1 , φ2 , . . . , φn } is orthogonal then its Gram matrix Gij is


diagonal. The best approximation in X can then simply be written down
explicitly
n
x, φj ∗
xn+1 = φj .
j=1
φ j , φj  ∗
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 190

190 Numerical Linear Algebra

Similarly, the best approximation in the affine subspace K can also be


written down explicitly as
n
x − y 0 , φj ∗
n+1 0
x =y + φj .
j=1
φj , φj ∗
Summing this series for the best approximation in an affine subspace can
be expressed as an algorithm that looks like (and is) a descent method.

Algorithm 8.3 (Summing an orthogonal series). Given x ∈ RN , an


n-dimensional subspace X, with orthogonal basis {φ1 , · · ·, φn } and a vector
x0
x0 = y 0
for j=0:n-1
dj = φj+1
αj = x − x0 , dj ∗ /dj , dj ∗
xj+1 = xj + αj dj
end

The usual descent method (general directions) produces at each step an


approximation optimal in the (1-dimensional) line x = xj + αdj , α ∈ R.
Since the descent directions are orthogonal this produces at the jth step
an approximation that is optimal over the j-dimensional affine subspace
x0 + span{φ1 , . . . , φj }.

Theorem 8.4. If {φ1 , . . . , φj } are A-orthogonal and ·, ·∗ is the A-inner
product, then the approximations produced by the descent method choosing
{φ1 , . . . , φj } for descent directions (i.e., if choosing di = φi ) are the same
as those produced by summing the orthogonal series in Algorithm 8.3 above.
Thus, with A-orthogonal search directions, the approximations produced by
the descent algorithm satisfy
||x − xj ||A = min x − x
/A ,
∈x0 +span{φ1 ,...,φj }
x

J(xj ) = arg min J(/


x).
∈x0 +span{φ1 ,...,φj }
x

Proof. Thus, consider the claim of equivalence of the two methods. The
general step of each takes the form xj+1 = xj + αdj with the same xj , dj .
We thus need to show equivalence of the two formulas for the stepsize:
descent : αj = rj , φj /φj , φj A
orthogonal series: αj = x − y 0 , φj A /φj , φj A .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 191

The Conjugate Gradient Method 191

Since the denominators are the same we begin with the first numerator and
show its equal to the second. Indeed,
rj , φj  = b − Axj , φj  = Ax − Axj , φj 
= x − xj , φj A .
Consider the form of xj produced by the descent algorithm. We have (both
obvious and easily proven by induction) that xj takes the general form
xj = x0 + a1 φ1 + · · · + aj−1 φj−1 .
Thus, by A-orthogonality of {φ1 , . . . , φj }
xj , φj A = x0 + a1 φ1 + · · · + aj−1 φj−1 , φj A = x0 , φj A .
Thus we have
rj , φj  = x − xj , φj A = x − x0 , φj A ,
which proves equivalence. The error estimate is just restating the error
estimate of the Pythagorean theorem. From the work on descent methods
we know that A-norm optimality of the error is equivalent to minimization
of J(·) over the same space. Hence the last claim follows.

Thus:

• Algorithm 8.3 does 1-dimensional work at each j th step (a 1-


dimensional optimization) and attains a j-dimensional optimum error
level;
• Equivalently, if the descent directions are chosen A-orthogonal, a j-
dimensional minimizer results.

The focus now shifts to how to generate the orthogonal basis. The
classic method is the Gram–Schmidt algorithm.

8.3.1 The Gram–Schmidt algorithm


The Gram–Schmidt algorithm is not used in CG for SPD systems. It is
important for understanding the method actually used (orthogonalization
of moments which is coming) and becomes important in generalized conju-
gate gradient methods for nonsymmetric systems. For example, GS is used
to generate search directions in the method GMres.

Algorithm 8.4 (Gram–Schmidt orthogonalization). Given a basis


{e1 , e2 , . . . , eN } for RN ,
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 192

192 Numerical Linear Algebra

φ1 = e1
for j=1:n
for i=1:j
αi = ej+1 , φi ∗ /φi , φi ∗
end
j
φj+1 = ej+1 − i=1 αi φi
end

Theorem 8.5. Given a basis {e1 , e2 , . . . , eN } for RN , the Gram–Schmidt


Algorithm 8.4 constructs a new, ·, ·∗ −orthogonal basis φ1 , . . . , φN for RN
so that:

(1) span{e1 , . . . , ej } = span{φ1 , . . . , φj } for each j = 1, · · ·, N ; and,


(2) φi , φj ∗ = 0 whenever i = j.

The nth step of Gram–Schmidt obtains an orthogonal basis for an n-


dimensional subspace as a result of doing n-dimensional work calculating
the n coefficients αi , i = 1, · · ·, n. There is exactly one case where this work
can be reduced dramatically and that is the case relevant for the conjugate
gradient method. Since summing an orthogonal series is globally optimal
but has 1-dimensional work at each step, the problem shifts to finding
an algorithm for constructing an A-orthogonal basis which, unlike Gram–
Schmidt, requires 1-dimensional work at each step. There is exactly one
such method which only works in exactly one special case (for the Krylov
subspace of powers of A times a fixed vector) called “Orthogonalization of
moments”.

Exercise 8.6. Prove that the Gram matrix Gij = ei , ej ∗ is SPD provided
e1 , . . . , en is a basis for X and diagonal provided e1 , . . . , en are orthogonal.

Exercise 8.7. Give an induction proof that the Gram–Schmidt Algo-


rithm 8.4 constructs a new basis with the two properties claimed in Theo-
rem 8.5.

8.3.2 Orthogonalization of moments instead of


Gram–Schmidt
A great part of its [higher arithmetic] theories derives an ad-
ditional charm from the peculiarity that important propositions,
with the impress of simplicity on them, are often easily discovered
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 193

The Conjugate Gradient Method 193

by induction, and yet are of so profound a character that we can-


not find the demonstrations till after many vain attempts; and even
then, when we do succeed, it is often by some tedious and artificial
process, while the simple methods may long remain concealed.
— Gauss, Karl Friedrich (1777–1855).
In H. Eves Mathematical Circles Adieu, Boston: Prindle, We-
ber and Schmidt, 1977.

The CG method at the nth step computes an A-norm optimal approx-


imation in a n-dimensional subspace. In general this requires solving an
N × N linear system with the Gram matrix. The only case, and the case
of the CG method, when it can be done with much less expense is when an
A-orthogonal basis is known for the subspace and this is known only for a
Krylov subspace:

Xn : = span{r0 , Ar0 , A2 r0 , · · ·, An−1 r0 }


Kn : = x 0 + X n .

CG hinges on an efficient method of determining an A-orthogonal


basis for Xn . With such a method, CG takes the general form:

Algorithm 8.5. Given SPD A, initial guess x0 , and maximum number of


iterations itmax,

r0 = b − Ax0
d0 = r 0
for n=1:itmax
Descent step:
αn−1 = rn−1 , dn−1 /dn−1 , dn−1 A
xn = xn−1 + αn−1 dn−1
rn = b − Axn
OM step:
Calculate new A-orthogonal search direction dn so that
span{d0 , d1 , . . . , dn } = span{r0 , Ar0 , A2 r0 , . . . , An r0 }
end

The key (OM step) is accomplished by the “Orthogonalization of mo-


ments” algorithm, so-called because moments of an operator A are powers
of A acting on a fixed vector. This algorithm takes a set of moments
{e1 , e2 , e3 , · · · , ej } where ej = Aj−1 e1 and generates an A-orthogonal basis
{φ1 , φ2 , φ3 , · · · , φj } spanning the same subspace.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 194

194 Numerical Linear Algebra

Algorithm 8.6 (Orthogonalization of moments algorithm). Let A


be SPD, and e1 ∈ RN be a given vector.

φ1 = e1
for n=1:N-1
α = φn , Aφn A /φn , φn A
if n==1
φ2 = Aφ1 − αφ1
else
β = φn−1 , Aφn A /φn−1 , φn−1 A
φn+1 = Aφn − αφn − βφn−1
end
end

In comparison with Gram–Schmidt, this algorithm produces an A-


orthogonal basis of the Krylov subspace through a three term recursion.

Theorem 8.6 (Orthogonalization of moments). Let A be an SPD


matrix. The sequence {φj }N
j=1 produced by the Orthogonalization of Mo-
ments Algorithm 8.6 is A-orthogonal. Further, for ej = Aj−1 e1 , at each
step 1 ≤ j ≤ N

span{e1 , e2 , e3 , · · · , ej } = span{φ1 , φ2 , φ3 , · · · , φj }.

Proof. Preliminary remarks: First note that the equation for φn+1
takes the form

φn+1 = Aφn + αφn + βφn−1 . (8.1)

Consider the RHS of this equation. We have, by the induction hypothesis

Aφn ∈ span{e1 , e2 , e3 , · · ·, en+1 }


αφn + βφn−1 ∈ span{e1 , e2 , e3 , · · ·, en },
and φn , φn−1 A = 0.

The step φn+1 = Aφn + αφn + βφn−1 contains two parameters. It is easy
to check that the parameters α and β are picked (respectively) to make the
two equations hold:

φn+1 , φn A = 0,
φn+1 , φn−1 A = 0.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 195

The Conjugate Gradient Method 195

Indeed
7 8 7 8
0 = φn+1 , φn A = Aφn + αφn + βφn−1 , φn A
7 8
= Aφn , φn A + α φn , φn A + β φn−1 , φn A
= Aφn , φn A + α φn , φn A
and the same for φn+1 , φn−1 A = 0 gives two equations for α, β whose
solutions are exactly the values chosen on Orthogonalization of Moments.
The key issue in the proof is thus to show that
φn+1 , φj A = 0, for j = 1, 2, · · · , n − 2. (8.2)
This will hold precisely because span{e1 , e2 , e3 , · · · , ej } is a Krylov subspace
determined by moments of A.
Details of the proof: We show from (8.1) that (8.2) holds. The proof
is by induction. To begin, from the choice of α, β it follows that the theorem
holds for j = 1, 2, 3. Now suppose the theorem holds for j = 1, 2, · · · , n.
From (8.1) consider φn+1 , φj A . By the above preliminary remarks, this
is zero for j = n, n − 1. Thus consider j ≤ n − 2. We have
φn+1 , φj A = Aφn , φj A + αφn , φj A + βφn−1 , φj A
for j = 1, 2, · · ·, n − 2.
By the induction hypothesis
φn , φj A = φn−1 , φj A = 0
thus it simplifies to
φn+1 , φj A = Aφn , φj A .
Consider thus Aφn , φj A . Note that A is self adjoint with respect to the
A-inner product. Indeed, we calculate
t
Aφn , φj A = (Aφn ) Aφj = (φn )t At Aφj = (φn )t AAφj = φn , Aφj A .
Thus, Aφn , φj A = φn , Aφj A . By the induction hypothesis (and because
we are dealing with a Krylov subspace): for j ≤ n − 2
φj ∈ span{e1 , e2 , e3 , · · ·, en−2 }
thus
Aφ ∈ span{e1 , e2 , e3 , · · ·, en−2 , en−1 }.
j

Further, by the induction hypothesis


span{e1 , e2 , e3 , · · · , en−1 } = span{φ1 , φ2 , φ3 , · · · , φn−1 }.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 196

196 Numerical Linear Algebra

Finally by the induction hypothesis {φ1 , φ2 , φ3 , · · ·, φn } are A-orthogonal,


so
φn , something in span{φ1 , φ2 , φ3 , · · ·, φn−1 }A = 0.
Putting the steps together:
φn+1 , φj A = Aφn , φj A = φn , Aφj A
= φn , something in span{φ1 , φ2 , φ3 , · · · , φn−1 }A = 0.
All that remains is to show that
span{e1 , e2 , e3 , · · ·, en+1 } = span{φ1 , φ2 , φ3 , · · ·, φn+1 }.
This reduces to showing en+1 = Aen ∈ span{φ1 , φ2 , φ3 , ···, φn+1 }. It follows
from the induction hypothesis and (8.1) and is left as an exercise.

The orthogonalization of moments algorithm is remarkable. Using it to


find the basis vectors [which become the descent directions] and exploit-
ing the various orthogonality relations, we shall see that the CG method
simplifies to a very efficient form.

Exercise 8.8. (An exercise in looking for similarities in different


algorithms). Compare the orthogonalization of moments algorithm to
the one (from any comprehensive numerical analysis book) which produces
orthogonal polynomials. Explain their similarities.

Exercise 8.9. If A is not symmetric, where does the proof break down? If
A is not positive definite, where does it break down?

Exercise 8.10. Write down the Gram–Schmidt algorithm for producing


an A-orthogonal basis. Calculate the complexity of Gram–Schmidt and
Orthogonalization of moments. (Hint: Count matrix-vector multiplies and
inner products, ignore other operations.) Compare.

Exercise 8.11. Complete the proof of the Orthogonalization of Moments


Theorem.

8.3.3 A simplified CG method


We can already present a CG method that attains the amazing properties
claimed of CG in the first section. The method is improvable in various
ways, but let us first focus on the major advances made by just descent
(equivalent to summing an orthogonal series) with the directions generated
by orthogonalization of moments. Putting the two together in a very simple
way gives the following version of the CG method.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 197

The Conjugate Gradient Method 197

Algorithm 8.7 (A version of CG). Given SPD A and initial guess x0 ,

r0 = b − Ax0
d0 = r0 /r0 
First descent step:
d0 ,r0 
α0 = d0 ,d0 
A

x1 = x 0 + α 0 d0
r1 = b − Ax1
First step of OM:
d0 ,Ad0 
γ0 = d0 ,d0  A
A

d1 = Ad0 − γ0 d0
d1 = d1 /d1  (normalize5 d1 )
for n=1:∞
Descent Step:
r n ,dn 
αn = dn ,dn A
n+1
x = x + α n dn
n

rn+1 = b − Axn+1
if converged, STOP, end
OM step:
dn ,Adn A
γn = dn ,dn A
dn−1 ,Adn A
βn = dn−1 ,dn−1 A
dn+1 = Adn − γn dn − βn dn−1
dn+1 n+1
=d n+1
/d  (normalize5 dn+1 )
end

Algorithm 8.7, while not the most efficient form for computations, cap-
tures the essential features of the method. The differences between the
above version and the highly polished one given in the first section, Algo-
rithm 8.2, take advantage of the various orthogonality properties of CG.
These issues, while important, will be omitted to move on to the error
analysis of the method.

5 Normalizing d is not theoretically necessary, but on a computer, d could grow large

enough to cause the calculation to fail.


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 198

198 Numerical Linear Algebra

Exercise 8.12. Consider the above version of CG. Show that it can be writ-
ten as a three term recursion of the general form xn+1 = an xn +bn xn−1 +cn .

Exercise 8.13. In Exercise 8.2, you wrote a program implementing CG for


the 2D MPP and compared it with other iterative methods. Make a copy
of that program and modify it to implement the simplified version of CG
given in Algorithm 8.7. Show by example that the two implementations are
equivalent in the sense that they generate essentially the same sequence of
iterates.

8.4 Error Analysis of CG

But as no two (theoreticians) agree on this (skin friction) or


any other subject, some not agreeing today with what they wrote
a year ago, I think we might put down all their results, add them
together, and then divide by the number of mathematicians, and
thus find the average coefficient of error.
— Sir Hiram Maxim, In Artificial and Natural Flight (1908),
3. Quoted in John David Anderson, Jr., Hypersonic and High
Temperature Gas Dynamics (2000), 335.

“To err is human but it feels divine.”


— Mae West

We shall show that the CG method takes O( cond(A)) steps per sig-
nificant digit (and, as noted above, at most N steps). This result is built
up in several steps that give useful detail on error behavior. The first step
is to relate the error to a problem in Chebychev polynomial approximation.
Recall that we denote the polynomials of degree ≤ n by

Πn := {p(x) : p(x) is a real polynomial of degree ≤ n}.

Theorem 8.7. Let A be SPD. Then the CG method’s error en = x − xn


satisfies:
(i) en ∈ e0 + span{Ae0 , A2 e0 , · · · , An e0 }.
(ii) ||en ||A = min{||e||A : e ∈ e0 + span{Ae0 , A2 e0 , · · · , An e0 }}.
(iii) The error is bounded by
# $
||x − xn ||A ≤ min max |p(x)| ||e0 ||A .
pn ∈Πn and p(0)=1 λmin ≤x≤λmax
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 199

The Conjugate Gradient Method 199

Proof. As Ae = r and
rn ∈ r0 + span{Ar0 , A2 r0 , A3 r0 , · · ·, An r0 }
this implies
 
Ae ∈ A e0 + span{Ae0 , A2 e0 , A3 e0 , · · ·, An e0 } ,
n

which proves part (i). For (ii) note that since ||en ||2A = 2(J(xn ) − J(x))
minimizing J(x) is equivalent to minimizing the A-norm of e. Thus, part
(ii) follows. For part (iii), note that part (i) implies
en = [I + a1 A + a2 A2 + · · · + an An ]e0 = p(A)e0 ,
where p(x) is a real polynomial of degree ≤ n and p(0) = 1. Thus, from
this observation and part (ii),
||x − xn ||A = min ||p(A)e0 ||A
pn ∈Πn and p(0)=1
# $
≤ min ||p(A)||A ||e0 ||A .
pn ∈Πn and p(0)=1
The result follows by calculating using the spectral mapping theorem that
p(A)A = max |p(x)| ≤ max |p(x)|.
λ∈spectrum(A) λmin ≤x≤λmax

To determine the rate of convergence of the CG method, the question


now is:
How big is the quantity:
min max |p(λ)| ?
pn ∈ Πn λ∈spectrum(A)
p(0) = 1
Fortunately, this is a famous problem of approximation theory, solved long
ago by Chebychev.
The idea of Chebychev’s solution is to pick points xj in the interval
λmin ≤ x ≤ λmax and let p/n (x) interpolate zero at those points; p/n (x)
solves the interpolation problem:
p/n (0) = 1
p/n (xj ) = 0, j = 1, 2, . . . , n, where λmin ≤ xj ≤ λmax .
By making p/n (x) zero at many points on λmin ≤ x ≤ λmax we therefore
force p/n (x) to be small over all of λmin ≤ x ≤ λmax . We have then
min max |p(λ)| ≤ min max |p(x)|
pn ∈ Πn λ∈spectrum(A) pn ∈ Πn λmin ≤x≤λmax
p(0) = 1 p(0) = 1

≤ max |/
pn (x)|.
λmin ≤x≤λmax
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 200

200 Numerical Linear Algebra

The problem now shifts to finding the “best” points to interpolate zero, best
being in the sense of the min-max approximation error. This problem is a
classical problem of approximation theory and was also solved by Cheby-
chev, and the resulting polynomials are called “Chebychev polynomials”,
one of which is depicted in Figure 8.1.

max p(x)
p(x)
1

λmin min p(x) λmax

Fig. 8.1 We make p(x) small by interpolating zero at points on the interval. In this
illustration, the minimum and maximum values of p(x) are computed on the interval
[λmin , λmax ].

Theorem 8.8 (Chebychev polynomials, min-max problem). The


points xj for which p/n (x) attains
min max |p(x)|
pn ∈ Πn λmin ≤x≤λmax
p(0) = 1

are the zeroes on the Chebychev polynomial


# $ # $
b + a − 2x b+a
Tn /Tn
b−a b−a
on [a, b] (where a = λmin , b = λmax ). We then have
9 9
9 Tn ( b+a−2x ) 9
9 b−a 9
min max |p(x)| = max 9 9=
pn ∈ Πn a≤x≤b a≤x≤b 9 Tn ( b+a ) 9
b−a
p(0) = 1

σn 1 − ab
=2 , σ := 
1 + σn 1 + ab

Proof. For the proof and development of the beautiful theory of Chebychev
approximation see any general approximation theory book.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 201

The Conjugate Gradient Method 201

To convert this general result to a prediction about the rate of conver-


gence of CG, simply note that
a λmin
= = cond(A).
b λmax
Thus we have the error estimate for CG:

Theorem 8.9. Let A be SPD. Then

(1) The nth CG residual is optimal over Kn in the A−1 norm:


||rn ||A−1 = min ||r||A−1
r∈Kn
th
(2) The n CG error is optimal over Kn in the A norm:
||en ||A = min ||e||A ;
e∈Kn
th
(3) The n CG energy functional is optimal over Kn :
J(xn ) = min J(x);
x∈Kn

(4) We have the orthogonality relations


rk , rj  = 0, k = j,
d , d A = 0,
k j
k = j;
(5) Given any ε > 0 for
1 2
n≥ cond(A) ln( ) + 1
2 ε
the error in the CG iterations is reduced by ε:
xn − xA ≤ εx0 − xA .

Exercise 8.14. Construct an interpolating polynomial of degree ≤ N with


p(0) = 1 and p(λj ) = 0, 1 ≤ j ≤ N . Use this to give a second proof that
CG gives the exact answer in N steps or less.

Exercise 8.15. Show that if A has M < N distinct eigenvalues then the
CG method converges to the exact solution in at most M (< N ) steps.
Recall that
⎛ ⎞
⎜ ⎟
x − xn ||A = ⎝ min
|| max |p(x)|⎠ ||e0 ||A .
pn ∈ Πn λ∈spectrum(A)
p(0) = 1

Then construct a polynomial p(x) of degree M with p(λ) = 0 for all λ ∈


spectrum(A).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 202

202 Numerical Linear Algebra

8.5 Preconditioning

“Oh, what a difficult task it was.


To drag out of the marsh
the hippopotamus”
— Korney Ivanovic’ Chukovsky

The idea of preconditioning is to “preprocess” the linear system to re-


duce the condition number of A. We pick an SPD matrix M , for which it
is very easy to solve M y = f , and replace
Ax = b ⇐ M −1 Ax = M −1 b.
PCG for Ax = b is CG for M −1 Ax = M −1 b.6 Of course we never invert M
explicitly; every time M −1 is written it means “solve a linear system with
coefficient matrix M ”. A few simple examples of preconditioners;
N
• M = a diagonal matrix with Mii = j=1 |aij |,
• M = the tridiagonal part of A
Mij = aij for j = i, i − 1, i + 1, and
Mij = 0 otherwise.
• If A = A0 + B where A0 is simpler than A and easy to invert, then
pick M = A0 . Instead of picking A0 , we can also pick B: the entries
in A to discard to get the preconditioner M .
• M =L /L
/ t , a simple and cheap approximation to the LLt decomposition
of A.
• Any iterative method indirectly determines a preconditioner. Indeed,
since M approximates A the solution of M ρ = r approximates the
solution of Aρ = r. If some other iterative method is available as a
subroutine then an approximate solution to Aρ = r can be calculated
by doing a few steps of some (other) iterative method for the equation
Aρ = r. This determines a matrix M (indirectly of course).

If we are given an effective preconditioner M , PCG can be simplified to


the following attractive form.
6 To be very precise, A SPD and M SPD does not imply M −1 A is SPD. However,

M −1 A is similar to the SPD matrix M −1/2 AM −1/2 . We think of PCG for Ax = b


as CG for −1 −1
 M Ax = M b. Again, to be very picky, in actual fact it is CG for the
system M −1/2 AM −1/2 y = M −1/2 b with SPD coefficient matrix M −1/2 AM −1/2 .
The first step is, after writing down CG for this system to reverse the change of variable
everywhere y ⇐ M 1/2 x and eliminate all the M ±1/2 .
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 203

The Conjugate Gradient Method 203

Algorithm 8.8 (PCG Algorithm for solving Ax = b). Given a SPD


matrix A, preconditioner M , initial guess vector x0 , right side vector b,
and maximum number of iterations itmax

r0 = b − Ax0
Solve M d0 = r0
z 0 = d0
for n=0:itmax
αn = rn , z n /dn , Adn 
xn+1 = xn + αn dn
rn+1 = b − Axn+1 (∗)
if converged, stop end
Solve M z n+1 = rn+1
βn+1 = rn+1 , z n+1 /rn , z n 
dn+1 = z n+1 + βn+1 dn (∗∗)
end

Note that the extra cost is exactly one solve with M each step. There is
a good deal of art in picking preconditioners that are inexpensive to apply
and that reduce cond(A) significantly.

Exercise 8.16. Let Ax = b be converted to a fixed point problem (I −


T )x = f (associated with a simple iterative method). If I − T is SPD we
can apply CG to this equation resulting in using a simple iterative method
to precondition A. (a) Suppose ||T || < 1. Estimate cond(I − T ) in terms of
1 − ||T ||. (b) For the MPP pick a simple iterative method and work out for
that method (i) if I − T is SPD, and (ii) whether cond(I − T ) < cond(A).
(c) Construct a 2 × 2 example where ||T || > 1 but cond(I − T )  cond(A).

Exercise 8.17. (a) Find 2×2 examples where A SPD and M SPD does not
imply M −1 A is SPD. (b) Show however M −1/2 AM −1/2 is SPD. (c) Show
that if A or B is invertible then AB is similar to BA. Using this show that
M −1/2 AM −1/2 and has the same eigenvalues as M −1 A.
 
Exercise 8.18. Write down CG for M −1/2 AM −1/2 y = M −1/2 b. Re-
verse the change of variable everywhere y ⇐ M 1/2 x and eliminate all the
M ±−1/2 to give the PCG algorithm as stated.

Exercise 8.19. In Exercise 8.1 (page 182), you wrote a program to apply
CG to the 2D MPP. Modify that code to use PCG, Algorithm 8.8. Test
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 204

204 Numerical Linear Algebra

it carefully, including one test using the identity matrix as preconditioner,


making sure it results in exactly the same results as CG.
Choose as preconditioner (a) the diagonal part of the matrix A (4 times
the identity matrix); (b) the tridiagonal part of A (the tridiagonal matrix
with 4 on the diagonal and -1 on the super- and sub-diagonals); and, (c) a
third preconditioner of your choosing. Compare the numbers of iterations
required and the total computer time required for each case.

8.6 CGN for Non-SPD Systems

You must take your opponent into a deep dark forest where
2+2=5, and the path leading out is only wide enough for one.
— Mikhail Tal

If A is SPD then the CG method is provable the best possible one. For
a general linear system the whole beautiful structure of the CG method
collapses. In the SPD case CG has the key properties that

• it is given by one 3 term (or two coupled 2 term) recursion,


• it has the finite termination property producing the exact solution in
N steps for an N × N system,
• the nth step produces an approximate solution that is optimal over an
n-dimensional affine subspace,
• it never breaks down,

• it takes at most O( cond(A)) steps per significant digit.

There is a three-step, globally optimal, finite terminating CG method


in the SPD case. In the non-symmetric case there is a fundamental result
of Farber and Manteuffel and Voevodin [Faber and Manteuffel (1984)] and
[Voevodin (1983)].
They proved a nonexistence theorem that no general extension of CG
exists which retains these properties. The following is summary.

Theorem 8.10 (Faber, Manteuffel, and Voevodin). Let A be an N ×


N real matrix. An s term conjugate gradient method exists for the matrix
A if and only if either A is 3 by 3 or A is symmetric or A has a complete
set of eigenvectors and the eigenvalues of A lie along a line in the complex
plane.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 205

The Conjugate Gradient Method 205

Thus, in a mathematically precise sense CG methods cannot exist for


general nonsymmetric matrices. This means various extensions of CG to
nonsymmetric systems seek to retain some of the above properties by giving
up the others. Some generalized CG methods drop global optimality (and
this means finite termination no longer holds) and some drop the restriction
of a small recursion length (e.g., some have full recursions- the nth step has
k = n). Since nothing can be a general best solution, there naturally have
resulted many generalized CG methods which work well for some problems
and poorly for others when A is nonsymmetric. (This fact by itself hints
that none work well in all cases.) Among the popular ones today there are:

• biCG = biconjugate gradient method: biCG is based on an extension


of the Orthogonalization of Moments algorithm to nonsymmetric ma-
trices. It does not produce an orthogonal basis but rather two, a basis
and a so-called shadow basis: {φi : i = 1, ···, N } and {φ/i : i = 1, ···, N }.
The pair have the bi-orthogonality property that
φ/ti Aφj = 0 for i = j.
• CGS = conjugate gradient squared (which does not require At ): CGS
is an idea of Sonneveld that performed very well but resisted rigor-
ous understanding for many years. Motivated by biCG, Sonneveld
tried (loosely speaking) replacing the use of At by A in the algorithm
wherever it occurred. This is of course very easy to test once biCG
is implemented. The result was a method that converged in practice
twice as fast as biCG.
• GMRes = generalized minimum residual method: GMRes was based
on two modifications to CG. First the residual minimized at each step
is ||b − Axn+1 ||22 . This produces a method with no breakdowns at this
step. Next orthogonalization of moments is replaced by the full Gram–
Schmidt algorithm. The result is a memory expensive method which
is optimal and does not break down.
• CGNE and CGNR = different realizations of CG for the normal
equations
At Ax = At b.
Of course an explicit or implicit change to the normal equations squares
the condition number of the system being solved and requires At .

None in general work better than CGNE so we briefly describe CGNE.


Again, we stress that for nonsymmetric systems, the “best” generalized CG
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 206

206 Numerical Linear Algebra

method will vary from one system to another. We shall restrict ourselves
to the case where A is square (N × N ). The following is known about the
normal equations.

Theorem 8.11 (The normal equations). Let A be N × N and invert-


ible. Then At A is SPD. If A is SPD then
2
cond2 (At A) = [cond2 (A)] .

Proof. Symmetry: (At A)t = At Att = At A. Positivity: xt (At A)x =


(Ax)t Ax = |Ax|2 > 0 for x nonzero since A is invertible. If A is SPD,
then At A = A2 and
λmax (A2 )
cond2 (At A) = cond2 (A2 ) =
λmin (A2 )
# $2
λmax (A)2 λmax (A)
= =
λmin (A)2 λmin (A)
2
= [cond2 (A)] .

Thus, any method using the normal equation will pay a large price in
increasing condition numbers and numbers of iterations required. Beyond
that, if A is sparse, forming At A directly shows that At A will have roughly
double the number of nonzero entries per row as A. Thus, any algorithm
working with the normal equations avoids forming them explicitly. Resid-
uals are calculated by multiplying by A and then multiplying that by At .

Algorithm 8.9 (CGNE = CG for the normal equations).


Given preconditioner M , matrix A, initial vector x0 , right side vector b,
and maximum number of iterations itmax
r0 = b − Ax0
z 0 = At r 0
d0 = z 0
for n=0:itmax
αn = ρn , ρn /dn , At (Adn ) = ρn , ρn /Adn , Adn 
xn+1 = xn + αn dn ,
rn+1 = b − Axn+1
if converged, exit, end
z n+1 = At rn+1
βn+1 = z n+1 , z n+1 /z n , z n 
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 207

The Conjugate Gradient Method 207

dn+1 = z n+1 + βn+1 dn


end

Applying the convergence theory of CG, we have that CGN takes


roughly the following number of steps per significant digit:

1 1 2 1
cond2 (At A)  [cond2 (A)]  cond2 (A).
2 2 2

Since this is much larger than the SPD case of 12 cond2 (A) steps, precon-
ditioning becomes much more important in the non SPD case than the SPD
case. Naturally, much less is known in the non SPD case about construction
of good preconditioners.
For the other variants, let us recall the good properties of CG as a way
to discuss in general terms some of them. CG for SPD matrices A has the
key properties that

• For A SPD CG is given by one 3 term recursion or, equiva-


lently, two coupled 2 term recursions: This is an important prop-
erty for efficiency. For A not SPD, GMRes drops this property and
computes the descent directions by Gram–Schmidt orthogonalization.
Thus for GMRes it is critical to start with a very good preconditioner
and so limit the number of steps required. CGS retains a 3 term recur-
sion for the search directions as does biCG and CGNE.
• For A SPD it has the finite termination property producing
the exact solution in at most N steps for an N × N system:
For A not SPD, biCG and full GMRes retain the finite termination
property while CGS does not.
• For A SPD the nth step produces an approximate solution that
is optimal over a n-dimensional affine subspace: For A not SPD,
biCG and full GMRes retain this property while CGS does not.
• For A SPD it never breaks down: For A not SPD, breakdowns
can occur. One method of dealing with them is to test for zero denom-
inators and when one appears the algorithm is simply restarted taking
the last approximation as the initial guess. biCG and CGS can have
breakdowns. Full GMRes is reliable. Breakdowns can occur when, the
full Gram–Schmidt orthogonalization procedure is truncated to a fixed
number of steps. 
• For A SPD it takes at most O( cond(A)) steps per significant
digit: For A not SPD, the question of the number of steps required
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 208

208 Numerical Linear Algebra

is very complex. On the one hard, one can phrase the question (indi-
rectly) that if method X is applied and A happens to be SPD then, does
method X reduce to CG? Among the methods mentioned, only biCG
has this property. General (worst case) convergence results for these
methods give no improvement over CGNE: they predict O(cond2 (A))
steps per significant digit. Thus the question is usually studied by com-
putational tests which have shown that there are significant examples
of nonsymmetric systems for which each of the methods mentioned is
the best and requires significantly fewer than the predicted worst case
number of steps.

Among the generalized CG methods for nonsymmetric systems, GMRes


is the one currently most commonly used. It also seems likely that CGS is
a method that is greatly under appreciated and under used.

Exercise 8.20. The goal of this exercise is for you to design and analyze
(reconstruct as much of the CG theory as you can) your own Krylov sub-
space iterative method that will possibly be better than CG. So consider
solving Ax = b where A is nonsingular. Given xn , dn the new iterate is
computed by

xn+1 = xn + αn dn
αn = arg min ||b − Axn+1 ||22 .

(a) Find a formula for αn . Can this formula ever break down? Is there a
zero divisor ever? Does the formula imply that xn+1 is a projection
[best approximation] with respect to some inner product and norm?
Prove it.
(b) Next consider your answer to part (a) carefully. Suppose the search
directions are orthogonal with respect to this inner product. Prove a
global optimality condition for your new method.
(c) What is the appropriate Krylov subspace to consider for the new
method? Reconsider the Orthogonalization of Moments algorithm.
Adapt it to give a algorithm and its proof for generating such an or-
thogonal basis.
(d) For this part you may choose: Either test the method and compare it
with CG for various h’s for the MPP or complete the error estimate for
the method adapting the one for CG.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 209

The Conjugate Gradient Method 209

Exercise 8.21. Consider the non symmetric, 2 × 2 block system


A1 C x b
= 1 .
−C t A2 y b2
Suppose A1 and A2 are SPD and all blocks are N ×N . Check a 2×2 example
that the eigenvalues of this matrix are not real. Consider preconditioning
by the 2 × 2 block SPD system as follows:
−1 −1
A1 0 A1 C x A1 0 b1
= .
0 A2 −C t A2 y 0 A2 b2
Show that the eigenvalues of the preconditioned matrix
−1
A1 0 A1 C
0 A2 −C t A2
lie on a line in the complex plane.
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 211

Chapter 9

Eigenvalue Problems

“Why is eigenvalue like liverwurst?”


— C. A. Cullen

9.1 Introduction and Review of Eigenvalues

By relieving the brain of all unnecessary work, a good notation


sets it free to concentrate on more advanced problems, and, in effect,
increases the mental power of the race.
— Whitehead, Alfred North (1861–1947), In P. Davis and
R. Hersh, The Mathematical Experience, Boston: Birkhäuser, 1981.

One of the three fundamental problems of numerical linear algebra is to


find information about the eigenvalues of an N × N matrix A. There are
various cases depending on the structure of A (large and sparse vs. small
and dense, symmetric vs. non-symmetric) and the information sought (the
largest or dominant eigenvalue, the smallest eigenvalue vs. all the eigenval-
ues).

Definition 9.1 (eigenvalue-eigenvector). Let A be an N × N matrix.




λ is an eigenvalue of A if there is a nonzero vector φ = 0 with

− →

Aφ = λφ.


φ is an eigenvector associated with the eigenvalue λ.

Eigenvalues are important quantities and often the figure of interest in


physical problems. A few examples:

211
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 212

212 Numerical Linear Algebra

Vibration problems: Let x(t) : [0, ∞) → RN satisfy


x (t) + Ax(t) = 0.
For such problems vibratory or oscillatory motions at a fundamental fre-
quency ω are critical to the observed dynamics of x(t). Often the problem
is to design a system (the design results in the matrix A) to fix its funda-
mental frequencies. Such an oscillatory solution takes the form


x(t) = cos(ωt) φ .
Inserting this into the ODE x (t) + Ax(t) = 0 gives

− →

−ω 2 cos(ωt) φ + A cos(ωt) φ = 0
⇐⇒

− →

A φ = ω2 φ .
Thus, ω is a fundamental frequency (and the resulting motion is nonzero
and a persistent vibration) if and only if ω 2 is an eigenvalue of A. Finding
fundamental frequencies means finding eigenvalues.
Buckling of a beam: The classic model for buckling of a thin beam
is a yard stick standing and loaded on its top. Under light loads (and
carefully balanced) it will stand straight. At a critical load it will buckle.
The problem is to find the critical load. The linear elastic model for the
displacement is the ODE
y  (x) + λy  (x) = 0, 0 < x < b,
y(0) = 0 = y  (0)
y(b) = 0 = y  (b).
The critical load can be inferred from the smallest value of λ for which the
above has a nonzero solution.1 If the derivatives in the ODE are replaced by
difference quotients on a mesh, this leads to an eigenvalue problem for the
resulting matrix. Finding the critical load under which buckling happens
means finding an eigenvalue.
Stability of equilibria: If x(t) is a function : [0, ∞) → RN satisfies
x (t) = F (x(t)),
F : RN → RN
1 This simple problem can be solved by hand using general solution of the ODE. If the

problem becomes 1/2 step closer to a real problem from science or engineering, such as
buckling of a 2D shell, it cannot be solved exactly. Then the only recourse is to discretize,
replace it by an EVP for a matrix and solve that.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 213

Eigenvalue Problems 213

then an equilibrium solution is a vector x0 satisfying F (x0 ) = 0. If x(t)


is another solution near the equilibrium solution, we can expand in a Tay-
lor series near the equilibrium. The deviation from equilibrium x(t) − x0
satisfies

(x(t) − x0 ) = F (x(t)) − F (x0 )
= F (x0 ) + F  (x0 )(x(t) − x0 ) + O(x(t) − x0 )2
= F  (x0 )(x(t) − x0 ) + (small terms)2 .
Thus whether x(t) approaches x0 or not depends on the real parts of the
eigenvalues of the N × N derivative matrix evaluated at the equilibrium.
The equilibrium x0 is locally stable provided the eigenvalues λ of F  (x0 )
satisfy Re(λ) < 0. Determining stability of rest states means finding eigen-
values.
Finding eigenvalues. Calculating λ, φ by hand (for small matrices) is
a two step process which is simple in theory but seldom practicable.


Finding λ, φ for A an N × N real matrix by hand:

• Step 1: Calculate exactly the characteristic polynomial of A. p(λ) :=


det(A − λI) is a polynomial of degree N with real coefficients.
• Step 2: Find the N (counting multiplicities) real or complex roots of
p(λ) = 0. These are the eigenvalues
λ1 , λ2 , λ3 , · · ·, λN .
• Step 3: For each eigenvalue λi , using Gaussian elimination find a non-
zero solution of


[A − λi I] φ i = 0, i = 1, 2, · · ·, N.

Example 9.1. Find the eigenvalues and eigenvectors of the 2 × 2 matrix


11
A= .
41
We calculate the degree 2 polynomial
1−λ 1
p2 (λ) = det(A − λI) = det = (1 − λ)2 − 4.
4 1−λ
Solving p2 (λ) = 0 gives
p2 (λ) = 0 ⇔ (1 − λ)2 − 4 = 0 ⇔ λ1 = 3, λ2 = −1.


The eigenvector φ 1 of λ1 = 3 is found by solving
x −2 1 x 0
(A − λI) = = .
y 4 −2 y 0
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 214

214 Numerical Linear Algebra

Solving gives (for any t ∈ R)


1
y = t, −2x + y = 0, or x = t.
2
Thus, (x, y)t = ( 12 t, t)t for any t = 0 is an eigenvector. For example, t = 2
gives

− 1
λ1 = +3, φ 1 = .
2


Similarly, we solve for φ 2
x 21 x 0
(A − λI) = =
y 42 y 0

or (x, y)t = (− 12 t, t)t . Picking t = 2 gives


− −1
λ2 = −1, φ 2 = .
2

Example 9.2 (An example of Wilkinson). The matrices


21
A= ,
02
2 1
A(ε) =
−ε 2
are close to each other for ε small. However their eigenspaces differ qualita-
tively. A has a double eigenvalue λ = 2 which only has 1 eigenvector. (The
matrix A and eigenvalue λ = 2 are called defective.) The other matrix A(ε)
has distinct eigenvalues (and thus a complete set of eigenvectors)

λ1 (ε) = 2 + ε

λ2 (ε) = 2 − ε.

What is the sensitivity of the eigenvalues of A to perturbations? We cal-


culate
d
Sensitivity of λi (ε) := λi (ε)|ε=0

1
= ± ε−1/2 |ε=0 = ∞.
2
Thus small changes of the coefficients of a defective matrix can produce
large relative changes of its eigenvalues.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 215

Eigenvalue Problems 215

Exercise 9.1. Analyze the sensitivity as ε → 0 of the eigenvalues of the


two matrices
20 2ε
A= , A(ε) = .
02 02

Exercise 9.2. Find the smallest in norm perturbation (A → A(ε)) of the


2 × 2 diagonal matrix A = diag(λ1 , λ2 ) that merges its eigenvalues to a
double eigenvalue of A(ε) having value (λ1 + λ2 )/2.

9.1.1 Properties of eigenvalues

Some important properties of eigenvalues and eigenvectors are given below.

Proposition 9.1. If A is diagonal, upper triangular or lower triangular,


then the eigenvalues are on the diagonal of A.

Proof. Let A be upper triangular. Then, using ∗ to denote a generic non-


zero entry,
⎡ ⎤
a11 − λ ∗ ∗ ∗
⎢ 0 a22 − λ ∗ ∗ ⎥
⎢ ⎥
det [A − λI] = det ⎢ . ⎥
⎣ 0 0 . . ∗ ⎦
0 0 0 ann − λ
= (a11 − λ)(a22 − λ) · . . . · (ann − λ) = pn (λ).

The roots of pn are obvious!

Proposition 9.2. Suppose A is N × N and λ1 , . . . , λk are distinct eigen-


values of A. Then,

det(A − λI) = (λ1 − λ)p1 (λ2 − λ)p2 · . . . · (λk − λ)pk

where p1 + p2 + . . . + pk = n.
Each λj has at least one eigenvector φj and possibly as many as pk
linearly independent eigenvectors.
If each λj has pj linearly independent eigenvectors then all the eigen-
vectors together form a basis for RN .

Proposition 9.3. If A is symmetric (and real) (A = At ), then:


(i) all the eigenvalues and eigenvectors are real,
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 216

216 Numerical Linear Algebra


− →

(ii) there exists N orthonormal2 eigenvectors φ 1 , . . . , φ N of A:


− → − 1, if i = j,
 φ i, φ j  =
0, if i = j.


(iii) if C is the N × N matrix with eigenvector φ j in the j th column
then
⎡ ⎤
λ1 0 . . . 0
⎢ 0 λ2 . . . 0 ⎥
⎢ ⎥
C −1 = C t and C −1 AC = ⎢ . . . . ⎥.
⎣ .. .. . . .. ⎦
0 0 . . . λN


Proposition 9.4. If an eigenvector φ is known, the corresponding eigen-
value is given by the Rayleigh quotient
−→ → − -→
φ∗ A φ −
→∗ − .tr
λ= − − , where φ = conjugate transpose = φ
→∗ → .
φ φ

− →

Proof. If A φ = λ φ , we have

→ → − −
→→ −
φ ∗ A φ = λφ ∗ φ
from which the formula for λ follows.

Proposition 9.5. If  ·  is any matrix norm (induced by the vector norm


 · ),
|λ| ≤ A.

Proof. Since

− →

λφ = Aφ
we have

− →
− →
− →

|λ| φ  = λ φ  = A φ  ≤ A φ .

Remark 9.1. The eigenvalues of A are complicated, nonlinear functions


of the entries in A. Thus, the eigenvalues of A+B have no correlation with
those of A and B. In general, λ(A + B) = λ(A) + λ(B).
2 “Orthonormal” means orthogonal (meaning mutually perpendicular so their dot prod-

ucts give zero) and normal (meaning their length is normalized to be one).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 217

Eigenvalue Problems 217

9.2 Gershgorin Circles

The question we consider in this section is:

What can we tell about the eigenvalues of A from the entries in the
matrix A?

Eigenvalues are very important yet they are complicated, nonlinear


functions of the entries of A. Thus, results that allow us to look at the
entries of A and get information about where the eigenvalues live are useful
results indeed. We have seen two already for N × N real matrices:

• If A = At then λ(A) is real (and, by a similar argument, if At = −A


then the eigenvalues are purely imaginary).
• |λ| ≤ A for any norm  · ; in particular this means |λ| ≤
min{A1 , A∞ }.

Definition 9.2. The spectrum of A, σ(A), is


σ(A) := {λ|λ is an eigenvalue of A}.
The numerical range R(A) := {x∗ Ax| for all x ∈ CN , ||x||2 = 1}.
This question is often called “spectral localization”. The two classic
spectral localization results are σ(A) ⊂ R(A) and the Gershgorin circle
theorem.

Theorem 9.1 (Properties of Numerical Range). For A an N × N


matrix,

• σ(A) ⊂ R(A)
• R(A) is compact and convex (and hence simply connected).
• If A is normal matrix (i.e., A commutes with At ) then R(A) is the
convex hull of σ(A).

Proof. The second claim is the celebrated Toeplitz-Hausdorff theorem. We


only prove the first claim. Picking x = the eigenvector of λ of unit length
gives: x∗ Ax = the eigenvalue λ.

Theorem 9.2 (The Gershgorin Circle Theorem). Let A = (aij ), i,


j = 1, . . . , N. Define the row and column sums which exclude the diagonal
entry.
N N
rk = |akj |, ck = |ajk |.
j=1,j =k j=1,j =k
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 218

218 Numerical Linear Algebra

Define the closed disks in C:


Rk = {z ∈ C; |z − akk | ≤ rk }, Ck = {z ∈ C; |z − akk | ≤ ck }.
Then, if λ is an eigenvalue of A

(1) λ ∈ Rk for some k.


(2) λ ∈ Ck for some k.
(3) If Ω is a union of precisely k disks that is disjoint from all other disks
then Ω must contain k eigenvalues of A.

2
1


Fig. 9.1 Three Gershgorin disks in Example 9.3.

Example 9.3.
⎡ ⎤
1 2 −1
A3×3 = ⎣ 2 7 0 ⎦.
−1 0 5
We calculate
r1 = 2 + 1 = 3, r2 = 2 + 0 = 2, r3 = 1 + 0 = 1.
c1 = 2 + 1 = 3, c2 = 2 + 0 = 2, ck = 1 + 0 = 1.
The eigenvalues must belong to the three disks in Figure 9.1. Since A = At ,
they must also be real. Thus
−2 ≤ λ ≤ 8.

Exercise 9.3. If B is a submatrix of A (constructed by deleting the same


number of rows and columns), show that R(B) ⊂ R(A).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 219

Eigenvalue Problems 219

9.3 Perturbation Theory of Eigenvalues

Whatever method is used to calculate approximate eigenvalues, in finite


precision arithmetic what is actually calculated are the eigenvalues of a
nearby matrix. Thus, the first question is “How are the eigenvalues changed
under small perturbations?” It is known that the eigenvalues of a matrix
are continuous functions of the entries of a matrix. However, the modulus
of continuity can be large; small changes in some matrices can produce large
changes in the eigenvalues. One class of matrices that are well conditioned
with respect to its eigenvalues is real symmetric matrices.

Example 9.4 (An example of Forsythe). Let A, E be the N × N ma-


trices: for a > 0 and ε > 0 small:
⎡ ⎤
a 0 0 ··· 0
⎢1 a 0 ··· 0⎥
⎢ ⎥
⎢ 0⎥
A = ⎢0 1 a ⎥,
⎢. . . . . .. ⎥
⎣ .. . . .⎦
0 ··· 0 1 a
⎡ ⎤
0 0 ··· 0 ε
⎢0 0 · · · 0 0⎥
⎢ ⎥
E=⎢ ⎥
⎢0 0 · · · 0 0⎥.
⎣ ⎦
0 0 ··· 0 0

Then the characteristic equations of A and A+E are, respectively,

(a − λ)N = 0 and (a − λ)N + ε(−1)N +1 = 0.

Thus, the eigenvalues of A are λk = a, a, a, · · · while those of A+E are

μk = a + ω k ε1/N , k = 0, 1, · · ·, N − 1

where ω is a primitive N th root of unity. Thus A→ A + E changes one,


multiple, real eigenvalue into N distinct complex eigenvalues about a with
radius ε1/N . Now suppose that

• ε = 10−10 , N = 10 then the error in the eigenvalues is 0.1. It has been


magnified by 109 !
• ε = 10−100 , N = 100 then the error in the eigenvalues is 0.1 = 1099 ×
Error in A!
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 220

220 Numerical Linear Algebra

9.3.1 Perturbation bounds


For simplicity, suppose A is diagonalizable
H −1 AH = Λ = diag(λ1 , λ2 , · · ·, λN )
and let λ, μ denote the eigenvalues of A and A + E respectively:
Ax = λx, (A + E)y = μy.

Theorem 9.3. Let A be diagonalizable by the matrix H and let λi denote


the eigenvalues of A. Then, for each eigenvalue μ of A + E,
min |μ − λi | ≤ ||H||2 ||H −1 ||2 ||E||2 = cond2 (H)||E||2
i

Proof. The eigenvector y of A + E satisfies the equation


(μI − A)y = Ey.
If μ is an eigenvalue of A the the result holds since the LHS is zero. Oth-
erwise, we have
H −1 (μI − A)HH −1 y = H −1 EHH −1 y, or
 
(μI − Λ)w = H −1 EH w, where w = H −1 y.
In this case (μI − Λ) is invertible. Thus
 
w = (μI − Λ)−1 H −1 EH w.
So that
||w||2 ≤ ||(μI − Λ)−1 ||2 ||H −1 ||2 ||E||2 ||H||2 ||w||2 .
The result follows since
1 - .−1
||(μI − Λ)−1 ||2 = max = min |μ − λi | .
i |μ − λi | i

Definition 9.3. Let λ, μ denote the eigenvalues of A and A + E respec-


tively. The eigenvalues of the real, symmetric matrix A are called “well
conditioned” when
min |μ − λi | ≤ ||E||2 .
i

Proof. In this case note that H is orthogonal and thus ||H||2 = ||H −1 ||2 =
1.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 221

Eigenvalue Problems 221

Other results are known such as:


Theorem 9.4. Let A be real and symmetric. There is an ordering of eigen-
values of A and A + E under which
max |μi − λi | ≤ ||E||2 ,
i

|μi − λi |2 ≤ ||E||2F robenius . (9.1)
i=1,N

For more information see the book of Wilkinson.


J. Wilkinson, The Algebra Eigenvalue Problem, Oxford Univ. Press,
1965.

9.4 The Power Method

The power method is used to find the dominant (meaning the largest in
complex modulus) eigenvalue of a matrix A. It is specially appropriate
when A is large and sparse so multiplying by A is cheap in both storage
and in floating point operations. If a complex eigenvalue is sought, then
the initial guess in the power method must also be complex. In this case
the inner product of complex vectors is the conjugate transpose:

N
x, y := x∗ y := xT y = x i yi .
i=1

Algorithm 9.1 (Power Method for Dominant Eigenvalue).


Given a matrix A, an initial vector x0 = 0, and a maximum number of
iterations itmax,
for n=0:itmax
x̃n+1 = Axn
xn+1 = x̃n+1 /x̃n+1 
% estimate the eigenvalue by
(∗) λ = (xn+1 )∗ Axn+1
if converged, stop, end
end

Remark 9.2. The step (∗) in which the eigenvalue is recovered can be
rewritten as
xn+2 )∗ xn+1
λn+1 = (/
/n+2 = Axn+1 . Thus it can be computed without additional cost.
since x
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 222

222 Numerical Linear Algebra

9.4.1 Convergence of the power method


In this section we examine convergence of the power method for the case
that the dominant eigenvalue is simple. First we note that the initial guess
must have some component in the direction of the eigenvector of the dom-
inant eigenvalue. We shall show that the power method converges rapidly
when the dominant eigenvalue is well separated from the rest:
|λ1 |  |λj |, j = 2, . . . , N.
In order to illuminate the basic idea, we shall analyze its convergence under
the additional assumption that A has N linearly independent eigenvectors
and that the dominant eigenvector is simple and real.

− → − →
− →

With x0 ∈ RN and eigenvectors φ 1 , φ 2 of A (A φ j = λj φ j ) we can
expand the initial guess in terms of the eigenvectors of A as follows:

− →
− →

x0 = c1 φ 1 + c 2 φ 2 + . . . + c N φ N .
If the initial guess has some component in the first eigenspace then
c1 = 0.
Then we calculate the normalized3 iterates x /1 = A/ /2 = A/
x0 , x /0 ,
x 1 = A2 x
etc.:

− →

/1 = A/
x x0 = c1 A φ 1 + . . . + c n A φ n

− →
− →

= c 1 λ1 φ 1 + c 2 λ 2 φ 2 + . . . + c N λ N φ N ,

− →

/2 = A/
x x 1 = c 1 λ1 A φ 1 + . . . + c n λN A φ n

− →
− →

= c1 λ21 φ 1 + c2 λ22 φ 2 + . . . + cN λ2N φ N ,
..
.
/k = A/
x xk−1 = Ak−1 x /0

− →
− →

= c1 λk1 φ 1 + c2 λk2 φ 2 + . . . + cN λkN φ N .
Since |λ1 | > |λj | the largest contribution to ||/ xk || is the first term. Thus,
normalize x / by the size of the first term so that
k


− - .k →
− - .k → −
λ
1 k
k /
x = c 1 φ 1 + c 2
λ2
λ1 φ 2 +. . . + c N
λN
λ1 φ N.
1
↓ ↓ ↓
0 0 0
3 The x
 here are different from those in the algorithm because the normalization is
different.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 223

Eigenvalue Problems 223

9 9
9 9
Each term except the first → 0 since 9 λλ21 9 < 1. Thus,
1 k →

k
/ = c1 φ 1 + ( terms that → 0 as k → ∞) ,
x
λ1
or,

− →

/k  c1 λk1 φ 1 , φ 1 = eigenvector of λ1 ,
x

− →

xk  A(c1 λk1 φ 1 ) = c1 λk+1
so A/ 1 φ 1 or
x k  λ1 x
A/ /k


and so we have found λ1 , φ 1 approximately.
2 4 1
Example 9.5. A = , x0 = . Then
3 13 0
2 4 1 2
/1 = Ax0 =
x = ,
3 13 0 3
/1
x [2, 3]t .5547
x1 = =√ = .
/
x 
1
22 + 32 .8321
4.438
/2 = Ax1 =
x .
12.48
<2
x
x2 = = ..., and so on.
/
x2 
Exercise 9.4. Write a computer program to implement the power method,
Algorithm 9.1. Regard the algorithm as converged when |Axn+1 −λxn+1 | <
, with  = 10−4 . Test your program by computing x /1 , x1 and x
/2 in
Example 9.5 above. What are the converged eigenvalue and eigenvector in
this example? How many steps did it take?

9.4.2 Symmetric matrices


The Power Method converges twice as fast for symmetric matrices as for
non-symmetric matrices because of some extra error cancellation that oc-
curs due to the eigenvalues of symmetric matrices being orthogonal. To see
this, suppose A = At and calculate
/k+1 = A/
x xk (= . . . = Ak x0 ).
Then the k th approximation to λ is μk given by
xk )t A/
(/ xk
μk = /k+1 /(/
xk ) t x
= (/ /k .
xk ) t x
(/k
x )x t /k
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 224

224 Numerical Linear Algebra


− →
− →

If x0 = c1 φ 1 + c2 φ 2 + . . . + cN φ N then, as in the previous case

− →

x/k = c1 λk1 φ 1 + . . . + cN λkN φ N , and thus

− →

x/k+1 = c1 λk+11 φ 1 + . . . + cN λk+1
N φ N.
In the symmetric case the eigenvectors are mutually orthogonal:
− t→
→ −
φ i φ j = 0, i = j.
Using orthogonality we calculate

(xk )t xk+1
- →
− − .t - k+1 →
→ − − .

= c1 λk1 φ 1 + . . . + cN λkN φ N c1 λ1 φ 1 + . . . + cN λk+1
N φN

= . . . = c21 λ2k+1
1 + c22 λ2k+1
2 + . . . + c2N λ2k+1
N .
Similarly
(xk )t xk = c21 λ2k 2 2k
1 + . . . + c N λN

and we find
c21 λ2k+1
1 + . . . + c2N λ2k+1
μk = N
c21 λ2k + . . . + c 2 λ2k
9 92k !
1 N N

c21 λ2k+1 9 λ 29
= 2 1 2k + O 99 99
c 1 λ1 λ1
9 92k !
9 λ2 9
= λ1 + O 99 99 ,
λ1
which is twice as fast as the non-symmetric case!

Exercise 9.5. Take A2×2 given below. Find the eigenvalues of A. Take
x0 = (1, 2)t and do 2 steps of the power method. If it is continued, to which
eigenvalue will it converge? Why?
2 −1
A= .
−1 2

9.5 Inverse Power, Shifts and Rayleigh Quotient Iteration

The idea behind variants of the power method is to replace A by a matrix


whose largest eigenvalue is the one sought, find that by the power method
for the modified matrix and then recover the sought eigenvalue of A.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 225

Eigenvalue Problems 225

9.5.1 The inverse power method


Although this may seem a paradox, all exact science is domi-
nated by the idea of approximation.
— Russell, Bertrand (1872–1970), in W. H. Auden and L. Kro-
nenberger (eds.) “The Viking Book of Aphorisms”, New York:
Viking Press, 1966.

The inverse power method computes the eigenvalue of A closest to the


origin — the smallest eigenvalue of A. The inverse power method is equiv-
alent to the power method applied to A−1 (since the smallest eigenvalue of
A is the largest eigenvalue of A−1 ).

Algorithm 9.2 (Inverse power method). Given a matrix A, an initial


vector x0 = 0, and a maximum number of iterations itmax,

for n=0:itmax
(∗) Solve A/ xn+1 = xn
x n+1
=x/ n+1
xn+1 
//
if converged, break, end
end
% The converged eigenvalue is given by
xn+1 )∗ xn
μ = (/
λ = 1/μ

For large sparse matrices step (∗) is done by using some other iterative
method for solving a linear system with coefficient matrix A. Thus the total
cost of the inverse power method is:

(number of steps of the inverse power method)*(number of iterations per


xn+1 = xn )
step required to solve A/

This product can be large. Thus various ways to accelerate the inverse
power method have been developed. Since the number of steps depends on
the separation of the dominant eigenvalue from the other eigenvalues, most
methods do this by using shifts to get further separation. If α is fixed, then
the largest eigenvalue of (A − αI)−1 is related to the eigenvalue of A closest
to α, λα by
1
λmax (A − αI) = .
λα (A) − α
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 226

226 Numerical Linear Algebra

The inverse power method with shift finds the eigenvalue closest to α.

Algorithm 9.3 (Inverse power method with shifts). Given a matrix


A, an initial vector x0 = 0, a shift α, and a maximum number of iterations
itmax,

for n=0:itmax
(∗) Solve (A − αI)/ xn+1 = xn
xn+1 = x/n+1 //
xn+1 
if converged, break, end
end
% The converged eigenvalue is given by
xn+1 )∗ xn
μ = (/
λ = α + 1/μ

9.5.2 Rayleigh Quotient Iteration

Fourier is a mathematical poem.


— Thomson, [Lord Kelvin] William (1824–1907)

The Power Method and the Inverse Power Method are related to (and
combine to form) Rayleigh Quotient Iteration. Rayleigh Quotient Iteration
finds very quickly the eigenvalue closest to the initial shift for symmetric
matrices. It is given by:

Algorithm 9.4 (Rayleigh Quotient Iteration). Given a matrix A, an


initial vector x0 = 0, an initial eigenvalue λ0 , and a maximum number of
iterations itmax,

for n=0:itmax
(∗) Solve (A − λn I)/xn+1 = xn
/n+1 //
xn+1 = x xn+1 
λn+1 = (xn+1 )t Axn+1
if converged, return, end
end

It can be shown that for symmetric matrices



− →

xn+1 − φ 1  ≤ Cxn − φ 1 3 ,
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 227

Eigenvalue Problems 227

i.e., the number of significant digits triples at each step in Rayleigh quotient
iteration.

Remark 9.3. The matrix A − λn I will become ill-conditioned as the itera-


tion converges and λn approaches an eigenvalue of A. This ill-conditioning
helps the iteration rather than hinders it because roundoff errors accumu-
late fastest in the direction of the eigenvector.

9.6 The QR Method

“But when earth had covered this generation also, Zeus the son of Cronos
made yet another, the fourth, upon the fruitful earth, which was nobler
and more righteous, a god-like race of hero-men who are called demi-gods,
the race before our own, throughout the boundless earth. Grim war and
dread battle destroyed a part of them, some in the land of Cadmus at
seven-gated Thebe when they fought for the flocks of Oedipus, and some,
when it had brought them in ships over the great sea gulf to Troy for
rich-haired Helen’s sake: there death’s end enshrouded a part of them.
But to the others father Zeus the son of Cronos gave a living and an abode
apart from men, and made them dwell at the ends of earth. And they live
untouched by sorrow in the islands of the blessed along the shore of deep
swirling Ocean, happy heroes for whom the grain-giving earth bears
honey-sweet fruit flourishing thrice a year, far from the deathless
gods. . . ”
— Hesiod, Works and Days

The QR algorithm is remarkable because if A is a small, possibly dense


matrix the algorithm gives a reliable calculation of all the eigenvalues of A.
The algorithm is based on the observation that the proof of existence of a
QR factorization is constructive. First we recall the theorem of existence.

Theorem 9.5. Let A be an N × N matrix. Then there exists

• a unitary matrix Q and


• an upper triangular matrix R

such that

A = QR.

Moreover, R can be constructed so that the diagonal entries satisfy Rii ≥ 0.


If A is invertible then there is a unique factorization with Rii ≥ 0.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 228

228 Numerical Linear Algebra

Proof. Sketch of proof: Suppose A is invertible then the columns of A


span RN . Let ai , qi denote the column vectors of A and Q respectively.
With R = rij upper triangular, writing out the equation A = QR gives the
following system:

a1 = r11 q1
a2 = r12 q1 + r22 q2
···
aN = r1n q1 + r2n q2 + · · · + rN N qN .

Thus, in this form, the QR factorization takes a spanning set ai , i =


1, · · · , N and from that constructs an orthogonal set qi , i = 1, · · · , N with

span{a1 , a2 , · · · , ak } = span{q1 , q2 , · · · , qk } for every k.

Thus the entries in R are just the coefficients generated by the Gram–
Schmidt process! This proves existence when A is invertible.

Remark 9.4. We remark that the actual calculation of the QR factor-


ization is done stably by using Householder transformations rather than
Gram–Schmidt.

The QR algorithm to calculate eigenvalues is built upon repeated con-


struction of QR factorizations. Its cost is

cost of the QR algorithm  1 to 4 LU decompositions  43 N 3 FLOPs.

Algorithm 9.5 (Simplified QR algorithm). Given a square matrix A1


and a maximum number of iterations itmax,

for n=1:itmax
Factor An = Qn Rn
Form An+1 = Rn Qn
if converged, return, end
end

This algorithm converges in an unusual sense:

• Ak is similar to A1 for every k, and


• (Ak )ij → 0 for i > j,
• diagonal(Ak ) → eigenvalues of A,
• (Ak )ij for i < j does not necessarily converge to anything.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 229

Eigenvalue Problems 229

Various techniques are used to speed up convergence of QR such as


using shifts.

Exercise 9.6. Show that Ak is similar to A1 for every k.

Exercise 9.7. Show that if A is real and A = QR then so are Q and R.

Exercise 9.8. Show A = QDS for invertible A with Q unitary and D


diagonal and positive and S upper triangular with diagonal entries all 1.

Exercise 9.9. Show that if A = QR and A is real then so are Q and R.

Exercise 9.10. Find the QR factorization of


⎡ ⎤
2 −1 0
⎣ −1 2 −1 ⎦ .
0 −1 2

For more information see [Watkins (1982)].

Angling may be said to be so like mathematics that it can never


be fully learned.
— Walton, Izaak, The Compleat Angler, 1653.
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 231

Appendix A

An Omitted Proof

Whatever regrets may be, we have done our best.


— Sir Ernest Shackleton, January 9, 1909, 88◦ 23 South

The proof of Theorem 6.4 depends on the well-known Jordan canonical


form of the matrix.

Theorem A.1 (Jordan canonical form). Given a N × N matrix T , an


invertible matrix C can be found so that
T = CJC −1
where J is a block diagonal matrix
⎡ ⎤
J1
⎢ J2 ⎥
⎢ ⎥
J =⎢ .. ⎥ (A.1)
⎣ . ⎦
JK
and each of the ni × ni diagonal blocks Ji has the form
⎡ ⎤
λi 1
⎢ λi 1 ⎥
⎢ ⎥
⎢ . .. .. ⎥
.
Ji = ⎢ ⎥ (A.2)
⎢ ⎥
⎣ λi 1 ⎦
λi
where λi is an eigenvalue of T . The λi need not be distinct eigenvalues.

The proof of this theorem is beyond the scope of this text, but can be
found in any text including elementary linear algebra, such as the beautiful
book of Herstein [Herstein (1964)].

231
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 232

232 Numerical Linear Algebra

Theorem 6.4 is restated here:

Theorem A.2. Given any N × N matrix T and any ε > 0 there exists a
matrix norm  ·  with T  ≤ ρ(T ) + ε.

Proof. Without loss of generality, assume ε < 1. Consider the matrix


⎡ ⎤
1
⎢ 1/ε ⎥
⎢ ⎥
⎢ 1/ε 2 ⎥
Eε = ⎢ ⎥
⎢ .. ⎥
⎣ . ⎦
N −1
1/ε
and the product Eε JEε−1 . The first block of this product can be seen to be
⎡ ⎤
λ1 ε
⎢ λ1 ε ⎥
⎢ ⎥
⎢ .. .. ⎥
⎢ . . ⎥
⎢ ⎥
⎣ λ1 ε ⎦
λ1
and each of the other blocks is similar. It is clear that Eε JEε−1 ∞ ≤
ρ(T ) + ε. Defining the norm T  = Eε JEε−1 ∞ completes the proof.

Remark A.1. If it happens that each of the eigenvalues with |λi | = ρ(T )
is simple, then each of the corresponding Jordan blocks is 1 × 1 and T  =
ρ(T ).
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 233

Appendix B

Tutorial on Basic MATLAB


Programming

. . . Descartes, a famous philosopher, author of the celebrated


dictum, Cogito ergo sum — whereby he was pleased to suppose he
demonstrated the reality of human existence. The dictum might be
improved, however, thus: Cogito cogito ergo cogito sum — “I think
that I think, therefore I think that I am”; as close an approach to
certainty as any philosopher has yet made.
— Ambrose Bierce, “The Devil’s Dictionary”

B.1 Objective

The purpose of this appendix is to introduce the reader to the basics of the
Matlab programming (or scripting) language. By “basics” is meant the
basic syntax of the language for arithmetical manipulations. The intent of
this introduction is twofold:

(1) Make the reader sufficiently familiar with Matlab that the pseudocode
used in the text is transparent.
(2) Provide the reader with sufficient syntactical detail to expand pseu-
docode used in the text into fully functional programs.

In addition, pointers to some of the very powerful Matlab functions


that implement the algorithms discussed in this book are given.
The Matlab language was chosen because, at the level of detail pre-
sented here, it is sufficiently similar to other languages such as C, C++,
Fortran, and Java, that knowledge of one can easily be transferred to the
others. Except for a short discussion of array syntax and efficiency, all

233
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 234

234 Numerical Linear Algebra

of the programming constructs discussed in this appendix can be simply1


translated into the other programming languages.
Matlab is available as a program of the same name from
The Mathworks, Natick, MA. The company operates a web site,
http://www.mathworks.com, from which purchasing information is avail-
able. Many institutions have Matlab installed on computers in computer
laboratories and often make Matlab licenses available for their members’
use on personally owned computers. At the level of detail described and
used here, a computer program called “GNU Octave”, conceived and writ-
ten by John W. Eaton (and many others), is freely available on the internet
at (http://www.gnu.org/software/octave/index.html). It can be used
to run Matlab programs without modification.

B.2 MATLAB Files

For our purposes, the best way to use Matlab is to use its scripting facility.
With sequences of Matlab commands contained in files, it is easy to see
what calculations were done to produce a certain result, and it is easy to
show that the correct values were used in producing a result. It is terribly
embarrassing to produce a very nice plot that you show to your teacher or
advisor only to discover later that you cannot reproduce it or anything like
it for similar conditions or parameters. When the commands are in clear
text files, with easily read, well-commented code, you have a very good idea
of how a particular result was obtained. And you will be able to reproduce
it and similar calculations as often as you please.
The Matlab comment character is a percent sign (%). That is, lines
starting with % are not read as Matlab commands and can contain any
text. Similarly, any text on a line following a % can contain textual com-
ments and not Matlab commands.
A Matlab script file is a text file with the extension .m. Matlab script
files should start off with comments that identify the author, the date, and
a brief description of the intent of the calculation that the file performs.
Matlab script files are invoked by typing their names without the .m at
the Matlab command line or by using their names inside another Matlab
file. Invoking the script causes the commands in the script to be executed,
in order.

1 For example, if the variable A represents a matrix A, its components A are represented
ij
by A(i,j) in Matlab and Fortran but by A[i][j] in C, C++ and Java.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 235

Tutorial on Basic Matlab Programming 235

Matlab function files are also text files with the extension .m, but the
first non-comment line must start with the word function and be of the
form

function output variable = function name (parameters)

This defining line is called the “signature” of the function. More than
one input parameter requires they be separated by commas. If a func-
tion has no input parameters, they, and the parentheses, can be omitted.
Similarly, a function need not have output variables. A function can have
several output variables, in which case they are separated by commas and
enclosed in brackets as

function [out1,out2,out3]=function name(in1,in2,in3,in4)

The name of the function must be the same as the file name. Comment
lines can appear either before or after the signature line, but not both, and
should include the following.

(1) The first line following the signature (or the first line of the file) should
repeat the signature (I often leave out the word “function”) to provide
a reminder of the usage of the function.
(2) Brief description of the mathematical task the function performs.
(3) Description of all the input parameters.
(4) Description of all the output parameters.

Part of the first of these lines is displayed in the “Current directory”


windowpane, and the lines themselves comprise the response to the Mat-
lab command help function name.
The key difference between function and script files is that

• Functions are intended to be used repetitively,


• Functions can accept parameters, and,
• Variables used inside a function are invisible outside the function.

This latter point is important: variables used inside a function (except


for output variables) are invisible after the function completes its tasks
while variables in script files remain in the workspace.
The easiest way to produce script or function files is to use the editor
packaged with the Matlab program. Alternatively, any text editor (e.g.,
emacs, notepad) can be used. A word processor such as Microsoft Word or
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 236

236 Numerical Linear Algebra

Wordpad is not appropriate because it embeds special formatting characters


in the file and Matlab cannot interpret them.
Because function files are intended to be used multiple times, it is a bad
idea to have them print or plot things. Imagine what happens if you have
a function that prints just one line of information that you think might be
useful, and you put it into a loop that is executed a thousand times. Do
you plan to read those lines?
Matlab commands are sometimes terminated with a semicolon (;) and
sometimes not. The difference is that the result of a calculation is printed
to the screen when there is no semicolon but no printing is done when
there is a semicolon. It is a good idea to put semicolons at the ends of all
calculational lines in a function file. When using pseudocode presented in
this book to generate Matlab functions or scripts, you should remember
to insert semicolons in order to minimize extraneous printing.

B.3 Variables, Values and Arithmetic

Values in Matlab are usually2 double precision numbers. When Matlab


prints values, however, it will round a number to about four digits to the
right of the decimal point, or less if appropriate. Values that are integers are
usually printed without a decimal point. Remember, however, that when
Matlab prints a number, it may not be telling you all it knows about that
number.
When Matlab prints values, it often uses a notation similar to scientific
notation, but written without the exponent. For example, Avogadro’s num-
ber is 6.022 · 1023 in usual scientific notation, but Matlab would display
this as 6.022e+23. The e denotes 10. Similarly, Matlab would display
the fraction 1/2048=4.8828e-04. You can change the number of digits
displayed with the format command. (See help format for details.)
Matlab uses variable names to represent data. A variable name rep-
resents a matrix containing complex double-precision data. Of course, if
you simply tell Matlab x=1, Matlab will understand that you mean a
1 × 1 matrix and it is smart enough to print x out without its decimal and
imaginary parts, but make no mistake: they are there. And x can just as
easily turn into a matrix.
A variable can represent some important value in a program, or it can
represent some sort of dummy or temporary value. Important quantities
2 It is possible to have single precision numbers or integers or other formats, but requires

special declarations.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 237

Tutorial on Basic Matlab Programming 237

should be given names longer than a few letters, and the names should indi-
cate the meaning of the quantity. For example, if you were using Matlab
to generate a matrix containing a table of squares of numbers, you might
name the table, for example, tableOfSquares or table of squares.
Once you have used a variable name, it is bad practice to re-use it to
mean something else. It is sometimes necessary to do so, however, and the
statement

clear varOne varTwo

should be used to clear the two variables varOne and varTwo before they
are re-used. This same command is critical if you re-use a variable name
but intend it to have smaller dimensions.
Matlab has a few reserved names. You should not use these as variable
names in your files. If you do use such variables as i or pi, they will lose
their special meaning until you clear them. Reserved names include:

ans: The result of the previous calculation.


computer: The type of computer you are on.
eps: The smallest positive number  that can be represented on the com-
puter and that satisfies the expression 1 +  > 1. Be warned that this
usage is different from the use of eps in the text.

i, j: The imaginary unit ( −1). Using i or j as subscripts or loop
indices when you are also using complex numbers can generate incorrect
answers.
inf: Infinity (∞). This will be the result of dividing 1 by 0.
NaN: “Not a Number.” This will be the result of dividing 0 by 0, or inf
by inf, multiplying 0 by inf, etc.
pi: π
realmax, realmin: The largest and smallest real numbers that can be
represented on this computer.
version: The version of Matlab you are running. (The ver command
gives more detailed information.)

Arithmetic operations can be performed on variables. These operations


include the following. In each case, the printed value would be suppressed
if a semicolon were used.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 238

238 Numerical Linear Algebra

Some Matlab operations


= Assignment x=4 causes variable x to have value 4.
+ Addition x+1 prints the value 5.
- Subtraction x-1 prints the value 3.
* Multiplication 2*x prints the value 8.
/ Division 6/x prints the value 1.5.
^ Exponentiation x^3 prints the value 64.
() Grouping (x+2)/2 prints the value 3.

Matlab has a vast number of mathematical functions. Matlab func-


tions are called using parentheses, as in log(5).

Exercise B.1. Start up Matlab or Octave and use it to answer the fol-
lowing questions.

(a) What are the values of the reserved variables pi, eps, realmax, and
realmin?
(b) Use the “format long” command to display pi in full precision and
“format short” to return Matlab to its default, short, display.
(c) Set the variable a=1, the variable b=1+eps, the variable c=2, and the
variable d=2+eps. What is the difference in the way that Matlab
displays these values?
(d) Do you think the values of a and b are different? Is the way that
Matlab formats these values consistent with your idea of whether
they are different or not?
(e) Do you think the values of c and d are different? Explain your answer.
(f) Choose a value and set the variable x to that value.
(g) What is the square of x? Its cube?
(h) Choose an angle θ and set the variable theta to its value (a number).
(i) What is sin θ? cos θ? Angles can be measured in degrees or radians.
Which of these has Matlab used?

B.4 Variables Are Matrices

Matlab treats all its variables as though they were matrices. Important
subclasses of matrices include row vectors (matrices with a single row and
possibly several columns) and column vectors (matrices with a single col-
umn and possibly several rows). One important thing to remember is that
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 239

Tutorial on Basic Matlab Programming 239

you don’t have to declare the size of your variable; Matlab decides how big
the variable is when you try to put a value in it. The easiest way to define
a row vector is to list its values inside of square brackets, and separated by
spaces or commas:

rowVector = [ 0, 1, 3, 6, 10 ]

The easiest way to define a column vector is to list its values inside of square
brackets, separated by semicolons or line breaks.

columnVector1 = [ 0; 1; 3; 6; 10 ]
columnVector2 = [ 0
1
9
36
100 ]

(It is not necessary to line the entries up, but it makes it look nicer.) Note
that rowVector is not equal to columnVector1 even though each of their
components is the same.
Matlab has a special notation for generating a set of equally spaced
values, which can be useful for plotting and other tasks. The format is:

start : increment : finish

or

start : finish

in which case the increment is understood to be 1. Both of these expressions


result in row vectors. So we could define the even values from 10 to 20 by:

evens = 10 : 2 : 20

Sometimes, you’d prefer to specify the number of items in the list, rather
than their spacing. In that case, you can use the linspace function, which
has the form

linspace( firstValue, lastValue, numberOfValues )

in which case we could generate six even numbers with the command:

evens = linspace ( 10, 20, 6 )

or fifty evenly-spaced points in the interval [10,20] with


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 240

240 Numerical Linear Algebra

points = linspace ( 10, 20, 50 )

As a general rule, use the colon notation when the increment is an integer
or when you know what the increment is and use linspace when you know
the number of values but not the increment.
Another nice thing about Matlab vector variables is that they are
flexible. If you decide you want to add another entry to a vector, it’s very
easy to do so. To add the value 22 to the end of our evens vector:

evens = [ evens, 22 ]

and you could just as easily have inserted a value 8 before the other entries,
as well.
Even though the number of elements in a vector can change, Matlab
always knows how many there are. You can request this value at any time
by using the numel function. For instance,

numel ( evens )

should yield the value 7 (the 6 original values of 10, 12, ... 20, plus the value
22 tacked on later). In the case of matrices with more than one nontrivial
dimension, the numel function returns the product of the dimensions. The
numel of the empty vector is zero. The size function returns a vector
containing two values: the number of rows and the number of columns (or
the numbers along each of the dimensions for arrays with more than two
dimensions). To get the number of rows of a variable v, use size(v,1) and
to get the number of columns use size(v,2). For example, since evens is
a row vector, size( evens, 1)=1 and size( evens, 2)=7, one row and
seven columns.
To specify an individual entry of a vector, you need to use index no-
tation, which uses round parentheses enclosing the index of an entry. The
first element of an array has index 1 (as in Fortran, but not C and Java).
Thus, if you want to alter the third element of evens, you could say

evens(3) = 7

Exercise B.2. Start up Matlab or Octave and use it to do the following


tasks:

(a) Use the linspace function to create a row vector called meshPoints
containing exactly 500 values with values evenly spaced between -1 and
1. Do not print all 500 values!
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 241

Tutorial on Basic Matlab Programming 241

(b) What expression will yield the value of the 55th element of meshPoints?
(c) Use the numel function to confirm the vector has length 500.
(d) Produce a plot of a sinusoid on the interval [−1, 1] using the command
plot(meshPoints,sin(2*pi*meshPoints))
In its very simplest form, the signature of the plot function is
plot(array of x values, array of y values)
The arrays, of course, need to have the same numbers of elements.
The plot function has more complex forms that give you considerable
control over the plot. Use doc plot for further documentation.

B.5 Matrix and Vector Operations

Matlab provides a large assembly of tools for matrix and vector manip-
ulation. The following exercise illuminates the use of these operations by
example.

Exercise B.3. Open up Matlab or Octave and use it to perform the


following tasks.

Define the following vectors and matrices:


rowVec1 = [ -1 -4 -9]
colVec1 = [ 2
9
8 ]
mat1 = [ 1 3 5
7 -9 2
4 6 8 ]

(a) You can multiply vectors by constants. Compute


colVec2 = (pi/4) * colVec1
(b) The cosine function can be applied to a vector to yield a vector of
cosines. Compute
colVec2 = cos( colVec2 )
Note that the values of colVec2 have been overwritten.
(c) You can add vectors and multiply by scalars. Compute
colVec3 = colVec1 + 2 * colVec2
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 242

242 Numerical Linear Algebra

(d) The sum of a row vector and a column vector makes no sense in the
context of this book. Old versions of Matlab flag attempts to add
row vectors to column vectors as errors, but current versions allow
it. The result is a matrix. In the context of this book, this matrix
result is always wrong and is a recurring source of programming errors.
Compute

wrongMatrix = colVec1 + rowVec1;

Look carefully at the result. When you are testing a program, be alert
for unexpected appearances of matrices and search for sums of row and
column vectors that might cause these appearances!
(e) You can do row-column matrix multiplication. Compute

colvec4 = mat1 * colVec1

(f) A single quote following a matrix or vector indicates a (Hermitian)


transpose.

mat1Transpose = mat1’
rowVec2 = colVec3’

Warning: The single quote means the Hermitian adjoint or complex-


conjugate transpose. If you want a true transpose applied to a complex
matrix you must use “.’”.
(g) Transposes allow the usual operations. You might find uT v a useful
expression to compute the dot (inner) product u · v (although there is
a dot function in Matlab).

mat2 = mat1 * mat1’ % mat2 is symmetric


rowVec3 = rowVec1 * mat1
dotProduct = colVec3’ * colVec1
euclideanNorm = sqrt(colVec2’ * colVec2)

(h) Matrix operations such as determinant and trace are available, too.

determinant = det( mat1 )


traceOfMat1 = trace( mat1 )

(i) You can pick certain elements out of a vector, too. Use the following
command to find the smallest element in a vector rowVec1.

min(rowVec1)
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 243

Tutorial on Basic Matlab Programming 243

(j) The min and max functions work along one dimension at a time. They
produce vectors when applied to matrices.

max(mat1)

(k) You can compose vector and matrix functions. For example, use the
following expression to compute the max norm of a vector.

max(abs(rowVec1))

(l) How would you find the single largest element of a matrix?
(m) As you know, a magic square is a matrix all of whose row sums, column
sums and the sums of the two diagonals are the same. (One diagonal of
a matrix goes from the top left to the bottom right, the other diagonal
goes from top right to bottom left.) Show by direct computation that
if the matrix A is given by

A=magic(100); % do not print all 10,000 entries.

Then it has 100 row sums (one for each row), 100 column sums (one
for each column) and two diagonal sums. These 202 sums should all
be exactly the same, and you could verify that they are the same by
printing them and “seeing” that they are the same. It is easy to miss
small differences among so many numbers, though. Instead, verify that
A is a magic square by constructing the 100 column sums (without
printing them) and computing the maximum and minimum values of
the column sums. Do the same for the 100 row sums, and compute the
two diagonal sums. Check that these six values are the same. If the
maximum and minimum values are the same, the flyswatter principle
says that all values are the same.
Hints:
• Use the Matlab min and max functions.
• Recall that sum applied to a matrix yields a row vector whose values
are the sums of the columns.
• The Matlab function diag extracts the diagonal of a matrix, and
the composition of functions
sum(diag(fliplr(A))) computes the sum of the other diagonal.
(n) Suppose we want a table of integers from 0 to 9, their squares and
cubes. We could start with

integers = 0 : 9
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 244

244 Numerical Linear Algebra

but now we’ll get an error when we try to multiply the entries of
integers by themselves.
squareIntegers = integers * integers
Realize that Matlab deals with vectors, and the default multiplication
operation with vectors is row-by-column multiplication. What we want
here is element-by-element multiplication, so we need to place a period
in front of the operator:
squareIntegers = integers .* integers
Now we can define cubeIntegers and fourthIntegers in a similar
way.
cubeIntegers = squareIntegers .* integers
fourthIntegers = squareIntegers .* squareIntegers
Finally, we would like to print them out as a table. integers,
squareIntegers, etc. are row vectors, so make a matrix whose columns
consist of these vectors and allow Matlab to print out the whole ma-
trix at once.
tableOfPowers=[integers’, squareIntegers’, ...
cubeIntegers’, fourthIntegers’]
(The “. . . ” tells Matlab that the command continues on the next
line.)
(o) Compute the squares of the values in integers alternatively using the
exponentiation operator as:
sqIntegers = integers .^ 2
and check that the two calculations agree with the command
norm(sqIntegers-squareIntegers)
that should result in zero.
(p) You can add constants to vectors and matrices. Compute
squaresPlus1=squareIntegers+1;
(q) Watch out when you use vectors. The multiplication, division and
exponentiation operators all have two possible forms, depending on
whether you want to operate on the arrays, or on the elements in the
arrays. In all these cases, you need to use the period notation to force
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 245

Tutorial on Basic Matlab Programming 245

elementwise operations. Fortunately, as you have seen above, using


multiplication or exponentiation without the dot will often produce an
error. The same cannot be said of division. Compute

squareIntegers./squaresPlus1

and also

squareIntegers/squaresPlus1

This latter value uses the Moore–Penrose pseudo-inverse and is almost


never what you intend. You have been warned! Remark: Addition,
subtraction, and division or multiplication by a scalar never require the
dot in front of the operator, although you will get the correct result if
you use one.
(r) The index notation can also be used to refer to a subset of elements
of the array. With the start:increment:finish notation, we can refer to
a range of indices. Two-dimensional vectors and matrices can be con-
structed by leaving out some elements of our three-dimensional ones.
For example, submatrices an be constructed from tableOfPowers.
(The end function in Matlab means the last value of that dimension.)

tableOfCubes = tableOfPowers(:,[1,3])
tableOfOddCubes = tableOfPowers(2:2:end,[1,3])
tableOfEvenFourths = tableOfPowers(1:2:end,1:3:4)

(s) You have already seen the Matlab function magic(n). Use it to con-
struct a 10 × 10 matrix.

A = magic(10)

What commands would be needed to generate the four 5×5 matrices in


the upper left quarter, the upper right quarter, the lower left quarter,
and the lower right quarter of A?

Repeated Warning: Although multiplication of vectors is illegal without


the dot, division of vectors is legal! It will be interpreted in terms of the
Moore–Penrose pseudo-inverse. Beware!
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 246

246 Numerical Linear Algebra

B.6 Flow Control

It is critical to be able to ask questions and to perform repetitive calcu-


lations in m-files. These topics are examples of “flow control” constructs
in programming languages. Matlab provides two basic looping (repeti-
tion) constructs: for and while, and the if construct for asking ques-
tions. These statements each surround several Matlab statements with
for, while or if at the top and end at the bottom.

Remark B.1. It is an excellent idea to indent the statements between the


for, while, or if lines and the end line. This indentation strategy makes
code immensely more readable. Code that is hard to read is hard to debug,
and debugging is hard enough as it is.

The syntax of a for loop is


for control-variable=start : increment : end
Matlab statement . . .
...
end

The syntax of a while loop is


Matlab statement initializing a control variable
while logical condition involving the control variable
Matlab statement . . .
...
Matlab statement changing the control variable
end

The syntax of a simple if statement is


if logical condition
Matlab statement . . .
...
end

The syntax of a compound if statement is


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 247

Tutorial on Basic Matlab Programming 247

if logical condition
Matlab statement . . .
...
elseif logical condition
...
else
...
end
Note that elseif is one word! Using two words else if changes the
statement into two nested if statements with possibly a very different
meaning, and a different number of end statements.

Exercise B.4. The “max” or “sup” or “infinity” norm of a vector is given


as the maximum of the absolute values of the components of the vector.
Suppose {vn }n=1,...,N is a vector in RN , then the infinity norm is given as
v∞ = max |vn |. (B.1)
n=1,...,N

If v is a Matlab vector, then the Matlab function numel gives its number
of elements, and the following code will compute the infinity norm. Note
how indentation helps make the code understandable. (Matlab already
has a norm function to compute norms, but this is how it could be done.)

% find the infinity norm of a vector v

N = numel(v);
nrm = abs(v(1));
for n=2:N
if abs(v(n)) > nrm
nrm=abs(v(n)); % largest value up to now
end
end
nrm % no semicolon: value is printed

(a) Define a vector as


v=[ -5 2 0 6 8 -1 -7 -10 -10];
(b) How many elements does v have? Does that agree with the result of
the numel function?
(c) Copy the above code into the Matlab command window and execute
it.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 248

248 Numerical Linear Algebra

(d) What is the first value that nrm takes on? (5)
(e) How many times is the statement with the comment “largest value
up to now” executed? (3)
(f) What are all the values taken by the variable nrm? (5,6,7,10)
(g) What is the final value of nrm? (10)

B.7 Script and Function Files

If you have to type everything at the command line, you will not get very
far. You need some sort of scripting capability to save the trouble of typing,
to make editing easier, and to provide a record of what you have done. You
also need the capability of making functions or your scripts will become too
long to understand. In the next exercise, you will write a script file.

Exercise B.5.

(a) Copy the code given above for the infinity norm into a file
named infnrm.m. Recall you can get an editor window from the
File→New→M-file menu or from the edit command in the command
windowpane. Don’t forget to save the file.
(b) Redefine the vector
v = [-35 -20 38 49 4 -42 -9 0 -44 -34];
(c) Execute the script m-file you just created by typing just its name
(infnrm) without the .m extension in the command windowpane. What
is the infinity norm of this vector? (49)
(d) The usual Euclidean or 2-norm is defined as

N

v2 =  vn2 . (B.2)
1

Copy the following Matlab code to compute the 2-norm into a file
named twonrm.m.
% find the two norm of a vector v
% your name and the date

N = numel(v);
nrm = v(1)^2;
for n=2:N
nrm = nrm + v(n)^2;
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 249

Tutorial on Basic Matlab Programming 249

end
nrm=sqrt(nrm) % no semicolon: value is printed

(e) Using the same vector v, execute the script twonrm. What are the first
four values the variable norm takes on? (1625, 3069, 5470, 5486) What
is its final value? (102.0931)
(f) Look carefully at the mathematical expression (B.2) and the Matlab
code in twonrm.m. The the way one translates a mathematical summa-
tion into Matlab code is to follow the steps:

(i) Set the initial value of the sum variable (nrm in this case) to zero
or to the first term.
(ii) Put an expression adding subsequent terms, one at a time, inside
a loop. In this case it is of the form nrm=nrm+something.

Script files are very convenient, but they have drawbacks. For example,
if you had two different vectors, v and w, for which you wanted norms, it
would be inconvenient to use infnrm or twonrm. It would be especially
inconvenient if you wanted to get, for example, v2 + 1/w∞ . This in-
convenience is avoided by using function m-files. Function m-files define
your own functions that can be used just like Matlab functions such as
sin(x), etc. In the following exercise, you will write two function m-files.

Exercise B.6.

(a) Copy the file infnrm.m to a file named infnorm.m. (Look carefully, the
names are different! You can use “save as” or cut-and-paste to do the
copy.) Add the following lines to the beginning of the file:

function nrm = infnorm(v)


% nrm = infnorm(v)
% v is a vector
% nrm is its infinity norm

(b) The first line of a function m-file is called the “signature” of the function.
The first comment line repeats the signature in order to explain the
“usage” of the function. Subsequent comments explain the parameters
(such as v) and the output (such as norm) and, if possible, briefly explain
the methods used. The function name and the file name must agree.
(c) Place a semicolon on the last line of the file so that nothing will normally
be printed by the function.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 250

250 Numerical Linear Algebra

(d) Use the Matlab “help” command:

help infnorm

This command will repeat the first lines of comments (up to a blank
line or a line of code) and provides a quick way to refresh your memory
of how the function is to be called and what it does.
(e) Invoke the function in the command windowpane by typing

infnorm(v)

(f) Repeat the above steps to define a function named twonorm.m from the
code in twonrm.m. Be sure to put comments in.
(g) Define two vectors

a = [ -43 -37 24 27 37 ];
b = [ -5 -4 -29 -29 30 ];

and find the value of infinity norm of a and the two norm of b with the
commands

aInfinity = infnorm(a)
bTwo = twonorm(b)

Note that you no longer need to use the letter v to denote the vector,
and it is easy to manipulate the values of the norms.

B.8 MATLAB Linear Algebra Functionality

Matlab was originally conceived as a “matrix laboratory” and has consid-


erable linear algebra functionality available. This section presents a small
sample of those Matlab functions that implement algorithms similar to
those discussed in this book. Detailed instructions on use and implemen-
tation can be found in the Matlab “help” facility that is part of the dis-
tribution package or on the Mathworks web site.

B.8.1 Solving matrix systems in MATLAB


Matlab provides a collection of direct solvers for matrix systems rolled into
a single command: “\”. If a Matlab variable A is an N×N matrix and b is a
N×1 vector, then the solution of the system Ax = b is computed in Matlab
with the command x=A\b. Although this looks unusual to a person used
to mathematical notation, it is equivalent to x = A−1 b and it respects the
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 251

Tutorial on Basic Matlab Programming 251

order of matrix operations. There is a named function, mldivide (“matrix


left divide”) that is equivalent to the symbolic operation: x=A\b is the
identical to x=mldivide(A,b). In informal speech, this capability is often
called simply “backslash”.
Warning:
The command mldivide is more general than ordinary matrix inversion.
If the matrix A is not N×N or if b is not an N-vector, mldivide provides a
least-squares best approximate solution, and no warning message is given.
This is equivalent to the Moore–Penrose pseudoinverse. Care must be taken
that typing errors do not lead to incorrect numerical results.
The underlying numerical methods for the mldivide command cur-
rently come from umfpack [Davis (2004)]. They work for both dense and
sparse matrices and are among the most efficient methods known.
In addition to mldivide (backslash), Matlab provides implementa-
tions of several iterative solution algorithms, only two of which are men-
tioned here.

pcg uses the CG and preconditioned CG methods.


gmres uses the generalized minimum residual method.

B.8.2 Condition number of a matrix


Matlab has three functions to find the condition number of a matrix, using
three different methods.

(1) The function cond computes the condition number of a matrix as pre-
sented in Definition 4.7.
(2) The function condest is an estimate of the 1-norm condition number.
(3) The function rcond is an estimate of the reciprocal of the condition
number.

B.8.3 Matrix factorizations


Matlab has functions to compute several standard matrix factorizations,
as well as “incomplete” factorizations that are useful as preconditioners for
iterative methods such as conjugate gradients.

chol computes the Cholesky factorization, an LLt factorization for sym-


metric positive definite matrices.
ichol computes the incomplete Cholesky factorization.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 252

252 Numerical Linear Algebra

lu computes the LU factorization as discussed in Section 3.6.


ilu computes the incomplete LU factorization.
qr computes the QR factorization as discussed in Section 9.6.

B.8.4 Eigenvalues and singular values


eig computes all the eigenvalues and eigenvectors of a matrix, as discussed
in Chapter 9. It can handle the “generalized eigenvalue” problem also.
It is primarily used for relatively small dense matrices.
eigs computes some of the largest or smallest eigenvalues and eigenvectors,
or those nearest a shift, σ. It is most appropriate for large, sparse
matrices.
svd computes the singular value decomposition of a matrix (Definition 2.5).
svds computes some of the largest singular values or those nearest a shift.
It is most appropriate for large, sparse matrices.

B.9 Debugging

A programming error in a computer program is called a “bug”. It is com-


monly believed that the use of the term “bug” in this way dates to a prob-
lem in a computer program that Rear Admiral Grace Hopper, [Patterson
(2011)], was working with. It turned out that a moth had become trapped
in a relay in the early computer, causing it to fail. This story is true and
pictures of the deceased moth taped into her notebook can be found on the
internet. The term “bug” in reference to errors in mechanical devices had
been in use for many years at the time, but the story is so compelling that
people still believe it was the source of the term.
Finding and eliminating bugs in computer programs is called “debug-
ging”. Debugging is one of the most difficult, time consuming and least
rewarding activities you are likely to engage in. Software engineers teach
that absolutely any programming habit you develop that reduces the likeli-
hood of creating a bug is worth the trouble in saved debugging time. Among
these habits are:

• Indenting your code inside loops and if statements;


• Use long descriptive variable names;
• Write shorter functions with fewer branches; and,
• Never reuse the same variable for two different quantities.

You are urged to adopt these and any other practices that you find help
you avoid bugs.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 253

Tutorial on Basic Matlab Programming 253

One of the most powerful debugging tools a programmer has is a


“source-level debugger”, or just “debugger”. Matlab, like all other mod-
ern programming environments, incudes such a debugger, integrated into
its window environment. This tool can be used to follow the execution of a
Matlab function or script line by line, by which you can understand how
the code works, thereby helping to find errors. Matlab provides an ex-
cellent tutorial on its debugger. Search the documentation for “Debugging
Process and Features”. If you are using another programming language,
you should learn to use an appropriate debugger: the time spent learning
it will be paid back manyfold as you use it.
It is beyond the scope of this book to provide tutorials on the various
debuggers available for various languages. It is true, however, that there is
a core functionality that all debuggers share. Some of those core functions
are listed below, using Matlab terminology. Other debuggers may have
other terminology for similar functions.

Values All debuggers allow you to query the current value of variables in
the current function. In Matlab and in several other debuggers, this
can be accomplished by placing the cursor over the variable and holding
it stationary.
Step Execute one line of source code from the current location. If the line
is a function call, complete the function call and continue in the current
function.
Step in If the next line of source code is a function call, step into that
function, so that the first line of the function is the line that is displayed.
You would normally use this for functions you suspect contribute to the
bug but not for Matlab functions or functions you are confident are
correct.
Breakpoints It is usually inconvenient to follow a large program from its
beginning until the results of a bug become apparent. Instead, you set
“breakpoints”, which are places in the code that cause the program to
stop and display source code along with values of variables. If you find
a program stopping in some function, you can set a breakpoint near the
beginning of that function and then track execution from that point on.
Conditional breakpoints Matlab provides for breakpoints based on
conditions. Numerical programs sometimes fail because the result of
some calculation is unreasonable or unexpected. For example, if the
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 254

254 Numerical Linear Algebra

forbidden values inf or NaN3 are generated unexpectedly. Setting a


breakpoint that will be activated as soon as such a value is generated,
no matter what line of code is involved may expose the bug. It is also
possible to set a breakpoint based on a condition such as x becoming
equal to 1.
Continue Continue from the current line until the next breakpoint, or
until it loops back to this breakpoint.
Call stack Most programs call many functions and often call the same
function from different places. If, for example, your debugger shows
that the program has just computed inf inside log(x) with x=0, you
need to know where the call to log(x) occurred. The call stack is the
list of function calls culminating with the current one.

Finally, one remarkably effective strategy to use with a debugger is to


examine the source code, querying the current values of relevant variables.
Then it is possible to predict the effect of the next line of code. Stepping the
debugger one line will confirm your prediction or surprise you. If it surprises
you, you probably have found a bug. If not, go on to the following line.

B.10 Execution Speed

The remarks in this section are specific to Matlab and, to some extent,
Octave. These remarks cannot be generalized to languages such as C, C++
and Fortran, although Fortran shares the array notation and use of the
colon with Matlab.
It is sometimes possible to substantially reduce execution times for some
Matlab code by reformulating it in a mathematically equivalent manner
or by taking advantage of Matlab’s array notation. In this section, a few
strategies are presented for speeding up programs similar to the pseudocode
examples presented in this book.
The simplest timing tools in Matlab are the tic and toc commands.
These commands are used by calling tic just before the segment of code or
function that is being timed, and toc just after the code is completed. The
toc call results in the elapsed time since the tic call being printed. Care
must be taken to place them inside a script or function file or on the same
line as the code to be timed, or else it will be your typing speed that is
3 Division by zero results in a special illegal value denoted inf. The result of 0/0 and

most arithmetic performed on inf results in a different illegal value denoted NaN for “Not
a Number”.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 255

Tutorial on Basic Matlab Programming 255

measured. A second point to remember is that the first time a function is


called it must be read from disk, a slow process. If you plan to measure the
speed of a function, you should do it twice. You will find that the second
value is much more reliable (and often much smaller) than the first.

B.10.1 Initializing vectors and matrices in MATLAB


Matlab vectors are not fixed in length, but can grow dynamically. They
do not shrink. The first time Matlab encounters a vector, it allocates
some amount of storage. As soon as Matlab encounters an index larger
than it has already allocated, it stops, allocates a new, longer, vector and
(if necessary) copies all old information into the new vector. This operation
involves calls to the operating system for the allocation and then (possibly)
a copy. All this work can take a surprising amount of time. Passing through
an array in reverse direction can avoid some of this work.
For example, on a 2012-era computer running Kubuntu Linux, the fol-
lowing command
tic; for i=1:2000; for j=1:2000; G(i,j)=i+j;end;end;toc
takes about 4.65 seconds. Executing the command a second time, so G has
already been allocated, takes only 0.37 seconds. Similarly, executing the
command
tic; for i=2000:-1:1; for j=2000:-1:1; G(i,j)=i+j;end;end;toc
(passing through the array in reverse order) takes 0.40 seconds. (The dif-
ference between 0.37 seconds and 0.40 seconds is not significant.)
In many computer languages, you are required to declare the size of an
array before it is used. Such a declaration is not required in Matlab, but
a common strategy in Matlab is to initialize a matrix to zero using the
zeros command. It turns out that such a strategy carries a substantial
advantage in computer time:
tic;G=zeros(2000,2000);
for i=1:2000;for j=1:2000;G(i,j)=i+j;end;end;toc
(typed on a single line) takes only 0.08 seconds!

B.10.2 Array notation and efficiency in MATLAB


Matlab allows arithmetic and function evaluation to be done on entire
matrices at once instead of using loops. Addition, subtraction, and (row-
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 256

256 Numerical Linear Algebra

column) multiplication can be represented in the usual manner. In addition,


componentwise multiplication, division, exponentiation and function calls
can also be done on matrices. These are summarized in Table B.1.
Table B.1 Selected Matlab array operations.
(A)ij = aij for 1 ≤ i ≤ NA and 1 ≤ j ≤ MA , similarly for B and C.
Operation Interpretation Restrictions

C=A+B cij = aij + bij N A = NB = NC , MA = M B = M C

C=A-B cij = aij − bij N A = NB = NC , MA = M B = M C


MA
C=A*B cij = k=1 aik bkj N C = NA , M C = M B , M A = NB

C=A^n C=A*A*A*· · · *A A is square

C=A.*B cij = aij ∗ bij N A = NB = NC , MA = M B = M C

C=A./B cij = aij /bij N A = NB = NC , MA = M B = M C

C=A.^n cij = (aij )n N A = NC , M A = M C


aij
C=n.^A cij = n N A = NC , M A = M C

C=f(A) cij = f (aij ) NA = NC , MA = MC , f a function

Warning: A careful examination of Table B.1 shows that the expression


∞
exp(A) is not the same as e^A (= n=0 An /n!).
The array operations described in Table B.1 are generally faster than the
equivalent loops. When the matrices are large, the speed improvement for
using array operations can be dramatic. On a 2012-era computer running
Kubuntu Linux, a loop for adding two 4000×4000 matrices took 41 seconds,
but the same matrix operation took less than 0.06 seconds!
It is often possible to optain a speedup simply by replacing a loop with
equivalent array operations, even when the operations are not built-in Mat-
lab operations. For example, consider the following loop, for N=4000*4000.

g=zeros(N,1);
for i=1:N
g(i)=sin(i);
end

This loop takes about 3.93 seconds on the computer mentioned above. A
speedup of almost a factor of two is available with the simple trick of cre-
ating a vector, i=(1:N), consisting of the consecutive integers from 1 to N,
as in the following code.
June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 257

Tutorial on Basic Matlab Programming 257

g=zeros(N,1);
i=1:N; % i is a vector
g(i)=sin(i); % componentwise application of sin
This code executes in 2.04 seconds. Once the loop has been eliminated, the
code can be streamlined to pick up another 10%.
g=sin(1:N);
and this code executes in 1.82 seconds, for a total improvement of more
than a factor of two.
Sometimes dramatic speed improvements are available through careful
consideration of what the code is doing. The MPP2D matrix is available in
Matlab through the gallery function. This function provides a “rogues
gallery” of matrices that can be used for testing algorithms. Recall that the
MPP2D matrix is tridiagonal, and hence quite sparse. An LU factorization
is available using Matlab’s lu function, and, given a right hand side vector
b, the forward and backward substitutions can be done using mldivide (the
“\” operator).
Consider the following code:
N=4000;
A=gallery(’tridiag’,N);
[L,U]=lu(A);
b=ones(N,1);
tic;x=U\L\b;toc
tic;y=U\(L\b);toc
On the same computer mentioned above, the computation of x takes 1.06
seconds, dramatically slower than the 0.0005 seconds needed to compute
y. The reason is that U\L\b means the same as (U\L)\b. In this case,
both U and L are bidiagonal, but (U\L) has nonzeros everywhere above the
diagonal and also on the lower subdiagonal. It is quite large and it takes
a long time to compute. Once it is computed, multiplying by b reduces it
back to a vector. In contrast, U\(L\b) first computes the vector L\b by
a simple bidiagonal multiplication and then computes the vector y with
another bidiagonal multiplication.
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


June 24, 2020 11:35 ws-book9x6 Numerical Linear Algebra 11926-main page 259

Bibliography

Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J.,
Eijkhout, V., Pozo, R., Romine, C., and der Vorst, H. V. (1994). Templates
for the Solution of Linear Systems: Building Blocks for Iterative Methods
(SIAM, Philadelphia, PA).
Davis, T. (2004). Algorithm 832: Umfpack v4.3 — an unsymmetric-pattern mul-
tifrontal method, ACM Transactions on Mathematical Software (TOMS)
30, 2, pp. 196–199.
Faber, V. and Manteuffel, T. (1984). Necessary and sufficient conditions for the
existence of a conjugate gradient method, SIAM Journal on Numerical
Analysis 21, 2, pp. 352–362.
Gilbert, J. R., Moler, C., and Schreiber, R. (1992). Sparse matrices in matlab:
Design and implementation, SIAM Journal on Matrix Analysis and Appli-
cations 13, 1, pp. 333–356.
Hageman, L. A. and Young, D. M. (1981). Applied iterative methods (Academic
Press, New York), ISBN 0123133408; 9780123133403.
Herstein, I. N. (1964). Topics in algebra, 1st edn. (Blaisdell Pub. Co, New York,
N.Y).
Patterson, M. R. (2011). Grace murray hopper, rear admiral, united states navy,
http://www.arlingtoncemetery.net/ghopper.htm.
Voevodin, V. V. (1983). The question of non-self-adjoint extension of the con-
jugate gradients method is closed, USSR Computational Mathematics and
Mathematical Physics 23, 2, pp. 143–144.
von Neumann, J. and Goldstine, H. H. (1947). Numerical inverting of matrices of
high order, Bulletin of the American Mathematical Society 53, 11, pp. 1021–
1100.
Watkins, D. S. (1982). Understanding the qr algorithm, SIAM Review 24, 4,
pp. 427–440.

259
b2530   International Strategic Relations and China’s National Security: World at the Crossroads

This page intentionally left blank

b2530_FM.indd 6 01-Sep-16 11:03:06 AM


June 25, 2020 13:34 ws-book9x6 Numerical Linear Algebra 11926-main page 261

Index

absolute error, 10 descent method, 162, 163, 180


ADI, 154, 155 difference approximations, 93
alternating direction implicit, 155 difference molecule, 102
augmented matrix, 38 divergence theorem, 90
Douglas–Rachford, 153
backsubstitution, 33, 36, 40–42, 60
backward error, 79 eigenspace, 17
backward error analysis, 10 eigenvalue, 17, 29, 72, 81, 82, 105,
banded matrices, 48, 52 110, 127, 252
biCG, 205 eigenvector, 17, 105, 175
biconjugate gradient, 205 elimination, 33, 34
block Jacobi, 167 error, 9, 23, 74, 76, 123, 198
buckling, 212
FENLA, 24, 61
CG, 180, 186, 198 first order Richardson, 116, 123
CGNE, 205 fixed point, 121
CGNR, 205 fixed point iteration, 121
CGS, 205 FLOPS, 42, 49, 51, 52, 59, 64, 112,
Chebychev, 186 113, 164, 228
Chebychev polynomials, 200 FOR, 116, 120, 123, 129, 134, 135,
commute, 16 140, 173
compressed row storage, 146
computational complexity, 41, 42 Gauss–Seidel, 136, 137
cond(A), 80, 81, 83, 85, 87, 88 Gaussian elimination, 38–40, 42, 47,
condition number, 74, 76, 81, 133, 251 57, 59
conjugate gradient, 181 GCN, 182, 204
conjugate gradient squared, 205 generalized minimum residual, 205
convection diffusion, 98, 176 Gershgorin, 217
convergence, 62, 120, 125, 126, 128, GMRES, 182, 205
133, 135, 140, 153, 167, 184, 222 Gram matrix, 189
curse of dimensionality, 110 Gram–Schmidt, 191, 196

261
June 25, 2020 13:34 ws-book9x6 Numerical Linear Algebra 11926-main page 262

262 Numerical Linear Algebra

Gramian, 189 Peclet number, 99, 177


percent error, 10
Hilbert matrix, 77 permutation, 56
permutation vector, 57
ill conditioned, 31, 76, 77, 86 perturbation lemma, 83
inner products, 66 pivot, 48
inverse power, 225 pivoting, 34, 35, 43–47, 57, 59
iterative improvement, 62, 64 P LU , 57–59
Poisson problem, 89
Jacobi method, 117, 119, 120, 122, positive definite, 157
137 power method, 221, 223
Jordan, 231 preconditioning, 121, 164, 202
pseudocode, 36–38
Krylov, 185, 192
QR, 227, 228
Laplacian, 102, 106
LU , 52–54, 56, 59, 62–64 Rayleigh quotient, 216, 226
red-black, 167
machine epsilon, 5 regular splitting, 122, 164
matrix norm, 69
relative error, 10
method of lines, 92
relaxation step, 139
min-max problem, 130
residual, 23, 74–76, 123
minimization, 169
residual-update form, 122
model Poisson problem, 93, 107, 110,
roundoff errors, 10
119, 120, 136
Moore–Penrose pseudoinverse, 245,
second order Richardson, 143, 144
251
self-adjointness, 68
MPP, 89, 93, 100, 105, 106, 110, 120,
similar matrices, 127
140, 144, 166, 169, 170, 183, 198,
203 singular values, 22, 71, 252
skew symmetric, 159
norm, 65, 66, 69, 73 skew-symmetric, 157
normal equations, 171, 206 SOR, 141, 142
SPD, 68, 157, 158, 160, 161, 180
optimization, 160 spectral condition number, 85, 130
orthogonal, 67, 68, 72, 77 spectral localization, 81
orthogonal basis, 189 spectral mapping theorem, 83, 128
orthogonalization of moments, 187, spectral radius, 126
192–196 spectrum, 217
orthonormal, 21, 67, 68 splitting, 149
over-relaxation, 140 stationary iterative method, 121
steepest descent, 162, 172, 173, 175
parallel, 118, 125 stopping criteria, 124
PCG, 202, 203 successive over relaxation, 141
Peaceman–Rachford, 149, 150, 152, symmetric, 157
153 symmetric matrices, 21
June 25, 2020 13:34 ws-book9x6 Numerical Linear Algebra 11926-main page 263

Index 263

transpose, 14 unique solvability, 29


triangular, 122, 215 update, 123
triangular matrices, 21
tridiag(−1, 2, −1), 97, 98
tridiagonal, 48–51, 56, 155 vibration, 212

under-relaxation, 139 well conditioned, 76, 220

You might also like