0% found this document useful (0 votes)
7 views

Introduction Numerical Analysis (1) (1)

The document is an introduction to numerical analysis, covering essential topics such as function approximation, interpolation, numerical differentiation, and integration. It includes detailed sections on various methods and principles, including polynomial interpolation, least squares approximation, and eigenvalue problems. The content is structured into chapters, each addressing specific areas of numerical analysis with examples and theoretical foundations.

Uploaded by

ax2183
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Introduction Numerical Analysis (1) (1)

The document is an introduction to numerical analysis, covering essential topics such as function approximation, interpolation, numerical differentiation, and integration. It includes detailed sections on various methods and principles, including polynomial interpolation, least squares approximation, and eigenvalue problems. The content is structured into chapters, each addressing specific areas of numerical analysis with examples and theoretical foundations.

Uploaded by

ax2183
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 252

Introduction to Numerical Analysis

Hector D. Ceniceros
c Draft date July 6, 2020
Contents

Contents i

Preface 1

1 Introduction 3
1.1 What is Numerical Analysis? . . . . . . . . . . . . . . . . . . 3
1.2 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 An Approximation Principle . . . . . . . . . . . . . . . 4
1.2.2 Divide and Conquer . . . . . . . . . . . . . . . . . . . 6
1.2.3 Convergence and Rate of Convergence . . . . . . . . . 7
1.2.4 Error Correction . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 Richardson Extrapolation . . . . . . . . . . . . . . . . 11
1.3 Super-algebraic Convergence . . . . . . . . . . . . . . . . . . . 13

2 Function Approximation 17
2.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Uniform Polynomial Approximation . . . . . . . . . . . . . . . 19
2.2.1 Bernstein Polynomials and Bézier Curves . . . . . . . . 19
2.2.2 Weierstrass Approximation Theorem . . . . . . . . . . 23
2.3 Best Approximation . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Best Uniform Polynomial Approximation . . . . . . . . 27
2.4 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . 31

3 Interpolation 37
3.1 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Equispaced and Chebyshev Nodes . . . . . . . . . . . . 40
3.2 Connection to Best Uniform Approximation . . . . . . . . . . 41
3.3 Barycentric Formula . . . . . . . . . . . . . . . . . . . . . . . 43

i
ii CONTENTS

3.3.1 Barycentric Weights for Chebyshev Nodes . . . . . . . 44


3.3.2 Barycentric Weights for Equispaced Nodes . . . . . . . 45
3.3.3 Barycentric Weights for General Sets of Nodes . . . . . 45
3.4 Newton’s Form and Divided Differences . . . . . . . . . . . . . 46
3.5 Cauchy Remainder . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Hermite Interpolation . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 Convergence of Polynomial Interpolation . . . . . . . . . . . . 53
3.8 Piece-wise Linear Interpolation . . . . . . . . . . . . . . . . . 55
3.9 Cubic Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.9.1 Solving the Tridiagonal System . . . . . . . . . . . . . 60
3.9.2 Complete Splines . . . . . . . . . . . . . . . . . . . . . 62
3.9.3 Parametric Curves . . . . . . . . . . . . . . . . . . . . 63

4 Trigonometric Approximation 65
4.1 Approximating a Periodic Function . . . . . . . . . . . . . . . 65
4.2 Interpolating Fourier Polynomial . . . . . . . . . . . . . . . . 70
4.3 The Fast Fourier Transform . . . . . . . . . . . . . . . . . . . 75

5 Least Squares Approximation 79


5.1 Continuous Least Squares Approximation . . . . . . . . . . . . 79
5.2 Linear Independence and Gram-Schmidt Orthogonalization . . 85
5.3 Orthogonal Polynomials . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . 89
5.4 Discrete Least Squares Approximation . . . . . . . . . . . . . 90
5.5 High-dimensional Data Fitting . . . . . . . . . . . . . . . . . . 95

6 Computer Arithmetic 99
6.1 Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . 99
6.2 Rounding and Machine Precision . . . . . . . . . . . . . . . . 100
6.3 Correctly Rounded Arithmetic . . . . . . . . . . . . . . . . . . 101
6.4 Propagation of Errors and Cancellation of Digits . . . . . . . . 102

7 Numerical Differentiation 105


7.1 Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 The Effect of Round-off Errors . . . . . . . . . . . . . . . . . . 108
7.3 Richardson’s Extrapolation . . . . . . . . . . . . . . . . . . . . 109
CONTENTS iii

8 Numerical Integration 111


8.1 Elementary Simpson Quadrature . . . . . . . . . . . . . . . . 111
8.2 Interpolatory Quadratures . . . . . . . . . . . . . . . . . . . . 114
8.3 Gaussian Quadratures . . . . . . . . . . . . . . . . . . . . . . 116
8.3.1 Convergence of Gaussian Quadratures . . . . . . . . . 119
8.4 Computing the Gaussian Nodes and Weights . . . . . . . . . . 121
8.5 Clenshaw-Curtis Quadrature . . . . . . . . . . . . . . . . . . . 122
8.6 Composite Quadratures . . . . . . . . . . . . . . . . . . . . . 124
8.7 Modified Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . 125
8.8 The Euler-Maclaurin Formula . . . . . . . . . . . . . . . . . . 127
8.9 Romberg Integration . . . . . . . . . . . . . . . . . . . . . . . 131

9 Linear Algebra 135


9.1 The Three Main Problems . . . . . . . . . . . . . . . . . . . . 135
9.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.3 Some Important Types of Matrices . . . . . . . . . . . . . . . 138
9.4 Schur Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.6 Condition Number of a Matrix . . . . . . . . . . . . . . . . . 148
9.6.1 What to Do When A is Ill-conditioned? . . . . . . . . . 150

10 Linear Systems of Equations I 153


10.1 Easy to Solve Systems . . . . . . . . . . . . . . . . . . . . . . 154
10.2 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . 156
10.2.1 The Cost of Gaussian Elimination . . . . . . . . . . . . 163
10.3 LU and Choleski Factorizations . . . . . . . . . . . . . . . . . 164
10.4 Tridiagonal Linear Systems . . . . . . . . . . . . . . . . . . . 168
10.5 A 1D BVP: Deformation of an Elastic Beam . . . . . . . . . . 170
10.6 A 2D BVP: Dirichlet Problem for the Poisson’s Equation . . . 172
10.7 Linear Iterative Methods for Ax = b . . . . . . . . . . . . . . . 175
10.8 Jacobi, Gauss-Seidel, and S.O.R. . . . . . . . . . . . . . . . . 176
10.9 Convergence of Linear Iterative Methods . . . . . . . . . . . . 178

11 Linear Systems of Equations II 183


11.1 Positive Definite Linear Systems as an Optimization Problem . 183
11.2 Line Search Methods . . . . . . . . . . . . . . . . . . . . . . . 185
11.2.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . 186
11.3 The Conjugate Gradient Method . . . . . . . . . . . . . . . . 186
iv CONTENTS

11.3.1 Generating the Conjugate Search Directions . . . . . . 189


11.4 Krylov Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . 192
11.5 Convergence of the Conjugate Gradient Method . . . . . . . . 194

12 Eigenvalue Problems 197


12.1 The Power Method . . . . . . . . . . . . . . . . . . . . . . . . 197
12.2 Methods Based on Similarity Transformations . . . . . . . . . 198
12.2.1 The QR method . . . . . . . . . . . . . . . . . . . . . 199

13 Non-Linear Equations 201


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
13.2 Bisection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
13.2.1 Convergence of the Bisection Method . . . . . . . . . . 202
13.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . 203
13.4 Interpolation-Based Methods . . . . . . . . . . . . . . . . . . . 204
13.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . 205
13.6 The Secant Method . . . . . . . . . . . . . . . . . . . . . . . . 207
13.7 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . 209
13.8 Systems of Nonlinear Equations . . . . . . . . . . . . . . . . . 211
13.8.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . 212

14 Numerical Methods for ODEs 215


14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
14.2 A First Look at Numerical Methods . . . . . . . . . . . . . . . 219
14.3 One-Step and Multistep Methods . . . . . . . . . . . . . . . . 221
14.4 Local and Global Error . . . . . . . . . . . . . . . . . . . . . . 222
14.5 Order of a Method and Consistency . . . . . . . . . . . . . . . 226
14.6 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
14.7 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . 230
14.8 Adaptive Stepping . . . . . . . . . . . . . . . . . . . . . . . . 234
14.9 Embedded Methods . . . . . . . . . . . . . . . . . . . . . . . . 235
14.10Multistep Methods . . . . . . . . . . . . . . . . . . . . . . . . 235
14.10.1 Adams Methods . . . . . . . . . . . . . . . . . . . . . . 236
14.10.2 Zero-Stability and Dahlquist Theorem . . . . . . . . . 237
14.11A-Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
14.12Stiff ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
List of Figures

2.1 The Bernstein weights bk,n (x) for x = 0.25 (◦)and x = 0.75
(•), n = 50 and k = 1 . . . n. . . . . . . . . . . . . . . . . . . . . 21
2.2 Quadratic Bézier curve. . . . . . . . . . . . . . . . . . . . . . . 21
2.3 If the error function en does not equioscillate at least twice we
could lower ken k∞ by an amount c > 0. . . . . . . . . . . . . . 28

4.1 S8 (x) for f (x) = sin xecos x on [0, 2π]. . . . . . . . . . . . . . . 74

5.1 The function f (x) = ex on [0, 1] and its Least Squares Ap-
proximation p1 (x) = 4e − 10 + (18 − 6e)x. . . . . . . . . . . . 81
5.2 Geometric interpretation of the solution Xa of the Least Squares
problem as the orthogonal projection of f on the approximat-
ing linear subspace W . . . . . . . . . . . . . . . . . . . . . . . 97

v
vi LIST OF FIGURES
List of Tables

1.1 Composite Trapezoidal Rule for f (x) = ex in [0, 1]. . . . . . . 8


1.2 Composite Trapezoidal Rule for f (x) = 1/(2 + sin x) in [0, 2π]. 13

14.1 Butcher tableau for a general RK method. . . . . . . . . . . . 232


14.2 Improved Euler. . . . . . . . . . . . . . . . . . . . . . . . . . . 232
14.3 Midpoint RK. . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
14.4 Classical fourth order RK. . . . . . . . . . . . . . . . . . . . . 233
14.5 Backward Euler. . . . . . . . . . . . . . . . . . . . . . . . . . 233
14.6 Implicit mid-point rule RK. . . . . . . . . . . . . . . . . . . . 233
14.7 Hammer and Hollingworth DIRK. √. . . . . . . . . . . . . . . . 234
14.8 Two-stage order 3 SDIRK (γ = 3±6 3 ). . . . . . . . . . . . . . 234

vii
viii LIST OF TABLES
Preface

These notes were prepared by the author for use in the upper division under-
graduate course of Numerical Analysis (Math 104 ABC) at the University of
California at Santa Barbara. They were written with the intent to emphasize
the foundations of Numerical Analysis rather than to present a long list of
numerical methods for different mathematical problems.
We begin with an introduction to Approximation Theory and then use
the different ideas of function approximation in the derivation and analysis
of many numerical methods.
These notes are intended for undergraduate students with a strong math-
ematics background. The prerequisites are Advanced Calculus, Linear Alge-
bra, and introductory courses in Analysis, Differential Equations, and Com-
plex Variables. The ability to write computer code to implement the nu-
merical methods is also a necessary and essential part of learning Numerical
Analysis.
These notes are not in finalized form and may contain errors, misprints,
and other inaccuracies. They cannot be used or distributed without written
consent from the author.

1
2 LIST OF TABLES
Chapter 1

Introduction

1.1 What is Numerical Analysis?


This is an introductory course of Numerical Analysis, which comprises the
design, analysis, and implementation of constructive methods and algorithms
for the solution of mathematical problems.
Numerical Analysis has vast applications both in Mathematics and in
modern Science and Technology. In the areas of the Physical and Life Sci-
ences, Numerical Analysis plays the role of a virtual laboratory by providing
accurate solutions to the mathematical models representing a given physical
or biological system in which the system’s parameters can be varied at will, in
a controlled way. The applications of Numerical Analysis also extend to more
modern areas such as data analysis, web search engines, social networks, and
basically anything where computation is involved.

1.2 An Illustrative Example: Approximating


a Definite Integral
The main principles and objectives of Numerical Analysis are better illus-
trated with concrete examples and this is the purpose of this chapter.
Consider the problem of calculating a definite integral
Z b
I[f ] = f (x)dx. (1.1)
a

3
4 CHAPTER 1. INTRODUCTION

In most cases we cannot find an exact value of I[f ] and very often we only
know the integrand f at finite number of points in [a, b]. The problem is
then to produce an approximation to I[f ] as accurate as we need and at a
reasonable computational cost.

1.2.1 An Approximation Principle


One of the central ideas in Numerical Analysis is to approximate a given
function or data by simpler functions which we can analytically evaluate,
integrate, differentiate, etc. For example, we can approximate the integrand
f in [a, b] by the segment of the straight line, a linear polynomial p1 (x), that
passes through (a, f (a)) and (b, f (b)). That is

f (b) − f (a)
f (x) ≈ p1 (x) = f (a) + (x − a). (1.2)
b−a
and
Z b Z b
1
f (x)dx ≈ p1 (x)dx = f (a)(b − a) + [f (b) − f (a)](b − a)
a a 2 (1.3)
1
= [f (a) + f (b)](b − a).
2
That is
b
b−a
Z
f (x)dx ≈ [f (a) + f (b)]. (1.4)
a 2

The right hand side is known as the simple Trapezoidal Rule Quadrature. A
quadrature is a method to approximate an integral. How accurate is this
approximation? Clearly, if f is a linear polynomial or a constant then the
Trapezoidal Rule would give us the exact value of the integral, i.e. it would
be exact. The underlying question is: how well does a linear polynomial p1 ,
satisfying

p1 (a) = f (a), (1.5)


p1 (b) = f (b), (1.6)

approximate f on the interval [a, b]? We can almost guess the answer. The
approximation is exact at x = a and x = b because of (1.5)-(1.6) and it is
1.2. AN ILLUSTRATIVE EXAMPLE 5

exact for all polynomials of degree ≤ 1. This suggests that f (x) − p1 (x) =
Cf 00 (ξ)(x − a)(x − b), where C is a constant. But where is f 00 evaluated
at? it cannot be at x for if it did f would be the solution of a second order
ODE and f is an arbitrary (but sufficiently smooth, C 2 [a, b] ) function so
it has to be at some undetermined point ξ(x) in (a, b). Now, if we take the
particular case f (x) = x2 on [0, 1] then p1 (x) = x, f (x) − p1 (x) = x(x − 1),
and f 00 (x) = 2, which implies that C would have to be 1/2. So our conjecture
is
1
f (x) − p1 (x) = f 00 (ξ(x))(x − a)(x − b). (1.7)
2
There is a beautiful 19th Century proof of this result by A. Cauchy. It goes
as follows. If x = a or x = b (1.7) holds trivially. So let us take x in (a, b)
and define the following function of a new variable t as
(t − a)(t − b)
φ(t) = f (t) − p1 (t) − [f (x) − p1 (x)] . (1.8)
(x − a)(x − b)

Then φ, as a function of t, is C 2 [a, b] and φ(a) = φ(b) = φ(x) = 0. Since


φ(a) = φ(x) = 0 by Rolle’s theorem there is ξ1 ∈ (a, x) such that φ0 (ξ1 ) = 0
and similarly there is ξ2 ∈ (x, b) such that φ0 (ξ2 ) = 0. Because φ is C 2 [a, b] we
can apply Rolle’s theorem one more time, observing that φ0 (ξ1 ) = φ0 (ξ2 ) = 0,
to get that there is a point ξ(x) between ξ1 and ξ2 such that φ00 (ξ(x)) = 0.
Consequently,
2
0 = φ00 (ξ(x)) = f 00 (ξ(x)) − [f (x) − p1 (x)] (1.9)
(x − a)(x − b)
and so
1
f (x) − p1 (x) = f 00 (ξ(x))(x − a)(x − b), ξ(x) ∈ (a, b).  (1.10)
2
We can use (1.10) to find the accuracy of the simple Trapezoidal Rule. As-
suming the integrand f is C 2 [a, b]
Z b Z b
1 b 00
Z
f (x)dx = p1 (x)dx + f (ξ(x))(x − a)(x − b)dx. (1.11)
a a 2 a
Now, (x − a)(x − b) does not change sign in [a, b] and f 00 is continuous so
by the Weighted Mean Value Theorem for Integrals we have that there is
6 CHAPTER 1. INTRODUCTION

η ∈ (a, b) such that


Z b Z b
00 00
f (ξ(x))(x − a)(x − b)dx = f (η) (x − a)(x − b)dx. (1.12)
a a

The last integral can be easily evaluated if we shift to the midpoint, i.e.,
changing variables to x = y + 21 (a + b) then
Z b Z b−a "  2 #
2 b − a 1
(x − a)(x − b)dx = y2 − dy = − (b − a)3 . (1.13)
a − b−a
2
2 6

Collecting (1.11) and (1.13) we get


Z b
b−a 1
f (x)dx = [f (a) + f (b)] − f 00 (η)(b − a)3 , (1.14)
a 2 12
where η is some point in (a, b). So in the approximation
Z b
b−a
f (x)dx ≈ [f (a) + f (b)].
a 2
we make the error
1 00
E[f ] = − f (η)(b − a)3 . (1.15)
12

1.2.2 Divide and Conquer


The error (1.15) of the simple Trapezoidal Rule grows cubically with the
length of the interval of integration so it is natural to divide [a, b] into smaller
subintervals, apply the Trapezoidal Rule on each of them, and sum up the
result.
Let us divide [a, b] in N subintervals of equal length h = N1 (b − a), deter-
mined by the points x0 = a, x1 = x0 + h, x2 = x0 + 2h, . . . , xN = x0 + N h = b,
then
Z b Z x1 Z x2 Z xN
f (x)dx = f (x)dx + f (x)dx + . . . + f (x)dx
a x0 x1 xN −1
N
X −1 Z xj+1 (1.16)
= f (x)dx.
j=0 xj
1.2. AN ILLUSTRATIVE EXAMPLE 7

But we know
Z xj+1
1 1
f (x)dx = [f (xj ) + f (xj+1 )]h − f 00 (ξj )h3 (1.17)
xj 2 12

for some ξj ∈ (xj , xj+1 ). Therefore, we get


Z b   N −1
1 1 1 X
f (x)dx = h f (x0 ) + f (x1 ) + . . . + f (xN −1 ) + f (xN ) − h3 f 00 (ξj ).
a 2 2 12 j=0

The first term on the right hand side is called the Composite Trapezoidal
Rule Quadrature (CTR):
 
1 1
Th [f ] := h f (x0 ) + f (x1 ) + . . . + f (xN −1 ) + f (xN ) . (1.18)
2 2

The error term is


N −1
" N −1
#
1 X 1 1 X 00
Eh [f ] = − h3 f 00 (ξj ) = − (b − a)h2 f (ξj ) , (1.19)
12 j=0 12 N j=0

where we have used that h = (b − a)/N . The term in brackets is a mean


value of f 00 (it is easy to prove that it lies between the maximum and the
minimum of f 00 ). Since f 00 is assumed continuous (f ∈ C 2 [a, b]) then by the
Intermediate Value Theorem, there is a point ξ ∈ (a, b) such that f 00 (ξ) is
equal to the quantity in the brackets so we obtain that
1
Eh [f ] = − (b − a)h2 f 00 (ξ), (1.20)
12
for some ξ ∈ (a, b).

1.2.3 Convergence and Rate of Convergence


We do not not know what the point ξ is in (1.20). If we knew, the error could
be evaluated and we would know the integral exactly, at least in principle,
because

I[f ] = Th [f ] + Eh [f ]. (1.21)
8 CHAPTER 1. INTRODUCTION

But (1.20) gives us two important properties of the approximation method


in question. First, (1.20) tell us that Eh [f ] → 0 as h → 0. That is, the
quadrature rule Th [f ] converges to the exact value of the integral as h → 0
1
. Recall h = (b − a)/N , so as we increase N our approximation to the
integral gets better and better. Second, (1.20) tells us how fast the approx-
imation converges, namely quadratically in h. This is the approximation’s
rate of convergence. If we double N (or equivalently halve h) then the
error decreases by a factor of 4. We also say that the error is order h2 and
write Eh [f ] = O(h2 ). The Big ‘O’ notation is used frequently in Numerical
Analysis.

Definition 1.1. We say that g(h) is order hα , and write g(h) = O(hα ), if
there is a constant C and h0 such that |g(h)| ≤ Chα for 0 ≤ h ≤ h0 , i.e. for
sufficiently small h.

Example 1.1. Let’s check the Trapezoidal Rule approximation for an integral
we can compute exactly. Take f (x) = ex in [0, 1]. The exact value of the
integral is e − 1. Observe how the error |I[ex ] − T1/N [ex ]| decreases by a

Table 1.1: Composite Trapezoidal Rule for f (x) = ex in [0, 1].


N T1/N [ex ] |I[ex ] − T1/N [ex ]| Decrease factor
−4
16 1.718841128579994 5.593001209489579 × 10
32 1.718421660316327 1.398318572816137 × 10−4 0.250012206406039
64 1.718316786850094 3.495839104861176 × 10−5 0.250003051723810
128 1.718290568083478 8.739624432374526 × 10−6 0.250000762913303

factor of (approximately) 1/4 as N is doubled, in accordance to (1.20).

1.2.4 Error Correction


We can get an upper bound for the error using (1.20) and that f 00 is bounded
in [a, b], i.e. |f 00 (x)| ≤ M2 for all x ∈ [a, b] for some constant M2 . Then

1
|Eh [f ]| ≤ (b − a)h2 M2 . (1.22)
12
1
Neglecting round-off errors introduced by finite precision number representation and
computer arithmetic.
1.2. AN ILLUSTRATIVE EXAMPLE 9

However, this bound does not in general provide an accurate estimate of the
error. It could grossly overestimate it. This can be seen from (1.19). As
N → ∞ the term in brackets converges to a mean value of f 00 , i.e.
N −1 Z b
1 X 00 1 1
f (ξj ) −→ f 00 (x)dx = [f 0 (b) − f 0 (a)], (1.23)
N j=0 b−a a b−a

as N → ∞, which could be significantly smaller than the maximum of |f 00 |.


Take for example f (x) = e100x on [0, 1]. Then max |f 00 | = 10000e100 , whereas
the mean value (1.23) is equal to 100(e100 − 1) so the error bound (1.22)
overestimates the actual error by two orders of magnitude. Thus, (1.22) is
of little practical use.
Equation (1.19) and (1.23) suggest that asymptotically, that is for suffi-
ciently small h,

Eh [f ] = C2 h2 + R(h), (1.24)

where
1 0
C2 = − [f (b) − f 0 (a)] (1.25)
12
and R(h) goes to zero faster than h2 as h → 0, i.e.

R(h)
lim = 0. (1.26)
h→0 h2

We say that R(h) = o(h2 ) (little ‘o’ h2 ).


Definition 1.2. A function g(h) is little ‘o’ hα if

g(h)
lim =0
h→0 hα

and we write g(h) = o(hα ).


We then have

I[f ] = Th [f ] + C2 h2 + R(h). (1.27)

and, for sufficiently small h, C2 h2 is an approximation of the error. If it


is possible and computationally efficient to evaluate the first derivative of
10 CHAPTER 1. INTRODUCTION

f at the end points of the interval then we can compute directly C2 h2 and
use this leading order approximation of the error to obtain the improved
approximation
1
Teh [f ] = Th [f ] − [f 0 (b) − f 0 (a)]h2 . (1.28)
12
This is called the (composite) Modified Trapezoidal Rule. It then follows from
(1.27) that error of this “corrected approximation” is R(h), which goes to
zero faster than h2 . In fact, we will prove later that the error of the Modified
Trapezoidal Rule is O(h4 ).
Often, we only have access to values of f and/or it is difficult to evaluate
f 0 (a) and f 0 (b). Fortunately, we can compute a sufficiently good approxi-
mation of the leading order term of the error, C2 h2 , so that we can use the
same error correction idea that we did for the Modified Trapezoidal Rule.
Roughly speaking, the error can be estimated by comparing two approxima-
tions obtained with different h.
Consider (1.27). If we halve h we get
1
I[f ] = Th/2 [f ] + C2 h2 + R(h/2). (1.29)
4
Subtracting (1.29) from (1.27) we get
4  4
C2 h2 = Th/2 [f ] − Th [f ] + (R(h/2) − R(h)) . (1.30)
3 3
The last term on the right hand side is o(h2 ). Hence, for h sufficiently small,
we have
4
C2 h2 ≈

Th/2 [f ] − Th [f ] (1.31)
3
and this could provide a good, computable estimate for the error, i.e.
4 
Eh [f ] ≈ Th/2 [f ] − Th [f ] . (1.32)
3
The key here is that h has to be sufficiently small to make the asymptotic
approximation (1.31) valid. We can check this by working backwards. If h
is sufficiently small, then evaluating (1.31) at h/2 we get
 2
h 4 
C2 ≈ Th/4 [f ] − Th/2 [f ] (1.33)
2 3
1.2. AN ILLUSTRATIVE EXAMPLE 11

and consequently the ratio


Th/2 [f ] − Th [f ]
q(h) = (1.34)
Th/4 [f ] − Th/2 [f ]
should be approximately 4. Thus, q(h) offers a reliable, computable indicator
of whether or not h is sufficiently small for (1.32) to be an accurate estimate
of the error.
We can now use (1.31) and the idea of error correction to improve the
accuracy of Th [f ] with the following approximation 2
4 
Sh [f ] := Th [f ] + Th/2 [f ] − Th [f ] . (1.35)
3

1.2.5 Richardson Extrapolation


We can view the error correction procedure as a way to eliminate the
leading order (in h) contribution to the error. Multiplying (1.29) by 4 and
substracting (1.27) to the result we get
4Th/2 [f ] − Th [f ] 4R(h/2) − R(h)
I[f ] = + (1.36)
3 3
Note that Sh [f ] is exactly the first term in the right hand side of (1.36) and
that the last term converges to zero faster than h2 . This very useful and
general procedure in which the leading order component of the asymptotic
form of error is eliminated by a combination of two computations performed
with two different values of h is called Richardson’s Extrapolation.
Example 1.2. Consider again f (x) = ex in [0, 1]. With h = 1/16 we get
T1/32 [ex ] − T1/16 [ex ]
 
1
q = ≈ 3.9998 (1.37)
16 T1/64 [ex ] − T1/32 [ex ]
and the improved approximation is
4
S1/16 [ex ] = T1/16 [ex ] +
T1/32 [ex ] − T1/16 [ex ] = 1.718281837561771 (1.38)

3
which gives us nearly 8 digits of accuracy (error ≈ 9.1 × 10−9 ). S1/32 gives
us an error ≈ 5.7 × 10−10 . It decreased by approximately a factor of 1/16.
This would correspond to fourth order rate of convergence. We will see in
Chapter 8 that indeed this is the case.
2
The symbol := means equal by definition.
12 CHAPTER 1. INTRODUCTION

It appears that Sh [f ] gives us superior accuracy to that of Th [f ] but at


roughly twice the computational cost. If we group together the common
terms in Th [f ] and Th/2 [f ] we can compute Sh [f ] at about the same compu-
tational cost as that of Th/2 [f ]:
" 2N −1
#
h 1 X 1
4Th/2 [f ] − Th [f ] = 4 f (a) + f (a + jh/2) + f (b)
2 2 j=1
2
" N −1
#
1 X 1
− h f (a) + f (a + jh) + f (b)
2 j=1
2
" N −1 N −1
#
h X X
= f (a) + f (b) + 2 f (a + kh) + 4 f (a + kh/2) .
2 k=1 k=1

Therefore
" N −1 N −1
#
h X X
Sh [f ] = f (a) + 2 f (a + kh) + 4 f (a + kh/2) + f (b) . (1.39)
6 k=1 k=1

The resulting quadrature formula Sh [f ] is known as the Composite Simpson’s


Rule and, as we will see in Chapter 8, can be derived by approximating the
integrand by quadratic polynomials. Thus, based on cost and accuracy, the
Composite Simpson’s Rule would be preferable to the Composite Trapezoidal
Rule, with one important exception: periodic smooth integrands integrated
over their period.

Example 1.3. Consider the integral


Z 2π
dx
I[1/(2 + sin x)] = . (1.40)
0 2 + sin x

Using Complex Variables techniques


√ (Residues) the exact integral can be com-
puted and I[1/(2 + sin x)] = 2π/ 3. Note that the integrand is smooth (has
an infinite number of continuous derivatives) and periodic in [0, 2π]. If we
use the Composite Trapezoidal Rule to find approximations to this integral
we obtain the results show in Table 1.2.
The approximations converge amazingly fast. With N = 32, we already
reached machine precision (with double precision we get about 16 digits).
1.3. SUPER-ALGEBRAIC CONVERGENCE 13

Table 1.2: Composite Trapezoidal Rule for f (x) = 1/(2 + sin x) in [0, 2π].
N T2π/N [1/(2 + sin x)] |I[1/(2 + sin x)] − T2π/N [1/(2 + sin x)]|
8 3.627791516645356 1.927881769203665 × 10−4
16 3.627598733591013 5.122577029226250 × 10−9
32 3.627598728468435 4.440892098500626 × 10−16

1.3 Super-Algebraic Convergence of the CTR


for Smooth Periodic Integrands
Integrals of periodic integrands appear in many applications, most notably,
in Fourier Analysis.
Consider the definite integral
Z 2π
I[f ] = f (x)dx,
0

where the integrand f is periodic in [0, 2π] and has m > 1 continuous deriva-
tives, i.e. f ∈ C m [0, 2π] and f (x + 2π) = f (x) for all x. Due to periodicity
we can work in any interval of length 2π and if the function has a different
period, with a simple change of variables, we can reduce the problem to one
in [0, 2π].
Consider the equally spaced points in [0, 2π], xj = jh for j = 0, 1, . . . , N
and h = 2π/N . Because f is periodic f (x0 = 0) = f (xN = 2π) and the CTR
becomes
  N −1
f (x0 ) f (xN ) X
Th [f ] = h + f (x1 ) + . . . + f (xN −1 ) + =h f (xj ). (1.41)
2 2 j=0

Being f smooth and periodic in [0, 2π], it has a uniformly convergent Fourier
Series:

a0 X
f (x) = + (ak cos kx + bk sin kx) (1.42)
2 k=1

where
1 2π
Z
ak = f (x) cos kx dx, k = 0, 1, . . . (1.43)
π 0
1 2π
Z
bk = f (x) sin kx dx, k = 1, 2, . . . (1.44)
π 0
14 CHAPTER 1. INTRODUCTION

Using the Euler formula3 .

eix = cos x + i sin x (1.45)

we can write

eix + e−ix
cos x = , (1.46)
2
eix − e−ix
sin x = (1.47)
2i
and the Fourier series can be conveniently expressed in complex form in terms
of functions eikx for k = 0, ±1, ±2, . . . so that (1.42) becomes

X
f (x) = ck eikx , (1.48)
k=−∞

where
Z 2π
1
ck = f (x)e−ikx dx. (1.49)
2π 0

We are assuming that f is real-valued so the complex Fourier coefficients


satisfy c̄k = c−k , where c̄k is the complex conjugate of ck . We have the
relation 2c0 = a0 and 2ck = ak − ibk for k = ±1, ±2, . . ., between the complex
and real Fourier coefficients.
Using (1.48) in (1.41) we get

−1 ∞
N
!
X X
Th [f ] = h ck eikxj . (1.50)
j=0 k=−∞

Justified by the uniform convergence of the series we can exchange the finite
and the infinite sums to get
∞ N −1
2π X X 2π
Th [f ] = ck eik N j . (1.51)
N k=−∞ j=0

3 2
i = −1 and if c = a + ib, with a, b ∈ R, then its complex conjugate c̄ = a − ib.
1.3. SUPER-ALGEBRAIC CONVERGENCE 15

But
N −1 N −1  j
ik 2π ik 2π
X X
j
e N = e N . (1.52)
j=0 j=0


Note that eik N = 1 precisely when k is an integer multiple of N , i.e. k = lN ,
l ∈ Z and if so
N −1  j

X
eik N =N for k = lN . (1.53)
j=0

Otherwise, if k 6= lN , then
 2π N
N
X −1 

j 1 − eik N
eik N =  2π  = 0 for k 6= lN (1.54)
j=0 1 − eik N

Using (1.53) and (1.54) we thus get that



X
Th [f ] = 2π clN . (1.55)
l=−∞

On the other hand


Z 2π
1 1
c0 = f (x)dx = I[f ]. (1.56)
2π 0 2π
Therefore

Th [f ] = I[f ] + 2π [cN + c−N + c2N + c−2N + . . .] , (1.57)

that is

|Th [f ] − I[f ]| ≤ 2π [|cN | + |c−N | + |c2N | + |c−2N | + . . .] , (1.58)

So now, the relevant question is how fast the Fourier coefficients clN of f
decay with N . The answer is tied to the smoothness of f . Doing integration
by parts in the formula (4.11) for the Fourier coefficients of f we have
Z 2π 
1 1 0 −ikx −ikx 2π
ck = f (x)e dx − f (x)e 0
k 6= 0 (1.59)
2π ik 0
16 CHAPTER 1. INTRODUCTION

and the last term vanishes due to the periodicity of f (x)e−ikx . Hence,

1 1 2π 0
Z
ck = f (x)e−ikx dx k=6 0. (1.60)
2π ik 0
Integrating by parts m times we obtain
 m Z 2π
1 1
ck = f (m) (x)e−ikx dx k 6= 0, (1.61)
2π ik 0

where f (m) is the m-th derivative of f . Therefore, for f ∈ C m [0, 2π] and
periodic
Am
|ck | ≤ , (1.62)
|k|m

where Am is a constant (depending only on m). Using this in (1.58) we get


 
2 2 2
|Th [f ] − I[f ]| ≤ 2πAm + + + ...
N m (2N )m (3N )m
  (1.63)
4πAm 1 1
= 1 + m + m + ... ,
Nm 2 3

and so for m > 1 we can conclude that


Cm
|Th [f ] − I[f ]| ≤ . (1.64)
Nm
Thus, in this particular case, the rate of convergence of the CTR at equally
spaced points is not fixed (to 2). It depends on the number of derivatives
of f and we say that the accuracy and convergence of the approximation is
spectral. Note that if f is smooth, i.e. f ∈ C ∞ [0, 2π] and periodic, the CTR
converges to the exact integral at a rate faster than any power of 1/N (or
h)! This is called super-algebraic convergence.
Chapter 2

Function Approximation

We saw in the introductory chapter that one key step in the construction of
a numerical method to approximate a definite integral is the approximation
of the integrand by a simpler function, which we can integrate exactly.
The problem of function approximation is central to many numerical
methods: given a continuous function f in an interval [a, b], we would like to
find a good approximation to it by simpler functions, such as polynomials,
trigonometric polynomials, wavelets, rational functions, etc. We are going
to measure the accuracy of an approximation using norms and ask whether
or not there is a best approximation out of functions from a given family of
simpler functions. These are the main topics of this introductory chapter to
Approximation Theory.

2.1 Norms
A norm on a vector space V over a field K = R (or C) is a mapping

k · k : V → [0, ∞),

which satisfy the following properties:

(i) kxk ≥ 0 ∀x ∈ V and kxk = 0 iff x = 0.

(ii) kx + yk ≤ kxk + kyk ∀x, y ∈ V .

(iii) kλxk = |λ| kxk ∀x ∈ V, λ ∈ K.

17
18 CHAPTER 2. FUNCTION APPROXIMATION

If we relax (i) to just kxk ≥ 0, we obtain a semi-norm.


We recall first some of the most important examples of norms in the finite
dimensional case V = Rn (or V = Cn ):
kxk1 = |x1 | + . . . + |xn |, (2.1)
p
kxk2 = |x1 |2 + . . . + |xn |2 , (2.2)
kxk∞ = max{|x1 |, . . . , |xn |}. (2.3)
These are all special cases of the lp norm:
kxkp = (|x1 |p + . . . + |xn |p )1/p , 1 ≤ p ≤ ∞. (2.4)
If we have weights wi > 0 for i = 1, . . . , n we can also define a weighted p
norm by
kxkw,p = (w1 |x1 |p + . . . + wn |xn |p )1/p , 1 ≤ p ≤ ∞. (2.5)
All norms in a finite dimensional space V are equivalent, in the sense that
there are two constants c and C such that
kxkα ≤ Ckxkβ , (2.6)
kxkβ ≤ ckxkα , (2.7)
for all x ∈ V and for any two norms k · kα and k · kβ defined in V .
If V is a space of functions defined on a interval [a, b], for example C[a, b],
the corresponding norms to (2.1)-(2.4) are given by
Z b
kuk1 = |u(x)|dx, (2.8)
a
Z b 1/2
2
kuk2 = |u(x)| dx , (2.9)
a
kuk∞ = sup |u(x)|, (2.10)
x∈[a,b]
Z b 1/p
p
kukp = |u(x)| dx , 1≤p≤∞ (2.11)
a

and are called the L1 , L2 , L∞ , and Lp norms, respectively. Similarly to (2.5)


we can defined a weighted Lp norm by
Z b 1/p
p
kukp = w(x)|u(x)| dx , 1 ≤ p ≤ ∞, (2.12)
a
2.2. UNIFORM POLYNOMIAL APPROXIMATION 19

where w is a given positive weight function defined in [a, b]. If w(x) ≥ 0, we


get a semi-norm.

Lemma 1. Let k · k be a norm on a vector space V then

| kxk − kyk | ≤ kx − yk. (2.13)

This lemma implies that a norm is a continuous function (on V to R).

Proof. kxk = kx − y + yk ≤ kx − yk + kyk which gives that

kxk − kyk ≤ kx − yk. (2.14)

By reversing the roles of x and y we also get

kyk − kxk ≤ kx − yk. (2.15)

2.2 Uniform Polynomial Approximation


There is a fundamental result in approximation theory, which states that
any continuous function can be approximated uniformly, i.e. using the norm
k·k∞ , with arbitrary accuracy by a polynomial. This is the celebrated Weier-
strass Approximation Theorem. We are going to present a constructive proof
due to Sergei Bernstein, which uses a class of polynomials that have found
widespread applications in computer graphics and animation. Historically,
the use of these so-called Bernstein polynomials in computer assisted design
(CAD) was introduced by two engineers working in the French car industry:
Pierre Bézier at Renault and Paul de Casteljau at Citroën.

2.2.1 Bernstein Polynomials and Bézier Curves


Given a function f on [0, 1], the Bernstein polynomial of degree n ≥ 1 is
defined by
n   
X k n k
Bn f (x) = f x (1 − x)n−k , (2.16)
k=0
n k
20 CHAPTER 2. FUNCTION APPROXIMATION

where
 
n n!
= , k = 0, . . . , n (2.17)
k (n − k)!k!

are the binomial coefficients. Note that Bn f (0) = f (0) and Bn f (1) = f (1)
for all n. The terms
 
n k
bk,n (x) = x (1 − x)n−k , k = 0, . . . , n (2.18)
k

which are all nonnegative, are called the Bernstein basis polynomials and can
be viewed as x-dependent weights that sum up to one:
n n  
X X n k
bk,n (x) = x (1 − x)n−k = [x + (1 − x)]n = 1. (2.19)
k=0 k=0
k

Thus, for each x ∈ [0, 1], Bn f (x) represents a weighted average of the values
of f at 0, 1/n, 2/n, . . . , 1. Moreover, as n increases the weights bk,n (x) con-
centrate more and more around the points k/n close to x as Fig. 2.1 indicates
for bk,n (0.25) and bk,n (0.75).
For n = 1, the Bernstein polynomial is just the straight line connecting
f (0) and f (1), B1 f (x) = (1 − x)f (0) + xf (1). Given two points P0 = (x0 , y0 )
and P1 = (x1 , y1 ), the segment of the straight line connecting them can be
written in parametric form as

B1 (t) = (1 − t)P0 + t P1 , t ∈ [0, 1]. (2.20)

With three points, P0 , P1 , P2 , we can employ the quadratic Bernstein basis


polynomials to get a more useful parametric curve

B2 (t) = (1 − t)2 P0 + 2t(1 − t)P1 + t2 P2 , t ∈ [0, 1]. (2.21)

This curve connects again P0 and P2 but P1 can be used to control how
the curve bends. More precisely, the tangents at the end points are B20 (0) =
2(P1 − P0 ) and B20 (1) = 2(P2 − P1 ), which intersect at P1 , as Fig. 2.2 illus-
trates. These parametric curves formed with the Bernstein basis polynomials
are called Bézier curves and have been widely employed in computer graph-
ics, specially in the design of vector fonts, and in computer animation. To
allow the representation of complex shapes, quadratic or cubic Bézier curves
2.2. UNIFORM POLYNOMIAL APPROXIMATION 21

0.14

0.12

0.1

0.08

0.06

0.04
bk,n(0.25) bk,n(0.75)

0.02

0
0 10 20 30 40 50
k

Figure 2.1: The Bernstein weights bk,n (x) for x = 0.25 (◦)and x = 0.75 (•),
n = 50 and k = 1 . . . n.

P1

P0 P2

Figure 2.2: Quadratic Bézier curve.


22 CHAPTER 2. FUNCTION APPROXIMATION

are pieced together to form composite Bézier curves. To have some degree
of smoothness (C 1 ), the common point for two pieces of a composite Bézier
curve has to lie on the line connecting the two adjacent control points on ei-
ther side. For example, the TrueType font used in most computers today is
generated with composite, quadratic Bézier curves while the Metafont used
in these pages, via LATEX, employs composite, cubic Bézier curves. For each
character, many pieces of Bézier are stitched together.
Let us now do some algebra to prove some useful identities of the Bern-
stein polynomials. First, for f (x) = x we have,

n   n
X k n k n−k
X kn!
x (1 − x) = xk (1 − x)n−k
k=0
n k k=1
n(n − k)!k!
n  
X n − 1 k−1
=x x (1 − x)n−k
k=1
k − 1 (2.22)
n−1
X n − 1
=x xk (1 − x)n−1−k
k=0
k
= x [x + (1 − x)]n−1 = x.

Now for f (x) = x2 , we get

n  2   n  
X k n k n−k
X k n−1
x (1 − x) = xk (1 − x)n−k (2.23)
k=0
n k k=1
n k−1

and writing

k k−1 1 n−1k−1 1
= + = + , (2.24)
n n n n n−1 n
2.2. UNIFORM POLYNOMIAL APPROXIMATION 23

we have
n  2   n  
X k n k n−k n−1Xk−1 n−1 k
x (1 − x) = x (1 − x)n−k
k=0
n k n k=2 n − 1 k − 1
n  
1 X n−1 k
+ x (1 − x)n−k
n k=1 k − 1
n  
n−1X n−2 k x
= x (1 − x)n−k +
n k=2 k − 2 n
n−2  
n−1 2X n−2 k x
= x x (1 − x)n−2−k + .
n k=0
k n

Thus,
n  2  
X k n n−1 2 x
xk (1 − x)n−k = x + . (2.25)
k=0
n k n n

k
2
Now, expanding n
−x and using (2.19), (2.22), and (2.25) it follows that
n  2  
X k n k 1
−x x (1 − x)n−k = x(1 − x). (2.26)
k=0
n k n

2.2.2 Weierstrass Approximation Theorem


Theorem 2.1. (Weierstrass Approximation Theorem) Let f be a continuous
function in [a, b]. Given  > 0 there is a polynomial p such that

max |f (x) − p(x)| < .


a≤x≤b

Proof. We are going to work on the interval [0, 1]. For a general interval
[a, b], we consider the simple change of variables x = a + (b − a)t for t ∈ [0, 1]
so that F (t) = f (a + (b − a)t) is continuous in [0, 1].
Using (2.19), we have
n     
X k n k
f (x) − Bn f (x) = f (x) − f x (1 − x)n−k . (2.27)
k=0
n k
24 CHAPTER 2. FUNCTION APPROXIMATION

Since f is continuous in [0, 1], it is also uniformly continuous. Thus, given


 > 0 there is δ() > 0, independent of x, such that

|f (x) − f (k/n)| < if |x − k/n| < δ. (2.28)
2
Moreover,

|f (x) − f (k/n)| ≤ 2kf k∞ for all x ∈ [0, 1], k = 0, 1, . . . , n. (2.29)

We now split the sum in (2.27) in two sums, one over the points such that
|k/n − x| < δ and the other over the points such that |k/n − x| ≥ δ:
    
X k n k
f (x) − Bn f (x) = f (x) − f x (1 − x)n−k
n k
|k/n−x|<δ
     (2.30)
X k n k
+ f (x) − f x (1 − x)n−k .
n k
|k/n−x|≥δ

Using (2.28) and (2.19) it follows immediately that the first sum is bounded
by /2. For the second sum we have
   
X k n k
f (x) − f x (1 − x)n−k
n k
|k/n−x|≥δ
X n
≤ 2kf k∞ xk (1 − x)n−k
k
|k/n−x|≥δ
 2  
2kf k∞ X k n k (2.31)
≤ 2
−x x (1 − x)n−k
δ n k
|k/n−x|≥δ
n  2  
2kf k∞ X k n k
≤ 2
− x x (1 − x)n−k
δ k=0
n k
2kf k∞ kf k∞
= 2
x(1 − x) ≤ .
nδ 2nδ 2

Therefore, there is N such that for all n ≥ N the second sum in (2.30) is
bounded by /2 and this completes the proof.
2.3. BEST APPROXIMATION 25

2.3 Best Approximation


We just saw that any continuous function f on a closed interval can be
approximated uniformly with arbitrary accuracy by a polynomial. Ideally
we would like to find the closest polynomial, say of degree at most n, to the
function f when the distance is measured in the supremum (infinity) norm,
or in any other norm we choose. There are three important elements in this
general problem: the space of functions we want to approximate, the norm,
and the family of approximating functions. The following definition makes
this more precise.

Definition 2.1. Given a normed linear space V and a subspace W of V ,


p∗ ∈ W is called the best approximation of f ∈ V by elements in W if

kf − p∗ k ≤ kf − pk, for all p ∈ W . (2.32)

For example, the normed linear space V could be C[a, b] with the supre-
mum norm (2.10) and W could be the set of all polynomials of degree at
most n, which henceforth we will denote by Pn .

Theorem 2.2. Let W be a finite-dimensional subspace of a normed linear


space V . Then, for every f ∈ V , there is at least one best approximation to
f by elements in W .

Proof. Since W is a subspace 0 ∈ W and for any candidate p ∈ W for best


approximation to f we must have

kf − pk ≤ kf − 0k = kf k. (2.33)

Therefore we can restrict our search to the set

F = {p ∈ W : kf − pk ≤ kf k}. (2.34)

F is closed and bounded and because W is finite-dimensional it follows that


F is compact. Now, the function p 7→ kf − pk is continuous on this compact
set and hence it attains its minimum in F .

If we remove the finite-dimensionality of W then we cannot guarantee


that there is a best approximation as the following example shows.
26 CHAPTER 2. FUNCTION APPROXIMATION

Example 2.1. Let V = C[0, 1/2] and W be the space of all polynomials
(clearly of subspace of V ). Take f (x) = 1/(1 − x). Then, given  > 0 there
is N such that
1
max − (1 + x + x2 + . . . + xN ) < . (2.35)
x∈[0,1/2] 1−x

So if there is a best approximation p∗ in the max norm, necessarily kf −


p∗ k∞ = 0, which implies
1
p∗ (x) = , (2.36)
1−x
which is impossible.
Theorem 2.2 does not guarantee uniqueness of best approximation. Strict
convexity of the norm gives us a sufficient condition.
Definition 2.2. A norm k · k on a vector space V is strictly convex if for all
f 6= g in V with kf k = kgk = 1 then

kθf + (1 − θ)gk < 1, for all 0 < θ < 1.

In other words, a norm is strictly convex if its unit ball is strictly convex.
The p-norm is strictly convex for 1 < p < ∞ but not for p = 1 or p = ∞.
Theorem 2.3. Let V be a vector space with a strictly convex norm, W a
subspace of V , and f ∈ V . If p∗ and q ∗ are best approximations of f in W
then p∗ = q ∗ .
Proof. Let M = kf − p∗ k = kf − q ∗ k, if p∗ 6= q ∗ by the strict convexity of
the norm

kθ(f − p∗ ) + (1 − θ)(f − q ∗ )k < M, for all 0 < θ < 1. (2.37)

Taking θ = 1/2 we get


1
kf − (p∗ + q ∗ )k < M, (2.38)
2
which is impossible because 21 (p∗ + q ∗ ) is in W and cannot be a better ap-
proximation.
2.3. BEST APPROXIMATION 27

2.3.1 Best Uniform Polynomial Approximation


Given a continuous function f on a interval [a, b] we know there is at least
one best approximation p∗n to f , in any given norm, by polynomials of degree
at most n because the dimension of Pn is finite. The norm k · k∞ is not
strictly convex so Theorem 2.3 does not apply. However, due to a special
property (called the Haar property) of the linear space Pn , which is that the
only element of Pn that has more than n roots is the zero element, it is
possible to prove that the best approximation is unique.
The crux of the matter is that error function
en (x) = f (x) − p∗n (x), x ∈ [a, b], (2.39)
has to equioscillate at least n+2 points, between +ken k∞ and −ken k∞ . That
is, there are k points, x1 , x2 , . . . , xk , with k ≥ n + 2, such that
en (x1 ) = ±ken k∞
en (x2 ) = −en (x1 ),
en (x3 ) = −en (x2 ), (2.40)
..
.
en (xk ) = −en (xk−1 ),
for if not, it would be possible to find a polynomial of degree at most n,
with the same sign at the extrema of en (at most n sign changes), and use
this polynomial to decrease the value of ken k∞ . This would contradict the
fact that p∗n is a best approximation. This is easy to see for n = 0 as it is
impossible to find a polynomial of degree 0 (a constant) with one change of
sign. This is the content of the next result.
Theorem 2.4. The error en = f − p∗n has at least two extrema x1 and x2
in [a, b] such that |en (x1 )| = |en (x2 )| = ken k∞ and en (x1 ) = −en (x2 ) for all
n ≥ 0.
Proof. The continuous function |en (x)| attains its maximum ken k∞ in at least
one point x1 in [a, b]. Suppose ken k∞ = en (x1 ) and that en (x) > −ken k∞ for
all x ∈ [a, b]. Then, m = minx∈[a,b] en (x) > −ken k∞ and we have some room
to decrease ken k∞ by shifting down en a suitable amount c. In particular, if
take c as one half the gap between the minimum m of en and −ken k∞ ,
1
c= (m + ken k∞ ) > 0, (2.41)
2
28 CHAPTER 2. FUNCTION APPROXIMATION

en(x)

en(x)−c

Figure 2.3: If the error function en does not equioscillate at least twice we
could lower ken k∞ by an amount c > 0.

and subtract it to en , as shown in Fig. 2.3, we have

−ken k∞ + c ≤ en (x) − c ≤ ken k∞ − c. (2.42)

Therefore, ken −ck∞ = kf −(p∗n +c)k∞ = ken k∞ −c < ken k∞ but p∗n +c ∈ Pn
so this is impossible since p∗n is a best approximation. A similar argument
can used when en (x1 ) = −ken k∞ .
Before proceeding to the general case, let us look at the n = 1 situation.
Suppose there are only two alternating extrema x1 and x2 for e1 as described
in (2.40). We are going to construct a linear polynomial that has the same
sign as e1 at x1 and x2 and which can be used to decrease ke1 k∞ . Suppose
e1 (x1 ) = ke1 k∞ and e1 (x2 ) = −ke1 k∞ . Since e1 is continuous, we can find
small closed intervals I1 and I2 , containing x1 and x2 , respectively, and such
that
ke1 k∞
e1 (x) > for all x ∈ I1 , (2.43)
2
ke1 k∞
e1 (x) < − for all x ∈ I2 . (2.44)
2
Clearly I1 and I2 are disjoint sets so we can choose a point x0 between the
two intervals. Then, it is possible to find a linear polynomial q that passes
through x0 and that is positive in I1 and negative in I2 . We are now going
2.3. BEST APPROXIMATION 29

to find a suitable constant α > 0 such that kf − p∗1 − αqk∞ < ke1 k∞ . Since
p∗1 + αq ∈ P1 this would be a contradiction to the fact that p∗1 is a best
approximation.
Let R = [a, b] \ (I1 ∪ I2 ) and d = maxx∈R |e1 (x)|. Clearly d < ke1 k∞ .
Choose α such that
1
0<α< (ke1 k∞ − d) . (2.45)
2kqk∞

On I1 , we have
1 1
0 < αq(x) < (ke1 k∞ − d) q(x) ≤ (ke1 k∞ − d) < e1 (x). (2.46)
2kqk∞ 2

Therefore

|e1 (x) − αq(x)| = e1 (x) − αq(x) < ke1 k∞ , for all x ∈ I1 . (2.47)

Similarly, on I2 , we can show that |e1 (x) − αq(x)| < ke1 k∞ . Finally, on R we
have
1
|e1 (x) − αq(x)| ≤ |e1 (x)| + |αq(x)| ≤ d + (ke1 k∞ − d) < ke1 k∞ . (2.48)
2
Therefore, ke1 − αqk∞ = kf − (p∗1 + αq)k∞ < ke1 k∞ , which contradicts the
best approximation assumption on p∗1 .

Theorem 2.5. (Chebyshev Equioscillation Theorem) Let f ∈ C[a, b]. Then,


p∗n in Pn is a best uniform approximation of f if and only if there are at least
n + 2 points in [a, b], where the error en = f − p∗n equioscillates between the
values ±ken k∞ as defined in (2.40).

Proof. We first prove that if the error en = f − p∗n , for some p∗n ∈ Pn ,
equioscillates at least n + 2 times then p∗n is a best approximation. Suppose
the contrary. Then, there is qn ∈ Pn such that

kf − qn k∞ < kf − p∗n k∞ . (2.49)

Let x1 , . . . , xk , with k ≥ n + 2, be the points where en equioscillates. Then

|f (xj ) − qn (xj )| < |f (xj ) − p∗n (xj )|, j = 1, . . . , k (2.50)


30 CHAPTER 2. FUNCTION APPROXIMATION

and since

f (xj ) − p∗n (xj ) = −[f (xj+1 ) − p∗n (xj+1 )], j = 1, . . . , k − 1 (2.51)

we have that

qn (xj ) − p∗n (xj ) = f (xj ) − p∗n (xj ) − [f (xj ) − qn (xj )] (2.52)

changes signs k − 1 times, i.e. at least n + 1 times. But qn − p∗n ∈ Pn .


Therefore qn = p∗n , which contradicts (2.49), and consequently p∗n has to be
a best uniform approximation of f .
For the other half of the proof the idea is the same as for n = 1 but we need
to do more bookkeeping. We are going to partition [a, b] into the union of
sufficiently small subintervals so that we can guarantee that |en (t) − en (s)| ≤
ken k∞ /2 for any two points t and s in each of the subintervals. Let us label
by I1 , . . . , Ik , the subintervals on which |en (x)| achieves its maximum ken k∞ .
Then, on each of these subintervals either en (x) > ken k∞ /2 or en (x) <
−ken k∞ /2. We need to prove that en changes sign at least n + 1 times.
Going from left to right, we can label the subintervals I1 , . . . , Ik as a (+)
or (−) subinterval depending on the sign of en . For definiteness, suppose I1
is a (+) subinterval then we have the groups

{I1 , . . . , Ik1 }, (+)


{Ik1 +1 , . . . , Ik2 }, (−)
..
.
{Ikm +1 , . . . , Ik }, (−)m .

We have m changes of sign so let us assume that m ≤ n. We already know


m ≥ 1. Since the sets, Ikj and Ikj +1 are disjoint for j = 1, . . . , m, we can
select points t1 , . . . , tm , such that tj > x for all x ∈ Ikj and tj < x for all
x ∈ Ikj +1 . Then, the polynomial

q(x) = (t1 − x)(t2 − x) · · · (tm − x) (2.53)

has the same sign as en in each of the extremal intervals I1 , . . . , Ik and q ∈ Pn .


The rest of the proof is as in the n = 1 case to show that p∗n + αq would be
a better approximation to f than p∗n .
Theorem 2.6. Let f ∈ C[a, b]. The best uniform approximation p∗n to f by
elements of Pn is unique.
2.4. CHEBYSHEV POLYNOMIALS 31

Proof. Suppose qn∗ is also a best approximation, i.e.

ken k∞ = kf − p∗n k∞ = kf − qn∗ k∞ .

Then, the midpoint r = 21 (p∗n + qn∗ ) is also a best approximation, for r ∈ Pn


and
1 1
kf − rk∞ = k (f − p∗n ) + (f − qn∗ )k∞
2 2 (2.54)
1 1
≤ kf − pn k∞ + kf − qn∗ k∞ = ken k∞ .

2 2
Let x1 , . . . , xn+2 be extremal points of f − r with the alternating property
(2.40), i.e. f (xj ) − r(xj ) = (−1)m+j ken k∞ for some integer m and j =
1, . . . n + 2. This implies that

f (xj ) − p∗n (xj ) f (xj ) − qn∗ (xj )


+ = (−1)m+j ken k∞ , j = 1, . . . , n + 2.
2 2
(2.55)

But |f (xj ) − p∗n (xj )| ≤ ken k∞ and |f (xj ) − qn∗ (xj )| ≤ ken k∞ . As a conse-
quence,

f (xj ) − p∗n (xj ) = f (xj ) − qn∗ (xj ) = (−1)m+j ken k∞ , j = 1, . . . , n + 2,


(2.56)

and it follows that

p∗n (xj ) = qn∗ (xj ), j = 1, . . . , n + 2 (2.57)

Therefore, qn∗ = p∗n .

2.4 Chebyshev Polynomials


The best uniform approximation of f (x) = xn+1 in [−1, 1] by polynomials of
degree at most n can be found explicitly and the solution introduces one of
the most useful and remarkable polynomials, the Chebyshev polynomials.
Let p∗n ∈ Pn be the best uniform approximation to xn+1 in the interval
[−1, 1] and as before define the error function as en (x) = xn+1 − p∗n (x). Note
that since en is a monic polynomial (its leading coefficient is 1) of degree
32 CHAPTER 2. FUNCTION APPROXIMATION

n + 1, the problem of finding p∗n is equivalent to finding, among all monic


polynomials of degree n + 1, the one with the smallest deviation (in absolute
value) from zero.
According to Theorem 2.5, there exist n + 2 distinct points,
−1 ≤ x1 < x2 < · · · < xn+2 ≤ 1, (2.58)
such that
e2n (xj ) = ken k2∞ , for j = 1, . . . , n + 2. (2.59)
Now consider the polynomial
q(x) = ken k2∞ − e2n (x). (2.60)
Then, q(xj ) = 0 for j = 1, . . . , n+2. Each of points xj in the interior of [−1, 1]
is also a local minimum of q, then necessarily q 0 (xj ) = 0 for j = 2, . . . n + 1.
Thus, the n points x2 , . . . , xn+1 are zeros of q of multiplicity at least two.
But q is a nonzero polynomial of degree 2n + 2 exactly. Therefore, x1 and
xn+2 have to be simple zeros and so x1 = −1 and xn+2 = 1. Note that the
polynomial p(x) = (1 − x2 )[e0n (x)]2 ∈ P2n+2 has the same zeros as q and so
p = cq, for some constant c. Comparing the coefficient of the leading order
term of p and q it follows that c = (n + 1)2 . Therefore, en satisfies the
ordinary differential equation
(1 − x2 )[e0n (x)]2 = (n + 1)2 ken k2∞ − e2n (x) .
 
(2.61)
We know e0n ∈ Pn and its n zeros are the interior points x2 , . . . , xn+1 . There-
fore, e0n cannot change sign in [−1, x2 ]. Suppose it is nonnegative for x ∈
[−1, x2 ] (we reach the same conclusion if we assume e0n (x) ≤ 0) then, taking
square roots in (2.61) we get
e0n (x) n+1
p =√ , for x ∈ [−1, x2 ]. (2.62)
ken k2∞ − e2n (x) 1 − x2
Using the trigonometric substitution x = cos θ, we can integrate to obtain
en (x) = ken k∞ cos[(n + 1)θ], (2.63)
for x = cos θ ∈ [−1, x2 ] with 0 < θ ≤ π, where we have chosen the constant of
integration to be zero so that en (1) = ken k∞ . Recall that en is a polynomial
of degree n + 1 then so is cos[(n + 1) cos−1 x]. Since these two polynomials
agree in [−1, x2 ], (2.63) must also hold for all x in [−1, 1].
2.4. CHEBYSHEV POLYNOMIALS 33

Definition 2.3. The Chebyshev polynomial (of the first kind) of degree n,
Tn is defined by

Tn (x) = cos nθ, x = cos θ, 0 ≤ θ ≤ π. (2.64)

Note that (2.64) only defines Tn for x ∈ [−1, 1]. However, once the
coefficients of this polynomial are determined we can define it for any real
(or complex) x.
Using the trigonometry identity

cos[(n + 1)θ] + cos[(n − 1)θ] = 2 cos nθ cos θ, (2.65)

we immediately get

Tn+1 (cos θ) + Tn−1 (cos θ) = 2Tn (cos θ) · cos θ (2.66)

and going back to the x variable we obtain the recursion formula

T0 (x) = 1,
T1 (x) = x, (2.67)
Tn+1 (x) = 2xTn (x) − Tn−1 (x), n ≥ 1,

which makes it more evident the Tn for n = 0, 1, . . . are indeed polynomials


of exactly degree n. Let us generate a few of them.

T0 (x) = 1,
T1 (x) = x,
T2 (x) = 2x · x − 1 = 2x2 − 1,
(2.68)
T3 (x) = 2x · (2x2 − 1) − x = 4x3 − 3x,
T4 (x) = 2x(4x3 − 3x) − (2x2 − 1) = 8x4 − 8x2 + 1
T5 (x) = 2x(8x4 − 8x2 + 1) − (4x3 − 3x) = 16x5 − 20x3 + 5x.

From these few Chebyshev polynomials, and from (2.67), we see that

Tn (x) = 2n−1 xn + lower order terms (2.69)

and that Tn is an even (odd) function of x if n is even (odd), i.e.

Tn (−x) = (−1)n Tn (x). (2.70)


34 CHAPTER 2. FUNCTION APPROXIMATION

Going back to (2.63), since the leading order coefficient of en is 1 and that
of Tn+1 is 2n , it follows that ken k∞ = 2−n . Therefore
1
p∗n (x) = xn+1 − Tn+1 (x) (2.71)
2n
is the best uniform approximation of xn+1 in [−1, 1] by polynomials of degree
at most n. Equivalently, as noted in the beginning of this section, the monic
polynomial of degree n with smallest infinity norm in [−1, 1] is
1
T̃n (x) = Tn (x). (2.72)
2n−1
Hence, for any other monic polynomial p of degree n
1
max |p(x)| > . (2.73)
x∈[−1,1] 2n−1

The zeros and extrema of Tn are easy to find. Because Tn (x) = cos nθ
and 0 ≤ θ ≤ π, the zeros occur when θ is an odd multiple of π/2. Therefore,
 
(2j + 1) π
x̄j = cos j = 0, . . . , n − 1. (2.74)
n 2

The extrema of Tn (the points x where Tn (x) = ±1) correspond to nθ = jπ


for j = 0, 1, . . . , n, that is
 

xj = cos , j = 0, 1, . . . , n. (2.75)
n
These points are called Chebyshev or Gauss-Lobatto points and are ex-
tremely useful in applications. Note that xj for j = 1, . . . , n − 1 are local
extrema. Therefore

Tn0 (xj ) = 0, for j = 1, . . . , n − 1. (2.76)

In other words, the Chebyshev points (2.75) are the n − 1 zeros of Tn0 plus
the end points x0 = 1 and xn = −1.
Using the Chain Rule we can differentiate Tn with respect to x we get
dθ sin nθ
Tn0 (x) = −n sin nθ =n , (x = cos θ). (2.77)
dx sin θ
2.4. CHEBYSHEV POLYNOMIALS 35

Therefore
0 0
Tn+1 (x) Tn−1 (x) 1
− = [sin(n + 1)θ − sin(n − 1)θ] (2.78)
n+1 n−1 sin θ
and since sin(n + 1)θ − sin(n − 1)θ = 2 sin θ cos nθ, we get that
0 0
Tn+1 (x) Tn−1 (x)
− = 2Tn (x). (2.79)
n+1 n−1
The polynomial
0
Tn+1 (x) sin(n + 1)θ
Un (x) = = , (x = cos θ) (2.80)
n+1 sin θ
of degree n is called the Chebyshev polynomial of second kind. Thus, the
Chebyshev nodes (2.75) are the zeros of the polynomial

qn+1 (x) = (1 − x2 )Un−1 (x). (2.81)


36 CHAPTER 2. FUNCTION APPROXIMATION
Chapter 3

Interpolation

3.1 Polynomial Interpolation


One of the basic tools for approximating a function or a given data set is
interpolation. In this chapter we focus on polynomial and piece-wise poly-
nomial interpolation.
The polynomial interpolation problem can be stated as follows: Given n+1
data points, (x0 , f0 ), (x1 , f1 )..., (xn , fn ), where x0 , x1 , . . . , xn are distinct, find
a polynomial pn ∈ Pn , which satisfies the interpolation property:

pn (x0 ) = f0 ,
pn (x1 ) = f1 ,
..
.
pn (xn ) = fn .

The points x0 , x1 , . . . , xn are called interpolation nodes and the values f0 , f1 , . . . , fn


are data supplied to us or can come from a function f we are trying to ap-
proximate, in which case fj = f (xj ) for j = 0, 1, . . . , n.
Let us represent such polynomial as pn (x) = a0 + a1 x + · · · + an xn . Then,
the interpolation property implies

a0 + a1 x0 + · · · + an xn0 = f0 ,

a0 + a1 x1 + · · · + an xn1 = f1 ,
..
.

37
38 CHAPTER 3. INTERPOLATION

a0 + a1 xn + · · · + an xnn = fn .

This is a linear system of n + 1 equations in n + 1 unknowns (the polynomial


coefficients a0 , a1 , . . . , an ). In matrix form:
    
1 x0 x20 · · · xn0 a0 f0
1 x1 x2 · · · xn   a1   f1 
1 1 
  ..  =  ..  (3.1)
  
 ..
.  .   . 
1 xn x2n · · · xnn an fn .

Does this linear system have a solution? Is this solution unique? The answer
is yes to both. Here is a simple proof. Take fj = 0 for j = 0, 1, . . . , n. Then
pn (xj ) = 0, for j = 0, 1, ..., n but pn is a polynomial of degree at most n, it
cannot have n + 1 zeros unless pn ≡ 0, which implies a0 = a1 = · · · = an = 0.
That is, the homogenous problem associated with (3.1) has only the trivial
solution. Therefore, (3.1) has a unique solution.

Example 3.1. As an illustration let us consider interpolation by a linear


polynomial, p1 . Suppose we are given (x0 , f0 ) and (x1 , f1 ). We have written
p1 explicitly in the Introduction, we write it now in a different form:

x − x1 x − x0
p1 (x) = f0 + f1 . (3.2)
x0 − x1 x1 − x0

Clearly, this polynomial has degree at most 1 and satisfies the interpolation
property:

p1 (x0 ) = f0 , (3.3)
p1 (x1 ) = f1 . (3.4)

Example 3.2. Given (x0 , f0 ), (x1 , f1 ), and (x2 , f2 ) let us construct p2 ∈


P2 that interpolates these points. The way we have written p1 in (3.2) is
suggestive of how to explicitly write p2 :

(x − x1 )(x − x2 ) (x − x0 )(x − x2 ) (x − x0 )(x − x1 )


p2 (x) = f0 + f1 + f2 .
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
3.1. POLYNOMIAL INTERPOLATION 39

If we define
(x − x1 )(x − x2 )
l0 (x) = , (3.5)
(x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 )
l1 (x) = , (3.6)
(x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
l2 (x) = , (3.7)
(x2 − x0 )(x2 − x1 )
then we simply have

p2 (x) = l0 (x)f0 + l1 (x)f1 + l2 (x)f2 . (3.8)

Note that each of the polynomials (3.5), (3.6), and (3.7) are exactly of degree
2 and they satisfy lj (xk ) = δjk 1 . Therefore, it follows that p2 given by (3.8)
satisfies the interpolation property

p2 (x0 ) = f0 ,
p2 (x1 ) = f1 , (3.9)
p2 (x2 ) = f2 .

We can now write down the polynomial of degree at most n that interpo-
lates n + 1 given values, (x0 , f0 ), . . . , (xn , fn ), where the interpolation nodes
x0 , . . . , xn are assumed distinct. Define
(x − x0 ) · · · (x − xj−1 )(x − xj+1 ) · · · (x − xn )
lj (x) =
(xj − x0 ) · · · (xj − xj−1 )(xj − xj+1 ) · · · (xj − xn )
n (3.10)
Y (x − xk )
= , for j = 0, 1, ..., n.
k=0
(xj − xk )
k6=j

These are called the elementary Lagrange polynomials of degree n. For sim-
plicity, we are omitting in the notation their dependence on the n + 1 nodes.
Since lj (xk ) = δjk , we have that
n
X
pn (x) = l0 (x)f0 + l1 (x)f1 + · · · + ln (x)fn = lj (x)fj (3.11)
j=0

1
δjk is the Kronecker delta, i.e. δjk = 0 if k 6= j and 1 if k = j.
40 CHAPTER 3. INTERPOLATION

interpolates the given data, i.e., it satisfies the interpolation property pn (xj ) =
fj for j = 0, 1, 2, . . . , n. Relation (3.11) is called the Lagrange form of the
interpolating polynomial. The following result summarizes our discussion.

Theorem 3.1. Given the n + 1 values (x0 , f0 ), . . . , (xn , fn ), for x0 , x1 , ..., xn


distinct. There is a unique polynomial pn of degree at most n such that
pn (xj ) = fj for j = 0, 1, . . . , n.

Proof. pn in (3.11) is of degree at most n and interpolates the data. Unique-


ness follows from the Fundamental Theorem of Algebra, as noted earlier.
Suppose there is another polynomial qn of degree at most n such that qn (xj ) =
fj for j = 0, 1, . . . , n. Consider r = pn − qn . This is a polynomial of degree
at most n and r(xj ) = pn (xj ) − qn (xj ) = fj − fj = 0 for j = 0, 1, 2, . . . , n,
which is impossible unless r ≡ 0. This implies qn = pn .

3.1.1 Equispaced and Chebyshev Nodes


There are two special sets of nodes that are particularly important in ap-
plications. For convenience we are going to take the interval [−1, 1]. For a
general interval [a, b], we can do the simple change of variables

1 1
x = (a + b) + (b − a)t, t ∈ [−1, 1]. (3.12)
2 2
The uniform or equispaced nodes are given by

xj = −1 + jh, j = 0, 1, . . . , n and h = 2/n. (3.13)

These nodes yield very accurate and efficient trigonometric polynomial inter-
polation but are generally not good for (algebraic) polynomial interpolation
as we will see later.
One of the preferred set of nodes for high order, accurate, and computa-
tionally efficient polynomial interpolation is the Chebyshev or Gauss-Lobatto
set
 

xj = cos , j = 0, . . . , n, (3.14)
n

which, as discussed in Section 2.4, are the extrema of the Chebyshev poly-
nomial (2.64) of degree n . Note that these nodes are obtained from the
3.2. CONNECTION TO BEST UNIFORM APPROXIMATION 41

equispaced points θj = j(π/n), j = 0, 1, . . . , n in [0, π] by the one-to-one re-


lation x = cos θ, for θ ∈ [0, π]. As defined in (3.14), the nodes go from 1 to -1
so often the alternative definition xj = − cos(jπ/n) is used. The Chebyshev
nodes are not equally spaced and tend to cluster toward the end points of
the interval.

3.2 Connection to Best Uniform Approxima-


tion
Given a continuous function f in [a, b], its best uniform approximation p∗n in
Pn is characterized by an error, en = f − p∗n , which equioscillates, as defined
in (2.40), at least n + 2 times. Therefore en has a minimum of n + 1 zeros
and consequently, there exists x0 , . . . , xn such that

p∗n (x0 ) = f (x0 ),


p∗n (x1 ) = f (x1 ),
.. (3.15)
.

pn (xn ) = f (xn ),

In other words, p∗n is the polynomial of degree at most n that interpolates


the function f at n + 1 zeros of en . Of course, we do not construct p∗n by
finding these particular n + 1 interpolation nodes. A more practical question
is: given (x0 , f (x0 )), (x1 , f (x1 )), . . . , (xn , f (xn )), where x0 , . . . , xn are distinct
interpolation nodes in [a, b], how close is pn , the interpolating polynomial of
degree at most n of f at the given nodes, to the best uniform approximation
p∗n of f in Pn ?
To obtain a bound for kpn − p∗n k∞ we note that pn − p∗n is a polynomial of
degree at most n which interpolates f − p∗n . Therefore, we can use Lagrange
formula to represent it
n
X
pn (x) − p∗n (x) = lj (x)(f (xj ) − p∗n (xj )). (3.16)
j=0

It then follows that

kpn − p∗n k∞ ≤ Λn kf − p∗n k∞ , (3.17)


42 CHAPTER 3. INTERPOLATION

where
n
X
Λn = max |lj (x)| (3.18)
a≤x≤b
j=0

is called the Lebesgue Constant and depends only on the interpolation nodes,
not on f . On the other hand, we have that

kf − pn k∞ = kf − p∗n − pn + p∗n k∞ ≤ kf − p∗n k∞ + kpn − p∗n k∞ . (3.19)

Using (3.17) we obtain

kf − pn k∞ ≤ (1 + Λn )kf − p∗n k∞ . (3.20)

This inequality connects the interpolation error kf − pn k∞ with the best


approximation error kf − p∗n k∞ . What happens to these errors as we increase
n? To make it more concrete, suppose we have a triangular array of nodes
as follows:
(0)
x0
(1) (1)
x0 x1
(2) (2) (2)
x0 x1 x2
.. (3.21)
.
(n) (n) (n)
x0 x1 ... xn
..
.

(n) (n) (n)


where a ≤ x0 < x1 < · · · < xn ≤ b for n = 0, 1, . . .. Let pn be the
interpolating polynomial of degree at most n of f at the nodes corresponding
to the n + 1 row of (3.21).
By the Weierstrass Approximation Theorem ( p∗n is a better approxima-
tion or at least as good as that provided by the Bernstein polynomial),

kf − p∗n k∞ → 0 as n → ∞. (3.22)

However, it can be proved that


2
Λn > log n − 1 (3.23)
π2
3.3. BARYCENTRIC FORMULA 43

and hence the Lebesgue constant is not bounded in n. Therefore, we cannot


conclude from (3.20) and (3.22) that kf − pn k∞ as n → ∞, i.e. that the
interpolating polynomial, as we add more and more nodes, converges uni-
formly to f . That depends on the regularity of f and on the distribution of
the nodes. In fact, if we are given the triangular array of interpolation nodes
(3.21) in advance, it is possible to construct a continuous function f such
that pn will not converge uniformly to f as n → ∞.

3.3 Barycentric Formula


The Lagrange form of the interpolating polynomial is not convenient for com-
putations. If we want to increase the degree of the polynomial we cannot
reuse the work done in getting and evaluating a lower degree one. How-
ever, we can obtain a very efficient formula by rewriting the interpolating
polynomial in the following way. Let
ω(x) = (x − x0 )(x − x1 ) · · · (x − xn ). (3.24)
Then, differentiating this polynomial of degree n+1 and evaluating at x = xj
we get
n
Y
0
ω (xj ) = (xj − xk ), for j = 0, 1, . . . , n, (3.25)
k=0
k6=j

Therefore, each of the elementary Lagrange polynomials may be written as


ω(x)
x − xj ω(x)
lj (x) = 0 = , for j = 0, 1, . . . , n, (3.26)
ω (xj ) (x − xj )ω 0 (xj )
for x 6= xj and lj (xj ) = 1 follows from L’Hôpital rule. Defining
1
λj = , for j = 0, 1, . . . , n, (3.27)
ω 0 (xj )
we can write Lagrange formula for the interpolating polynomial of f at
x0 , x1 , . . . , xn as
n
X λj
pn (x) = ω(x) fj . (3.28)
j=0
x − xj
44 CHAPTER 3. INTERPOLATION

Now, note that from (3.11) with f (x) ≡ 1 it follows that


n n
X X λj
1= lj (x) = ω(x) . (3.29)
j=0 j=0
x − xj

Dividing (3.28) by (3.29), we get the so-called Barycentric Formula for in-
terpolation:
n
X λj
fj
j=0
x − xj
pn (x) = n , for x 6= xj , j = 0, 1, . . . , n. (3.30)
X λj
j=0
x − xj

For x = xj , j = 0, 1, . . . , n, the interpolation property, pn (xj ) = fj , should


be used.
The numbers λj depend only on the nodes x0 , x1 , ..., xn and not on given
values f0 , f1 , ..., fn . We can obtain them explicitly for both the Chebyshev
nodes (3.14) and for the equally spaced nodes (3.13) and can be precomputed
efficiently for a general set of nodes.

3.3.1 Barycentric Weights for Chebyshev Nodes


The Chebyshev nodes are the zeros of qn+1 (x) = (1 − x2 )Un−1 (x), where
Un−1 (x) = sin nθ/ sin θ, x = cos θ is the Chebyshev polynomial of the second
kind of degree n − 1, with leading order coefficient 2n−1 [see Section 2.4].
Since the λj ’s can be defined up to a multiplicative constant (which would
cancel out in the barycentric formula) we can take λj to be proportional to
0
1/qn+1 (xj ). Since
qn+1 (x) = sin θ sin nθ, (3.31)
differentiating we get
0
qn+1 (x) = −n cos nθ − sin nθ cot θ. (3.32)
Thus,

 −2n, for j = 0,

0
qn+1 (xj ) = −(−1)j n, for j = 1, . . . , n − 1, (3.33)

−2n (−1)n for j = n.

3.3. BARYCENTRIC FORMULA 45

We can factor out −n in (3.33) to obtain the barycentric weights for the
Chebyshev points

1
 2 , for j = 0,

λj = (−1)j , for j = 1, . . . , n − 1, (3.34)

 1 n
2
(−1) for j = n.

Note that for a general interval [a, b], the term (a + b)/2 in the change of
variables (3.12) cancels out in (3.25) but we gain an extra factor of [(b−a)/2]n .
However, this factor can be omitted as it does not alter the barycentric
formula. Therefore, the same barycentric weights (3.34) can also be used for
the Chebyshev nodes in an interval [a, b].

3.3.2 Barycentric Weights for Equispaced Nodes


For equispaced points, xj = x0 + jh, j = 0, 1, . . . , n we have

1
λj =
(xj − x0 ) · · · (xj − xj−1 )(xj − xj+1 ) · · · (xj − xn )
1
=
(jh)[(j − 1)h] · · · (h)(−h)(−2h) · · · (j − n)h
1
= n−j n
(−1) h [j(j − 1) · · · 1][1 · 2 · · · (n − j)]
1 n!
= n−j n
(−1) h n! j!(n − j)!
 
1 j n
= (−1) .
(−1)n hn n! j

We can omit the factor 1/((−1)n hn n!) because it cancels out in the barycen-
tric formula. Thus, for equispaced nodes we can use
 
j n
λj = (−1) , j = 0, 1, . . . n. (3.35)
j

3.3.3 Barycentric Weights for General Sets of Nodes


For general arrays of nodes we can precompute the barycentric weights effi-
ciently as follows.
46 CHAPTER 3. INTERPOLATION

(0)
λ0 = 1;
for m = 1 : n
for j = 0 : m − 1
(m−1)
(m) λ
λj = xjj −xm ;
end
(m)
λm = m−1 1 ;
Y
(xm − xk )
k=0
end

If we want to add one more point (xn+1 , fn+1 ) we just extend the m-loop
(n+1) (n+1) (n+1)
to n + 1 to generate λ0 , λ1 , · · · , λn+1 .

3.4 Newton’s Form and Divided Differences


There is another representation of the interpolating polynomial which is both
very efficient computationally and very convenient in the derivation of nu-
merical methods based on interpolation. The idea of this representation, due
to Newton, is to use successively lower order polynomials for constructing
pn .
Suppose we have gotten pn−1 ∈ Pn−1 , the interpolating polynomial of
(x0 , f0 ), (x1 , f1 ), . . . , (xn−1 , fn−1 ) and we would like to obtain pn ∈ Pn , the in-
terpolating polynomial of (x0 , f0 ), (x1 , f1 ), . . . , (xn , fn ) by reusing pn−1 . The
difference between these polynomials, r = pn −pn−1 , is a polynomial of degree
at most n. Moreover, for j = 0, . . . , n − 1

r(xj ) = pn (xj ) − pn−1 (xj ) = fj − fj = 0. (3.36)

Therefore, r can be factored as

r(x) = cn (x − x0 )(x − x1 ) · · · (x − xn−1 ). (3.37)

The constant cn is called the n-th divided difference of f = [f0 , f1 , . . . , fn ]


with respect to x0 , x1 , ..., xn , and is usually denoted as f [x0 , . . . , xn ]. Thus,
we have

pn (x) = pn−1 (x) + f [x0 , . . . , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 ). (3.38)


3.4. NEWTON’S FORM AND DIVIDED DIFFERENCES 47

By the same argument, we have

pn−1 (x) = pn−2 (x) + f [x0 , . . . , xn−1 ](x − x0 )(x − x1 ) · · · (x − xn−2 ), (3.39)

etc. So we arrive at Newton’s Form of pn :

pn (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + . . . + f [x0 , . . . , xn ](x − x0 ) · · · (x − xn−1 ).


(3.40)

Note that for n = 1

p1 (x) = f [x0 ] + f [x0 , x1 ](x − x0 ),


p1 (x0 ) = f [x0 ] = f0 ,
p1 (x1 ) = f [x0 ] + f [x0 , x1 ](x1 − x0 ) = f1 .

Therefore

f [x0 ] = f0 , (3.41)
f1 − f0
f [x0 , x1 ] = , (3.42)
x1 − x0
and
f1 − f0
p1 (x) = f0 + (x − x0 ). (3.43)
x1 − x0
Define f [xj ] = fj for j = 0, 1, ...n. The following identity will allow us to
compute all the required divided differences.

Theorem 3.2.
f [x1 , x2 , ..., xk ] − f [x0 , x1 , ..., xk−1 ]
f [x0 , x1 , ..., xk ] = . (3.44)
xk − x0
Proof. Let pk−1 be the interpolating polynomial of degree at most k − 1 of
(x1 , f1 ), . . . , (xk , fk ) and qk−1 the interpolating polynomial of degree at most
k − 1 of (x0 , f0 ), . . . , (xk−1 , fk−1 ). Then
x − xk
p(x) = pk−1 (x) + [pk−1 (x) − qk−1 (x)]. (3.45)
xk − x0
48 CHAPTER 3. INTERPOLATION

is a polynomial of degree at most k and for j = 1, 2, ....k − 1


xj − xk
p(xj ) = fj + [fj − fj ] = fj .
xk − x0
Moreover, p(x0 ) = qk−1 (x0 ) = f0 and p(xk ) = pk−1 (xk ) = fk . There-
fore, p is the interpolation polynomial of degree at most k of the points
(x0 , f0 ), (x1 , f1 ), . . . , (xk , fk ). The leading order coefficient of pk is f [x0 , ..., xk ]
and equating this with the leading order coefficient of p
f [x1 , ..., xk ] − f [x0 , x1 , ...xk−1 ]
,
xk − x0
gives (3.44).
To obtain the divided difference we construct a table using (3.44), column
by column as illustrated below for n = 3.

xj 0th order 1th order 2th order 3th order


x0 f0
f [x0 , x1 ]
x1 f1 f [x0 , x1 , x2 ]
f [x1 , x2 ] f [x0 , x1 , x2 , x3 ]
x2 f2 f [x1 , x2 , x3 ]
f [x2 , x3 ]
x3 f3

Example 3.3. Let f (x) = 1 + x2 , xj = j, and fj = f (xj ) for j = 0, . . . , 3.


Then
xj 0th order 1th order 2th order 3th order
0 1
2−1
1−0
= 1
3−1
1 2 2−0
= 1
5−2
2−1
=3 0
5−3
2 5 3−1
=1
10−5
3−2
=5
3 10
so

p3 (x) = 1 + 1(x − 0) + 1(x − 0)(x − 1) + 0(x − 0)(x − 1)(x − 2) = 1 + x2 .


3.5. CAUCHY REMAINDER 49

After computing the divided differences, we need to evaluate pn at a given


point x. This can be done efficiently by suitably factoring it. For example,
for n = 3 we have

p3 (x) = c0 + c1 (x − x0 ) + c2 (x − x0 )(x − x1 ) + c3 (x − x0 )(x − x1 )(x − x2 )


= c0 + (x − x0 ) {c1 + (x − x1 )[c2 + (x − x2 )c3 ]}

For general n we can use the following Horner-like scheme to get p = pn (x):
p = cn ;
for k = n − 1 : 0
p = ck + (x − xk ) ∗ p;
end

3.5 Cauchy Remainder


We now assume that the data fj = f (xj ), j = 0, 1, . . . , n come from a
sufficiently smooth function f , which we are trying to approximate with
an interpolating polynomial pn , and we focus on the error f − pn of such
approximation.
In the Introduction we proved that if x0 , x1 , and x are in [a, b] and
f ∈ C 2 [a, b] then

1
f (x) − p1 (x) = f 00 (ξ(x))(x − x0 )(x − x1 ),
2
where p1 is the polynomial of degree at most 1 that interpolates (x0 , f (x0 )),
(x1 , f (x1 )) and ξ(x) ∈ (a, b). The general result about the interpolation error
is the following theorem:

Theorem 3.3. Let f ∈ C n+1 [a, b], x0 , x1 , ..., xn , x be contained in [a, b], and
pn be the interpolation polynomial of degree at most n of f at x0 , ..., xn then
1
f (x) − pn (x) = f (n+1) (ξ(x))(x − x0 )(x − x1 ) · · · (x − xn ), (3.46)
(n + 1)!

where min{x0 , . . . , xn , x} < ξ(x) < max{x0 , . . . , xn , x}.

Proof. The right hand side of (3.46) is known as the Cauchy Remainder and
the following proof is due to Cauchy.
50 CHAPTER 3. INTERPOLATION

For x equal to one of the nodes xj the result is trivially true. Take x fixed
not equal to any of the nodes and define
(t − x0 )(t − x1 ) · · · (t − xn )
φ(t) = f (t) − pn (t) − [f (x) − pn (x)] . (3.47)
(x − x0 )(x − x1 ) · · · (x − xn )
Clearly, φ ∈ C n+1 [a, b] and vanishes at t = x0 , x1 , ..., xn , x. That is, φ has at
least n + 2 zeros. Applying Rolle’s Theorem n + 1 times we conclude that
there exists a point ξ(x) ∈ (a, b) such that φ(n+1) (ξ(x)) = 0. Therefore,
(n + 1)!
0 = φ(n+1) (ξ(x)) = f (n+1) (ξ(x)) − [f (x) − pn (x)]
(x − x0 )(x − x1 ) · · · (x − xn )
from which (3.46) follows. Note that the repeated application of Rolle’s theo-
rem implies that ξ(x) is between min{x0 , x1 , ..., xn , x} and max{x0 , x1 , ..., xn , x}.

We are now going to find a beautiful connection between Chebyshev


polynomials and the interpolation error as given by the Cauchy remainder
(3.46). Let us consider the interval [−1, 1]. We have no control on the term
f (n+1) (ξ(x)). However, we can choose the interpolation nodes x0 , . . . , xn so
that the factor

w(x) = (x − x0 )(x − x1 ) · · · (x − xn ) (3.48)

is smallest as possible in the infinity norm. The function w is a monic poly-


nomial of degree n + 1 and we have proved in Section 2.4 that the Chebyshev
polynomial Ten+1 , defined in (2.72), is the monic polynomial of degree n + 1
with smallest infinity norm. Hence, if the interpolation nodes are taken to
be the zeros of Ten+1 , namely
 
(2j + 1) π
xj = cos , j = 0, 1, . . . n. (3.49)
n+1 2
kwk∞ is minimized and kwk∞ = 2−n . The following theorem summarizes
this observation.
Theorem 3.4. Let pTn be the interpolating polynomial of degree at most n of
f ∈ C n+1 [−1, 1] with respect to the nodes (3.49) then
1
kf − pTn k∞ ≤ kf n+1 k∞ . (3.50)
2n (n + 1)!
3.5. CAUCHY REMAINDER 51

The Gauss-Lobatto Chebyshev points,


 

xj = cos , j = 0, 1, . . . , n, (3.51)
n
which are the extrema and not the zeros of the corresponding Chebyshev
polynomial, do not minimize maxx∈[−1,1] |w(x)|. However, they are nearly
optimal. More precisely, since the Gauss-Lobatto nodes are the zeros of the
(monic) polynomial [see (2.81) and (3.31) ]
1 1
(1 − x2 )Un−1 (x) = sin θ sin nθ, x = cos θ. (3.52)
2n−1 2n−1
We have that
1 2 1
kwk∞ = max (1 − x )Un−1 (x) ≤ . (3.53)
x∈[−1,1] 2n−1 2n−1
So, the Gauss-Lobatto nodes yield a kwk∞ of no more than a factor of two
from the optimal value.
We now relate divided differences to the derivatives of f using the Cauchy
remainder. Take an arbitrary point t distinct from x0 , . . . , xn . Let pn+1 be
the interpolating polynomial of f at x0 , . . . , xn , t and pn that at x0 , . . . , xn .
Then, Newton’s formula (3.38) implies

pn+1 (x) = pn (x) + f [x0 , . . . , xn , t](x − x0 )(x − x1 ) · · · (x − xn ). (3.54)

Noting that pn+1 (t) = f (t) we get

f (t) = pn (t) + f [x0 , . . . , xn , t](t − x0 )(t − x1 ) · · · (t − xn ). (3.55)

Since t was arbitrary we can set t = x and obtain

f (x) = pn (x) + f [x0 , . . . , xn , x](x − x0 )(x − x1 ) · · · (x − xn ), (3.56)

and upon comparing with the Cauchy remainder we get


f (n+1) (ξ(x))
f [x0 , ..., xn , x] = . (3.57)
(n + 1)!
If we set x = xn+1 and relabel n + 1 by k we have
1 (k)
f [x0 , ..., xk ] = f (ξ), (3.58)
k!
52 CHAPTER 3. INTERPOLATION

where min{x0 , . . . , xk } < ξ < max{x0 , . . . , xk }. Suppose that we now let


x1 , ..., xk → x0 . Then ξ → x0 and
1 (k)
lim f [x0 , ..., xk ] = f (x0 ). (3.59)
x1 ,...,xk →x0 k!
We can use this relation to define a divided difference where there are
coincident nodes. For example f [x0 , x1 ] when x0 = x1 by f [x0 , x0 ] = f 0 (x0 ),
etc. This is going to be very useful for the following interpolation problem.

3.6 Hermite Interpolation


The Hermite interpolation problem is: given values of f and some of its
derivatives at the nodes x0 , x1 , ..., xn , find the polynomial of smallest degree
interpolating those values. This polynomial is called the Hermite Interpo-
lation Polynomial and can be obtained with a minor modification to the
Newton’s form representation.
For example: Suppose we look for a polynomial p of lowest degree which
satisfies the interpolation conditions:

p(x0 ) = f (x0 ),
p0 (x0 ) = f 0 (x0 ),
p(x1 ) = f (x1 ),
p0 (x1 ) = f 0 (x1 ).

We can view this problem as a limiting case of polynomial interpolation of


f at two pairs of coincident nodes, x0 , x0 , x1 , x1 and we can use Newton’s
Interpolation form to obtain p. The table of divided differences, in view of
(3.59), is

x0 f (x0 )
x0 f (x0 ) f 0 (x0 )
(3.60)
x1 f (x1 ) f [x0 , x1 ] f [x0 , x0 , x1 ]
x1 f (x1 ) f 0 (x1 ) f [x0 , x1 , x1 ] f [x0 , x0 , x1 , x1 ]

and
p(x) = f (x0 ) + f 0 (x0 )(x − x0 ) + f [x0 , x0 , x1 ](x − x0 )2
(3.61)
+ f [x0 , x0 , x1 , x1 ](x − x0 )2 (x − x1 ).
3.7. CONVERGENCE OF POLYNOMIAL INTERPOLATION 53

Example 3.4. Let f (0) = 1, f 0 (0) = 0 and f (1) = 2. Find the Hermite
Interpolation Polynomial.
We construct the table of divided differences as follows:

0 1
0 1 0
(3.62)
√ √ √
1 2 2−1 2−1

and therefore
√ √
p(x) = 1 + 0(x − 0) + ( 2 − 1)(x − 0)2 = 1 + ( 2 − 1)x2 . (3.63)

3.7 Convergence of Polynomial Interpolation


From the Cauchy Remainder formula
1
f (x) − pn (x) = f (n+1) (ξ(x))(x − x0 )(x − x1 ) · · · (x − xn ) (3.64)
(n + 1)!

it is clear that the accuracy and convergence of the interpolation polynomial


pn of f depends on both the smoothness of f and the distribution of nodes
x0 , x1 , . . . , xn .
In the Runge example
1
f (x) = x ∈ [−1, 1],
1 + 25x2
is very smooth. It has an infinite number of continuous derivatives, i.e.
f ∈ C ∞ [−1, 1] (in fact f is real analytic in the whole real line, i.e. it has
a convergent Taylor series to f (x) for every x ∈ R). Nevertheless, for the
equispaced nodes (3.13) pn does not converge uniformly to f (x) as n → ∞.
In fact it diverges quite dramatically toward the end points of the interval.
On the other hand, there is fast and uniform convergence of pn to f when
the Chebyshev nodes (3.14) are employed.
It is then natural to ask: Given any f ∈ C[a, b], can we guarantee that if
we choose the Chebyshev nodes kf −pn k∞ → 0? The answer is no. Bernstein
and Faber proved in 1914 that given any distribution of points, organized in
54 CHAPTER 3. INTERPOLATION

a triangular array (3.21), it is possible to construct a continuous function f


for which its interpolating polynomial pn (corresponding to the nodes on the
n-th row of (3.21)) will not converge uniformly to f as n → ∞. However,
if f is slightly smoother, for example f ∈ C 1 [a, b], then for the Chebyshev
array of nodes kf − pn k∞ → 0.
2
If f (x) = e−x and there is convergence of pn even with the equidistributed
nodes. What is so special about this function? The function
2
f (z) = e−z , (3.65)

z = x + iy is analytic in the entire complex plane. Using complex variables


analysis it can be shown that if f is analytic in a sufficiently large region in
the complex plane containing [a, b] then kf − pn k∞ → 0. Just how large the
region of analyticity needs to be? it depends on the asymptotic distribution
of the nodes as n → ∞.
In the limit as n → ∞, we can think of the nodes as a continuum with a
density ρ so that for sufficiently large n,
Z x
(n + 1) ρ(t)dt (3.66)
a

is the total number of nodes in [a, x]. Take for example, [−1, 1]. For
√ equis-
paced nodes ρ(x) = 1/2 and for the Chebyshev nodes ρ(x) = 1/(π 1 − x2 ).
It turns out that the relevant domain of analyticity is given in terms of
the function
Z b
φ(z) = − ρ(t) ln |z − t|dt. (3.67)
a

Let Γc be the level curve consisting of all the points z ∈ C such that φ(z) = c
for c constant. For very large and negative c, Γc approximates a large circle.
As c is increased, Γc shrinks. We take the “smallest” level curve, Γc0 , which
contains [a, b]. The relevant domain of analyticity is

Rc0 = {z ∈ C : φ(z) ≥ c0 }. (3.68)

Then, if f is analytic in Rc0 , kf − pn k∞ → 0, not only in [a, b] but for every


point in Rc0 . Moreover,

|f (x) − pn (x)| ≤ Ce−N (φ(x)−c0 ) , (3.69)


3.8. PIECE-WISE LINEAR INTERPOLATION 55

for some constant C. That is pn converges exponentially fast to f . For


the Chebyshev nodes Rc0 approximates [a, b], so if f is analytic in any region
containing [a, b], however thin this region might be, pn will converge uniformly
to f . For equidistributed nodes, Rc0 looks like a football, with [a, b] as its
longest axis. In the Runge example, the function is singular at z = ±i/5
which happens to be inside this football-like domain and this explain the
observed lack of convergence for this particular function.
The moral of the story is that polynomial interpolation using Chebyshev
nodes converges very rapidly for smooth functions and thus yields very ac-
curate approximations.

3.8 Piece-wise Linear Interpolation


One way to reduce the error in linear interpolation is to divide [a, b] into
small subintervals [x0 , x1 ], ..., [xn−1 , xn ]. In each of the subintervals [xj , xj+1 ]
we approximate f by
f (xj+1 ) − f (xj )
p(x) = f (xj ) + (x − xj ), x ∈ [xj , xj+1 ]. (3.70)
xj+1 − xj
We know that
1
f (x) − p(x) = f 00 (ξ)(x − xj )(x − xj+1 ), x ∈ [xj , xj+1 ] (3.71)
2
where ξ is some point between xj and xj+1 .
Suppose that |f 00 (x)| ≤ M2 , ∀x ∈ [a, b] then
1
|f (x) − p(x)| ≤ M2 max |(x − xj )(x − xj+1 )|. (3.72)
2 xj ≤x≤xj+1

Now the max at the right hand side is attained at the midpoint (xj + xj+1 )/2
and
 2
xj+1 − xj 1
max |(x − xj )(x − xj+1 )| = = h2j , (3.73)
xj ≤x≤xj+1 2 4
where hj = xj+1 − xj . Therefore
1
max |f (x) − p(x)| ≤ M2 h2j . (3.74)
xj ≤x≤xj+1 8
56 CHAPTER 3. INTERPOLATION

If we want this error to be smaller than a prescribed tolerance δ we can take


sufficiently small subintervals. Namely, we can pick hj such that 18 M2 h2j ≤ δ
which implies that
r

hj ≤ . (3.75)
M2

3.9 Cubic Splines


Several applications require a smoother curve than that provided by a piece-
wise linear approximation. Continuity of the first and second derivatives
provide that required smoothness.
One of the most frequently used such approximations is a cubic spline,
which is is a piecewise cubic function, s, which interpolates a set points
(x0 , f0 ), (x1 , f1 ), . . . (xn , fn ), and has two continuous derivatives. In each
subinterval [xj , xj+1 ], s is a cubic polynomial, which we may represent as

sj (x) = Aj (x − xj )3 + Bj (x − xj )2 + Cj (x − xj ) + Dj . (3.76)

Let

hj = xj+1 − xj . (3.77)

The spline s(x) interpolates the given data:

sj (xj ) = fj = Dj , (3.78)
sj (xj+1 ) = Aj h3j + Bj h2j + Cj hj + Dj = fj+1 . (3.79)

Now s0j (x) = 3Aj (x − xj )2 + 2Bj (x − xj ) + Cj and s00j (x) = 6Aj (x − xj ) + 2Bj .
Therefore

s0j (xj ) = Cj , (3.80)


s0j (xj+1 ) = 3Aj h2j + 2Bj hj + Cj , (3.81)
s00j (xj ) = 2Bj , (3.82)
s00j (xj+1 ) = 6Aj hj + 2Bj . (3.83)

We are going to write the spline coefficients Aj , Bj , Cj , and Dj in terms of


fj and fj+1 and the unknown values zj = s00j (xj ) and zj+1 = s00j (xj+1 ). We
3.9. CUBIC SPLINES 57

have

Dj = fj ,
1
Bj = zj ,
2
1
6Aj hj + 2Bj = zj+1 ⇒ Aj = (zj+1 − zj )
6hj

and substituting these values in (3.79) we get

1 1
Cj = (fj+1 − fj ) − hj (zj+1 + 2zj ).
hj 6

Let us collect all our formulas for the spline coefficients:

1
Aj = (zj+1 − zj ), (3.84)
6hj
1
Bj = zj , (3.85)
2
1 1
Cj = (fj+1 − fj ) − hj (zj+1 + 2zj ), (3.86)
hj 6
Dj = fj . (3.87)

Note that the second derivative of s is continuous, s00j (xj+1 ) = zj+1 = s00j+1 (xj+1 ),
and by construction s interpolates the given data. We are now going to use
the condition of continuity of the first derivative of s to determine equations
for the unknown values zj , j = 1, 2, . . . , n − 1:

s0j (xj+1 ) = 3Aj h2j + 2Bj hj + Cj


1 1 1 1
=3 (zj+1 − zj )h2j + 2 zj hj + (fj+1 − fj ) − hj (zj+1 + 2zj )
6hj 2 hj 6
1 1
= (fj+1 − fj ) + hj (2zj+1 + zj ).
hj 6

Decreasing the index by 1 we get

1 1
s0j−1 (xj ) = (fj − fj−1 ) + hj−1 (2zj + zj−1 ). (3.88)
hj−1 6
58 CHAPTER 3. INTERPOLATION

Continuity of the first derivative at an interior node means s0j−1 (xj ) = s0j (xj )
for j = 1, 2, ..., n − 1. Therefore
1 1 1 1
(fj − fj−1 ) + hj−1 (2zj + zj−1 ) = Cj = (fj+1 − fj ) − hj (zj+1 + 2zj )
hj−1 6 hj 6
which can be written as
hj−1 zj−1 + 2(hj−1 + hj )zj + hj zj+1 =
6 6 (3.89)
− (fj − fj−1 ) + (fj+1 − fj ), j = 1, . . . , n − 1.
hj−1 hj
This is a linear system of n−1 equations for the n−1 unknowns z1 , z2 , . . . , zn−1 .
In matrix form
 
2(h0 + h1 ) h1 ··· 0 z1
 
d1

.
h1 2(h1 + h2 ) h2 . . 0   z2   d2 
   

 .. ... ...    =
  ...   ... 
 ,

 . hn−2  
...
0 hn−2 2(hn−2 + hn−1 ) zn−1 dn−1
(3.90)
where
− h60 (f1 − f0 ) + h61 (f2 − f1 ) − h0 z0
   
d1
 d2   − h61 (f2 − f1 ) + h62 (f3 − f2 ) 
..
   
..
= . (3.91)
   
 . .
6 6
− hn−3 (fn−2 − fn−3 ) + hn−2 (fn−1 − fn−2 )
   
dn−2   
dn−1 6 6
− hn−2 (fn−1 − fn−2 ) + hn−1 (fn − fn−1 ) − hn−1 zn
Note that z0 = f000 and zn = fn00 are unspecified. The values z0 = zn = 0 de-
fined what is called a Natural Spline. The matrix of the linear system (3.90)
is strictly diagonally dominant, a concept we make precise in the definition
below. A consequence of this property, as we will see shortly, is that the ma-
trix is nonsingular and therefore there is a unique solution for z1 , z2 , . . . , zn−1
corresponding the second derivative values of the spline at the interior nodes.
Definition 3.1. An n × n matrix A with entries aij , i, j = 1, . . . , n is strictly
diagonally dominant if
n
X
|aii | > |aij |, for i = 1, . . . , n. (3.92)
j=1
j6=i
3.9. CUBIC SPLINES 59

Theorem 3.5. Let A be a strictly diagonally dominant matrix. Then A is


nonsingular.
Proof. Suppose the contrary, that is there is x 6= 0 such that Ax = 0. Let k
be an index such that |xk | = kxk∞ . Then, the k-th equation in Ax = 0 gives
n
X
akk xk + akj xj = 0 (3.93)
j=1
j6=k

and consequently
n
X
|akk ||xk | ≤ |akj ||xj |. (3.94)
j=1
j6=k

Dividing by |xk |, which by assumption in nonzero, and using that |xj |/|xk | ≤
1 for all j = 1, . . . , n, we get
n
X
|akk | ≤ |akj |, (3.95)
j=1
j6=k

which contradicts the fact that A is strictly diagonally dominant.


If the nodes are equidistributed, xj = x0 + jh for j = 0, 1, . . . , n then the
linear system (3.90), after dividing by h simplifies to
   6 
z (f − 2f + f ) − z

4 1 ··· 0 1 h2 0 1 2 0
6
...   z2  
h2
(f1 − 2f2 + f3 ) 
1 4 1 0  .  
    
.
 ..  =  .. . (3.96)
. 
 .. 1 ... 
 1 
   6
zn−2   h2 (fn−3 − 2fn−2 + fn−1 ) 

.
0 .. 1 4 zn−1 6
(fn−2 − 2fn−1 + fn ) − zn
h2

Once the z1 , z2 , . . . , zn−1 are found the spline coefficients can be computed
from (3.84)-(3.87).
Example 3.5. Find the natural cubic spline that interpolates (0, 0), (1, 1), (2, 16).
We know z0 = 0 and z2 = 0. We only need to find z1 (only 1 interior node).
The system (3.89) degenerates to just one equation and h = 1, thus
z0 + 4z1 + z2 = 6[f0 − 2f1 + f2 ] ⇒ z1 = 21
60 CHAPTER 3. INTERPOLATION

In [0, 1] we have
1 1 7
A0 = (z1 − z0 ) = × 21 = ,
6h 6 2
1
B0 = z0 = 0
2
1 1 1 5
C0 = (f1 − f0 ) − h(z1 + 2z0 ) = 1 − 21 = − ,
h 6 6 2
D0 = f0 = 0.

Thus, s0 (x) = A0 (x − 0)3 + B0 (x − 0)3 + C0 (x − 0) + D0 = 72 x3 − 25 x. Now


in [1, 2]
1 1 7
A1 = (z2 − z1 ) = (−21) = − ,
6h 6 2
1 21
B1 = z1 = ,
2 2
1 1 1
C1 = (f2 − f1 ) − h(z2 + 2z1 ) = 16 − 1 − (2 · 21) = 8,
h 6 6
D1 = f1 = 1.

and s1 (x) = − 72 (x − 1)3 + 21


2
(x − 1)2 + 8(x − 1) + 1. Therefore the spline is
given by
7 3
x − 25 x

2
x ∈ [0, 1],
s(x) = 7 3 21 2
− 2 (x − 1) + 2 (x − 1) + 8(x − 1) + 1 x ∈ [1, 2].

3.9.1 Solving the Tridiagonal System


The matrix of coefficients of the linear system (3.90) has the tridiagonal form

 
a1 b 1
 c 1 a2 b 2 
 
. .
c2 . . . .
 
 
.. .. ..
 
A=
 . . . 
 (3.97)
 ... .. .. 

 . . 

 . .. ... 
 bN −1 
cN −1 aN
3.9. CUBIC SPLINES 61

where for the natural splines (n − 1) × (n − 1) system (3.90), the non-zero


tridiagonal entries are
aj = 2(hj−1 + hj ), j = 1, 2, . . . , n − 1 (3.98)
bj = hj , j = 1, 2, . . . , n − 2 (3.99)
cj = hj j = 1, 2, . . . , n − 2. (3.100)
We can solve the corresponding linear system of equations Ax = d by
factoring this matrix A into the product of a lower triangular matrix L and
an upper triangular matrix U . To illustrate the idea let us take N = 5. A
5 × 5 tridiagonal linear system has the form
    
a1 b 1 0 0 0 x1 d1
 c1 a2 b2 0 0  x2  d2 
    
 0 c2 a3 b3 0  x3  = d3  (3.101)
    
 0 0 c3 a4 b4  x4  d4 
0 0 0 c 4 a5 x5 d5
and we seek a factorization of the form
    
a1 b1 0 0 0 1 0 0 0 0 m1 u1 0 0 0
 c1 a2 b2 0 0  l1 1 0 0 0
  0 m2 u2 0 0
  
 
0 c2 a3 b3 0   =  0 l2 1 0 0
 0 0 m3 u 3 0 
  
0 0 c3 a4 b4   0 0 l3 1 0  0 0 0 m4 u4 
0 0 0 c 4 a5 0 0 0 l4 1 0 0 0 0 m5
Note the the first matrix on the right hand side is lower triangular and the
second one is upper triangular. Performing the product of the matrices and
comparing with the corresponding entries of the left hand side matrix we
have
1st row: a1 = m1 , b1 = u1 ,
2nd row: c1 = m1 l1 , a2 = l1 u1 + m2 , b2 = u2 ,
3rd row: c2 = m2 l2 , a3 = l2 u2 + m3 , b3 = u3 ,
4th row: c3 = m3 l3 , a4 = l3 u3 + m4 , b4 = u4 ,
5th row: c4 = m4 l4 , a5 = l4 u4 + m5 , b5 = u5 .
So we can determine the unknowns in the following order

m1 ; u1 , l1 , m2 ; u2 , l2 , m3 ; . . . , u4 , l4 , m5 .
62 CHAPTER 3. INTERPOLATION

Since uj = bj for all j we can write down the algorithm for general N as

% Determine the factorization coefficients


m 1 = a1
for j = 1 : N − 1
lj = cj /mj
mj+1 = aj+1 − lj ∗ bj
end

% Forward substitution on Ly = d
y1 = d1
for j = 2 : N
yj = dj − lj−1 ∗ yj−1
end

% Backward substitution to solve U x = y


xN = yN /mN
for j = N − 1 : 1
xj = (yj − bj ∗ xj+1 )/mj
end

3.9.2 Complete Splines


Sometimes it is more appropriate to specify the first derivative at the end
points instead of the second derivative. This is called a complete or clamped
spline. In this case z0 = f000 and zn = fn00 become unknowns together with
z1 , z2 , . . . , zn−1 .
We need to add two more equations to have a complete system for all the
n + 1 unknown values z0 , z1 , . . . , zn in a complete spline. Recall that

sj (x) = Aj (x − xj )3 + Bj (x − xj )2 + Cj (x − xj ) + Dj

and so

s0j (x) = 3Aj (x − xj )2 + 2Bj (x − xj ) + Cj .


3.9. CUBIC SPLINES 63

Therefore

s00 (x0 ) = C0 = f00 (3.102)


s0n−1 (xn ) = 3An−1 h2n−1 + 2Bn−1 hn−1 + Cn−1 = fn0 . (3.103)

Substituting C0 , An−1 , Bn−1 , and Cn−1 from (3.84)-(3.86) we get


6
2h0 z0 + h0 z1 = (f1 − f0 ) − 6f00 , (3.104)
h0
6
hn−1 zn−1 + 2hn−1 zn = − (fn − fn−1 ) + 6fn0 . (3.105)
hn−1

These two equations together with (3.89) uniquely determine the second
derivative values at all the nodes. The resulting (n + 1) × (n + 1) is also
tridiagonal and diagonally dominant (hence nonsingular). Once the values
z0 , z1 , . . . , zn are found the splines coefficients are obtained from (3.84)-(3.87).

3.9.3 Parametric Curves


In computer graphics and animation it is often required to construct smooth
curves that are not necessarily the graph of a function but that have a para-
metric representation x = x(t) and y = y(t) for t ∈ [a, b]. Hence one needs
to determine two splines interpolating (tj , xj ) and (tj , yj ) (j = 0, 1, . . . n),
respectively.
The arc length of the curve is a natural choice for the parameter t. How-
ever, this is not known a priori and instead the nodes tj ’s are usually chosen
as the distances of consecutive, judiciously chosen points:
q
t0 = 0, tj = tj−1 + (xj − xj−1 )2 + (yj − yj−1 )2 , j = 1, 2, . . . n. (3.106)
64 CHAPTER 3. INTERPOLATION
Chapter 4

Trigonometric Approximation

We will study approximations employing truncated Fourier series. This type


of approximation finds multiple applications in digital signal and image pro-
cessing, and in the construction of highly accurate approximations to the
solution of some partial differential equations. We will look at the problem
of best approximation in the convenient L2 norm and then consider the more
practical approximation using interpolation. The latter will put us in the
discrete framework to introduce the leading star of this chapter, the Discrete
Fourier Transform, and one of the top ten algorithms of all time, the Fast
Fourier Transform, to compute it.

4.1 Approximating a Periodic Function

We begin with the problem of approximating a periodic function f at the


continuum level. Without loss of generality, we can assume that f is of period
p
2π (if it is of period p then the function F (y) = f ( 2π y) has period 2π).
If f is a smooth periodic function we can approximate it with the first n
few terms of its Fourier series:

n
1 X
f (x) ≈ a0 + (ak cos kx + bk sin kx), (4.1)
2 k=1

65
66 CHAPTER 4. TRIGONOMETRIC APPROXIMATION

where
1 2π
Z
ak = f (x) cos kx dx, k = 0, 1, . . . , n (4.2)
π 0
1 2π
Z
bk = f (x) sin kx dx, k = 1, 2, . . . , n. (4.3)
π 0
We will show that this is the best approximation to f , in the L2 norm, by
a trigonometric polynomial of degree n (the right hand side of (4.1)). For
convenience, we write a trigonometric polynomial Sn (of degree n) in complex
form (see (1.45)-(1.48) ) as
n
X
Sn (x) = ck eikx . (4.4)
k=−n

Consider the square of the error


Z 2π
Jn = kf − Sn k22 = [f (x) − Sn (x)]2 dx. (4.5)
0

Let us try to find the coefficients ck (k = 0, ±1 . . . , ±n) in (4.4) that minimize


Jn . We have
Z 2π " n
#2
X
Jn = f (x) − ck eikx dx
0 k=−n
Z 2π n
X Z 2π
2
= [f (x)] dx − 2 ck f (x)eikx dx (4.6)
0 k=−n 0
n
X n
X Z 2π
+ ck cl eikx eilx dx.
k=−n l=−n 0

This problem simplifies if we use the orthogonality of the set {1, eix , e−ix , . . . , einx , e−inx },
as for k 6= −l
Z 2π Z 2π 2π
ikx ilx i(k+l)x 1 i(k+l)x
e e dx = e dx = e =0 (4.7)
0 0 i(k + l) 0

and for k = −l
Z 2π Z 2π
ikx ilx
e e dx = dx = 2π. (4.8)
0 0
4.1. APPROXIMATING A PERIODIC FUNCTION 67

Thus, we get
Z 2π n
X Z 2π n
X
2 ikx
Jn = [f (x)] dx − 2 ck f (x)e dx + 2π ck c−k . (4.9)
0 k=−n 0 k=−n

Jn is a quadratic function of the coefficients ck and so to find the its minimum,


we determine the critical point of Jn as a function of the ck ’s
Z 2π
∂Jn
= −2 f (x)eimx dx + 2(2π)c−m = 0, m = 0, ±1, . . . , ±n. (4.10)
∂cm 0

Therefore, relabeling the coefficients with k again, we get


Z 2π
1
ck = f (x)e−ikx dx, k = 0, ±1, . . . , ±n, (4.11)
2π 0
which are the complex Fourier coefficients of f . The real Fourier coefficients
(4.2)-(4.3) follow from Euler’s formula eikx = cos kx+i sin kx, which produces
the relation
1 1 1
c0 = a0 , ck = (ak − ibk ), c−k = (ak + ibk ), k = 1, . . . , n. (4.12)
2 2 2
This concludes the proof to the claim that the best L2 approximation to f
by trigonometric polynomials of degree n is furnished by the Fourier series
of f truncated up to wave number k = n.
Now, if we substitute the Fourier coefficients (4.11) in (11.3) we get
Z 2π X n
2
0 ≤ Jn = [f (x)] dx − 2π |ck |2
0 k=−n

that is
n Z 2π
X
2 1
|ck | ≤ [f (x)]2 dx. (4.13)
k=−n
2π 0

This is known as Bessel’s inequality. In terms of the real Fourier coefficients,


Bessel’s inequality becomes
n
1 2π
Z
1 2 X 2 2
a + (a + bk ) ≤ [f (x)]2 dx. (4.14)
2 0 k=1 k π 0
68 CHAPTER 4. TRIGONOMETRIC APPROXIMATION
Z 2π
If [f (x)]2 dx is finite, then then series
0


1 2 X 2
a + (a + b2k )
2 0 k=1 k

converges and consequently limk→∞ ak = limk→∞ bk = 0.


The convergence of a Fourier series is a delicate question. For a continuous
function f , it does not follow that its Fourier series converges point-wise to
it, only that it does so in the mean
Z 2π
lim [f (x) − Sn (x)]2 dx = 0. (4.15)
n→∞ 0

This convergence in the mean for a continuous periodic function implies that
Bessel’s inequality becomes the equality

1 2π
Z
1 2 X 2 2
a + (a + bk ) = [f (x)]2 dx, (4.16)
2 0 k=1 k π 0

which is known as Parseval’s identity. We state now a convergence result


without proof.

Theorem 4.1. Suppose that f is piecewise continuous and periodic in [0, 2π]
and with a piecewise continuous first derivative. Then

1 X 1 +
f (x) + f − (x)

a0 + (ak cos kx + bk sin kx) =
2 k=1
2

for each x ∈ [0, 2π], where ak and bk are the Fourier coefficients of f . Here

lim f (x + h) = f + (x), (4.17)


h→0+
lim f (x − h) = f − (x). (4.18)
h→0+

In particular if x is a point of continuity of f



1 X
a0 + (ak cos kx + bk sin kx) = f (x).
2 k=1
4.1. APPROXIMATING A PERIODIC FUNCTION 69

We have been working on the interval [0, 2π] but we can choose any other
interval of length 2π. This is so because if g is 2π periodic then we have the
following result

Lemma 2.
Z 2π Z t+2π
g(x)dx = g(x)dx (4.19)
0 t

for any real t.

Proof. Define
Z t+2π
G(t) = g(x)dx. (4.20)
t

Then
Z 0 Z t+2π Z t+2π Z t
G(t) = g(x)dx + g(x)dx = g(x)dx − g(x)dx. (4.21)
t 0 0 0

By the Fundamental theorem of calculus

G0 (t) = g(t + 2π) − g(t) = 0

since g is 2π periodic. Thus, G is independent of t.

Example 4.1. f (x) = |x| on [−π, π].

1 π 1 π
Z Z
ak = f (x) cos kxdx = |x| cos kxdx
π −π π −π
2 π
Z
= x cos kxdx
π 0
u = x, dv = cos kxdx
1
du = dx, v = sin kx
k Z
π
1 π
 
2 x π 2
= sin kx − sin kxdx = cos kx
π k 0 k 0 πk 2 0
2
= 2
[(−1)k − 1], k = 1, ...
πk
70 CHAPTER 4. TRIGONOMETRIC APPROXIMATION

and a0 = π, bk = 0 for all k as the function is even. Therefore

S(x) = lim Sn (x)


n→∞

1 X 2
= π+ 2
[(−1)k − 1] cos kx
2 k=1
πk
 
1 4 cos x cos 3x cos 5x
= π− + + + ··· .
2 π 12 32 52

How do we find accurate numerical approximations to the Fourier coeffi-


cients? We know that for periodic smooth integrands, integrated over one (or
multiple) period(s), the Composite Trapezoidal Rule using equidistributed
nodes gives spectral accuracy. Let xj = jh, j = 0, 1, . . . N , h = 2π/N , then
we can approximate
" N −1
#
1 1 X 1
ck ≈ h f (x0 )e−ikx0 + f (xj )e−ikxj + f (xN )e−ikxN .
2π 2 j=1
2

But due to periodicity f (x0 )e−ikx0 = f (xN )e−ikxN so we have


N N −1
1 X −ikxj 1 X
ck ≈ f (xj )e = f (xj )e−ikxj . (4.22)
N j=1 N j=0

and for the real Fourier coefficients we have


N N −1
2 X 2 X
ak ≈ f (xj ) cos kxj = f (xj ) cos kxj , (4.23)
N j=1 N j=0
N N −1
2 X 2 X
bk ≈ f (xj ) sin kxj = f (xj ) sin kxj . (4.24)
N j=1 N j=0

4.2 Interpolating Fourier Polynomial


Let f be a 2π-periodic function and xj = j 2π N
, j = 0, 1, . . . N , equidistributed
nodes in [0, 2π]. The interpolation problem is to find a trigonometric poly-
nomial of lowest order Sn such that Sn (xj ) = f (xj ), for j = 0, 1, . . . , N .
Because of periodicity f (x0 ) = f (xN ) so we only have N independent values.
4.2. INTERPOLATING FOURIER POLYNOMIAL 71

If we take N = 2n then Sn has 2n + 1 = N + 1 coefficients. So we need


one more condition. At the equidistributed nodes, the sine term of highest
frequency vanishes as sin( N2 xj ) = sin(jπ) = 0 so the coefficient bN/2 is ir-
relevant for interpolation. We thus look for an interpolating trigonometric
polynomial of the form
N/2−1  
1 X 1 N
PN (x) = a0 + (ak cos kx + bk sin kx) + aN/2 cos x . (4.25)
2 k=1
2 2

The convenience of the 1/2 factor in the last term will be seen in the formulas
for the coefficients below. It is conceptually and computationally simpler to
work with the corresponding polynomial in complex form
N/2
X 00
PN (x) = ck eikx , (4.26)
k=−N/2

where the double prime in the sum means that the first and last term (for
k = −N/2 and k = N/2) have a factor of 1/2. It is also understood that
c−N/2 = cN/2 , which is equivalent to the bN/2 = 0 condition in (4.25).
Theorem 4.2.
N/2
X 00
PN (x) = ck eikx
k=−N/2

with
N −1
1 X N N
ck = f (xj )e−ikxj , k=− ,...,
N j=0 2 2

interpolates f at the equidistributed points xj = j(2π/N ), j = 0, 1, . . . , N .


Proof. We have
N/2 N −1 N/2
X 00
ikx
X 1 X00 ik(x−xj )
PN (x) = ck e = f (xj ) e .
j=0
N
k=−N/2 k=−N/2

Defining
N/2
1 X00 ik(x−xj )
lj (x) = e (4.27)
N
k=−N/2
72 CHAPTER 4. TRIGONOMETRIC APPROXIMATION

we have
N
X −1
PN (x) = f (xj )lj (x). (4.28)
j=0

Thus, we only need to prove that for j and m in the range 0, . . . , N − 1


(
1 for m = j,
lj (xm ) = (4.29)
0 for m 6= j.

N/2
1 X00 ik(m−j)2π/N
lj (xm ) = e .
N
k=−N/2

But ei(±N/2)(m−j)2π/N = e±i(m−j)π = (−1)(m−j) so we can combine the first


and the last term and remove the prime from the sum:

N/2−1
1 X
lj (xm ) = eik(m−j)2π/N
N
k=−N/2
N/2−1
1 X
= ei(k+N/2)(m−j)2π/N e−i(N/2)(m−j)2π/N
N
k=−N/2
N −1
−i(m−j)π 1 X ik(m−j)2π/N
=e e
N k=0

and, as we proved in the introduction

N −1
(
1 X −ik(j−m)2π/N 0 if j − m is not divisible by N
e = (4.30)
N k=0 1 otherwise.

Then (4.29) follows and

PN (xm ) = f (xm ), m = 0, 1, . . . N − 1.
4.2. INTERPOLATING FOURIER POLYNOMIAL 73

Using the relations (4.12) between the ck and the ak and bk coefficients
we find that
N/2−1  
1 X 1 N
PN (x) = a0 + (ak cos kx + bk sin kx) + aN/2 cos x
2 k=1
2 2

interpolates f at the equidistributed nodes xj = j(2π/N ), j = 0, 1, . . . , N if


and only if
N
2 X
ak = f (xj ) cos kxj , (4.31)
N j=1
N
2 X
bk = f (xj ) sin kxj . (4.32)
N j=1

A smooth periodic function f can be approximated very accurately by its


Fourier interpolant PN . Note that derivatives of PN can be easily computed
N/2
(p)
X 00
PN (x) = (ik)p ck eikx (4.33)
k=−N/2

The discrete Fourier coefficients of the p-th derivative of PN are (ik)p ck .


Thus, once these Fourier coefficients have been computed a very accurate
(p)
approximation of the derivatives of f is obtained, f (p) (x) ≈ PN (x). Fig-
ure 5.1 shows the approximation of f (x) = sin xecos x on [0, 2π] by P8 . The
graph of f and of the Fourier interpolant are almost indistinguishable.
Let us go back to the complex Fourier interpolant (4.26). Its coefficients
ck are periodic of period N ,
N −1 N −1
1 X −i(k+N )xj 1 X
ck+N = fj e = fj e−ikxj e−ij2π = ck (4.34)
N j=0 N j=0

and in particular c−N/2 = cN/2 . Using the interpolation property and setting
fj = f (xj ), we have
N/2 N/2−1
X 00 X
ikxj
fj = ck e = ck eikxj (4.35)
k=−N/2 k=−N/2
74 CHAPTER 4. TRIGONOMETRIC APPROXIMATION

1.5

0.5

−0.5

−1

−1.5
0 1 2 3 4 5 6

Figure 4.1: S8 (x) for f (x) = sin xecos x on [0, 2π].

and
N/2−1 −1 N/2−1
X X X
ikxj ikxj
ck e = ck e + ck eikxj
k=−N/2 k=−N/2 k=0
(4.36)
N −1 N/2−1 N −1
X X X
= ck eikxj + ck eikxj = ck eikxj ,
k=N/2 k=0 k=0

where we have used that ck+N = ck . Combining this with the formula for the
ck ’s we get Discrete Fourier Transform (DFT) pair
N −1
1 X
ck = fj e−ikxj , k = 0, . . . , N − 1, (4.37)
N j=0
N
X −1
fj = ck eikxj , j = 0, . . . , N − 1. (4.38)
k=0

The set of discrete coefficients (4.39) is known as DFT of the periodic array
f0 , f1 , . . . , fN −1 and (4.40) is referred to as the Inverse DFT.
The direct evaluation of the DFT is computationally expensive, it re-
quires order N 2 operations. However, there is a remarkable algorithm which
4.3. THE FAST FOURIER TRANSFORM 75

achieves this in merely order N log N operations. This is known as the Fast
Fourier Transform.

4.3 The Fast Fourier Transform


The DFT was defined as
N −1
1 X
ck = fj e−ikxj , k = 0, . . . , N − 1, (4.39)
N j=0
N
X −1
fj = ck eikxj , j = 0, . . . , N − 1. (4.40)
k=0

The direct computation of either (4.39) or (4.40) is requires order N 2 op-


erations. As N increasing the cost quickly becomes prohibitive. In many
applications N could easily be on the order of thousands, millions, etc.
One of the top algorithms of all times is the Fast Fourier Transform
(FFT). It is usually attributed to Cooley and Tukey (1965) but its origin can
be tracked back to C. F. Gauss (1777-1855). We now look at the main ideas
of this famous and widely used algorithm.
Let us define dk = N ck for k = 0, 1, . . . , N − 1. Then we can rewrite
(4.39) as

N
X −1
kj
dk = fj ωN , (4.41)
j=0

where ωN = e−i2π/N . In matrix form


   0 0 0
 
d0 ωN ωN ··· ωN f0
0 1 N −1
 d1  ωN ωN ··· ωN  f1 
 =  .. .. .. (4.42)
    
 .. ...  ..
 .   . . .  .
0 N −1 (N −1)2
dN −1 ωN ωN ··· ωN fN −1

Let us call the matrix on the right FN . Then FN is N times the matrix of
the DFT and the matrix of the Inverse DFT is simply the complex conjugate
76 CHAPTER 4. TRIGONOMETRIC APPROXIMATION

of FN . This follows from the identities:


2 N −1
1 + ωN + ωN + · · · + ωN = 0,
2 4 2(N −1)
1 + ωN + ωN + · · · + ωN = 0,
..
.
N −1 2(N −1) (N −1)2
1 + ωN + ωN + · · · + ωN = 0,
N 2N N (N −1)
1+ ωN + ωN + ··· + ωN = N.
We already proved the first of these identities. This geometric sum is equal
N N
to (1 − ωN )/(1 − ωN ), which in turn is equal to zero because ωN = 1. The
other are proved similarly. We can summarize these identities as
−1
N
(
X jk 0 j 6= 0 (mod N )
ωN = (4.43)
k=0
N j = 0 (mod N ),

where j = k (mod N ) means j − k is an integer multiple of N . Then


N −1 N −1
(
1 1 X jk −lk 1 X (j−l)k 0 j 6= l (mod N )
(FN F̄N )jl = ωN ωN = ωN =
N N k=0 N k=0 1 j = l (mod N ).
(4.44)

which shows that F̄N is the inverse of N1 FN .


Let N = 2n. Going back to (4.41), if we split the even-numbered and the
odd-numbered points we have
n−1
X n−1
X
2jk (2j+1)k
dk = f2j ωN + f2j+1 ωN (4.45)
j=0 j=0

But
2jk 2π −ijk 2π 2π
= e−i2jk N = e = e−ijk n = ωnkj ,
N
ωN 2 (4.46)
(2j+1)k −i(2j+1)k 2π −ik 2π −i2jk 2π k kj
ωN =e N =e N e N = ωN ωn . (4.47)
So denoting fje = f2j and fj0 = f2j+1 , we get
n−1
X n−1
X
dk = fje ωnjk + k
ωN fj0 ωnjk (4.48)
j=0 j=0
4.3. THE FAST FOURIER TRANSFORM 77

We have reduced the problem to two DFT of size n = N2 plus N multiplica-


k
tions (and N sums). The numbers ωN , k = 0, 1, . . . , N − 1 only depend on
N so they can be precomputed once and stored for other DFT of the same
size N .
If N = 2p , for p positive integer, we can repeat the process to reduce each
of the DFT’s of size n to a pair of DFT’s of size n/2 plus n multiplications
(and n additions), etc. We can do this p times so that we end up with 1-point
DFT’s, which require no multiplications!
Let us count the number of operations in the FFT algorithm. For simplic-
ity, let is count only the number of multiplications (the numbers of additions
is of the same order). Let mN be the number of multiplications to compute
the DFT for a periodic array of size N and assume that N = 2p . Then

mN = 2m N + N
2

= 2m2p−1 + 2p
= 2(2m2p−2 + 2p−1 ) + 2p
= 22 m2p−2 + 2 · 2p
= ···
= 2p m20 + p · 2p = p · 2p
= N log2 N,

where we have used that m20 = m1 = 0 ( no multiplication is needed for


DFT of 1 point). To illustrate the savings, if N = 220 , with the FFT we can
obtain the DFT (or the Inverse DFT) in order 20 × 220 operations, whereas
1 20
the direct methods requires order 240 , i.e. a factor of 20 2 ≈ 52429 more
operations.
The FFT can also be implemented efficiently when N is the product of
small primes. A very efficient implementation of the FFT is the FFTW
(“the Fastest Fourier Transform in the West”), which employs a variety of
code generation and runtime optimization techniques and is a free software.
78 CHAPTER 4. TRIGONOMETRIC APPROXIMATION
Chapter 5

Least Squares Approximation

5.1 Continuous Least Squares Approximation


Let f be a continuous function on [a, b]. We would like to find the best
approximation to f by a polynomial of degree at most n in the L2 norm. We
have already studied this problem for the approximation of periodic functions
by trigonometric (complex exponential) polynomials. The problem is to find
a polynomial pn of degree ≤ n such that
Z b
[f (x) − pn (x)]2 dx = min (5.1)
a

Such polynomial, is also called the Least Squares approximation to f . As an


illustration let us consider n = 1. We look for p1 (x) = a0 + a1 x, for x ∈ [a, b],
which minimizes
Z b
J(a0 , a1 ) = [f (x) − p1 (x)]2 dx
a
Z b Z b Z b (5.2)
2 2
= f (x)dx − 2 f (x)(a0 + a1 x)dx + (a0 + a1 x) dx.
a a a

J(a0 , a1 ) is a quadratic function of a0 and a1 and thus a necessary condition


for the minimum is that it is a critical point:
Z b Z b
∂J(a0 , a1 )
= −2 f (x)dx + 2 (a0 + a1 x)dx = 0,
∂a0 a a
Z b Z b
∂J(a0 , a1 )
= −2 xf (x)dx + 2 (a0 + a1 x)x dx = 0,
∂a1 a a

79
80 CHAPTER 5. LEAST SQUARES APPROXIMATION

which yields the following linear 2 × 2 system for a0 and a1 :


Z b  Z b  Z b
1dx a0 + xdx a1 = f (x)dx, (5.3)
a a a
Z b  Z b  Z b
2
xdx a0 + x dx a1 = xf (x)dx. (5.4)
a a a

These two equations are known as the Normal Equations for n = 1.

Example 5.1. Let f (x) = ex for x ∈ [0, 1]. Then


Z 1
ex dx = e − 1, (5.5)
0
Z 1
xex dx = 1, (5.6)
0

and the normal equations are


1
a0 + a1 = e − 1, (5.7)
2
1 1
a0 + a1 = 1, (5.8)
2 3
whose solution is a0 = 4e − 10, a1 = −6e + 18. Therefore the least squares
approximation to f (x) = ex by a linear polynomial is

p1 (x) = 4e − 10 + (18 − 6e)x.

p1 and f are plotted in Fig. 5.1

The Least Squares Approximation to a function f on an interval [a, b] by


a polynomial of degree at most n is the best approximation of f in the L2
norm by that class of polynomials. It is the polynomial

pn (x) = a0 + a1 x + · · · + an xn

such that
Z b
[f (x) − pn (x)]2 dx = min . (5.9)
a
5.1. CONTINUOUS LEAST SQUARES APPROXIMATION 81

2.8
P1(x)
x
f(x)=e
2.6

2.4

2.2

1.8

1.6

1.4

1.2

0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 5.1: The function f (x) = ex on [0, 1] and its Least Squares Approxi-
mation p1 (x) = 4e − 10 + (18 − 6e)x.

Defining this squared L2 error as J(a0 , a1 , ..., an ) we have


Z b
J(a0 , a1 , ..., an ) = [f (x) − (a0 + a1 x + · · · + an xn )]2 dx
a
Z b n
X Z b X n
n X Z b
2 k
= f (x)dx − 2 ak x f (x)dx + ak al xk+l dx.
a k=0 a k=0 l=0 a

J(a0 , a1 , ..., an ) is a quadratic function of the parameters a0 , ..., an . A neces-


sary condition is that the set of a0 , ..., an which minimizes J is the critical
point. That is
∂J(a0 , a1 , . . . , an )
0 =
∂am
Z b n
X Z b n
X Z b
m k+m
= −2 x f (x)dx + ak x dx + al xl+m dx
a k=0 a l=0 a
Z b Xn Z b
= −2 xm f (x)dx + 2 ak xk+m dx, m = 0, 1, . . . , n
a k=0 a
82 CHAPTER 5. LEAST SQUARES APPROXIMATION

and we get the Normal equations


Xn Z b Z b
k+m
ak x dx = xm f (x)dx, m = 0, 1, ..., n. (5.10)
k=0 a a

We can write the Normal Equations in matrix as


Z b Z b Z b    Z b 
n a0
 1dx xdx ··· x dx     f (x)dx 
 a a a    a 
    
Z     
 b Z b Z b   a1   Z b 
2 n+1

 xdx x dx · · · x dx   = 
   xf (x)dx  . (5.11)

 a a a    a 
 .. .. ..   ..   .. 

Z b . . .   .   . 
b b
Z Z    Z b 
 n n+1 2n
   n

x dx x dx · · · x dx a x f (x)dx
a a a n a

The matrix in this system is clearly symmetric. Denoting this matrix by H


and a = [a0 a1 · · · an ]T any n + 1 row vector, we have
n X
X n
aT Ha = ai aj Hij
i=0 j=0
Xn X n Z b
= ai aj xi+j dx
i=0 j=0 a

Xn X n Z b
= ai xi aj xj dx
i=0 j=0 a
n X
Z bX n
= ai xi aj xj dx
a i=0 j=0
Z b X n
= ( aj xj )2 dx ≥ 0.
a j=0

Moreover, aT Ha = 0 if and only if a = 0, i.e. H is positive definite. This


implies that H is nonsigular and hence there is a unique solution to (5.11).
For if there is a 6= 0 such that Ha = 0 then aT Ha = 0 contradicting the
2J
fact that H is positive definite. Furthermore, the Hessian, ∂a∂i ∂a j
is equal to
2H so it is also positive definite and therefore the critical point is indeed a
minimum.
5.1. CONTINUOUS LEAST SQUARES APPROXIMATION 83

In the interval [0, 1],

1 1 1
 
···
 1 2 n+1 
 
 
 1 1 1 
H=
 ··· 
, (5.12)
 2 3 n+2 
 .. .. .. .. 

 . . . . 
 1 1 1 
···
n+1 n+2 2n + 1

which is known as the [(n + 1) × (n + 1)] Hilbert matrix.


In principle, the direct process to obtain the Least Squares Approximation
pn to f is to solve the normal equations (5.10) for the coefficients a0 , a1 , . . . , an
and set pn (x) = a0 + a1 x + . . . an xn . There are however two problems with
this approach:

1. It is difficult to solve this linear system numerically for even moderate n


because the matrix H is very sensitive to small perturbations and this
sensitivity increases rapidly with n. For example, numerical solutions
in double precision (about 16 digits of accuracy) of a linear system with
the Hilbert matrix (5.12) will lose all accuracy for n ≥ 11.

2. If we want to increase the degree of the approximating polynomial we


need to start over again and solve a larger set of normal equations.
That is, we cannot use the a0 , a1 , . . . , an we already found.

It is more efficient and easier to solve the Least Squares Approximation


problem using orthogonality, as we did with approximation by trigonometric
polynomials. Suppose that we have a set of polynomials defined on an interval
[a, b],
{φ0 , φ1 , ..., φn },
such that φk is a polynomial of degree k. Then, we can write any polyno-
mial of degree at most n as a linear combination of these polynomials. In
particular, the Least Square Approximating polynomial pn can be written as
n
X
pn (x) = a0 φ0 (x) + a1 φ1 (x) + ... + an φn (x) = ak φk (x),
k=0
84 CHAPTER 5. LEAST SQUARES APPROXIMATION

for some coefficients a0 , . . . , an to be determined. Then


Z b n
X
J(a0 , ..., an ) = [f (x) − ak φj (x)]2 dx
a k=0
Z b n
X Z b
2
= f (x)dx − 2 aj φk (x)f (x)dx (5.13)
a k=0 a
n X
X n Z b
+ ak al φk (x)φl (x)dx
k=0 l=0 a

and
Z b n Z b
∂J X
0= = −2 φm (x)f (x)dx + 2 ak φk (x)φm (x)dx.
∂am a k=0 a

for m = 0, 1, . . . , n, which gives the normal equations


n
X Z b Z b
ak φk (x)φm (x)dx = φm (x)f (x)dx, m = 0, 1, ..., n. (5.14)
k=0 a a

Now, if the set of approximating functions {φ0 , ....φn } is orthogonal, i.e.


Z b
φk (x)φm (x)dx = 0 if k 6= m (5.15)
a

then the coefficients of the least squares approximation are explicitly given
by
Z b Z b
1
am = φm (x)f (x)dx, αm = φ2m (x)dx, m = 0, 1, ..., n. (5.16)
αm a a

and
pn (x) = a0 φ0 (x) + a1 φ1 (x) + ... + an φn (x).
Note that if the set {φ0 , ....φn } is orthogonal, (5.16) and (5.13) imply the
Bessel inequality
n
X Z b
αk a2k ≤ f 2 (x)dx. (5.17)
k=0 a
5.2. LINEAR INDEPENDENCE AND GRAM-SCHMIDT ORTHOGONALIZATION85

This inequality shows that if f is square integrable, i.e. if


Z b
f 2 (x)dx < ∞,
a


X
then the series αk a2k converges.
k=0
We can consider the Least Squares approximation for a class of linear
combinations of orthogonal functions {φ0 , ..., φn } not necessarily polynomials.
We saw an example of this with Fourier approximations 1 . It is convenient
to define a weighted L2 norm associated with the Least Squares problem
Z b  21
kf kw,2 = f 2 (x)w(x)dx , (5.18)
a

where w(x) ≥ 0 for all x ∈ (a, b) 2 . A corresponding inner product is defined


by
Z b
hf, gi = f (x)g(x)w(x)dx. (5.19)
a

Definition 5.1. A set of functions {φ0 , ..., φn } is orthogonal, with respect to


the weighted inner product (5.19), if hφk , φl i = 0 for k 6= l.

5.2 Linear Independence and Gram-Schmidt


Orthogonalization
Definition 5.2. A set of functions {φ0 (x), ..., φn (x)} defined on an interval
[a, b] is said to be linearly independent if

a0 φ0 (x) + a1 φ1 (x) + . . . an φn (x) = 0, for all x ∈ [a, b] (5.20)

then a0 = a1 = . . . = an = 0. Otherwise, it is said to be linearly dependent.


Z b
1
For complex-valued functions orthogonality means φk (x)φ̄l (x)dx = 0 if k 6= l,
a
where the bar denotes the complex conjugate
Z b Z b
2
More precisely, we will assume w ≥ 0, w(x)dx > 0, and xk w(x)dx < +∞ for
a a
k = 0, 1, . . .. We call such a w an admissible weight function.
86 CHAPTER 5. LEAST SQUARES APPROXIMATION

Example 5.2. The set of functions {φ0 (x), ..., φn (x)}, where φk (x) is a poly-
nomial of degree k for k = 0, 1, . . . , n is linearly independent on any interval
[a, b]. For a0 φ0 (x) + a1 φ1 (x) + . . . an φn (x) is a polynomial of degree at most
n and hence a0 φ0 (x) + a1 φ1 (x) + . . . an φn (x) = 0 for all x in a given interval
[a, b] implies a0 = a1 = . . . = an = 0.

Given a set of linearly independent functions {φ0 (x), ..., φn (x)} we can
produce an orthogonal set {ψ0 (x), ..., ψn (x)} by doing the Gram-Schmidt
procedure:

ψ0 (x) = φ0 (x)
ψ1 (x) = φ1 (x) − c0 ψ0 (x),
hφ1 , ψ0 i
hψ1 , ψ0 i = 0 ⇒ c0 =
hψ0 , ψ0 i
ψ2 (x) = φ2 (x) − c0 ψ0 (x) − c1 ψ1 (x),
hφ2 , ψ0 i
hψ2 , ψ0 i = 0 ⇒ c0 =
hψ0 , ψ0 i
hφ2 , ψ1 i
hψ2 , ψ1 i = 0 ⇒ c1 =
hψ1 , ψ1 i
..
.

We can write this procedure recursively as

ψ0 (x) = φ0 (x),
k−1
X hφk , ψj i (5.21)
ψk (x) = φk (x) − cj ψj (x), cj = .
j=0
hψj , ψj i

5.3 Orthogonal Polynomials


Let us take the set {1, x, . . . , xn } on a interval [a, b]. We can use the Gram-
Schmidt process to obtain an orthogonal set {ψ0 (x), ..., ψn (x)} of polynomials
with respect to the inner product (5.19). Each of the ψk is a polynomial of
degree k, determined up to a multiplicative constant (orthogonality is not
changed). Suppose we select the ψk (x), k = 0, 1, . . . , n to be monic, i.e.
the coefficient of xk is 1. Then ψk+1 (x) − xψk (x) = rk (x), where rk (x) is a
5.3. ORTHOGONAL POLYNOMIALS 87

polynomial of degree at most k. So we can write


k−2
X
ψk+1 (x) − xψk (x) = −αk ψk (x) − βk ψk−1 (x) + cj ψj (x). (5.22)
j=0

Then taking the inner product of this expression with ψk and using orthog-
onality we get
−hxψk , ψk i = −αk hψk , ψk i
and
hxψk , ψk i
αk = .
hψk , ψk i
Similarly, taking the inner product with ψk−1 we obtain
−hxψk , ψk−1 i = −βk hψk−1 , ψk−1 i
but hxψk , ψk−1 i = hψk , xψk−1 i and xψk−1 (x) = ψk (x)+qk−1 (x), where qk−1 (x)
is a polynomial of degree at most k − 1 then
hψk , xψk−1 i = hψk , ψk i + hψk , qk−1 i = hψk , ψk i,
where we have used orthogonality in the last equation. Therefore
hψk , ψk i
βk = .
hψk−1 , ψk−1 i
Finally, taking the inner product of (5.22) with ψm for m = 0, . . . , k − 2 we
get
−hψk , xψm i = cm hψm , ψm i m = 0, . . . , k − 2
but the left hand side is zero because xψm (x) is a polynomial of degree at
most k − 1 and hence it is orthogonal to ψk (x). Collecting the results we
obtain a three-term recursion formula
ψ0 (x) = 1, (5.23)
hxψ0 , ψ0 i
ψ1 (x) = x − α0 , α0 = (5.24)
hψ0 , ψ0 i
for k = 1, . . . n
ψk+1 (x) = (x − αk )ψk (x) − βk ψk−1 (x), (5.25)
hxψk , ψk i
αk = , (5.26)
hψk , ψk i
hψk , ψk i
βk = . (5.27)
hψk−1 , ψk−1 i
88 CHAPTER 5. LEAST SQUARES APPROXIMATION

Example 5.3. Let [a, b] = [−1, 1] and w(x) ≡ 1. The corresponding orthog-
onal polynomials are known as the Legendre Polynomials and are widely
used in a variety of numerical methods. Because xψk2 (x)w(x) is an odd func-
tion it follows that αk = 0 for all k. We have ψ0 (x) = 1 and ψ1 (x) = x. We
can now use the three-term recursion (5.25) to obtain
Z 1
x2 dx
β1 = Z−11 = 1/3
dx
−1

and ψ2 (x) = x2 − 13 . Now for k = 2 we get


Z 1
1
(x2 − )2 dx
3
β2 = −1Z 1 = 4/15
2
x dx
−1

and ψ3 (x) = x(x2 − 13 ) − 15


4
x = x3 − 53 x. We now collect Legendre polynomials
we found:

ψ0 (x) = 1,
ψ1 (x) = x,
1
ψ2 (x) = x2 − ,
3
3
ψ3 (x) = x3 − x,
5
..
.

Theorem 5.1. The zeros of orthogonal polynomials are real, simple, and
they all lie in (a, b).
Proof. Indeed, ψk (x) is orthogonal to ψ0 (x) = 1 for each k ≥ 1, thus
Z b
ψk (x)w(x)dx = 0 (5.28)
a

i.e. ψk has to change sign in [a, b] so it has a zero, say x1 ∈ (a, b). Suppose x1
is not a simple root, then q(x) = ψk (x)/(x − x1 )2 is a polynomial of degree
5.3. ORTHOGONAL POLYNOMIALS 89

k − 2 and so
b
ψk2 (x)
Z
0 = hψk , qi = w(x)dx > 0,
a (x − x1 )2
which is of course impossible. Assume that ψk (x) has only l zeros in (a, b),
x1 , . . . , xl . Then ψk (x)(x − x1 ) · · · (x − xl ) = qk−l (x)(x − x1 )2 · · · (x − xl )2 ,
where qk−l (x) is a polynomial of degree k − l which does not change sign in
[a, b]. Then
Z b
hψk , (x − x1 ) · · · (x − xl )i = qk−j (x)(x − x1 )2 · · · (x − xl )2 w(x)dx 6= 0
a

but hψk , (x − x1 ) · · · (x − xl )i = 0 for l < k. Therefore l = k.

5.3.1 Chebyshev Polynomials


We introduced in Section 2.4 the Chebyshev polynomials, which as we have
seen possess remarkable properties. We now add one more important prop-
erty of this outstanding class of polynomials, namely orthogonality.
The Chebyshev polynomials are orthogonal with respect to the weight
function
1
w(x) = √ . (5.29)
1 − x2

Indeed. Recall that Tn (x) = cos nθ, (x = cos θ). Then,


Z 1 Z π
Tn (x)Tm (x)
√ dx = = cos nθ cos mθdθ (5.30)
−1 1 − x2 0

and since 2 cos nθ cos mθ = cos(m + n)θ + cos(m − n)θ, we get for m 6= n
Z 1   π
Tn (x)Tm (x) 1 1 1
√ dx = sin(n + m)θ + sin(n − m)θ = 0.
−1 1 − x2 2 n+m n−m 0

Consequently,

Z 1 m 6= n,
 0
Tn (x)Tm (x) π
√ dx = m = n > 0, (5.31)
−1 1 − x2  2
π m = n = 0.
90 CHAPTER 5. LEAST SQUARES APPROXIMATION

5.4 Discrete Least Squares Approximation


Suppose that we are given the data (x1 , f1 ), (x2 , f2 ), · · · , (xN , fN ) obtained
from an experiment. Can we find a simple function that appropriately fits
these data? Suppose that empirically we determine that there is an approx-
imate linear behavior between fj and xj , j = 1, . . . , N . What is the straight
line y = a0 + a1 x that best fits these data? The answer depends on how
we measure the error, the deviations fj − (a0 + a1 xj ), i.e. which norm we
use for the error. The most convenient measure is the squared of the 2-norm
(Euclidean norm) because we will end up with a linear system of equations
to find a0 and a1 . Other norms will yield a nonlinear system for the unknown
parameters. So the problem is: Find a0 , a1 which minimize

N
X
J(a0 , a1 ) = [fj − (a0 + a1 xj )]2 . (5.32)
j=1

We can repeat all the Least Squares Approximation theory that we have
seen at the continuum level except that integrals are replaced by sums. The
conditions for the minimum
N
∂J(a0 , a1 ) X
=2 [fj − (a0 + a1 xj )](−1) = 0, (5.33)
∂a0 j=1
N
∂J(a0 , a1 ) X
=2 [fj − (a0 + a1 xj )](−xj ) = 0, (5.34)
∂a1 j=1

produce the Normal Equations:

N
X N
X N
X
a0 1 + a1 xj = fj , (5.35)
j=1 j=1 j=1
N
X N
X N
X
a0 x j + a1 x2j = xj f j . (5.36)
j=1 j=1 j=1

Approximation by a higher order polynomial

pn (x) = a0 + a1 x + · · · + an xn , n<N −1
5.4. DISCRETE LEAST SQUARES APPROXIMATION 91

is similarly done. In practice we would like n << N to avoid overfitting. We


find the a0 , a1 , ..., an which minimize
N
X
J(a0 , a1 , ..., an ) = [fj − (a0 + a1 xj + · · · + an xnj )]2 . (5.37)
j=1

Proceeding as in the continuum case, we get the Normal Equations


n
X N
X N
X
ak xk+m
j = xm
j fj , m = 0, 1, ..., n. (5.38)
k=0 j=1 j=1

That is
N
X N
X N
X N
X
a0 x0j + a1 x1j + · · · + an xnj = x0j fj
j=1 j=1 j=1 j=1

N
X N
X N
X N
X
a0 x1j + a1 x2j + · · · + an xn+1
j = x1j fj
j=1 j=1 j=1 j=1

..
.
N
X N
X N
X N
X
a0 xnj + a1 xn+1
j + · · · + an x2n
j = xnj fj .
j=1 j=1 j=1 j=1

The matrix of coefficients of this linear system is, as in the continuum case,
symmetric, positive definite, and highly sensitive to small perturbations in
the data. And again, if we want to increase the degree of the approximating
polynomial we cannot reuse the coefficients we already computed for the lower
order polynomial. But now we know how to go around these two problems:
we use orthogonality.
Let us define the (weighted) discrete inner product as

N
X
hf, giN = fj gj ωj , (5.39)
j=1

where ωj > 0, for j = 1, . . . , N are given weights. Let {φ0 , ..., φn } be a set
of polynomials such that φk is of degree exactly k. Then the solution to the
92 CHAPTER 5. LEAST SQUARES APPROXIMATION

discrete Least Squares Approximation problem by a polynomial of degree


≤ n can be written as
pn (x) = a0 φ0 (x) + a1 φ1 (x) + · · · + an φn (x) (5.40)
and the square of the error is given by
N
X
J(a0 , · · · , an ) = [fj − (a0 φ0 (xj ) + · · · + an φn (xj ))]2 ωj . (5.41)
j=1

Consequently, the normal equations are


n
X
ak hφk , φl iN = hφl , f iN , l = 0, 1, ..., n. (5.42)
k=0

If {φ0 , · · · , φn } are orthogonal with respect to the inner product h·, ·iN ,
i.e. if hφk , φl iN = 0 for k 6= l, then the coefficients of the Least Squares
Approximation are given by
hφk , f iN
ak = , k = 0, 1, ..., n (5.43)
hφk , φk iN
and pn (x) = a0 φ0 (x) + a1 φ1 (x) + · · · + an φn (x).
If the {φ0 , · · · , φn } are not orthogonal we can produce an orthogonal set
{ψ0 , · · · , ψn } using the 3-term recursion formula adapted to the discrete inner
product. We have
ψ0 (x) ≡ 1, (5.44)
hxψ0 , ψ0 iN
ψ1 (x) = x − α0 , α0 = , (5.45)
hψ0 , ψ0 iN
for k = 1, ..., n
ψk+1 (x) = (x − αk )ψk (x) − βk ψk−1 (x), (5.46)
hxψk , ψk iN hψk , ψk iN
αk = , βk = . (5.47)
hψk , ψk iN hψk−1 , ψk−1 iN
Then,
hψj , f iN
aj = , j = 0, ..., n. (5.48)
hψj , ψj iN
and the Least Squares Approximation is pn (x) = a0 ψ0 (x) + a1 ψ1 (x) + · · · +
an ψn (x).
5.4. DISCRETE LEAST SQUARES APPROXIMATION 93

Example 5.4. Suppose we are given the data: xj : 0, 1, 2, 3, fj = 1.1, 3.2, 5.1, 6.9
and we would like to fit to a line. The normal equations are
4
X 4
X 4
X
a0 1 + a1 xj = fj (5.49)
j=1 j=1 j=1
4
X 4
X 4
X
a0 x j + a1 x2j = xj f j (5.50)
j=1 j=1 j=1

and performing the sums we have

4a0 + 6a1 = 16.3, (5.51)


6a0 + 14a1 = 34.1. (5.52)

Solving this 2 × 2 linear system we get a0 = 1.18 and a1 = 1.93. Thus, the
Least Squares Approximation is

p1 (x) = 1.18 + 1.93x

and the square of the error is


4
X
J(1.18, 193) = [fj − (1.18 + 1.93xj )]2 = 0.023.
j=1

Example 5.5. Fitting to an exponential y = beaxk . Defining


N
X
J(a, b) = [fj − beaxj ]2
j=1

we get the conditions for a and b


N
∂J X
=2 [fj − beaxj ](−bxj eaxj ) = 0, (5.53)
∂a j=1
N
∂J X
=2 [fj − beaxj ](−eaxj ) = 0. (5.54)
∂b j=1
94 CHAPTER 5. LEAST SQUARES APPROXIMATION

which is a nonlinear system of equations. However, if we take the natural log


of y = beaxk we have ln y = ln b + ax. Defining B = ln b the problem becomes
linear in B and a. Tabulating (xj , ln fj ) we can obtain the normal equations
N
X N
X N
X
B 1+a xj = ln fj , (5.55)
j=1 j=1 j=1
N
X N
X N
X
B xj + a x2j = xj ln fj , (5.56)
j=1 j=1 j=1

and solve this linear system for B and a. Then b = eB and a = a.


If a is given and we only need to determine b then the problem is linear.
From (5.54) we have
N
X
fj eaxj
N N
X X j=1
b e2axj = fj eaxj ⇒ b = N
j=1 j=1
X
e2axj
j=1

Example 5.6. Discrete orthogonal polynomials. Let us construct the first


few orthogonal polynomials with respect to the discrete inner product with
ω ≡ 1 and xj = Nj , j = 1, ..., N . Here N = 10 (the points are equidistributed
in [0, 1]). We have ψ0 (x) = 1 and ψ1 (x) = x − α0 , where
PN
hxψ0 , ψ0 iN j=1 xj
α0 = = PN = 0.55.
hψ0 , ψ0 iN j=1 1

and hence ψ1 (x) = x − 0.55. Now

ψ2 (x) = (x − α1 )ψ1 (x) − β1 ψ0 (x), (5.57)


PN 2
hxψ1 , ψ1 iN j=1 xj (xj − 0.55)
α1 = = PN = 0.55, (5.58)
hψ1 , ψ1 iN j=1 (xj − 0.55)
2

hψ1 , ψ1 iN
β1 = = 0.0825. (5.59)
hψ0 , ψ0 iN
Therefore ψ2 (x) = (x − 0.55)2 − 0.0825. We can now use these orthogonal
polynomials to find the Least Squares Approximation by polynomial of degree
5.5. HIGH-DIMENSIONAL DATA FITTING 95

at most two of a given set of data. Let us take fj = x2j + 2xj + 3. Clearly, the
Least Squares Approximation should be p2 (x) = x2 + 2x + 3. Let us confirm
this by using the orthogonal polynomials ψ0 , ψ1 and ψ2 . The Least Squares
Approximation coefficients are given by
hf, ψ0 iN
a0 = = 4.485, (5.60)
hψ0 , ψ0 iN
hf, ψ1 iN
a1 = = 3.1, (5.61)
hψ1 , ψ1 iN
hf, ψ2 iN
a2 = = 1, (5.62)
hψ2 , ψ2 iN
which gives, p2 (x) = (x−0.55)2 −0.0825+(3.1)(x−0.55)+4.485 = x2 +2x+3.

5.5 High-dimensional Data Fitting


In many applications each data point contains many variables. For example,
a value for each pixel in an image, or clinical measurements of a patient, etc.
We can put all these variables in a vector x ∈ Rd for d ≥ 1. Associated with
x there is a scalar quantity f that can be measured or computed so that
our data set consists of the points (xj , fj ), where xj ∈ Rd and fj ∈ R, for
j = 1, . . . , N .
A central problem in machine learning is that of predicting f from a
given large, high-dimensional dataset; this is called supervised learning. The
simplest and most commonly used approach is to postulate a linear relation
f (x) = a0 + aT x (5.63)
and determine the bias coefficient a0 and the vector a = [a1 , . . . , ad ]T as a
least squares solution, i.e. such that they minimize
N
X
[fj − (a0 + aT xj )]2 .
j=1

We have studied in detail the case d = 1. Here we are interested in the case
d >> 1.
If we add an extra component, equal to 1, to each data vector xj so that
now xj = [1, xj1 , . . . , xjd ]T , for j = 1, . . . , N , then we can write (5.63) as
f (x) = aT x (5.64)
96 CHAPTER 5. LEAST SQUARES APPROXIMATION

and the dimension d is increased by one. Then, we are seeking a vector


a ∈ Rd that minimizes
N
X
J(a) = [fj − aT xj ]2 . (5.65)
j=1

Putting the data xj as rows of an N × d (N ≥ d) matrix X and fj as the


components of a (column) vector f , i.e.
   
x1 f1
 x2 
 f2 
 
X=  and f =  ..  (5.66)
 
..
 .   . 
xN fN

we can write (5.65) as

J(a) = (f − Xa)T (f − Xa) = kf − Xak2 . (5.67)

The normal equations are given by the condition ∇a J(a) = 0. Since ∇a J(a) =
−2X T f + 2X T Xa, we get the linear system of equations

X T Xa = X T f . (5.68)

Every solution of the least square problem is necessarily a solution of the


normal equations. We will prove that the converse is also true and that the
solutions have a geometric characterization.
Let W be the linear space spanned by the columns of X. Clearly, W ⊆
R . Then, the least squares problem is equivalent to minimizing kf − wk2
N

among all vectors w in W . There is always at least one solution, which can
be obtained by projecting f onto W , as Fig. 5.2 illustrates. First, note that if
a ∈ Rd is a solution of the normal equations (5.68) then the residual f − Xa
is orthogonal to W because

X T (f − Xa) = X T f − X T Xa = 0 (5.69)

and a vector r ∈ RN is orthogonal to W if it is orthogonal to each column


of X, i.e. X T r = 0. Let a∗ be a solution of the normal equations, let
r = f − Xa∗ , and for arbitrary a ∈ Rd , let s = Xa − Xa∗ . Then, we have

kf − Xak2 = kf − Xa∗ − (Xa − Xa∗ )k2 = kr − sk2 . (5.70)


5.5. HIGH-DIMENSIONAL DATA FITTING 97

f
f − Xa

Xa

Figure 5.2: Geometric interpretation of the solution Xa of the Least Squares


problem as the orthogonal projection of f on the approximating linear sub-
space W .

But r and s are orthogonal. Therefore,

kr − sk2 = krk2 + ksk2 ≥ krk2 (5.71)

and so we have proved that

kf − Xak2 ≥ kf − Xa∗ k2 (5.72)

for arbitrary a ∈ Rd , i.e. a∗ minimizes kf − Xak2 .


If the columns of X are linearly independent, i.e. if for every a 6= 0 we
have that Xa 6= 0, then the d × d matrix X T X is positive definite and hence
nonsingular. Therefore, in this case, there is a unique solution to the least
squares problem mina kf − Xak2 given by

a∗ = (X T X)−1 X T f . (5.73)

The d × N matrix

X † = (X T X)−1 X T (5.74)

is called the pseudoinverse of the N × d matrix X. Note that if X were


square and nonsingular X † would coincide with the inverse, X −1 .
As we have done in the other least squares problems we have seen so far,
rather than working with the normal equations, whose matrix X T X may be
very sensitive to perturbations in the data, we use an orthogonal basis for the
approximating subspace (W in this case) to find a solution. While in principle
this can be done by applying the Gram-Schmidt process to the columns of
98 CHAPTER 5. LEAST SQUARES APPROXIMATION

X, this is a numerically unstable procedure; when two columns are nearly


linearly dependent, errors introduced by the finite precision representation of
computer numbers can be largely amplified during the Gram-Schmidt process
and render vectors which are not orthogonal. A re-orthogonalization step can
be introduced in the Gram-Schmidt algorithm to remedy this problem at the
price of doubling the computational cost. A more efficient method using a
sequence of orthogonal transformations, known as Householder reflections, is
usually preferred. Once this orthonormalization process is completed we get
a QR factorization of X

X = QR, (5.75)

where Q is an N × N orthogonal matrix, i.e. QT Q = I, and R is an N × d


upper triangular matrix
 
∗ ··· ∗
 . . . .. 
 .  
R1
R= = . (5.76)
 
 ∗  0
 

Here R1 is a d×d upper triangular matrix and the zero stands for a (N −d)×d
zero matrix.
Using X = QR we have

kf − Xak2 = kf − QRak2 = kQT f − Rak2 . (5.77)

Therefore, a least squares solution is obtained by solving the system Ra =


QT f . Writing R in blocks we have
   T 
R1 a (Q f )1
= . (5.78)
0 (QT f )2

so that the solution is found by solving upper triangular system R1 a = (QT f )1


(R1 is nonsingular if the columns of X are linearly independent). The last
N − d equations, (QT f )2 = 0, may be satisfied or not depending on f but we
have no control on them.
Chapter 6

Computer Arithmetic

6.1 Floating Point Numbers


Floating point numbers are based on scientific notation in binary (base 2).
For example

(1.0101)2 × 22 = (1 · 20 + 0 · 2−1 + 1 · 2−2 + 0 · 2−3 + 1 · 2−4 ) × 22


1 1
= (1 + + ) × 4 = 5.2510 .
4 16
We can write any non-zero real number x in normalized, binary, scientific
notation as

x = ±S × 2E , 1 ≤ S < 2, (6.1)

where S is called the significant or mantissa and E is the exponent. In


general S is an infinite expansion of the form

S = (1.b1 b2 · · · )2 . (6.2)

In a computer, a real number is represented in scientific notation but


using a finite number of binary digits (bits). We call these floating point
numbers. In single precision (SP), floating point numbers are stored in 32-
bit words whereas in double precision (DP), used in most scientific computing
applications, a 64-bit word is employed: 1 bit is used for the sign, 52 bits for
S, and 11 bits for E. This memory limits produce a large but finite set of
floating point numbers which can be represented in a computer. Moreover,
the floating points numbers are not uniformly distributed!

99
100 CHAPTER 6. COMPUTER ARITHMETIC

The maximum exponent possible in DP would be 211 = 2048 but this


is shifted to allow representation of small and large numbers so that we
actually have Emin = −1022, Emax = 1023. Consequently, the min and max
DP floating point number which can be represented in DP are

Nmin = min |x| = 2−1022 ≈ 2.2 × 10−308 , (6.3)


x∈DP
Nmax = max |x| = (1.1.....1)2 · 21023 = (2 − 2−52 ) · 21023 ≈ 1.8 × 10308 . (6.4)
x∈DP

If in the course of a computation a number is produced which is bigger than


Nmax we get an overflow error and the computation would halt. If the number
is less than Nmin (in absolute value) then an underflow error occurs.

6.2 Rounding and Machine Precision


To represent a real number x as a floating point number, rounding has to be
performed to retain only the numbers of binary bits allowed in the significant.
Let x ∈ R and its binary expansion be x = ±(1.b1 b2 · · · )2 × 2E .
One way to approximate x to a floating number with d bits in the signif-
icant is to truncate or chop discarding all the bits after bd , i.e.

x∗ = chop(x) = ±(1.b1 b2 · · · bd )2 × 2E . (6.5)

In double precision d = 52.


A better way to approximate to a floating point number is to do rounding
up or down (to the nearest floating point number), just as we do when we
round in base 10. In binary, rounding is simpler because bd+1 can only be 0
(we round down) or 1 (we round up). We can write this type of rounding in
terms of the chopping described above as

x∗ = round(x) = chop(x + 2−(d+1) × 2E ). (6.6)

Definition 6.1. Given an approximation x∗ to x the absolute error is



defined by |x − x∗ | and the relative error by | x−x
x
|, x 6= 0.

The relative error is generally more meaningful than the absolute error
to measure a given approximation.
6.3. CORRECTLY ROUNDED ARITHMETIC 101

The relative error in chopping and in rounding (called a round-off error)


is
x − chop(x) 2−d 2E
≤ ≤ 2−d , (6.7)
x (1.b1 b2 · · · )2E
x − round(x) 1
≤ 2−d . (6.8)
x 2
The number 2−d is called machine precision or epsilon (eps). In double
precision eps=2−52 ≈ 2.22 × 10−16 . The smallest double precision number
greater than 1 is 1+eps. As we will see below, it is more convenient to write
(6.8) as
round(x) = x(1 + δ), |δ| ≤ eps. (6.9)

6.3 Correctly Rounded Arithmetic


Computers today follow the IEEE standard for floating point representation
and arithmetic. This standard requires a consistent floating point represen-
tation of numbers across computers and correctly rounded arithmetic.
In correctly rounded arithmetic, the computer operations of addition, sub-
traction, multiplication, and division are the correctly rounded value of the
exact result. For example, if x and y are floating point numbers and ⊕ is the
machine addition, then
x ⊕ y = round(x + y) = (x + y)(1 + δ+ ), |δ+ | ≤ eps, (6.10)
and similarly for , ⊗, .
One important interpretation of (6.10) is the following. Assuming x+y 6=
0, write
1
δ+ = [δx + δy ].
x+y
Then
 
1
x ⊕ y = (x + y) 1 + (δx + δy ) = (x + δx ) + (y + δy ). (6.11)
x+y
The computer ⊕ is giving the exact result but for a sightly perturbed data.
This interpretation is the basis for Backward Error Analysis, which is
used to study how round-off errors propagate in a numerical algorithm.
102 CHAPTER 6. COMPUTER ARITHMETIC

6.4 Propagation of Errors and Cancellation


of Digits
Let f l(x) and f l(y) denote the floating point approximation of x and y,
respectively, and assume that their product is computed exactly, i.e

f l(x) · f l(y) = x(1 + δx ) · y(1 + δy ) = x · y(1 + δx + δy + δx δy ) ≈ x · y(1 + δx + δy ),

where |δx |,|δy | ≤ eps. Therefore, for the relative error we get
x · y − f l(x) · f l(y)
≈ |δx + δy |, (6.12)
x·y
which is acceptable.
Let us now consider addition (or subtraction):

f l(x) + f l(y) = x(1 + δx ) + y(1 + δy ) = x + y + xδx + yδy


 
x y
= (x + y) 1 + δx + δy .
x+y x+y
The relative error is
x + y − (f l(x) + f l(y)) x y
= δx + δy . (6.13)
x+y x+y x+y
x y
If x and y have the same sign then x+y , x+y are both positive and bounded
by 1. Therefore the relative error is less than |δx + δy |, which is fine. But
if x and y have different sign and are close in magnitude, the error could be
x y
largely amplified because | x+y |, | x+y | can be very large.
Example 6.1. Suppose we have 10 bits of precision and

x = (1.01011100 ∗ ∗)2 × 2E ,
y = (1.01011000 ∗ ∗)2 × 2E ,

where the ∗ stands for inaccurate bits (i.e. garbage) that say were generated in
previous floating point computations. Then, in this 10 bit precision arithmetic

z = x − y = (1.00 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗)2 × 2E−6 . (6.14)

We end up with only 2 bits of accuracy in z. Any further computations using


z will result in an accuracy of 2 bits or lower!
6.4. PROPAGATION OF ERRORS AND CANCELLATION OF DIGITS103

Example 6.2. Sometimes we can rewrite the difference of two very close
numbers to avoid digit cancellation. For example, suppose we would like to
compute

y = 1+x−1

for x > 0 and very small. Clearly, we will have loss of digits if we proceed
directly. However, if we rewrite y as

√ 1+x−1 x
y = ( 1 + x + 1) √ =√
1+x+1 1+x+1
then the computation can be performed at nearly machine precision level.
104 CHAPTER 6. COMPUTER ARITHMETIC
Chapter 7

Numerical Differentiation

7.1 Finite Differences


Suppose f is a differentiable function and we’d like to approximate f 0 (x0 )
given the value of f at x0 and at neighboring points x1 , x2 , ..., xn . We could
approximate f by its interpolating polynomial pn at those points and use
f 0 (x0 ) ≈ p0n (x0 ). There are several other possibilities. For example, we can
approximate f 0 (x0 ) by the derivative of the cubic spline of f evaluated at x0 ,
or by the derivative of the Least Squares Chebyshev expansion of f :
n
X
0
f (x0 ) = aj Tj0 (x0 ),
j=1

etc. We are going to focus here on simple, finite difference formulas obtained
by differentiating low order interpolating polynomials.
Assuming x, x0 , . . . , xn ∈ [a, b] and f ∈ C n+1 [a, b], we have
1
f (x) = pn (x) + f (n+1) (ξ(x)) ωn (x), (7.1)
(n + 1)!
for some ξ(x) ∈ (a, b) and

ωn (x) = (x − x0 )(x − x1 ) · · · (x − xn ). (7.2)

Thus,
 
0 1 d (n+1)
f (x0 ) = p0n (x0 ) + f (ξ(x))ωn (x) + f (n+1) 0
(ξ(x)) ωn (x) .
(n + 1)! dx x=x0

105
106 CHAPTER 7. NUMERICAL DIFFERENTIATION

But ωn (x0 ) = 0 and ωn0 (x0 ) = (x0 − x1 ) · · · (x0 − xn ), thus


1
f 0 (x0 ) = p0n (x0 ) + f (n+1) (ξ(x))(x0 − x1 ) · · · (x0 − xn ) (7.3)
(n + 1)!
Example 7.1. Take n = 1 and x1 = x0 + h (h > 0). In Newton’s form
f (x0 + h) − f (x0 )
p1 (x) = f (x0 ) + (x − x0 ), (7.4)
h
and p01 (x0 ) = h1 [f (x0 +h)−f (x0 )]. We obtain the so-called Forward Difference
Formula for approximating f 0 (x0 )
f (x0 + h) − f (x0 )
Dh+ f (x0 ) := . (7.5)
h
From (7.3) the error in this approximation is
1 00 1
f 0 (x0 ) − Dh+ f (x0 ) = f (ξ(x))(x0 − x1 ) = − f 00 (ξ)h. (7.6)
2! 2
Example 7.2. Take again n = 1 but now x1 = x0 − h. Then p01 (x0 ) =
1
h
[f (x0 ) − f (x0 − h)] and we get the so-called Backward Difference Formula
for approximating f 0 (x0 )
f (x0 ) − f (x0 − h)
Dh− f (x0 ) := . (7.7)
h
Its error is
1
f 0 (x0 ) − Dh− f (x0 ) = f 00 (ξ)h. (7.8)
2
Example 7.3. Let n=2 and x1 = x0 − h, x2 = x0 + h. Then, p2 in Newton’s
form is

p2 (x) = f [x1 ] + f [x1 , x0 ](x − x1 ) + f [x1 , x0 , x2 ](x − x1 )(x − x0 ).

Let us obtain the finite difference table:


x0 − h f (x0 − h)
f (x0 )−f (x0 −h)
h
f (x0 +h)−2f (x0 )+f (x0 +h)
x0 f (x0 ) 2h2
f (x0 +h)−f (x0 )
h
x0 + h f (x0 + h)
7.1. FINITE DIFFERENCES 107

Therefore,
f (x0 ) − f (x0 − h) f (x0 + h) − 2f (x0 ) + f (x0 − h)
p02 (x0 ) = + h
h 2h2
and thus
f (x0 + h) − f (x0 − h)
p02 (x0 ) = . (7.9)
2h
This defines the Centered Difference Formula to approximate f 0 (x0 )
f (x0 + h) − f (x0 − h)
Dh0 f (x0 ) := . (7.10)
2h
Its error is
1 000 1
f 0 (x0 ) − Dh0 f (x0 ) = f (ξ)(x0 − x1 )(x0 − x2 ) = − f 000 (ξ)h2 . (7.11)
3! 6
Example 7.4. Let n = 2 and x1 = x0 + h, x2 = x0 + 2h. The table of finite
differences is

x0 f (x0 )
f (x0 +h)−f (x0 )
h
f (x0 +2h)−2f (x0 +h)+f (x0 )
x0 + h f (x0 + h) 2h2
f (x0 +2h)−f (x0 +h)
h
x0 + 2h f (x0 + 2h)

and
f (x0 + h) − f (x0 ) f (x0 + 2h) − 2f (x0 + h) + f (x0 )
p02 (x0 ) = + h
h 2h2
thus
−f (x0 + 2h) + 4f (x0 + h) − 3f (x0 )
p02 (x0 ) = . (7.12)
2h
If we use this sided difference to approximate f 0 (x0 ), the error is
1 000 1
f 0 (x0 ) − p02 (x0 ) = f (ξ)(x0 − x1 )(x0 − x2 ) = h2 f 000 (ξ), (7.13)
3! 3
which is twice as large as that in Centered Finite Difference Formula.
108 CHAPTER 7. NUMERICAL DIFFERENTIATION

7.2 The Effect of Round-off Errors


In numerical differentiation we take differences of values, which for small h,
could be very close to each other. As we know, this leads to loss of accuracy
because of finite precision floating point arithmetic. Consider for example
the centered difference formula. For simplicity let us suppose that h has an
exact floating point representation and that we make no rounding error when
doing the division by h. That is, suppose that the the only source of round-
off error is in the computation of the difference f (x0 + h) − f (x0 − h). Then
f (x0 +h) and f (x0 −h) are replaced by f (x0 +h)(1+δ+ ) and f (x0 −h)(1+δ− ),
respectively with |δ+ | ≤ eps and |δ− | ≤ eps. Then
f (x0 + h)(1 + δ+ ) − f (x0 − h)(1 + δ− ) f (x0 + h) − f (x0 − h)
= + rh ,
2h 2h
where
f (x0 + h)δ+ − f (x0 − h)δ−
rh = .
2h
Clearly, |rh | ≤ (|f (x0 + h)| + |f (x0 − h)|) eps
2h
≈ |f (x0 )| eps
h
. The approximation
error or truncation error for the centered finite difference approximation is
− 16 f 000 (ξ)h2 . Thus, the total error E(h) can be approximately bounded by
h M3 + |f (x0 )| eps
1 2
6 h
. The minimum error occurs at h0 such that E 0 (h0 ) = 0,
i.e.
 1
3 eps |f (x0 )| 3 1
h0 = ≈ c eps 3 (7.14)
M3
2
and E(h0 ) = O(eps 3 ). We do not get machine precision!
Higher order finite differences exacerbate the problem of digit cancella-
tion. When f can be extended to an analytic function in the complex plane,
Cauchy Integral Theorem can be used to evaluate the derivative:
Z
0 1 f (z)
f (z0 ) = dz, (7.15)
2πi C (z − z0 )2
where C is a simple closed contour around z0 and f is analytic on and inside
C. Parametrizing C as a circle of radius r we get
Z 2π
0 1
f (z0 ) = f (x0 + reit )e−it dt (7.16)
2πr 0
7.3. RICHARDSON’S EXTRAPOLATION 109

The integrand is periodic and smooth so it can be approximated with spectral


accuracy with the composite trapezoidal rule.
Another approach to obtain finite difference formulas to approximate
derivatives is through Taylor expansions. For example,
1 1 1
f (x0 + h) = f (x0 ) + f 0 (x0 )h + f 00 (x0 )h2 + f (3) (x0 )h3 + f (4) (ξ+ )h4 ,
2 3! 4!
(7.17)
1 1 1
f (x0 − h) = f (x0 ) − f 0 (x0 )h + f 00 (x0 )h2 − f (3) (x0 )h3 + f (4) (ξ− )h4 ,
2 3! 4!
(7.18)
where x0 < ξ+ < x0 + h and x0 − h < ξ− < x0 . Then subtracting (7.17)
from (7.18) we have f (x0 + h) − f (x0 − h) = 2f 0 (x0 )h + 3!2 f 000 (x0 )h3 + · · · and
therefore
f (x0 + h) − f (x0 − h)
= f 0 (x0 ) + c2 h2 + c4 h4 + · · · (7.19)
2h
Similarly if we add (7.17) and (7.18) we obtain f (x0 + h) + f (x0 − h) =
2f (x0 ) + f 00 (x0 )h2 + ch4 + · · · and consequently
f (x0 + h) − 2f (x0 ) + f (x0 − h)
f 00 (x0 ) = 2
+ c̃h2 + · · · (7.20)
h
The finite difference
f (x0 + h) − 2f (x0 ) + f (x0 − h)
Dh2 f (x0 ) = (7.21)
h2
is thus a second order approximation to f 00 (x0 ), i.e., f 00 (x0 ) − Dh2 f (x0 ) =
O(h2 ).

7.3 Richardson’s Extrapolation


From (7.19) we know that, asymptotically
Dh0 f (x0 ) = f 0 (x0 ) + c2 h2 + c4 h4 + · · · (7.22)
We can apply Richardson extrapolation once to obtain a fourth order ap-
proximation. Evaluating (7.22) at h/2 we get
1 1
0
Dh/2 f (x0 ) = f 0 (x0 ) + c2 h2 + c4 h4 + · · · (7.23)
4 16
110 CHAPTER 7. NUMERICAL DIFFERENTIATION

and multiplying this equation by 4, subtracting (7.22) to the result and


dividing by 3 we get
0
4Dh/2 f (x0 ) − Dh0 f (x0 )
Dhext f (x0 ) := = f 0 (x0 ) + c̃4 h4 + · · · (7.24)
3
The method Dhext f (x0 ) has order of convergence 4 for about twice the amount
of work of that Dh0 f (x0 ). Round-off errors are still O(eps/h) and the min-
imum total error will be when O(h4 ) is O(eps/h), i.e. when h = eps1/5 .
The minimum error is thus O(eps4/5 ) for Dhext f (x0 ), about 10−14 in double
precision with h = O(10−3 ).
Chapter 8

Numerical Integration

We now revisit the problem of numerical integration that we used to intro-


duced some principles of numerical analysis in Chapter 1.
The problem in question is to find accurate and efficient approximations
to
Z b
f (x)dx
a
Numerical formulas to approximate a definite integral are called quadra-
tures and, as we saw in Chapter 1, they can be elementary (simple) or com-
posite.
We shall assume henceforth, unless otherwise noted, that the integrand
is sufficiently smooth.

8.1 Elementary Simpson Quadrature


The elementary trapezoidal rule quadrature was derived by replacing the
integrand f by its linear interpolating polynomial p1 at a and b, that is
1
f (x) = p1 (x) + f 00 (ξ)(x − a)(x − b), (8.1)
2
for some ξ between a and b and thus
Z b Z b
1 b 00
Z
f (x)dx = p1 (x)dx + f (ξ)(x − a)(x − b)dx
a a 2 a (8.2)
1 1 00 3
= (b − a)[f (a) + f (b)] − f (η)(b − a)
2 2

111
112 CHAPTER 8. NUMERICAL INTEGRATION

Thus, the approximation


Z b
1
f (x)dx ≈ (b − a)[f (a) + f (b)] (8.3)
a 2
has an error given by − 12 f 00 (η)(b − a)3 .
We can add an intermediate point, say xm = (a + b)/2, and replace f
by its quadratic interpolating polynomial p2 with respect to the nodes a, xm
and b. For simplicity let’s take [a, b] = [−1, 1]. With the simple change of
variables
1 1
x = (a + b) + (b − a)t, t ∈ [−1, 1] (8.4)
2 2
we can obtain a quadrature formula for a general interval [a, b].
Let p2 be the interpolating polynomial of f at −1, 0, 1. The corresponding
divided difference table is:
−1 f (−1)
f (0) − f (−1)
f (1)−2f (0)+f (−1)
0 f (0) 2
.
f (1) − f (0)
1 f (1)
Thus
f (1) − 2f (0) + f (−1)
p2 (x) = f (−1) + [f (0) − f (1)](x + 1) + (x + 1)x.
2
(8.5)
Now using the interpolation formula with remainder expressed in terms of a
divided difference (3.56) we have
f (x) = p2 (x) + f [−1, 0, 1, x](x + 1)x(x − 1)
(8.6)
= p2 (x) + f [−1, 0, 1, x]x(x2 − 1).
Therefore
Z 1 Z 1 Z 1
f (x)dx = p2 (x)dx + f [−1, 0, 1, x]x(x2 − 1)dx
−1 −1 −1
1
= 2f (−1) + 2[f (0) − f (−1)] + [f (1) − 2f (0) + f (−1)] + E[f ]
3
1
= [f (−1) + 4f (0) + f (1)] + E[f ],
3
8.1. ELEMENTARY SIMPSON QUADRATURE 113

where
Z 1
E[f ] = f [−1, 0, 1, x]x(x2 − 1)dx (8.7)
−1

is the error. Note that x(x2 − 1) changes sign in [−1, 1] so we cannot use the
Mean Value Theorem for integrals. However, if we add another node, x4 , we
can relate f [−1, 0, 1, x] to the fourth order divided difference f [−1, 0, 1, x4 , x],
which will make the integral in (8.7) easier to evaluate:

f [−1, 0, 1, x] = f [−1, 0, 1, x4 ] + f [−1, 0, 1, x4 , x](x − x4 ). (8.8)

This identity is just an application of Theorem 3.2. Using (8.8)


Z 1 Z 1
2
E[f ] = f [−1, 0, 1, x4 ] x(x − 1)dx + f [−1, 0, 1, x4 , x]x(x2 − 1)(x − x4 )dx.
−1 −1

The first integral is zero, because the integrand is odd. Now we choose x4
symmetrically, x4 = 0, so that x(x2 − 1)(x − x4 ) does not change sign in
[−1, 1] and
Z 1 Z 1
2 2
E[f ] = f [−1, 0, 1, 0, x]x (x − 1)dx = f [−1, 0, 0, 1, x]x2 (x2 − 1)dx.
−1 −1
(8.9)

Now, using (3.58), there is ξ(x) ∈ (−1, 1) such that

f (4) (ξ(x))
f [−1, 0, 0, 1, x] = , (8.10)
4!
and assuming f ∈ C 4 [−1, 1], by the Mean Value Theorem for integrals, there
is η ∈ (−1, 1) such that
1
f (4) (η) 4 f (4) (η)
Z
1
E[f ] = x2 (x2 − 1)dx = − = − f (4) (η). (8.11)
4! −1 15 4! 90

Summarizing, Simpson’s elementary quadrature for the interval [−1, 1] is


Z 1
1 1
f (x)dx = [f (−1) + 4f (0) + f (1)] − f (4) (η). (8.12)
−1 3 90
114 CHAPTER 8. NUMERICAL INTEGRATION

Note that Simpson’s elementary quadrature gives the exact value of the
integral when f is polynomial of degree 3 or less (the error is proportional
to the fourth derivative), even though we used a second order polynomial to
approximate the integrand. We gain extra precision because of the symme-
try of the quadrature around 0. In fact, we could have derived Simpson’s
quadrature by using the Hermite (third order) interpolating polynomial of f
at −1, 0, 0, 1.
To obtain the corresponding formula for a general interval [a, b] we use
the change of variables (8.4)
Z b Z 1
1
f (x)dx = (b − a) F (t)dt,
a 2 −1

where
 
1 1
F (t) = f (a + b) + (b − a)t , (8.13)
2 2

and noting that F (k) (t) = ( b−a


2
)k f (k) (x) we obtain Simpson’s elementary rule
on the interval [a, b]:
Z b    
1 a+b
f (x)dx = (b − a) f (a) + 4f + f (b)
a 6 2
 5 (8.14)
1 (4) b−a
− f (η) .
90 2

8.2 Interpolatory Quadratures


The elementary trapezoidal and Simpson rules are examples of interpolatory
quadratures. This class of quadratures is obtained by selecting a set of nodes
x0 , x1 , . . . , xn in the interval of integration and by approximating the integral
by that of the interpolating polynomial pn of the integrand at these nodes.
By construction, such interpolatory quadrature is exact for polynomials of
degree up to n, at least. We just saw that Simpson rule is exact for polynomial
up to degree 3 and we used p2 in its construction. The “degree gain” was
due to the symmetric choice of the interpolation nodes. This leads us to two
important questions:
8.2. INTERPOLATORY QUADRATURES 115

1. For a given n, how do we choose the nodes x0 , x1 , . . . , xn so that the


corresponding interpolation quadrature is exact for polynomials of the
highest degree k possible?

2. What is that k?

Because orthogonal polynomials (Section 5.3) play a central role in the


answer to these questions, we will consider the more general problem of
approximating
Z b
I[f ] = f (x)w(x)dx, (8.15)
a

Rb
where w is an admissible weight function (w ≥ 0, a w(x)dx > 0, and
Rb k
a
x w(x)dx < +∞ for k = 0, 1, . . .), w ≡ 1 being a particular case. The
interval of integration [a, b] can be either finite or infinite (e.g. [0, +∞],
[−∞, +∞]).

Definition 8.1. We say that a quadrature Q[f ] to approximate I[f ] has


degree of precision k if it is exact, i.e. I[P ] = Q[P ], for all polynomials P
of degree up to k but not exact for polynomials of degree k + 1. Equivalently,
a quadrature Q[f ] has degree of precision k if I[xm ] = Q[xm ], for m =
0, 1, . . . , k but I[xk+1 ] 6= Q[xk+1 ].

Example 8.1. The trapezoidal rule quadrature has degree of precision 1 while
the Simpson quadrature has degree of precision 3.

For a given set of nodes x0 , x1 , . . . , xn in [a, b], let pn be the interpolating


polynomial of f at these nodes. In Lagrange form we can write pn as (see
Section 3.1)
n
X
pn (x) = f (xj )lj (x), (8.16)
j=0

where
n
Y (x − xk )
lj (x) = , for j = 0, 1, ..., n. (8.17)
k=0,k6=j
(xj − xk )
116 CHAPTER 8. NUMERICAL INTEGRATION

are the elementary Lagrange polynomials. The corresponding interpolatory


quadrature Qn [f ] to approximate I[f ] is then given by
n
X Z b
Qn [f ] = Aj f (xj ), Aj = lj (x)w(x)dx, for j = 0, 1, . . . , n. (8.18)
j=0 a

Theorem 8.1. Degree of precision of the interpolatory quadrature (8.18) is


less than 2n + 2
Proof. Suppose the degree of precision k of (8.18) is greater or equal than
2n + 2. Take f (x) = (x − x0 )2 (x − x1 )2 · · · (x − xn )2 . This is polynomial of
degree exactly 2n + 2. Then
Z b Xn
f (x)w(x)dx = Aj f (xj ) = 0. (8.19)
a j=0

and on the other hand


Z b Z b
f (x)w(x)dx = (x − x0 )2 · · · (x − xn )2 w(x)dx > 0 (8.20)
a a

which is a contradiction. Therefore k < 2n + 2.

8.3 Gaussian Quadratures


We will now show that there is a choice of nodes x0 , x1 , ..., xn which yields
the optimal degree of precision 2n + 1 for an interpolatory quadrature. The
corresponding quadratures are called Gaussian quadratures. To define them
we recall that ψk is the k-th orthogonal polynomial with respect to the inner
product
Z b
< f, g >= f (x)g(x)w(x)dx, (8.21)
a

if < ψk , q >= 0 for all polynomials q of degree less than k. Recall also that
the zeros of the orthogonal polynomials are real, simple, and contained in
[a, b] (see Theorem 5.1).
Definition 8.2. Let ψn+1 be the (n + 1)st orthogonal polynomial and let
x0 , x1 , ..., xn be its n + 1 zeros. Then the interpolatory quadrature (8.18) with
the nodes so chosen is called a Gaussian quadrature.
8.3. GAUSSIAN QUADRATURES 117

Theorem 8.2. The interpolatory quadrature (8.18) has degree of precision


k = 2n + 1 if and only if it is a Gaussian quadrature.

Proof. Let f is a polynomial of degree ≤ 2n + 1. Then, we can write

f (x) = q(x)ψn+1 (x) + r(x), (8.22)

where q and r are polynomials of degree ≤ n. Now


Z b Z b Z b
f (x)w(x)dx = q(x)ψn+1 (x)w(x)dx + r(x)w(x)dx (8.23)
a a a

The first integral on the right hand side is zero because of orthogonality. For
the second integral the quadrature is exact (it is interpolatory). Therefore
Z b n
X
f (x)w(x)dx = Aj r(xj ). (8.24)
a j=0

Moreover, r(xj ) = f (xj ) − q(xj )ψn+1 (xj ) = f (xj ) for all j = 0, 1, . . . , n.


Thus,
Z b n
X
f (x)w(x)dx = Aj f (xj ). (8.25)
a j=0

This proves that the Gaussian quadrature has degree of precision k = 2n + 1.


Now suppose that the interpolatory quadrature (8.18) has maximal degree
of precision 2n + 1. Take f (x) = p(x)(x − x0 )(x − x1 ) · · · (x − xn ) where p is
a polynomial of degree ≤ n. Then, f is a polynomial of degree ≤ 2n + 1 and
Z b Z b n
X
f (x)w(x)dx = p(x)(x − x0 ) · · · (x − xn )w(x)dx = Aj f (xj ) = 0.
a a j=0

Therefore, the polynomial (x − x0 )(x − x1 ) · · · (x − xn ) of degree n + 1 is


orthogonal to all polynomials of degree ≤ n. Thus, it is a multiple of ψn+1 .

Example 8.2. Consider the interval [−1, 1] and the weight function w ≡ 1.
The corresponding orthogonal the Legendre Polynomials 1, x, x2 − 31 , x3 −
118 CHAPTER 8. NUMERICAL INTEGRATION
q q
3
5
Take n = 1. The roots of ψ2 are x0 = − 13 and x1 = 13 . There-
x, · · · .
fore, the corresponding Gaussian quadrature is
Z 1 r ! r !
1 1
f (x)dx ≈ A0 f − + A1 f (8.26)
−1 3 3

where
Z 1
A0 = l0 (x)dx, (8.27)
−1
Z 1
A1 = l1 (x)dx. (8.28)
−1

We can evaluate these integrals directly or use the method of undeter-


mined coefficients to find A0 and A1 . The latter is generally easier and
we illustrate it now. Using that the quadrature is exact for 1 and x we have
Z 1
2= 1dx = A0 + A1 , (8.29)
−1
Z 1 r r
1 1
0= xdx = −A0 + A1 . (8.30)
−1 3 3

Solving this 2 × 2 linear system we get A0 = A1 = 1. So the Gaussian


quadrature for n = 1 in [−1, 1] is
r ! r !
1 1
Q1 [f ] = f − +f (8.31)
3 3

Let us compare this quadrature to the elementary trapezoidal rule. Take


f (x) = x2 . The trapezoidal rule, T [f ], gives
2
T [x2 ] = [f (−1) + f (1)] = 2 (8.32)
2
whereas the Gaussian quadrature Q1 [f ] yields the exact result:
r !2 r !2
1 1 2
Q1 [x2 ] = − + = . (8.33)
3 3 3
8.3. GAUSSIAN QUADRATURES 119

1
Example 8.3. Let us take again the interval [−1, 1] but now w(x) = √1−x 2.

As we know (see 2.4 ), ψn+1 = Tn+1 , i.e.the Chebyshev polynomial of degree


2j+1
n + 1. Its zeros are xj = cos[ 2(n+1) π] for j = 0, . . . , n. For n = 1 we have

π  r   r
1 5π 1
cos = , cos =− . (8.34)
4 2 4 2

We can use again the method of undetermined coefficients to find A0 and A1 :


Z 1
1
π= 1√ dx = A0 + A1 , (8.35)
−1 1 − x2
Z 1 r r
1 1 1
0= x√ dx = −A0 + A1 , (8.36)
−1 1 − x2 2 2

which give A0 = A1 = π2 . Thus, the corresponding Gaussian quadrature to


R1 1
approximate −1 f (x) √1−x 2 dx is

" r ! r !#
π 1 1
Q1 [f ] = f − +f . (8.37)
2 2 2

8.3.1 Convergence of Gaussian Quadratures


Let f ∈ C[a, b] and consider the interpolation quadrature (8.18). Can we
guarantee that the error converges to zero as n → ∞, i.e.,
Z b n
X
f (x)w(x)dx − Aj f (xj ) → 0, as n → ∞ ?
a j=0

The answer is no. As we know, convergence of the interpolating polynomial


to f depends on the smoothness of f and the distribution of the interpolating
nodes. However, if the interpolatory quadrature is a Gaussian the answer is
yes. This follows from the following special properties of the quadrature
weights A0 , A1 , . . . , An in the Gaussian quadrature.

Theorem 8.3. For a Gaussian quadrature all the quadrature weights are
positive and sum up to kwk1 , i.e.,

(a) Aj > 0 for all j = 0, 1, . . . , n.


120 CHAPTER 8. NUMERICAL INTEGRATION

n
X Z b
(b) Aj = w(x)dx.
j=0 a

Proof. (a) Let pk = lk2 for k = 0, 1, . . . , n. These are polynomials of degree


exactly equal to 2n and pk (xj ) = δkj . Thus,
Z b n
X
0< lk2 (x)w(x)dx = Aj lk2 (xj ) = Ak (8.38)
a j=0

for k = 0, 1, . . . , n.
(b) Take f (x) ≡ 1 then
Z b n
X
w(x)dx = Aj . (8.39)
a j=0

as the quadrature is exact for polynomials of degree zero.


We can now use these special properties of the Gaussian quadrature to
prove its convergence for all f ∈ C[a, b]:

Theorem 8.4. Let


n
X
Qn [f ] = Aj f (xj ) (8.40)
j=0

be the Gaussian quadrature. Then


Z b
En [f ] := f (x)w(x)dx − Qn [f ] → 0, as n → ∞ . (8.41)
a

Proof. Let p∗2n+1 be the best uniform approximation to f (in the max norm,
kf k∞ = maxx∈[a,b] |f (x)|) by polynomials of degree ≤ 2n + 1. Then,

En [f − p∗2n+1 ] = En [f ] − En [p∗2n+1 ] = En [f ] (8.42)

and therefore
Z b n
X
En [f ] = En [f − p∗2n+1 ] = [f (x) − p∗2n+1 (x)]w(x)dx − Aj [f (xj ) − p∗2n+1 (xj )].
a j=0
8.4. COMPUTING THE GAUSSIAN NODES AND WEIGHTS 121

Taking the absolute value, using the triangle inequality, and the fact that
the quadrature weights are positive we obtain
Z b n
X

|En [f ]| ≤ |f (x) − p2n+1 (x)|w(x)dx + Aj |f (xj ) − p∗2n+1 (xj )|
a j=0
Z b n
X
≤ kf − p∗2n+1 k∞ w(x)dx + kf − p∗2n+1 k∞ Aj
a j=0

= 2kwk1 kf − p∗2n+1 k∞

From the Weierstrass approximation theorem it follows that En [f ] → 0 as


n → ∞.
Moreover, one can prove (using one of the Jackson Theorems) that if
f ∈ C m [a, b]

|En [f ]| ≤ C(2n)−m kf (m) k∞ . (8.43)

That is, the rate of convergence is not fixed; it depends on the number of
derivatives the integrand has. We say in this case that the approximation is
spectral. In particular if f ∈ C ∞ [a, b] then the error decreases down to zero
faster than any power of 1/(2n).

8.4 Computing the Gaussian Nodes and Weights


Orthogonal polynomials satisfy a three-term relation:

ψk+1 (x) = (x − αk )ψk (x) − βk ψk−1 (x), for k = 0, 1, . . . , n, (8.44)


Rb
where β0 is defined by a w(x)dx, ψ0 (x) = 1 and ψ−1 (x) = 0. Equivalently

xψk (x) = βk ψk−1 (x) + αk ψk (x) + ψk+1 (x), for k = 0, 1, . . . , n. (8.45)

If we use the normalized orthogonal polynomials


ψk (x)
ψ̃k (x) = √ (8.46)
< ψk , ψk >
and recalling that
< ψk , ψk >
βk =
< ψk−1 , ψk−1 >
122 CHAPTER 8. NUMERICAL INTEGRATION

then (8.45) can be written as


p p
xψ̃k (x) = βk ψ̃k−1 (x) + αk ψ̃k (x) + βk+1 ψ̃k+1 (x), for k = 0, 1, . . . , n.
(8.47)
Now evaluating this at a root xj of ψn+1 we get the eigenvalue problem
xj vj = Jn vj , (8.48)
where
 √   
α 0 β 1 0 · · · 0 ψ̃0 (xj )
√β1 α0 √β2 · · · 0   ψ̃1 (xj )

   
Jn =  ... ... ... ... , vj =  ..
. (8.49)
   
 √   .


√ βn  ψ̃n−1 (xj )
0 0 0 βn αn ψ̃n (xj )
That is, the Gaussian nodes xj , j = 0, 1, . . . , n are the eigenvalues of the
Jacobi Matrix Jn . One can show that the Gaussian weights Aj , are given in
terms of the first component vj,0 of the (normalized) eigenvector vj (vjT vj =
1):
2
Aj = β0 vj,0 . (8.50)
There are efficient numerical methods (e.g. the QR method) to solve the
eigenvalue problem for a symmetric triadiagonal matrix and this is one of
most popular approached to compute the Gaussian nodes.

8.5 Clenshaw-Curtis Quadrature


Gaussian quadratures are optimal in terms of the degree of precision and
offer superalgebraic convergence for smooth integrands. However, the com-
putation of Gaussian weights and nodes carries a significant cost, for large
n. There is an ingenious interpolatory quadrature that is a close competitor
to the Gaussian quadrature due to its efficient and fast rate of convergence.
This is the Clenshaw-Curtis quadrature.
Suppose f is a smooth function in [−1, 1] and we are interested in an
accurate approximation of the integral
Z 1
f (x)dx.
−1
8.5. CLENSHAW-CURTIS QUADRATURE 123

The idea is to use the extrema of the n Chebyshev polynomial Tn , xj =


cos( jπ
n
), j = 0, 1, ..., n as the nodes of the corresponding interpolatory quadra-
ture. The degree of precision is only n (not 2n + 1!). However, as we know,
for smooth functions the approximation by polynomial interpolation using
the Chebyshev nodes converges very rapidly. Hence, for smooth integrands
this particular interpolatory quadrature can be expected to converge fast to
the exact value of the integral.
Let pn be the interpolating polynomial of f at xj = cos( jπ n
), j = 0, 1, ..., n.
We can write pn as
n−1
a0 X an
pn (x) = + ak Tk (x) + Tn (x) (8.51)
2 k=1
2

Under the change of variable x = cos θ, for θ ∈ [0, π] we get


n−1
a0 X 1
pn (cos θ) = + ak cos kθ + an cos nθ. (8.52)
2 k=1
2

Let Πn (θ) = pn (cos θ) and F (θ) = f (cos θ). By extending F evenly over
[−π, 0] (or over [π, 2π]) and using Theorem 4.2, we conclude that Πn (θ)
interpolates F (θ) = f (cos θ) at the equally spaced points θj = jπ
n
, j =
0, 1, ...n if and only if
n
2 X00
ak = F (θj ) cos kθj , k = 0, 1, .., n. (8.53)
n j=0

These are the (Type I) Discrete Cosine Transform (DCT) coefficients of F


and we can compute them efficiently in O(n log2 n) operations with the FFT.
Now, using the change of variable x = cos θ we have
Z 1 Z π Z π
f (x)dx = f (cos θ) sin θdθ = F (θ) sin θdθ, (8.54)
−1 0 0

and approximating F (θ) by its interpolant Πn (θ) = Pn (cos θ), we obtain the
corresponding quadradure
Z 1 Z π
f (x)dx ≈ Πn (θ) sin θdθ. (8.55)
−1 0
124 CHAPTER 8. NUMERICAL INTEGRATION

Substituting (8.52) we have


Z π Z π n−1 Z π Z π
a0 X an
Πn (θ) sin θdθ = sin θdθ + ak cos kθ sin θdθ + cos nθ sin θdθ.
0 2 0 k=1 0 2 0
(8.56)

Assuming n even and using cos kθ sin θ = 12 [sin(1 + k)θ + sin(1 − k)θ] we get
the Clenshaw-Curtis quadrature:
Z 1 n−2
X 2ak an
f (x)dx ≈ a0 + 2
+ . (8.57)
−1 k=2
1−k 1 − n2
k even

For a general interval [a, b] we simply use the change of variables


a+b b−a
x= + cos θ (8.58)
2 2
for θ ∈ [0, π] and thus
Z a π
b−a
Z
f (x)dx = F (θ) sin θdθ, (8.59)
b 2 0

where F (θ) = f ( a+b


2
+ b−a
2
cos θ) and so the formula (8.57) gets an extra factor
of (b − a)/2.

8.6 Composite Quadratures


We saw in Section 1.2.2 that one strategy to improve the accuracy of a
quadrature formula is to divide the interval of integration [a, b] into small
subintervals, use the elementary quadrature in each of them, and sum up all
the contributions.
For simplicity, let us divide uniformly [a, b] into N subintervals of equal
length h = (b − a)/N , [xj , xj+1 ], where xj = a + jh for j = 0, 1, . . . , N − 1.
If we use the elementary trapezoidal rule in each subinterval (as done in
Section 1.2.2) we arrive at the composite trapezoidal rule:
Z b " N −1
#
1 X 1 1
f (x)dx = h f (a) + f (xj ) + f (b) − (b − a)h2 f 00 (η), (8.60)
a 2 j=1
2 12
8.7. MODIFIED TRAPEZOIDAL RULE 125

where η is some point in (a, b).


To derive a corresponding composite Simpson quadrature we take N even
and apply the elementary Simpson quadrature in each of the N/2 intervals
[xj , xj+2 ], j = 0, ..., N − 2. That is:
Z b Z x2 Z x4 Z xN
f (x)dx = f (x)dx + f (x)dx + · · · + f (x)dx (8.61)
a x0 x2 xN −2

and since the elementary Simpson quadrature applied to [xj , xj+2 ] reads:
Z xj+2
h 1
f (x)dx = [f (xj ) + 4f (xj+1 ) + f (xj+2 )] − f (4) (ηj )h5 (8.62)
xj 3 90

for some ηj ∈ (xj , xj+2 ), summing up all the N/2 contributions we get the
composite Simpson quadrature:
 
Z b N/2−1 N/2
h X X
f (x)dx = f (a) + 2 f (x2j ) + 4 f (x2j−1 ) + f (b)
a 3 j=1 j=1

1
− (b − a)h4 f (4) (η),
180
for some η ∈ (a, b).

8.7 Modified Trapezoidal Rule


We are going to consider here a modification to the trapezoidal rule that will
yield a quadrature with an error of the same order as Simpson’s rule. More-
over, this modified quadrature will give us some insight to the the asymptotic
form of the trapezoidal rule error.
To simplify the derivation let us consider the interval [0, 1] and let p3 be
the polynomial interpolating f (0), f 0 (0), f (1), f 0 (1). Newton’s divided differ-
ences representation of p3 is

p3 (x) = f (0) + f [0, 0]x + f [0, 0, 1]x2 + f [0, 0, 1, 1]x2 (x − 1), (8.63)

and thus
Z 1
1 1 1
p3 (x)dx = f (0) + f 0 (0) + f [0, 0, 1] − f [0, 0, 1, 1]. (8.64)
0 2 3 12
126 CHAPTER 8. NUMERICAL INTEGRATION

Th divided differences are obtained in the tableau:


0 f (0)
f 0 (0)
0 f (0) f (1) − f (0) − f 0 (0)
f (1) − f (0) f 0 (1) + f 0 (0) + 2(f (0) − f (1))
0
1 f (1) f (1) − f (1) + f (0)
0
f (1)
1 f (1)

Thus,
Z 1
1 1 1
p3 (x)dx = f (0)+ f 0 (0)+ [f (1)−f (0)−f 0 (0)]− [f 0 (0)+f 0 (1)+2(f (0)−f (1))]
0 2 3 12
and simplifying the right hand side we get
Z 1
1 1
p3 (x)dx = [f (0) + f (1)] + [f 0 (0) − f 0 (1)], (8.65)
0 2 12
which is the simple trapezoidal rule plus a correction involving the derivative
of the integrand at the end points.
We can obtain an expression for the error of this quadrature formula by
recalling that the Cauchy remainder in the interpolation is
1 (4)
f (x) − p3 (x) = f (ξ(x))x2 (x − 1)2 (8.66)
4!
and since x2 (x − 1)2 does not change sign in [0, 1] we can use the mean value
Theorem for integrals to get
Z 1 Z 1
1 (4) 1 (4)
E[f ] = [f (x) − p3 (x)]dx = f (η) x2 (x − 1)2 dx = f (η)
0 4! 0 720
(8.67)

for some η ∈ (0, 1).


To obtain the quadrature in a general finite interval [a, b] we use the
change of variables x = a + (b − a)t, t ∈ [0, 1]
Z b Z 1
f (x)dx = (b − a) F (t)dt, (8.68)
a 0
8.8. THE EULER-MACLAURIN FORMULA 127

where F (t) = f (a + (b − a)t). Thus,


Z b
b−a (b − a)2 0 1 (4))
f (x)dx = [f (a) + f (b)] + [f (a) − f 0 (b)] + f (η)(b − a)5 ,
a 2 12 720
(8.69)

for some η ∈ (a, b).


We can get a composite modified trapezoidal rule by subdividing [a, b]
in N subintervals of equal length h = b−a N
, applying the simple rule in each
subinterval and adding up all the contributions:
Z b " N −1
#
1 X 1 h2
f (x)dx = h f (x0 ) + f (xj ) + f (xN ) − [f 0 (b) − f 0 (a)]
a 2 2 12
j=1 (8.70)
1 (4)
+ f (η)h4 .
720

8.8 The Euler-Maclaurin Formula


We are now going to obtain a more general formula for the asymptotic form
of the error in the trapezoidal rule quadrature. The idea is to use integration
by parts with the aid of suitable polynomials. Let us consider again the
interval [0, 1] and define B0 (x) = 1, B1 (x) = x − 21 , then
Z 1 Z 1 Z 1
f (x)dx = f (x)B0 (x)dx = f (x)B10 (x)dx
0 0
Z 1 0
= f (x)B1 (x)|10 − f 0 (x)B1 (x)dx (8.71)
0
Z 1
1
= [f (0) + f (1)] − f 0 (x)B1 (x)dx
2 0

We can continue the integration by parts using the Bernoulli Polynomials


which satisfy
0
Bk+1 (x) = (k + 1)Bk (x), k = 1, 2, . . . (8.72)

Since we start with B1 (x) = x − 21 it is clear that Bk (x) is a polynomial of


degree exactly k with leading order coefficient 1, i.e. monic. These polyno-
mials are determined by the recurrence relation (8.72) up to a constant. The
128 CHAPTER 8. NUMERICAL INTEGRATION

constant is fixed by requiring that

Bk (0) = Bk (1) = 0, k = 3, 5, 7, . . . (8.73)

Indeed,
00
Bk+1 (x) = (k + 1)Bk0 (x) = (k + 1)kBk−1 (x) (8.74)

and Bk−1 (x) has the form

Bk−1 (x) = xk−1 + ak−2 xk−2 + . . . a1 x + a0 . (8.75)

Integrating (8.74) twice we get


 
1 k+1 ak−2 k 1 2
Bk+1 (x) = k(k + 1) x + x + . . . + a0 x + bx + c
k(k + 1) (k − 1)k 2
(8.76)

For k + 1 odd, the two constants of integration b and c are determined


by the condition (8.73). The Bk (x) for k even are then given by Bk (x) =
0
Bk+1 (x)/(k + 1).
We are going to need a few properties of the Bernoulli polynomials. Be-
cause of construction, Bk (x) is an even (odd) polynomial in x − 21 is k is even
(odd). Equivalently, they satisfy the identity

(−1)k Bk (1 − x) = Bk (x). (8.77)

This follows because the polynomials Ak (x) = (−1)k Bk (1 − x) satisfy the


same conditions that define the Bernoulli polynomials, i.e. A0k+1 (x) = (k +
1)Ak (x) and Ak (0) = Ak (1) = 0, for k = 3, 5, 7, . . . and since A1 (x) = B1 (x)
they have are the same. From (8.77) and (8.73) we get that

Bk (0) = Bk (1), k = 2, 3, , . . . (8.78)

We define Bernoulli numbers as Bk = Bk (0) = Bk (1). This together with


the recurrence relation (8.72) implies that
Z 1 Z 1
1 0 1
Bk (x)dx = Bk+1 (x)dx = [Bk+1 (1) − Bk+1 (0)] = 0
0 k+1 0 k+1
(8.79)

for k = 1, 2, . . ..
8.8. THE EULER-MACLAURIN FORMULA 129

Lemma 3. The polynomials C2m (x) = B2m (x) − B2m , m = 1, 2, . . . do not


change sign in [0, 1].

Proof. We will prove it by contradiction. Let us suppose that C2M (x) changes
0 0
sign. Then it has at least 3 zeros and, by Rolle’s theorem, C2m (x) = B2m (x)
has at least 2 zeros in (0, 1). This implies that B2m−1 (x) has 2 zeros in
0
(0, 1). Since B2m−1 (0) = B2m−1 (1) = 0, again by Rolle’s theorem, B2m−1 (x)
has 3 zeros in (0, 1), which implies that B2m−2 (x) has 3 zeros, ...,etc. We
then conclude that B2l−1 (x) has 2 zeros in (0, 1) plus the two at the end
points, B2l−1 (0) = B2l−1 (1) for all l = 1, 2, . . ., which is a contradiction (for
l = 1, 2).

Here are the first few Bernoulli polynomials

B0 (x) = 1 (8.80)
1
B1 (x) = x − (8.81)
2
 2
1 1 1
B2 (x) = x − − = x2 − x + (8.82)
2 12 6
 3  
1 1 1 3 1
B3 (x) = x − − x− = x3 − x2 + x (8.83)
2 4 2 2 2
 4  2
1 1 1 7 1
B4 (x) = x − − x− + = x4 − 2x3 + x2 − . (8.84)
2 2 2 5 · 48 30

Let us retake the idea of integration by parts that we started in (8.71)

1
1 1 0
Z Z
0
− f (x)B1 (x)dx = − f (x)B20 (x)dx
0 2 0
(8.85)
1 1 00
Z
1 0 0
= B2 [f (0) − f (1)] + f (x)B2 (x)dx
2 2 0
130 CHAPTER 8. NUMERICAL INTEGRATION

and
Z 1 Z 1
1 00 1
f (x)B2 (x)dx = f 00 (x)B30 (x)dx
2 0 2·3 0
 Z 1 
1 00
1
000
= f (x)B3 (x) − f (x)B3 (x)dx
2·3 0 0
Z 1 Z 1
1 000 1
=− f (x)B3 (x)dx = − f 000 (x)B40 (x)dx
2·3 0 2·3·4 0
1 1 (4)
Z
B4 000 000
= [f (0) − f (1)] + f (x)B4 (x)dx.
4! 4! 0
(8.86)

Continuing this way we arrive at the Euler-Maclaurin formula for the simple
trapezoidal rule in [0, 1]:

Theorem 8.5.
Z 1 m
1 X B2k (2k−1)
f (x)dx = [f (0) + f (1)] + [f (0) − f (2k−1) (1)] + Rm
0 2 k=1
(2k)!
(8.87)

where
Z 1
1
Rm = f (2m+2) (x)[B2m+2 (x) − B2m+2 ]dx (8.88)
(2m + 2)! 0

and using (8.79), the Mean Value theorem for integrals, and Lemma 3
Z 1
1 B2m+2 (2m+2)
Rm = f (2m+2) (η)[B2m+2 (x) − B2m+2 ]dx = − f (η)
(2m + 2)! 0 (2m + 2)!
(8.89)

for some η ∈ (0, 1).

It is now straight forward to obtain the Euler Maclaurin formula for the
composite trapezoidal rule with equally spaced points:
8.9. ROMBERG INTEGRATION 131

Theorem 8.6. (The Euler-Maclaurin Summation Formula)


Let m be a positive integer and f ∈ C (2m+2) [a, b], h = b−a
N
then
" N −1
#
Z b
1 1 X
f (x)dx = h f (a) + f (b) + f (a + jh)
a 2 2 j=1
m
X B2k 2k 2k−1 (8.90)
+ h [f (a) − f 2k−1 (b)]
k=1
(2k)!
B2m+2
− (b − a)h2m+2 f (2m+2) (η). η ∈ (0, 1)
(2m + 2)!

Remarks: The error is in even powers of h. The formula gives m corrections


to the composite trapezoidal rule. For a smooth periodic function and if b−a
is a multiple of its period, then the error of the composite trapezoidal rule,
with equally spaced points, decreases faster than any power of h as h → 0.

8.9 Romberg Integration


We are now going to apply successively Richardson’s Extrapolation to the
trapezoidal rule. Again, we consider equally spaced nodes, xj = a + jh,
j = 0, 1, . . . , N , h = (b − a)/N , and assume N is even
N
" n−1
#
1 1 X X 00
Th [f ] = h f (a) + f (b) + f (a + jh) := h f (a + jh) (8.91)
2 2 j=1 j=0

where 00 means that first and last terms have a 21 factor.


P
We know from the Euler-Maclaurin formula that for a smooth integrand
Z b
f (x)dx = Th [f ] + c2 h2 + c4 h4 + · · · (8.92)
a

for some constants c2 , c4 , etc. We can do Richardson extrapolation to obtain


a quadrature with a leading order error O(h4 ). If we have computed T2h [f ]
we can combine it with Th [f ] to achieve this by noting that
Z b
f (x)dx = T2h [f ] + c2 (2h)2 + c4 (2h)4 + · · · (8.93)
a
132 CHAPTER 8. NUMERICAL INTEGRATION

we have
b
4Th [f ] − T2h [f ]
Z
f (x)dx = + c˜4 h4 + c˜6 h6 + · · · (8.94)
a 3
We can continue the Richardson extrapolation process but we can do this
more efficiently if we reuse the work we have done to compute T2h [f ] to
evaluate Th [f ]. To this end, we note that
N N
N 2 2
1 X 00 X 00 X
Th [f ] − T2h [f ] = h f (a + jh) − h f (a + 2jh) = h f (a + (2j − 1)h)
2 j=0 j=0 j=1

b−a
If we let hl = 2l
then
2l−1
1 X
Thl = Thl−1 [f ] + hl f (a + (2j − 1)hl ). (8.95)
2 j=1

Beginning with the simple trapezoidal rule (two points) we can successively
double the number of points in the quadrature by using (8.95) and immedi-
ately do extrapolation.
Let
b−a
R(0, 0) = Th0 [f ] = [f (a) + f (b)] (8.96)
2
and for l = 1, 2, ..., M define
2l−1
1 X
R(l, 0) = R(l − 1, 0) + hl f (a + 2j − 1)hl ). (8.97)
2 j=1

From R(0, 0) and R(1, 0) we can extrapolate to obtain


1
R(1, 1) = R(1, 0) + [R(1, 0) − R(0, 0)] (8.98)
4−1
We can generate a tableau of approximations like the following, for M = 4
R(0, 0)
R(1, 0) R(1, 1)
R(2, 0) R(2, 1) R(2, 2)
R(3, 0) R(3, 1) R(3, 2) R(3, 3)
R(4, 0) R(4, 1) R(4, 2) R(4, 3) R(4, 4)
8.9. ROMBERG INTEGRATION 133

Each of the R(l, m) is obtained by extrapolation


1
R(l, m) = R(l, m − 1) + [R(l, m − 1) − R(l − 1, m − 1)]. (8.99)
4m − 1
and R(4,4) would be the most accurate approximation (neglecting round off
errors). This is the Romberg algorithm and can be written as:

h = b − a;
R(0, 0) = 12 (b − a)[f (a) + f (b)];
for l = 1 : M
h = h/2;
P l−1
R(1, 0) = 21 R(l − 1, 0) + h 2j=1 f (a + (2j − 1)h);
for m = 1 : M
R(l, m) = R(l, m − 1) + 4m1−1 [R(l, m − l) − R(l − 1, m − 1)];
end
end
134 CHAPTER 8. NUMERICAL INTEGRATION
Chapter 9

Linear Algebra

9.1 The Three Main Problems


There are three main problems in Numerical Linear Algebra:
1. Solving large linear systems of equations.

2. Finding eigenvalues and eigenvectors.

3. Computing the Singular Value Decomposition (SVD) of a large matrix.


The first problem appears in a wide variety of applications and is an
indispensable tool in Scientific Computing.
Given a nonsingular n×n matrix A and a vector b ∈ Rn , where n could be
on the order of millions or billions, we would like to find the unique solution
x, satisfying

Ax = b (9.1)

or an accurate approximation x̃ to x. Henceforth we will assume, unless


otherwise stated, that the matrix A is real.
We will study Direct Methods (for example Gaussian Elimination), which
compute the solution (up to roundoff errors) in a finite number of steps
and Iterative Methods, which starting from an initial approximation of the
solution x(0) produce subsequent approximations x(1) , x(2) , . . . from a given
recipe

x(k+1) = G(x(k) , A, b), k = 0, 1, . . . (9.2)

135
136 CHAPTER 9. LINEAR ALGEBRA

where G is a continuous function of the first variable. Consequently, if the


iterations converge, x(k) → x as k → ∞, to the solution x of the linear system
Ax = b, then

x = G(x, A, b). (9.3)

That is, x is a fixed point of G.


One of the main strategies in the design of efficient numerical methods
for linear systems is to transform the problem to one which is much easier
to solve. Both direct and iterative methods use this strategy.
The eigenvalue problem for an n × n matrix A consists of finding each or
some of the scalars (the eigenvalues) λ and the corresponding eigenvectors
v 6= 0 such that

Av = λv. (9.4)

Equivalently, (A − λI)v = 0 and so the eigenvalues are the roots of the


characteristic polynomial of A

p(λ) = det(A − λI). (9.5)

Clearly, we cannot solve this problem with a finite number of elementary


operations (for n ≥ 5 it would be a contradiction to Abel’s theorem) so
iterative methods have to be employed. Also, λ and v could be complex
even if A is real. The maximum of the absolute value of the eigenvalues of a
matrix is useful concept in numerical linear algebra.

Definition 9.1. Let A be an n × n matrix. The spectral radius ρ of A is


defined as

ρ(A) = max{|λ1 |, . . . , |λn |}, (9.6)

where λi , i = 1, . . . , n are the eigenvalues (not necessarily distinct) of A.

Large eigenvalue (or more appropriately eigenvector) problems arise in


the study of the steady state behavior of time-discrete Markov processes
which are often used in a wide range of applications, such as finance, popu-
lation dynamics, and data mining. The original Google’s PageRank search
algorithm is a prominent example of the latter. The problem is to find an
eigenvector v associated with the eigenvalue 1, i.e. v = Av. Such v is a
9.2. NOTATION 137

probability vector so all its entries are positive, add up to 1, and represent
the probabilities of the system described by the Markov process to be in a
given state in the limit as time goes to infinity. This eigenvector v is in effect
a fixed point of the linear transformation represented by the Markov matrix
A.
The third problem is related to the second one and finds applications
in image compression, model reduction techniques, data analysis, and many
other fields. Given an m × n matrix A, the idea is to consider the eigenvalues
and eigenvectors of the square, n × n matrix AT A (or A∗ A, where A∗ is the
conjugate transpose of A as defined below, if A is complex). As we will see,
the eigenvalues are all real and nonnegative and AT A has a complete set of
orthogonal eigenvectors. The singular values of a matrix A are the positive
square roots of the eigenvalues of AT A. Using this, it follows that any real
m × n matrix A has the singular value decomposition (SVD)
U T AV = Σ, (9.7)
where U is an orthogonal m × m matrix, V is an orthogonal n × n matrix,
and Σ is a “diagonal” matrix of the form
 
D 0
Σ= , D = diag(σ1 , σ2 , . . . , σr ), (9.8)
0 0
where σ1 ≥ σ2 ≥ . . . σr > 0 are the nonzero singular values of A.

9.2 Notation
A matrix A with elements aij will be denoted A = (aij ), this could be a
square n × n matrix or an m × n matrix. AT denotes the transpose of A, i.e.
AT = (aji ).
A vector in x ∈ Rn will be represented as the n-tuple
 
x1
 x2 
x =  ..  . (9.9)
 
.
xn
The canonical vectors, corresponding to the standard basis in Rn , will be
denoted by e1 , e2 , . . . , en , where ek is the n-vector with all entries equal to
zero except the j-th one, which is equal to one.
138 CHAPTER 9. LINEAR ALGEBRA

The inner product of two real vectors x and y in Rn is


n
X
hx, yi = xi yi = xT y. (9.10)
i=1

If the vectors are complex, i.e. x and y in Cn we define their inner product
as
n
X
hx, yi = x̄i yi , (9.11)
i=1

where x̄i denotes the complex conjugate of xi .


With the inner product (9.10) in the real case or (9.11) in the complex
case, we can define the Euclidean norm
p
kxk2 = hx, xi. (9.12)
Note that if A is an n × n real matrix and x, y ∈ Rn then
n n
! n X n
X X X
hx, Ayi = xi aik yk = aik xi yk
i=1 k=1 i=1 k=1
n n
! n n
! (9.13)
X X X X
= aik xi yk = aTki xi yk ,
k=1 i=1 k=1 i=1

that is
hx, Ayi = hAT x, yi. (9.14)
Similarly in the complex case we have
hx, Ayi = hA∗ x, yi, (9.15)
where A∗ is the conjugate transpose of A, i.e. A∗ = (aji ).

9.3 Some Important Types of Matrices


One useful type of linear transformations consists of those that preserve the
Euclidean norm. That is, if y = Ax, then kyk2 = kxk2 but this implies
hAx, Axi = hAT Ax, xi = hx, xi (9.16)
and consequently AT A = I.
9.3. SOME IMPORTANT TYPES OF MATRICES 139

Definition 9.2. An n × n real (complex) matrix A is called orthogonal (uni-


tary) if AT A = I (A∗ A = I).
Two of the most important types of matrices in applications are symmet-
ric (Hermitian) and positive definite matrices.
Definition 9.3. An n × n real matrix A is called symmetric if AT = A. If
the matrix A is complex it is called Hermitian if A∗ = A.
Symmetric (Hermitian) matrices have real eigenvalues, for if v is an eigen-
vector associated to an eigenvalue λ of A, we can assumed it has been nor-
malized so that hv, vi = 1, and

hv, Avi = hv, λvi = λhv, vi = λ. (9.17)

But if AT = A then

λ = hv, Avi = hAv, vi = hλv, vi = λhv, vi = λ, (9.18)

and λ = λ if and only if λ ∈ R.


Definition 9.4. An n×n matrix A is called positive definite if it is symmetric
(Hermitian) and hx, Axi > 0 for all x ∈ Rn , x 6= 0.
By the preceding argument the eigenvalues of a positive definite matrix
A are real because AT = A. Moreover, if Av = λv with kvk2 = 1 then
0 < hv, Avi = λ. Therefore, positive definite matrices have real, positive
eigenvalues. Conversely, if all the eigenvalues of a symmetric matrix A are
positive then A is positive definite. This follows from the fact that symmetric
matrices are diagonalizable by an orthogonal matrix S, i.e. A = SDS T ,
where D is a diagonal matrix with the eigenvalues λ1 , . . . , λn (not necessarily
distinct) of A. Then
n
X
hx, Axi = λi yi2 , (9.19)
i=1

where y = S T x. Thus a symmetric (Hermitian) matrix A is positive definite


if and only if all its eigenvalues are positive. Moreover, since the determinant
is the product of the eigenvalues, positive definite matrices have a positive
determinant.
We now review another useful consequence of positive definiteness.
140 CHAPTER 9. LINEAR ALGEBRA

Definition 9.5. Let A = (aij ) be an n × n matrix. Its leading principal


submatrices are the square matrices

 
a11 · · · a1k
 ..
Ak =  . , k = 1, . . . , n. (9.20)

ak1 · · · akk

Theorem 9.1. All the leading principal submatrices of a positive definite


matrix are positive definite.

Proof. Suppose A is an n × n positive definite matrix. Then, all its leading


principal submatrices are symmetric (Hermitian). Moreover, if we take a
vector x ∈ Rn of the form

 
y1
 .. 
.
y 
 
x =  k , (9.21)
0
.
 .. 
0

where y = [y1 , . . . , yk ]T ∈ Rk is an arbitrary nonzero vector then

0 < hx, Axi = hy, Ak yi

which shows that Ak for k = 1, . . . , n is positive definite.

The converse of Theorem 9.1 is also true but the proof is much more
technical: A is positive definite if and only if det(Ak ) > 0 for k = 1, . . . , n.
Note also that if A is positive definite then all its diagonal elements are
positive because 0 < hej , Aej i = ajj , for j = 1, . . . , n.
9.4. SCHUR THEOREM 141

9.4 Schur Theorem


Theorem 9.2. (Schur) Let A be an n × n matrix, then there exists a unitary
matrix T ( T ∗ T = I ) such that
 
λ1 b12 b13 · · · b1n

 λ2 b23 · · · b2n  
T ∗ AT = 
 .. ..  , (9.22)
. . 
 
 bn−1,n 
λn

where λ1 , . . . , λn are the eigenvalues of A and all the elements below the
diagonal are zero.

Proof. We will do a proof by induction. Let A be a 2 × 2 matrix with


eigenvalues λ1 and λ2 . Let u be a normalized, eigenvector u (u∗ u = 1)
corresponding to λ1 . Then we can take T as the matrix whose first column
is u and its second column is a unit vector v orthogonal to u (u∗ v = 0). We
have
 ∗
λ1 u∗ Av
 
∗ u  
T AT = ∗ λ1 u Av = . (9.23)
v 0 v ∗ Av

The scalar v ∗ Av has to be equal to λ2 , as similar matrices have the same


eigenvalues. We now assume the result is true for all k × k (k ≥ 2) matrices
and will show that it is also true for all (k + 1) × (k + 1) matrices. Let
A be a (k + 1) × (k + 1) matrix and let u1 be a normalized eigenvector
associated with eigenvalue λ1 . Choose k unit vectors t1 , . . . , tk so that the
matrix T1 = [u1 t1 . . . tk ] is unitary. Then,
 
λ1 c12 c13 · · · c1n
0 
 
T1∗ AT1 =  ... , (9.24)
 
 Ak 
 
0

where Ak is a k × k matrix. Now, the eigenvalues of the matrix on the


right hand side of (9.24) are the roots of (λ1 − λ) det(Ak − λI) and since
this matrix is similar to A, it follows that the eigenvalues of Ak are the
142 CHAPTER 9. LINEAR ALGEBRA

remaining eigenvalues of A, λ2 , . . . , λk+1 . By the induction hypothesis there is


a unitary matrix Tk such that Tk∗ Ak Tk is upper triangular with the eigenvalues
λ2 , . . . , λk+1 sitting on the diagonal. We can now use Tk to construct the
(k + 1) × (k + 1) unitary matrix as
 
1 0 0 ··· 0
0 
 
 ..
Tk+1 =  . (9.25)

 Tk 

 
0

and define T = T1 Tk+1 . Then

T ∗ AT = Tk+1

T1∗ AT1 Tk+1 = Tk+1

(T1∗ AT1 )Tk+1 (9.26)

and using (9.24) and (9.25) we get


   
1 0 0 · · · 0 λ1 c12 c13 · · · c1n 1 0 0 ··· 0
0  0  0 
   
T ∗ AT =  ...
 ∗
  ..   .. 
 T k
 .
 Ak  .
 Tk 

   
0 0 0
 
λ1 b12 b13 · · · b1n

 λ2 b23 · · · b2n 
= . . .
. .
 
. .
 
 bn−1,n 
λn

9.5 Norms
A norm on a vector space V (for example Rn or Cn ) over K = R (or C) is a
mapping k · k : V → [0, ∞), which satisfy the following properties:
(i) kxk ≥ 0 ∀x ∈ V and kxk = 0 iff x = 0.

(ii) kx + yk ≤ kxk + kyk ∀x, y ∈ V .


9.5. NORMS 143

(iii) kλxk = |λ| kxk ∀x ∈ V, λ ∈ K.

Example 9.1.

kxk1 = |x1 | + . . . + |xn |, (9.27)


p p
kxk2 = hx, xi > = |x1 |2 + . . . + |xn |2 , (9.28)
kxk∞ = max{|x1 |, . . . , |xn |}. (9.29)

Lemma 4. Let k · k be a norm on a vector space V then

| kxk − kyk | ≤ kx − yk. (9.30)

This lemma implies that a norm is a continuous function (on V to R).


Proof. kxk = kx − y + yk ≤ kx − yk + kyk which gives that

kxk − kyk ≤ kx − yk. (9.31)

By reversing the roles of x and y we also get

kyk − kxk ≤ kx − yk. (9.32)

We will also need norms defined on matrices. Let A be an n × n matrix.


We can view A as a vector in Rn×n and define its corresponding Euclidean
norm
v
u n X n
uX
kAk = t |aij |2 . (9.33)
i=1 j=1

This is called the Frobenius norm for matrices. A different matrix norm can
be obtained by using a given vector norm and matrix-vector multiplication.
Given a vector norm k · k in Rn (or in Cn ), it is easy to show that

kAxk
kAk = max , (9.34)
x6=0 kxk

satisfies the properties (i), (ii), (iii) of a norm for all n × n matrices A . That
is, the vector norm induces a matrix norm.
144 CHAPTER 9. LINEAR ALGEBRA

Definition 9.6. The matrix norm defined by (11.1) is called the subordinate
or natural norm induced by the vector norm k · k.
Example 9.2.
kAxk1
kAk1 = max , (9.35)
x6=0 kxk1

kAxk∞
kAk∞ = max , (9.36)
x6=0 kxk∞

kAxk2
kAk2 = max . (9.37)
x6=0 kxk2

Theorem 9.3. Let k · k be an induced matrix norm then


(a) kAxk ≤ kAkkxk,
(b) kABk ≤ kAkkBk.
Proof. (a) if x = 0 the result holds trivially. Take x 6= 0, then the definition
(11.1) implies
kAxk
≤ kAk (9.38)
kxk
that is kAxk ≤ kAkkxk.
(b) Take x 6= 0. By (a) kABxk ≤ kAkkBxk ≤ kAkkBkkxk and thus
kABxk
≤ kAkkBk. (9.39)
kxk
Taking the max it we get that kABk ≤ kAkkBk.
The following theorem offers a more concrete way to compute the matrix
norms (9.35)-(9.37).
Theorem 9.4. Let A = (aij ) be an n × n matrix then
n
X
(a) kAk1 = max |aij |.
j
i=1

n
X
(b) kAk∞ = max |aij |.
i
j=1
9.5. NORMS 145
p
(c) kAk2 = ρ(AT A),

where ρ(AT A) is the spectral radius of AT A, as defined in (9.6).

Proof. (a)
n n n n
! n
!
X X X X X
kAxk1 = aij xj ≤ |xj | |aij | ≤ max |aij | kxk1 .
j
i=1 j=1 j=1 i=1 i=1

n
X
Thus, kAk1 ≤ max |aij |. We just need to show there is a vector x for
j
i=1
which the equality holds. Let j ∗ be the index such that
n
X n
X
|aij ∗ | = max |aij | (9.40)
j
i=1 i=1

and take x to be given by xi = 0 for i 6= j ∗ and xj ∗ = 1. Then, kxk1 = 1 and


n
X n
X n
X n
X
kAxk1 = aij xj = |aij ∗ | = max |aij |. (9.41)
j
i=1 j=1 i=1 i=1

(b) Analogously to (a) we have


n n
!
X X
kAxk∞ = max aij xj ≤ max |aij | kxk∞ . (9.42)
i i
j=1 j=1

Let i∗ be the index such that


n
X n
X
|ai∗ j | = max |aij | (9.43)
i
j=1 j=1

and take x given by


( a∗
i j
|ai∗ j |
if ai∗ j 6= 0,
xj = (9.44)
1 if ai∗ j = 0.
146 CHAPTER 9. LINEAR ALGEBRA

Then, |xj | = 1 for all j and kxk∞ = 1. Hence


n
X n
X n
X
kAxk∞ = max aij xj = |ai∗ j | = max |aij |. (9.45)
i i
j=1 j=1 i=1

(c) By definition
kAxk22 xT AT Ax
kAk22 = max = max (9.46)
x6=0 kxk22 x6=0 xT x
Note that the matrix AT A is symmetric and all its eigenvalues are nonnega-
tive. Let us label them in increasing order, 0 ≤ λ1 ≤ λ1 ≤ · · · ≤ λn . Then,
λn = ρ(AT A). Now, since AT A is symmetric, there is an orthogonal matrix Q
such that QT AT AQ = D = diag(λ1 , . . . , λn ). Therefore, changing variables,
x = Qy, we have
xT AT Ax y T Dy λ1 y12 + · · · + λn yn2
= = ≤ λn . (9.47)
xT x yT y y12 + · · · + yn2
Now take the vector y such that yj = 0 for j 6= n and yn = 1 and the equality
holds. Thus,
s
kAxk22 p p
kAk2 = max = λ n = ρ(AT A). (9.48)
x6=0 kxk22

Note that if AT = A then


p p
kAk2 = ρ(AT A) = ρ(A2 ) = ρ(A). (9.49)
Let λ be an eigenvalue of the matrix A with eigenvector x, normalized so
that kxk = 1. Then,
|λ| = |λ|kxk = kλxk = kAxk ≤ kAkkxk = kAk (9.50)
for any matrix norm with the property kAxk ≤ kAkkxk. Thus,
ρ(A) ≤ kAk (9.51)
for any induced norm. However, given an n × n matrix A and  > 0 there is
at least one induced matrix norm such that kAk is within  of the spectral
radius of A.
9.5. NORMS 147

Theorem 9.5. Let A be an n × n matrix. Given  > 0 there is at least one


induced matrix norm k · k such that

ρ(A) ≤ kAk ≤ ρ(A) + . (9.52)

Proof. By Schur’s Theorem, there is a unitary matrix T such that


 
λ1 b12 b13 · · · b1n

 λ2 b23 · · · b2n 


T AT =  .. ..
 = U, (9.53)
 
. .
 
 bn−1,n 
λn

where λj , j = 1, . . . , n are the eigenvalues of A. Take 0 < δ < 1 and define


the diagonal matrix Dδ = diag(δ, δ 2 , . . . , δ n ). Then
 
λ1 δb12 δ 2 b13 · · · δ n−1 b1n

 λ2 δb23 · · · δ n−2 b2n 

−1
Dδ U D δ = 
 .. ..  . (9.54)
. . 
 
 δbn−1,n 
λn

Given  > 0, we can find δ sufficiently small so that Dδ−1 U Dδ is “within ”


of a diagonal matrix, in the sense that the sum of the absolute values of the
off diagonal entries is less than  for each row:
n
X
δ j−i bij ≤  for i = 1, . . . , n. (9.55)
j=i+1

Now,

Dδ−1 U Dδ = Dδ−1 T ∗ AT Dδ = (T Dδ )−1 A(T Dδ ) (9.56)

Given a nonsingular matrix S and a matrix norm k · k then

kAk0 = kS −1 ASk (9.57)


148 CHAPTER 9. LINEAR ALGEBRA

is also a norm. Taking S = T Dδ and using the infinity norm we get


kAk0 = k(T Dδ )−1 A(T Dδ )k∞

   
λ1 0 δb12 δ 2 b13 · · · δ n−1 b1n

 λ2 


 0 δb23 · · · δ n−2 b2n 

≤ 
 .. 
+ 
 .. .. 
.  . . 
   
   δbn−1,n 
λn ∞
0 ∞

≤ ρ(A) + .

9.6 Condition Number of a Matrix


Consider the 5 × 5 Hilbert matrix
 
1 1 1 1 1
 1 2 3 4 5 
 
 
 1 1 1 1 1 
 
 2 3 4 5 6 
 
 
 1 1 1 1 1
 
H5 =  (9.58)


 3 4 5 6 7 
 
 
 1 1 1 1 1 
 
 4 5 6 7 8 
 
 
 1 1 1 1 1 
5 6 7 8 9
and the linear system H5 x = b where
 
137/60
 87/60 
 
b= 153/140 .
 (9.59)
 743/840 
1879/2520
9.6. CONDITION NUMBER OF A MATRIX 149

The exact solution of this linear system is x = [1, 1, 1, 1, 1]T . Note that
b ≈ [2.28, 1.45, 1.09, 0.88, 0.74]T . Let us perturb b slightly (about % 1)
 
2.28
1.46
 
b + δb = 1.10
 (9.60)
0.89
0.75

The solution of the perturbed system (up to rounding at 12 digits of accuracy)


is
 
0.5
 7.2 
 
x + δx = −21.0 .
 (9.61)
 30.8 
−12.6

A relative perturbation of kδbk2 /kbk2 = 0.0046 in the data produces a change


in the solution equal to kδxk2 ≈ 40. The perturbations gets amplified nearly
four orders of magnitude!
This high sensitivity of the solution to small perturbations is inherent to
the matrix of the linear system, H5 in this example.
Consider the linear system Ax = b and the perturbed one A(x + δx) =
b + δb. Then, Ax + Aδx = b + δb implies δx = A−1 δb and so

kδxk ≤ kA−1 kkδbk (9.62)

for any induced norm. But also kbk = kAxk ≤ kAkkxk or

1 1
≤ kAk . (9.63)
kxk kbk

Combining (9.62) and (9.63) we obtain

kδxk kδbk
≤ kAkkA−1 k . (9.64)
kxk kbk

The right hand side of this inequality is actually a least upper bound, there
are b and δb for which the equality holds.
150 CHAPTER 9. LINEAR ALGEBRA

Definition 9.7. Given a matrix norm k · k, the condition number of a


matrix A, denoted by κ(A) is defined by

κ(A) = kAkkA−1 k. (9.65)

Example 9.3. The condition number of the 5 × 5 Hilbert matrix H5 , (9.58),


in the 2 norm is approximately 4.7661 × 105 . For the particular b and δb we
chose we actually got a variation in the solution of O(104 ) times the relative
perturbation but now we know that the amplification factor could be as bad
as κ(A).

Similarly, if we perturbed the entries of a matrix A for a linear system


Ax = b so that we have (A + δA)(x + δx) = b we get

Ax + Aδx + δA(x + δx) = b (9.66)

that is, Aδx = −δA(x + δx), which implies that

kδxk ≤ kA−1 kkδAkkx + δxk (9.67)

for any induced matrix norm and consequently

kδxk kδAk kδAk


≤ kA−1 kkAk = κ(A) . (9.68)
kx + δxk kAk kAk

Because, for any induced norm, 1 = kIk = kA−1 kkAk ≤ kA−1 kkAk, we get
that κ(A) ≥ 1. We say that A is ill-conditioned if κ(A) is very large.

Example 9.4. The Hilbert matrix is ill-conditioned. We already saw that


in the 2 norm κ(H5 ) = 4.7661 × 105 . The condition number increases very
rapidly as the size of the Hilbert matrix increases, for example κ(H6 ) =
1.4951 × 107 , κ(H10 ) = 1.6025 × 1013 .

9.6.1 What to Do When A is Ill-conditioned?


There are two ways to deal with a linear system with an ill-conditioned matrix
A. One approach is to work with extended precision (using as many digits
as required to to obtain the solution up to a given accuracy). Unfortunately,
computations using extended precision can be computationally expensive,
several times the cost of regular double precision operations.
9.6. CONDITION NUMBER OF A MATRIX 151

A more practical approach is often to replace the ill-conditioned linear


system Ax = b by an equivalent linear system with a much smaller condition
number. This can be done by for example by premultiplying by a matrix
P −1 such that we have P −1 Ax = P −1 b. Obviously, taking P = A gives us
the smallest possible condition number but this choice is not practical so
a compromise is made between P approximating A and the cost of solving
linear systems with the matrix P to be low. This very useful technique, also
employed to accelerate the convergence of some iterative methods, is called
preconditioning.
152 CHAPTER 9. LINEAR ALGEBRA
Chapter 10

Linear Systems of Equations I

In this chapter we focus on a problem which is central to many applications:


find the solution to a large linear system of n linear equations in n unknowns
x1 , x2 , . . . , xn
a11 x1 + a12 x2 + . . . + a1n xn = b1 ,
a21 x1 + a22 x2 + . . . + a2n xn = b2 ,
.. (10.1)
.
an1 x1 + an2 x2 + . . . + ann xn = bn ,
or written in matrix form
Ax = b (10.2)
where A is the n × n matrix of coefficients
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A =  .. ..  , (10.3)
 
.. ..
 . . . . 
an1 an2 · · · ann
x is a column vector whose components are the unknowns, and b is the given
right hand side of the linear system
   
x1 b1
 x2   b2 
x =  ..  , b =  ..  . (10.4)
   
. .
xn bn

153
154 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

We will assume, unless stated otherwise, that A is a nonsingular, real matrix.


That is, the linear system (10.2) has a unique solution for each b. Equiva-
lently, the determinant of A, det(A), is non-zero and A has an inverse.
While mathematically we can write the solution as x = A−1 b, this is not
computationally efficient. Finding A−1 is several (about four) times more
costly than solving Ax = b for a given b.
In many applications n can be on the order of millions or much larger.

10.1 Easy to Solve Systems


When A is diagonal, i.e.
 
a11
 a22 
A= (10.5)
 
.. 
 . 
ann
(all the entries outside the diagonal are zero and since A is assumed non-
singular aii 6= 0 for all i), then each equation can be solved with just one
division:
xi = bi /aii , for i = 1, 2, . . . , n. (10.6)
If A is lower triangular and nonsingular,
 
a11
 a21 a22 
A =  .. (10.7)
 
.. . . 
 . . . 
an1 an2 · · · ann
the solution can also be obtained easily by the process of forward substitution:
b1
x1 =
a11
b2 − a21 x1
x2 =
a22
(10.8)
b3 − [a31 x1 + a32 x2 ] ..
x3 = .
a33
bn − [an1 x1 + an2 x2 + . . . + an,n−1 xn−1 ]
xn = ,
ann
10.1. EASY TO SOLVE SYSTEMS 155

or in pseudo-code:

Algorithm 10.1 Forward Subsitution


1: for i = 1, . . . , n do !
i−1
X
2: x i ← bi − aij xj /aii
j=1
3: end for

Note that the assumption that A is nonsingular implies that aii 6= 0 for all
i = 1, 2, . . . , n since det(A) = a11 a22 · · · ann . Also observe that (10.8) shows
that xi is a linear combination of bi , bi−1 , . . . , b1 and since x = A−1 b it follows
that A−1 is also lower triangular.
To compute xi we perform i−1 multiplications, i−1 additions/subtractions,
and one division, so the total amount of computational work W (n) to do for-
ward substitution is

n
X
W (n) = 2 (i − 1) + n = n2 − 2n, (10.9)
i=1

where we have used that

n
X n(n − 1)
i= . (10.10)
i=1
2

That is, W (n) = O(n2 ) to solve a lower triangular linear system.


If A is nonsingular and upper triangular
 
a11 a12 · · · a1n
 0 a22 · · · a2n 
A =  .. (10.11)
 
.. . . .. 
 . . . . 
0 0 · · · ann

we solve the linear system Ax = b starting from xn , then we solve for xn−1 ,
156 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

etc. This is called backward substitution


bn
xn = ,
ann
bn−1 − an−1,n xn
xn−1 = ,
an−1,n−1
bn−2 − [an−2,n−1 xn−1 + an−2,n xn ] (10.12)
xn−2 = ,
an−2,n−2
..
.
b1 − [a12 x2 + a13 x3 + · · · a1n xn ]
x1 = .
a11
From this we deduce that xi is a linear combination of bi .., bi+1 , ...bn and so
A−1 is an upper triangular matrix. In pseudo-code, we have

Algorithm 10.2 Backward Subsitution


1: for i = n, n − 1, . . . , 1 do!
X n
2: x i ← bi − aij xj /aii
j=i+1
3: end for

The operation count is the same as for forward substitution, W (n) =


O(n2 ).

10.2 Gaussian Elimination


The central idea of Gaussian elimination is to reduce the linear system Ax =
b to an equivalent upper triangular system, which has the same solution
and can readily be solved with backward substitution. Such reduction is
done with an elimination process employing linear combinations of rows. We
illustrate first the method with a concrete example:

x1 + 2x2 − x3 + x4 = 0,
2x1 + 4x2 − x4 = −3,
(10.13)
3x1 + x2 − x3 + x4 = 3,
x1 − x2 + 2x3 + x4 = 3.
10.2. GAUSSIAN ELIMINATION 157

To do the elimination we form an augmented matrix Ab by appending one


more column to the matrix of coefficients A, consisting of the right hand side
b:
 
1 2 −1 1 0
2 4 0 −1 −3
Ab =  . (10.14)
3 1 −1 1 3
1 −1 2 1 3
The first step is to eliminate the first unknown in the second to last equations,
i.e. to produce a zero in the first column of Ab for rows 2, 3, and 4:
   
1 2 −1 1 0 1 2 −1 1 0
2 4 0 −1 −3  −−−−−−→ 0
 0 2 −3 −3 ,
 (10.15)
3 1 −1 1 3 R2 ←R2 −2R1 0 −5 2 −2 3
R ←R −3R1
1 −1 2 1 3 R34 ←R34 −1R1 0 −3 3 0 3
where R2 ← R2 − 2R1 means that the second row has been replaced by
the second row minus two times the first row, etc. Since the coefficient of
x1 in the first equation is 1 it is easy to figure out the number we need to
multiply rows 2, 3, and 4 to achieve the elimination of the first variable for
each row, namely 2, 3, and 1. These numbers are called multipliers. In
general, to obtain the multipliers we divide the coefficient of x1 in the rows
below the first one by the nonzero coefficient a11 (2/1=2, 3/1=3, 1/1=1).
The coefficient we need to divide by to obtain the multipliers is called a pivot
(1 in this case).
Note that the (2, 2) element of the last matrix in (10.15) is 0 so we cannot
use it as a pivot for the second round of elimination. Instead, we proceed by
exchanging the second and the third rows
   
1 2 −1 1 0 1 2 −1 1 0
0 0 2 −3 −3  −−−−−−→ 0 −5
 2 −2 3.

0 −5 (10.16)
2 −2 3 R2 ↔R3 0 0 2 −3 −3
0 −3 3 0 3 0 −3 3 0 3
We can now use -5 as a pivot and do the second round of elimination:
   
1 2 −1 1 0 1 2 −1 1 0
0 −5 2 −2 3 −−−−−−→ 0 −5 2 −2 3

 . (10.17)
0 0 2 −3 −3 R3 ←R3 −0R2 0 0 2 −3 −3
3
0 −3 3 0 3 R4 ←R4 − 5 R2 0 0 9
5
6
5
6
5
158 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

Clearly, the elimination step R3 ← R3 − 0R2 is unnecessary as the coefficient


to be eliminated is already zero but we include it to illustrate the general
procedure. The last round of the elimination is
   
1 2 −1 1 0 1 2 −1 1 0
0 −5 2 −2 3 −−−−−−→ 0 −5 2 −2 3

 , (10.18)
0 0 2 −3 −3 R4 ←R4 − 109 R3 0
  0 2 −3 −3
9 6 6
0 0 5 5 5
0 0 0 3910
39
10

The last matrix, let us call it Ub , corresponds to the upper triangular system
    
1 2 −1 1 x1 0
0 −5 2 −2  x2   3
   =  , (10.19)
0 0 2 −3 x3  −3
0 0 0 39 10
x4 39
10

which we can solve with backward substitution to obtain the solution


   
x1 1
x2  −1
x=x3  =  0 .
   (10.20)
x4 1

Each of the steps in the Gaussian elimination process are linear trans-
formations and hence we can represent these transformations with matrices.
Note, however, that these matrices are not constructed in practice, we only
implement their effect (row exchange or elimination). The first round of elim-
ination (10.15) is equivalent to multiplying (from the left) Ab by the lower
triangular matrix
 
1 0 0 0
−2 1 0 0
E1 = 
−3 0 1 0 ,
 (10.21)
−1 0 0 1

that is
 
1 2 −1 1 0
0 0 2 −3 −3
E1 Ab = 
0 −5
. (10.22)
2 −2 3
0 −3 3 0 3
10.2. GAUSSIAN ELIMINATION 159

The matrix E1 is formed by taking the 4 × 4 identity matrix and replac-


ing the elements in the first column below 1 by negative the multiplier, i.e.
−2, −3, −1. We can exchange rows 2 and 3 with a permutation matrix
 
1 0 0 0
0 0 1 0
P = 0 1 0 0 ,
 (10.23)
0 0 0 1

which is obtained by exchanging the second and third rows in the 4 × 4


identity matrix,
 
1 2 −1 1 0
0 −5 2 −2 3
P E1 Ab =  . (10.24)
0 0 2 −3 −3
0 −3 3 0 3

To construct the matrix associated with the second round of elimination we


have to take 4 × 4 identity matrix and replace the elements in the second
column below the diagonal by negative the multipliers we got with the pivot
equal to -5:
 
1 0 0 0
0 1 0 0
E2 = 0
, (10.25)
0 1 0
0 − 35 0 1

and we get
 
1 2 −1 1 0
0 −5 2 −2 3
E2 P E1 Ab =  . (10.26)
0 0 2 −3 −3
9 6 6
0 0 5 5 5

Finally, for the last elimination we have


 
1 0 0 0
0 1 0 0
E3 = 0 0
, (10.27)
1 0
9
0 0 − 10 1
160 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

and E3 E2 P E1 Ab = Ub .
Observe that P E1 Ab = E10 P Ab , where
 
1 0 0 0
−3 1 0 0
E10 = 

, (10.28)
−2 0 1 0
−1 0 0 1
i.e., we exchange rows in advance and then reorder the multipliers accord-
ingly. If we focus on the matrix A, the first four columns of Ab , we have the
matrix factorization
E3 E2 E10 P A = U, (10.29)
where U is the upper triangular matrix
 
1 2 −1 1
0 −5 2 −2
U = . (10.30)
0 0 2 −3
39
0 0 0 10
Moreover, the product of upper (lower) triangular matrices is also an upper
(lower) triangular matrix and so is the inverse. Hence, we obtain the so-called
LU factorization
P A = LU, (10.31)
where L = (E3 E2 E10 )−1 = E10−1 E2−1 E3−1 is a lower triangular matrix. Now
recall that the matrices E10 , E2 , E3 perform the transformation of subtracting
the row of the pivot times the multiplier to the rows below. Therefore, the
inverse operation is to add the subtracted row back, i.e. we simply remove
the negative sign in front of the multipliers,
     
1 0 0 0 1 0 0 0 1 0 0 0
3 1 0 0 0 1 0 0 0 1 0 0
E10−1 =   −1
2 0 1 0 , E2 = 0 0 1 0 , E3 = 0 0 1 0 .
  −1  

1 0 0 1 0 53 0 1 0 0 10 9
1
It then follows that
 
1 0 0 0
3 1 0 0
L=
2 0
. (10.32)
1 0
1 53 9
10
1
10.2. GAUSSIAN ELIMINATION 161

Note that L has all the multipliers below the diagonal and U has all the
pivots on the diagonal. We will see that a factorization P A = LU is always
possible for any nonsingular n × n matrix A and can be very useful.
We now consider the general linear system (10.1). The matrix of coeffi-
cients and the right hand size are
   
a11 a12 · · · a1n b1
 a21 a22 · · · a2n   b2 
A =  .. ..  , b =  ..  , (10.33)
   
.. ..
 . . . .  .
an1 an2 · · · ann bn

respectively. We form the augmented matrix Ab by appending b to A as the


last column:
 
a11 a12 ··· a1n b1
 a21 a22 ··· a2n b2 
Ab =  .. . (10.34)
 
.. .. ..
 . . . . 
an1 an2 ··· ann bn

In principle if a11 6= 0 we can start the elimination. However, if |a11 | is


too small, dividing by it to compute the multipliers might lead to inaccurate
results in the computer, i.e. using finite precision arithmetic. It is generally
better to look for the coefficient of largest absolute value in the first column,
to exchange rows, and then do the elimination. This is called partial pivoting.
It is possible to then search for the element of largest absolute value in the
first row and switch columns accordingly. This is called complete pivoting
and works well provided the matrix is properly scaled. Henceforth, we will
consider Gaussian elimination only with partial pivoting, which is less costly
to apply.
To perform the first round of Gaussian elimination we do three steps:

1. Find the max |ai1 |, let us say this corresponds to the m-th row, i.e.
i
|am1 | = max |ai1 |. If |am1 | = 0, the matrix is singular. Stop.
i

2. Exchange rows 1 and m.

3. Compute the multipliers and perform the elimination.


162 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

After these three steps, we have transformed Ab into

 (1) (1) (1)


a11 a12 · · · a1n b01

 0 a(1) · · · a(1) b(1) 
(1) 22 2n 2 
Ab = . . (10.35)

 .. .. .. ..
. . . 
(1) (1) (1)
0 an2 · · · ann bn

(1)
This corresponds to Ab = E1 P1 Ab , where P1 is the permutation matrix
that exchanges rows 1 and m (P1 = I if no exchange is made) and E1 is the
matrix to obtain the elimination of the entries below the first element in the
first column. The same three steps above can now be applied to the smaller
(n − 1) × n matrix

 (1) (1) (1)



a22 · · · a2n b2
(1)  .. .. .. ..  ,
Ãb = . . . .  (10.36)
(1) (1) (1)
an2 · · · ann bn

and so on. Doing this process (n − 1) times, we obtain the reduced, upper
triangular system, which can be solved with backward substitution.
In matrix terms, the linear transformations in the Gaussian elimination
(k) (k−1) (0)
process correspond to Ab = Ek Pk Ab , for k = 1, 2, . . . , n − 1 (Ab = Ab ),
where the Pk and Ek are permutation and elimination matrices, respectively.
Pk = I if no row exchange is made prior to the k-th elimination round (but
recall that we do not construct the matrices Ek and Pk in practice). Hence,
the Gaussian elimination process for a nonsingular linear system produces
the matrix factorization

(n−1)
Ub ≡ Ab = En−1 Pn−1 En−2 Pn−2 · · · E1 P1 Ab . (10.37)

Arguing as in the introductory example we can rearrange the rows of Ab , with


the permutation matrix P = Pn−1 · · · P1 and the corresponding multipliers,
as if we knew in advance the row exchanges that would be needed to get

(n−1) 0 0
Ub ≡ Ab = En−1 En−2 · · · E10 P Ab . (10.38)
10.2. GAUSSIAN ELIMINATION 163

0 0
Since the inverse of En−1 En−2 · · · E10 is the lower triangular matrix
 
1 0 ··· ··· 0
 l21 1
 0 ··· 0 
L=
 l31 l32 1 ··· 0 , (10.39)
 .. .. . .. ... .. 
 . . .
ln1 ln2 · · · ln,n−1 1
where the lij , j = 1, . . . , n − 1, i = j + 1, . . . , n are the multipliers (com-
puted after all the rows have been rearranged), we arrive at the anticipated
factorization P A = LU . Incidentally, up to sign, Gaussian elimination also
produces the determinant of A because
(1) (2)
det(P A) = ± det(A) = det(LU ) = det(U ) = a11 a22 · · · a(n)
nn (10.40)
and so det(A) is plus or minus the product of all the pivots in the elimination
process.
In the implementation of Gaussian elimination the array storing the aug-
mented matrix Ab is overwritten to save memory. The pseudo code with
partial pivoting (assuming ai,n+1 = bi , i = 1, . . . , n) is presented in Algo-
rithm 10.3.

10.2.1 The Cost of Gaussian Elimination


We now do an operation count of Gaussian elimination to solve an n × n
linear system Ax = b.
We focus on the elimination as we already know that the work for the
step of backward substitution is O(n2 ). For each round of elimination, j =
1, . . . , n − 1, we need one division to compute each of the n − j multipliers
and (n − j)(n − j + 1) multiplications and (n − j)(n − j + 1) sums (subtracts)
to perform the eliminations. Thus, the total number number of operations is
n−1
X n−1
X
2(n − j)2 + 3(n − j)
 
W (n) = [2(n − j)(n − j + 1) + (n − j)] =
j=1 j=1
(10.41)
and using (10.10) and
m
X m(m + 1)(2m + 1)
i2 = , (10.42)
i=1
6
164 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

Algorithm 10.3 Gaussian Elimination with Partial Pivoting


1: for j = 1, . . . , n − 1 do
2: Find m such that |amj | = max |aij |
j≤i≤n
3: if |amj | = 0 then
4: stop . Matrix is singular
5: end if
6: ajk ↔ amk , k = j, . . . , n + 1 . Exchange rows
7: for i = j + 1, . . . , n do
8: m ← aij /ajj . Compute multiplier
9: aik ← aik − m ∗ ajk , k = j + 1, . . . , n + 1 . Elimination
10: aij ← m . Store multiplier
11: end for
12: end for
13: for i = n, n − 1, . . . , 1 do ! . Backward Substitution
X n
14: xi ← ai,n+1 − aij xj /aii
j=i+1
15: end for

we get
2
W (n) = n3 + O(n2 ). (10.43)
3
Thus, Gaussian elimination is computationally rather expensive for large
systems of equations.

10.3 LU and Choleski Factorizations


If Gaussian elimination can be performed without row interchanges, then we
obtain an LU factorization of A, i.e. A = LU . This factorization can be
advantageous when solving many linear systems with the same n × n matrix
A but different right hand sides because we can turn the problem Ax = b into
two triangular linear systems, which can be solved much more economically
in O(n2 ) operations. Indeed, from LU x = b and setting y = U x we have

Ly = b, (10.44)
U x = y. (10.45)
10.3. LU AND CHOLESKI FACTORIZATIONS 165

Given b, we can solve the first system for y with forward substitution and
then we solve the second system for x with backward substitution. Thus,
while the LU factorization of A has an O(n3 ) cost, subsequent solutions to
the linear system with the same matrix A but different right hand sides can
be done in O(n2 ) operations.
When can we obtain the factorization A = LU ? the following result
provides a useful sufficient condition.

Theorem 10.1. Let A be an n×n matrix whose leading principal submatrices


A1 , . . . , An are all nonsingular. Then, there exists an n × n lower triangular
matrix L, with ones on its diagonal, and an n × n upper triangular matrix
U such that A = LU and this factorization is unique.

Proof. Since A1 is nonsingular then a11 6= 0 and P1 = I. Suppose now


that we do not need to exchange rows in steps 2, . . . , k − 1 so that A(k−1) =
Ek−1 · · · E2 E1 A, that is
   
a11 · · · a1k · · · a1n 1 a11 · · · a1k · · · a1n

.. .   .. . . . 
. ···   −m21 . . · · · .. 
  

  .
 . .
···
 
   ..  ···

(k−1)  =  .

(k−1)
   
akk · · · akn   −mk1 ak1 akk · · · akn 

 1 

.. ..   . . .. .. 

  ..

.   ..

 . . . 
(k−1) (k−1)
ank · · · ann −mn1 · · · 1 an1 · · · ank · · · ann

The determinant of the boxed k × k leading principal submatrix on the left is


(2) (k−1)
a11 a22 · · · akk and this is equal to the determinant of the product of boxed
blocks on the right hand side. Since the determinant of the first such block
is one (it is a lower triangular matrix with ones on the diagonal), it follows
that
(2) (k−1)
a11 a22 · · · akk = det(Ak ) 6= 0, (10.46)
(k−1)
which implies that akk 6= 0 and so Pk = I and we conclude that U =
En−1 · · · E1 A and therefore A = LU .
Let us now show that this decomposition is unique. Suppose A = L1 U1 =
L2 U2 then

L−1 −1
2 L1 = U2 U1 . (10.47)
166 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

But the matrix on the left hand side is lower triangular (with ones in its diag-
onal) whereas the one on the right hand side is upper triangular. Therefore
L−1 −1
2 L1 = I = U2 U1 , which implies that L2 = L1 and U2 = U1 .

An immediate consequence of this result is that Gaussian elimination


can be performed without row interchange for a SDD matrix, as each of
its leading principal submatrices is itself SDD, and for a positive definite
matrix, as each of its leading principal submatrices is itself positive definite,
and hence non-singular in both cases.
Corollary 1. Let A be an n × n matrix. Then A = LU , where L is an n × n
lower triangular matrix , with ones on its diagonal, and U is an n × n upper
triangular matrix if either
(a) A is SDD or

(b) A is symmetric positive definite.


In the case of a positive definite matrix the number number of operations
can be cut down in approximately half by exploiting symmetry to obtain
a symmetric factorization A = BB T , where B is a lower triangular matrix
with positive entries in its diagonal. This representation is called Choleski
factorization of the symmetric positive definite matrix A.
Theorem 10.2. Let A be a symmetric positive definite matrix. Then, there
is a unique lower triangular matrix B with positive entries in its diagonal
such that A = BB T .
Proof. By Corollary 1 A has an LU factorization. Moreover, from (10.46) it
follows that all the pivots are positive and thus uii > 0 for all i = 1, . . . , n. We
√ √
can split the pivots evenly in L and U by letting D = diag( u11 , . . . , unn )
and writing A = LDD−1 U = (LD)(D−1 U ). Let B = LD and C = D−1 U .
√ √
Both matrices have diagonal elements u11 , . . . , unn but B is lower trian-
gular while C is upper triangular. Moreover, A = BC and because AT = A
we have that C T B T = BC, which implies

B −1 C T = C(B T )−1 . (10.48)

The matrix on the left hand side is lower triangular with ones in its diagonal
while the matrix on the right hand side is upper triangular also with ones
in its diagonal. Therefore, B −1 C T = I = C(B T )−1 and thus, C = B T
10.3. LU AND CHOLESKI FACTORIZATIONS 167

and A = BB T . To prove that this Choleski factorization is unique we go


back to the LU factorization, which we now is unique (if we choose L to
have ones in its diagonal). Given A = BB T , where B is lower triangular
−1
with positive diagonal elements b11 , . . . , bnn , we can write A = BDB DB B T ,
−1
where DB = diag(b11 , . . . , bnn ). Then L = BDB and U = DB B T yield
the unique LU factorization of A. Now suppose there is another Choleski
factorization A = CC T . Then by the uniqueness of the LU factorization, we
have

−1
L = BDB = CDC−1 , (10.49)
T T
U = DB B = DC C , (10.50)

where DC = diag(c11 , . . . , cnn ). Equation (10.50) implies that b2ii = c2ii for
i = 1, . . . , n and since bii > 0 and cii > 0 for all i, then DC = DB and
consequently C = B.

The Choleski factorization is usually written as A = LLT and is obtained


by exploiting the lower triangular structure of L and symmetry as follows.
First, L = (lij ) is lower triangular then lij = 0 for 1 ≤ i < j ≤ n and thus

n min(i,j)
X X
aij = lik ljk = lik ljk . (10.51)
k=1 k=1

Now, because AT = A we only need aij for i ≤ j, that is

i
X
aij = lik ljk 1 ≤ i ≤ j ≤ n. (10.52)
k=1

We can solve equations (10.52) to determine L, one column at a time. If we


set i = 1 we get

2 √
a11 = l11 , → l11 = a11 ,
a12 = l11 l21 ,
..
.
a1n = l11 ln1
168 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

and this allows us to get the first column of L. The second column is now
found by using (10.52) for i = 2
q
2 2 2
a22 = l21 + l22 , → l22 = a22 − l21 ,
a23 = l21 l31 + l22 l32 ,
..
.
a2n = l21 ln1 + l22 ln2 ,

etc. Algorithm 10.4 gives the pseudo code for the Choleski factorization.

Algorithm 10.4 Choleski factorization


1: for i = 1,r
. . . , n do . Compute column i of L for i = 1, . . . , n
 Pi−1 2 
2: lii ← aii − k=1 lik
3: for j = i + 1, . .P
. , n do
i−1
4: lji ← (aij − k=1 lik ljk )/lii
5: end for
6: end for

10.4 Tridiagonal Linear Systems


If the matrix of coefficients A has a triadiagonal structure
 
a1 b 1
 c 1 a2 b 2 
 
A=
 ... ... ... 
 (10.53)
 
 bn−1 
cn−1 an

its LU factorization can be computed at an O(n) cost and the corresponding


linear system can thus be solved efficiently.

Theorem 10.3. If A is triadiagonal and all of its leading principal subma-


10.4. TRIDIAGONAL LINEAR SYSTEMS 169

trices are nonsingular then


    
a1 b 1 1 m1 b1
 c 1 a2 b2  l1 1  m2 b2 
    
 ... ... .. =
  . . . .  . . . . ,

 . . .  . .
    
 bn−1     bn−1 
cn−1 an ln−1 1 mn
(10.54)

where

m1 = a, (10.55)
lj = cj /mj , mj+1 = aj+1 − lj bj , for j = 1, . . . , n − 1, (10.56)

and this factorization is unique.


Proof. By Theorem 10.1 we know that A has a unique LU factorization,
where L is unit lower triangular and U is upper triangular. We will show
that we can solve uniquely for l1 , . . . , ln−1 and m1 , . . . , mn so that (10.54)
holds. Equating the matrix product on the right hand side of (10.54), row
by row, we get

1st row: a1 = m1 , b1 = b1 ,
2nd row: c1 = m1 l1 , a2 = l1 b1 + m2 , b2 = b2 ,
..
.
(n − 1)-st row: cn−2 = mn−2 ln−2 , an−1 = ln−2 bn−2 + mn−1 , bn−1 = bn−1 ,
n-th row: cn−1 = mn−1 ln−1 , an = ln−1 bn−1 + mn

from which (10.55)-(10.56) follows. Of course, we need the mj ’s to be nonzero


to use (10.56). We now prove this is the case.
c
Note that mj+1 = aj+1 − lj bj = aj+1 − mjj bj . Therefore

mj+1 mj = aj+1 mj − bj cj , for j = 1, . . . , n − 1. (10.57)

Thus,

det(A1 ) = a1 = m1 , (10.58)
det(A2 ) = a2 a1 − c1 b1 = a2 m1 − b1 c1 = m1 m2 . (10.59)
170 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

We now do induction to show that det(Ak ) = m1 m2 · · · mk . Suppose det(Aj ) =


m1 m2 · · · mj for j = 1, . . . , k − 1. Expanding by the last row we get

det(Ak ) = ak det(Ak−1 ) − bk−1 ck−1 det(Ak−2 ) (10.60)

and using the induction hypothesis and (10.57) it follows that

det(Ak ) = m1 m2 · · · mk−2 [ak mk−1 − bk−1 ck−1 ] = m1 · · · mk , (10.61)

for k = 1, . . . , n. Since det(Ak ) 6= 0 for k = 1, . . . , n then m1 , m2 , . . . , mn are


all nonzero.

Algorithm 10.5 Tridiagonal solver


1: m1 ← a1
2: for j = 1, . . . , n − 1 do . Compute column L and U
3: lj ← cj /mj
4: mj+1 ← aj+1 − li ∗ bj
5: end for
6: y1 ← d1 . Forward substitution on Ly = d
7: for j = 2, . . . , n do
8: yj ← dj − lj−1 ∗ yj−1
9: end for
10: xn ← yn /mn . Backward substitution on U x = y
11: for j = n − 1, n − 2 . . . , 1 do
12: xj ← (yj − bj ∗ xj+1 )/mj
13: end for

10.5 A 1D BVP: Deformation of an Elastic


Beam
We saw in Section 5.5 an example of a very large system of equations in
connection with the least squares problem for fitting high dimensional data.
We now consider another example which leads to a large linear system of
equations.
Suppose we have a thin beam of unit length, stretched horizontally and
occupying the interval [0, 1]. The beam is subjected to a load density f (x)
10.5. A 1D BVP: DEFORMATION OF AN ELASTIC BEAM 171

at each point x ∈ [0, 1], and pinned at end points. Let u(x) be the beam
deformation from the horizontal position. Assuming that the deformations
are small (linear elasticity regime), u satisfies

−u00 (x) + c(x)u(x) = f (x), 0 < x < 1, (10.62)

where c(x) ≥ 0 is related to the elastic, material properties of the beam. Be-
cause the beam is pinned at the end points we have the boundary conditions

u(0) = u(1) = 0. (10.63)

The system (10.62)-(10.63) is called a boundary value problem (BVP). That


is, we need to find a function u that satisfies the ordinary differential equation
(10.62) and the boundary conditions (10.63) for any given, continuous f and
c. The condition c(x) ≥ 0 guarantees existence and uniqueness of solution
to this problem.
We will construct a discrete model whose solution gives an accurate ap-
proximation to the exact solution at a finite collection of selected points
(called nodes) in [0, 1]. We take the nodes to be equally spaced and to in-
clude the interval end points (boundary). So we choose a positive integer N
and define the nodes or grid points

x0 = 0, x1 = h, x2 = 2h, . . . , xN = N h, xN +1 = 1, (10.64)

where h = 1/(N + 1) is the grid size or node spacing. The nodes x1 , . . . , xN


are called interior nodes, because they lie inside the interval [0, 1], and the
nodes x0 and xN +1 are called boundary nodes.
We now construct a discrete approximation to the ordinary differential
equation by replacing the second derivative with a second order finite differ-
ence approximation. As we know,

u(xj+1 ) − 2u(xj ) + u(xj−1 )


u00 (xj ) = + O(h2 ). (10.65)
h2
Neglecting the O(h2 ) error and denoting the approximation of u(xj ) by vj
(i.e. vj ≈ u(xj )) fj = f (xj ) and cj = c(xj ), for j = 1, . . . , N , then at each
interior node
vj−1 + 2vj + vj+1
− + cj vj = fj , j = 1, 2, . . . , N (10.66)
h2
172 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

and at the boundary nodes, applying (10.63), we have

v0 = vN +1 = 0. (10.67)

Thus, (10.66) is a linear system of N equations in N unknowns v1 , . . . , vN ,


which we can write in matrix form as

2 + c1 h2 −1 0 ··· ··· 0
    
v1 f1
... ..
2 v
  2   f2 
 −1 2 + c2 h −1 .     
  ..   .. 


 0 ... ... ... ...  .   . 
1  .. .. .. .. .. ..
 .   . 
 .   . 
 .  =  . .

2 . . . . . .
h 

 .   . 
.. .. .. ..   ..   .. 

 . . . . 0    
 .. .. ..   ..   .. 
 . . . −1   .   . 
0 ··· 0 −1 2 + cN h2 vN fN
(10.68)

The matrix, let us call it A, of this system is tridiagonal and symmetric.


A direct computation shows that for an arbitrary, nonzero, column vector
v = [v1 , . . . , vN ]T

N
" 2 #
X vj + cj vj
vT Av = + cj vj2 > 0, ∀ v 6= 0 (10.69)
j=1
h

and therefore, since cj ≥ 0 for all j, A is positive definite. Thus, there is


a unique solution to (10.68) and can be efficiently found with our tridiago-
nal solver, Algorithm 10.5. Since the expected numerical error is O(h2 ) =
O(1/(N + 1)2 ), even a modest accuracy of O(10−4 ) requires N ≈ 100.

10.6 A 2D BVP: Dirichlet Problem for the


Poisson’s Equation
We now look at a simple 2D BVP for an equation that is central to many
applications, namely Poisson’s equation. For concreteness here, we can think
of the equation as a model for small deformations u of a stretched, square
membrane fixed to a wire at its boundary and subject to a force density
10.6. A 2D BVP: DIRICHLET PROBLEM FOR THE POISSON’S EQUATION173

f . Denoting by Ω, and ∂Ω, the unit square [0, 1] × [0, 1] and its boundary,
respectively, the BVP is to find u such that

−∆u(x, y) = f (x, y), for (x, y) ∈ Ω (10.70)

and

u(x, y) = 0. for (x, y) ∈ ∂Ω. (10.71)

In (10.70), ∆u is the Laplacian of u, also denoted as ∇2 u, and is given by

∂ 2u ∂ 2u
∆u = ∇2 u = uxx + uyy = + . (10.72)
∂x2 ∂y 2

Equation (10.70) is Poisson’s equation (in 2D) and together with (10.71)
specify a (homogeneous) Dirichlet problem because the value of u is given at
the boundary.
To construct a numerical approximation to (10.70)-(10.71), we proceed as
in the previous 1D BVP example by discretizing the domain. For simplicity,
we will use uniformly spaced grid points. We choose a positive integer N and
define the grid points of our domain Ω = [0, 1] × [0, 1] as

(xi , xj ) = (ih, jh), for i, j = 0, . . . , N + 1, (10.73)

where h = 1/(N + 1). The interior nodes correspond to 1 ≤ i, j ≤ N and the


boundary nodes are those corresponding to the remaining values of indices i
and j (i or j equal 0 and i or j equal N + 1).
At each of the interior nodes we replace the Laplacian by its second order
order finite difference approximation, called the five-point discrete Laplacian

u(xi−1 , xj ) + u(xi+1 , yj ) + u(xi , yj−1 ) + u(xi , yj+1 ) − 4u(xi , yj )


∇2 u(xi , yj ) =
h2
+ O(h2 ).
(10.74)

Neglecting the O(h2 ) discretization error and denoting by vij the approxima-
tion to u(xi , yj ) we get:

vi−1,j + vi+1,j + vi,j−1 + vi,j+1 − 4vij


− = fij , for 1 ≤ i, j ≤ N . (10.75)
h2
174 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

This is a linear system of N 2 equations for the N 2 unknowns, vij , 1 ≤ i, j ≤


N . We have freedom to order or label the unknowns any way we wish and
that will affect the structure of the matrix of coefficients of the linear system
but remarkably the matrix will be symmetric positive definite regardless of
ordering of the unknowns!.
The most common labeling is the so-called lexicographical order, which
proceeds from the bottom row to top one, left to right, v11 , v12 , . . . , v1N , v21 , . . .,
etc. Denoting by v1 = [v11 , v12 , . . . , v1N ]T , v2 = [v21 , v22 , . . . , v2N ]T , etc., and
similarly for the right hand side f , the linear system (10.75) can be written
in matrix form as

T −I 0 0
    
v1 f1
.
−I T −I . .   v2   f2 
  ..   .. 
    

0 ... ... ... ...  .  .
 . 
2  .. 
  
 .. .. .. .. ..  . 
 .  = h  . . (10.76)
 . . . . .
 .  .
.. .. .. ..  .   .. 

. . . . 0  . 

  
 .. .. ..   ..   .. 
 . . . −I   .  .
0 0 −I T vN fN

Here, I is the N × N identity matrix and T is the N × N tridiagonal matrix

4 −1 0 0
 
−1 4 −1 . . . 
 

0 . . .
.. .. .. .. . 

 
T =
 . . . . . . . . . . .

(10.77)
. . . . .
.. .. .. ..
 

 . . . . 0


 . . .
. . . . . . −1

0 0 −1 4

Thus, the matrix of coefficients in (10.76), is sparse, i.e. the vast majority of
10.7. LINEAR ITERATIVE METHODS FOR AX = B 175

its entries are zeros. For example, for N = 3 this matrix is


 
4 −1 0 −1 0 0 0 0 0
−1 4 −1 0 −1 0 0 0 0
 
0
 −1 4 0 0 −1 0 0 0 
−1 0 0 4 −1 0 −1 0 0
 
0
 −1 0 −1 4 −1 0 −1 0  .
0
 0 −1 0 −1 4 0 0 −1 
0
 0 0 −1 0 0 4 −1 0  
0 0 0 0 −1 0 −1 4 −1
0 0 0 0 0 −1 0 −1 4
Gaussian elimination is hugely inefficient for a large system (n > 100) with a
sparse matrix, as in this example. This is because the intermediate matrices
in the elimination would be generally dense due to fill-in introduced by the
elimination process. To illustrate the high cost of Gaussian elimination, if
we merely use N = 100 (this corresponds to a modest discretization error of
O(10−4 )), we end up with n = N 2 = 104 unknowns and the cost of Gaussian
elimination would be O(1012 ) operations.

10.7 Linear Iterative Methods for Ax = b


As we have seen, Gaussian elimination is an expensive procedure for large
linear systems of equations. An alternative is to seek not an exact (up to
roundoff error) solution in a finite number of steps but an approximation to
the solution that can be obtained from an iterative procedure applied to an
initial guess x(0) .
We are going to consider first a class of iterative methods where the
central idea is to write the matrix A as the sum of a non-singular matrix M ,
whose corresponding system is easy to solve, and a remainder −N = A − M ,
so that the system Ax = b is transformed into the equivalent system
M x = N x + b. (10.78)

Starting with an initial guess x(0) , (10.78) defines a sequence of approxima-


tions generated by
M x(k+1) = N x(k) + b, k = 0, 1, . . . (10.79)
The main questions are
176 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

1. When does this iteration converge?

2. What determines its rate of convergence?

3. What is the computational cost?

But first we look at three concrete iterative methods of the form (10.79).
Unless otherwise stated, A is assumed to be a non-singular n × n matrix and
b a given n-column vector.

10.8 Jacobi, Gauss-Seidel, and S.O.R.


If the all the diagonal elements of A are nonzero we can take M = diag(A)
and then at each iteration (i.e. for each k) the linear system (10.79) can be
easily solved to obtain the next iterate x(k+1) . Note that we do not need to
compute M −1 nor do we need to do the matrix product M −1 N (and due to
its cost it should be avoided). We just need to solve the linear system with
the matrix M , which in this case is trivial to do. We just solve the first
equation for the first unknown, the second equation for the second unknown,
etc., and we obtain the so-called Jacobi iterative method:
n
X (k+1)
− aij xj +b
j=1
(k+1) j6=i
xi = , i = 1, 2, ..., n, and k = 0, 1, ... (10.80)
aii

The iteration could be stopped when

kx(k+1) − x(k) k∞
≤ Tolerance. (10.81)
kx(k+1) k∞

Example 10.1. Consider the 4 × 4 linear system

10x1 − x2 + 2x3 = 6,
−x1 + 11x2 − x3 + 3x4 = 25,
(10.82)
2x1 − x2 + 10x3 − x4 = −11,
3x2 − x3 + 8x4 = 15.
10.8. JACOBI, GAUSS-SEIDEL, AND S.O.R. 177

It has the unique solution (1,2,-1,1). Jacobi’s iteration for this system is

(k+1) 1 (k) 1 (k) 3


x1 = x − x3 + ,
10 2 5 5
(k+1) 1 (k) 1 (k) 3 (k) 25
x2 = x1 + x3 − x4 + ,
11 11 11 11 (10.83)
(k+1) 1 (k) 1 (k) 1 (k) 11
x3 = − x1 + x 2 + x4 − ,
5 10 10 10
(k+1) 3 (k) 1 (k) 15
x4 = − x2 + x 3 + .
8 8 8
Starting with x(0) = [0, 0, 0, 0]T we obtain
     
0.60000000 1.04727273 0.93263636
 2.27272727
 , x(2) =  1.71590909 ,
   2.05330579
x(1) = 
 −1.10000000   −0.80522727  x(3) =
 −1.04934091  .

1.87500000 0.88522727 1.13088068


(10.84)
(k+1) (k+1)
In the Jacobi iteration, when we evaluate x2 we have already x1
(k+1) (k+1) (k+1)
available. When we evaluate x3 we have already x1 and x2 available
and so on. If we update the Jacobi iteration with the already computed
components of x(k+1) we obtained the Gauss-Seidel iteration:
i−1
X n
X
(k+1) (k)
− aij xj − aij xj + b
(k+1) j=1 j=i+1
xi = , i = 1, 2, ..., n, k = 0, 1, ...
aii
(10.85)

The Gauss-Seidel iteration is equivalent to the iteration obtained by taking


M as the lower triangular part of the matrix A, including its diagonal.
Example 10.2. For the system (10.82), starting again with the initial guess
[0, 0, 0, 0]T , Gauss-Seidel produces the following approximations
     
0.60000000 1.03018182 1.00658504
 2.32727273  2.03693802  2.00355502
x(1) =   (2)   (3)
 −0.98727273  , x =  −1.0144562  , x =  −1.00252738  .
 

0.87886364 0.98434122 0.99835095


(10.86)
178 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

In an attempt to accelerate convergence of the Gauss-Seidel iteration,


one could also put some weight in diagonal part of A, and split this into the
matrices M and N of the iterative method (10.79). Specifically, we can write
1 1−ω
diag(A) = diag(A) − diag(A), (10.87)
ω ω
where the first term of the right hand side goes into M and the last into N .
The Gauss-Seidel method then becomes
" i−1 n
#
(k)
X (k+1)
X (k)
aii xi − ω aij xj + aij xj − b
(k+1) j=1 j=i
xi = , i = 1, 2, ..., n, k = 0, 1, . . .
aii
(10.88)
Note that ω = 1 corresponds to Gauss-Seidel. This iteration is generically
S.O.R. (successive over-relaxation), even though we refer to over-relaxation
only when ω > 1 and under-relaxation when ω < 1. It can be proved that a
necessary condition for convergence is that 0 < ω < 2.

10.9 Convergence of Linear Iterative Meth-


ods
To study the convergence of iterative methods of the form M x(k+1) = N x(k) +
b, for k = 0, 1, . . . we use the equivalent iteration
x(k+1) = T x(k) + c, k = 0, 1, . . . (10.89)
where
T = M −1 N = I − M −1 A (10.90)
is called the iteration matrix and c = M −1 b.
The issue of convergence is that of existence of a fixed point for the map
F (x) = T x + c defined for all x ∈ Rn . That is, whether or not there is an
x ∈ Rn such that F (x) = x. For if the sequence defined in (10.89) converges
to a vector x then, by continuity of F , we would have x = T x + c = F (x).
For any x, y ∈ Rn and for any inducedinduced matrix norm we have
kF (x) − F (y)k = kT x − T yk ≤ kT k kx − yk. (10.91)
10.9. CONVERGENCE OF LINEAR ITERATIVE METHODS 179

If for some induced norm kT k < 1, F is a contracting map or contraction


and we will show that this guarantees the existence of a unique fixed point.
We will also show that the rate of convergence of the sequence generated by
iterative methods of the form (10.89) is given by the spectral radius ρ(T )
of the iteration matrix T . These conclusions will follow from the following
result.

Theorem 10.4. Let T be an n × n matrix. Then the following statements


are equivalent:

(a) lim T k = 0.
k→∞

(b) lim T k x = 0 for all x ∈ Rn .


k→∞

(c) ρ(T ) < 1.

(d) kT k < 1 for at least one induced norm.

Proof. (a) ⇒ (b): For any induced norm we have that

kT k xk ≤ kT k k kxk (10.92)

and so if T k → 0 as k → ∞ then kT k xk → 0, that is T k x → 0 for all x ∈ Rn .

(b) ⇒ (c): Let us suppose that lim T k x = 0 for all x ∈ Rn but that
k→∞
ρ(T ) ≥ 1. Then, there is a eigenvector v such that T v = λv with |λ| ≥ 1 and
the sequence T k v = λk v does not converge, which is a contradiction.

(c) ⇒ (d): By Theorem 9.5, for each  > 0, there is at least one induced
norm k · k such that kT k ≤ ρ(T ) +  from which the statement follows.

(d) ⇒ (a): This follows immediately from kT k k ≤ kT kk .

Theorem 10.5. The iterative method (10.89) is convergent for any initial
guess x(0) if and only if ρ(T ) < 1 or equivalently if and only if kT k < 1 for
at least one induced norm.

Proof. Let x be the exact solution of Ax = b. Then

x − x(1) = T x + c − (T x(0) + c) = T (x − x(0) ), (10.93)


180 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

from which it follows that the error of the k iterate, ek = x(k) − x, satisfies

ek = T k e0 , (10.94)

for k = 1, 2, . . . and where e0 = x − x(0) is the error of the initial guess. The
conclusion now follows immediately from Theorem 10.4.
The spectral radius ρ(T ) of the iteration matrix T measures the rate of
convergence of the method. For if T is normal, then kT k2 = ρ(T ) and from
(10.94) we get

kek k2 ≤ ρ(T )k ke0 k2 . (10.95)

But each k we can find a vector e0 for which the equality holds so ρ(T )k ke0 k2
is a least upper bound for the error kek k2 . If T is not normal, the following
results shows that, asymptotically kT k k ≈ ρ(T )k , for any matrix norm.

Theorem 10.6. Let T be any n × n matrix. Then, for any matrix norm k · k

lim kT k k1/k = ρ(T ). (10.96)


k→∞

Proof. We know that ρ(T k ) = ρ(T )k and that ρ(T ) ≤ kT k. Therefore

ρ(T ) ≤ kT k k1/k (10.97)

Now, for any given  > 0 construct the matrix T = T /(ρ(T ) + ). Then
limk→∞ Tk = 0 as ρ(T ) < 1. Therefore, there is an integer K such that

kT k k
kTk k = ≤ 1, for all k ≥ K . (10.98)
(ρ(T ) + )k

Thus, for all k ≥ K we have

ρ(T ) ≤ kT k k1/k ≤ ρ(T ) +  (10.99)

from which the results follows.

Theorem 10.7. Let A an n × n strictly diagonally dominant matrix. Then,


for any initial guess x(0) ∈ Rn

(a) The Jacobi iteration converges to the exact solution of Ax = b.


10.9. CONVERGENCE OF LINEAR ITERATIVE METHODS 181

(b) The Gauss-Seidel iteration converges to the exact solution of Ax = b.

Proof. (a) The Jacobi iteration matrix T has entries Tii = 0 and Tij =
−aij /aii for i 6= j. Therefore,
n n
X aij 1 X
kT k∞ = max = max |aij | < 1. (10.100)
1≤i≤n aii 1≤i≤n |aii |
j=1 j=1,j6=i
j6=i

(b) We will proof that ρ(T ) < 1 for the Gauss-Seidel iteration. Let x be an
eigenvector of T with eigenvalue λ, normalized to have kxk∞ = 1. Recall
that T = I − M −1 A. Then, T x = λx implies M x − Ax = λM x from which
we get
n
X i
X i−1
X
− aij xj = λ aij xj = λaii xi + λ aij xj . (10.101)
j=i+1 j=1 j=1

Now choose i such that kxk∞ = |xi | = 1 then


i−1
X n
X
|λ| |aii | ≤ |λ| |aij | + |aij |
j=1 j=i+1
X n n
X
|aij | |aij |
j=i+1 j=i+1
|λ| ≤ i−1
< n = 1.
X X
|aii | − |aij | |aij |
j=1 j=i+1

where the last inequality was obtained by using that A is SDD. Thus, |λ| < 1
and so ρ(T ) < 1.

Theorem 10.8. A necessary condition for the S.O.R. iteration is 0 < ω < 2.

Proof. We will show that det(T ) = (1−ω)n and because det(T ) is equal, up to
a sign, to the product of the eigenvalues of T we have that | det(T )| ≤ ρn (T )
and this implies that

ρ(T ) ≥ |1 − ω|. (10.102)


182 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I

Since ρ(T ) < 1 is required for convergence, the conclusion follows. Now,
T = M −1 [M − A] and det(T ) = det(M −1 ) det(M − A). From the definition
of the S.O.R. iteration (10.88) we get that
i−1
X n
X
(k+1) (k+1) (k) (k)
aii xi +ω aij xj = aii xi −ω aij xj + ωb. (10.103)
j=1 j=i

Therefore, M is lower triangular with a diagonal equal to that of A. Con-


sequently, det(M −1 ) = det(diag(A)−1 ). Similarly, det(M − A) = det((1 −
ω)diag(A)). Thus,

det(T ) = det(M −1 ) det(M − A) = det(diag(A)−1 ) det((1 − ω)diag(A))


= det(diag(A)−1 (1 − ω)diag(A)) = det((1 − ω)I) = (1 − ω)n .
(10.104)

If A is positive definite S.O.R. converges for any initial guess. However,


as we will see, there are more efficient iterative methods for positive definite
linear systems.
Chapter 11

Linear Systems of Equations II

In this chapter we focus on some numerical methods for the solution of large
linear systems Ax = b where A is a sparse, symmetric positive definite matrix.
We also look briefly at the non-symmetric case.

11.1 Positive Definite Linear Systems as an


Optimization Problem
Suppose that A is an n × n symmetric, positive definite matrix and we are
interested in solving Ax = b. Let x̄ be the unique, exact solution of Ax = b.
Since A is positive definite, we can define the norm

kxkA = xT Ax. (11.1)

Henceforth we are going to denote the inner product of two vector x, y in Rn


by hx, yi, i.e
n
X
hx, yi = xT y = xi y i . (11.2)
i=1

Consider now the quadratic function of x ∈ Rn defined by


1
J(x) = kx − x̄k2A . (11.3)
2
Note that J(x) ≥ 0 and J(x) = 0 if and only if x = x̄ because A is positive
definite. Therefore, x minimizes J if and only if x = x̄. In optimization, the

183
184 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II

function to be minimized (maximized), J in our case, is called the objective


function.
For several optimization methods it is useful to consider the one-dimensional
problem of minimizing J along a fixed direction. For given x, v ∈ Rn we con-
sider the so called line minimization problem consisting in minimizing J
along the line that passes through x and is in the direction of v, i.e

min J(x + tv). (11.4)


t∈R

Denoting g(t) = J(x + tv) and using the definition (11.3) of J we get

1
g(t) = hx − x̄ + tv, A(x − x̄ + tv)i
2
1
= J(x) + hx − x̄, Avi t + hv, Avi t2 (11.5)
2
1
= J(x) + hAx − b, vi t + hv, Avi t2 .
2
This is a parabola opening upward because hv, Avi > 0 for all v 6= 0. Thus,
its minimum is given by the critical point

0 = g 0 (t∗ ) = −hv, b − Axi + t∗ hv, Avi, (11.6)

that is
hv, b − Axi
t∗ = (11.7)
hv, Avi

and the minimum of J along the line x + tv, t ∈ R is

∗ 1 hv, b − Axi2
g(t ) = J(x) − . (11.8)
2 hv, Avi

Finally, using the definition of k · kA and Ax̄ = b, we have

1 1 1
kx − x̄k2A = kxk2A − hb, xi + kx̄k2A (11.9)
2 2 2
and so it follows that

∇J(x) = Ax − b. (11.10)
11.2. LINE SEARCH METHODS 185

11.2 Line Search Methods


We just saw in the previous section that the problem of solving Ax = b,
when A is a symmetric positive definite matrix is equivalent to a convex,
minimization problem of a quadratic objective function J(x) = kx − x̄k2A .
An important class of methods for this type of optimization problems is
called line search methods.
Line search methods produce a sequence of approximations to the mini-
mizer, in the form
x(k+1) = x(k) + tk v (k) , k = 0, 1, . . . , (11.11)
where the vector v (k) and the scalar tk are called the search direction and the
step length at the k-th iteration, respectively. The question then is how to
select the search directions and the step lengths to converge to the minimizer.
Most line search methods are of descent type because they required that the
value of J is decreased with each iteration. Going back to (11.5) this means
that descent line search methods have the condition hv (k) , ∇J(x(k) )i < 0.
Starting with an initial guess x(0) , line search methods generate
x(1) = x(0) + t0 v (0) (11.12)
(2) (1) (0) (0) (0) (1)
x =x + t1 v =x + t0 v + t1 v , (11.13)
etc., so that the k-th element of the sequence is x(0) plus a linear combination
of v (0) , v (1) , . . . , v (k−1) :
x(k) = x(0) + t0 v (0) + t1 v (0) + · · · + tk−1 v (k−1) . (11.14)
That is,
x(k) − x(0) ∈ span{v (0) , v (1) , . . . , v (k−1) }.

(11.15)
Unless otherwise noted, we will take the step length tk to be given by the
one-dimensional minimizer (11.7) evaluated at the k-step, i.e.
hv (k) , r(k) i
tk = , (11.16)
hv (k) , Av (k) i
where
r(k) = b − Ax(k) (11.17)
is the residual of the linear equation Ax = b associated with the approxima-
tion x(k) .
186 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II

11.2.1 Steepest Descent


One way to guarantee a decrease of J(x) = kx − x̄k2A at every step of a line
search method is to choose v (k) = −∇J(x(k) ), which is locally the fastest rate
of decrease of J. Recalling that ∇J(x(k) ) = −r(k) , we take v (k) = r(k) . The
optimal step length is selected according to (11.16) so that we choose the
line minimizer (in the direction of −∇J(x(k) )) of J. The resulting method is
called steepest descent and, starting from an initial guess x(0) , is given by

hr(k) , r(k) i
tk = , (11.18)
hr(k) , Ar(k) i
x(k+1) = x(k) + tk r(k) , (11.19)
(k+1) (k) (k)
r =r − tk Ar , (11.20)

for k = 0, 1, . . .. Formula (11.20), which comes from subtracting A times


(11.19) to b, is preferable to using the definition of the residual, i.e. r(k+1) =
b − Ax(k) , due to round-off errors.
If A is an n × n diagonal, positive definite matrix, the steepest descent
method finds the minimum in at most n steps. This is easy to visualize for
n = 2 as the level sets of J are ellipses with their principal axes aligned with
the coordinate axes. For a general, non-diagonal positive definite matrix A,
convergence of the steepest descent sequence to the minimizer of J and hence
to the solution of Ax = b is guaranteed but it may not be reached in a finite
number of steps.

11.3 The Conjugate Gradient Method


The steepest descent method uses an optimal search direction locally but not
globally and as a results it converges in general very slowly to the minimizer.
A key strategy to accelerate convergence in line search methods is to widen
our search space by considering the previous search directions, not just the
current one. Obviously, wewould like the v (k) ’s to be linear independent.
Recall that x(k) − x(0) ∈ span{v (0) , v (1) , . . . , v (k−1) }. We are going to
denote

Vk = span{v (0) , v (1) , . . . , v (k−1) } (11.21)

and write x ∈ x(0) + Vk to mean that x = x(0) + v with v ∈ Vk .


11.3. THE CONJUGATE GRADIENT METHOD 187

The idea is to select v (0) , v (1) , . . . , v (k−1) such that

x(k) = min kx − x̄k2A . (11.22)


x ∈ x(0) +Vk

If the search directions are linearly independent, as k increases our search


space grows so the minimizer would be found in at most n steps, when
Vn = Rn .
Let us derive a condition for the minimizer of J(x) = 12 kx − x̄k2A in
x + Vk . Suppose x ∈ x(0) + Vk . Then, there are scalars c0 , c1 , . . . , ck−1 such
(0)

that

x(k) = x0 + c0 v (0) + c1 v (1) + · · · + ck−1 v (k−1) . (11.23)

For fixed v (0) , v (1) , . . . , v (k−1) , define the following function of c0 , c1 , . . . , ck−1

G(c0 , c1 , ..., ck−1 ) := J x0 + c0 v (0) + c1 v (1) + · · · + ck−1 v (k−1)



(11.24)

Because J is a quadratic function, the minimizer of G is the critical point


c∗0 , c∗1 , ..., c∗k−1

∂G ∗ ∗
(c0 , c1 , ..., c∗k−1 ) = 0, j = 0, . . . , k − 1. (11.25)
∂cj

But by the Chain Rule


∂G
0= = ∇J · v (j) = −hr(k) , v (j) i, j = 0, 1, . . . , k − 1. (11.26)
∂cj

We have proved the following theorem.


Theorem 11.1. The vector x(k) ∈ x(0) +V k minimizes kx−x̄k2A over x(0) +Vk ,
for k = 0, 1, . . . if and only if

hr(k) , v (j) i = 0, j = 0, 1, . . . , k − 1. (11.27)

That is, the residual r(k) = b − Ax(k) is orthogonal to all the search directions
v (0) , . . . , v (k−1) .
Let us go back to one step of a line search method, x(k+1) = x(k) + tk v (k) ,
where tk is given by the one-dimensional minimizer (11.16). As we have
done in the Steepest Descent method, we find that the corresponding residual
188 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II

satisfies r(k+1) = r(k) −tk Av (k) . Starting with an initial guess x(0) , we compute
r(0) = b − Ax(0) and take v (0) = r(0) . Then,

x(1) = x(0) + t0 v (0) , (11.28)


(1) (0) (0)
r =r − t0 Av (11.29)

and

hr(1) , v (0) i = hr(0) , v (0) i − t0 hv (0) , Av (0) i = 0 (11.30)

where the last equality follows from the definition (11.16) of t0 . Now,

r(2) = r(1) − t1 Av (1) (11.31)

and consequently

hr(2) , v (0) i = hr(1) , v (0) i − t1 hv (0) , Av (1) i = −t1 hv (0) , Av (1) i. (11.32)

Thus if

hv (0) , Av (1) i = 0 (11.33)

then hr(2) , v (0) i = 0. Moreover, r(2) = r(1) − t1 Av (1) from which it follows that

hr(2) , v (1) i = hr(1) , v (1) i − t1 hv (1) , Av (1) i = 0, (11.34)

where in the last equality we have used the definition of t1 , (11.16). Thus, if
condition (11.33) holds we can guarantee that hr(1) , v (0) i = 0 and hr(2) , v (j) i =
0, j = 0, 1, i.e. we satisfy the conditions of Theorem 11.1 for k = 1, 2.

Definition 11.1. Let A be an n×n matrix. We say that two vectors x, y ∈ Rn


are conjugate with respect to A if

hx, Ayi = 0. (11.35)

We can now proceed by induction to prove the following theorem.

Theorem 11.2. Suppose v (0) , ..., v (k−1) are conjugate with respect to A, then
for k = 1, 2, . . .
hr(k) , v (j) i = 0, j = 0, 1, . . . , k − 1.
11.3. THE CONJUGATE GRADIENT METHOD 189

Proof. Let us do induction. We know the statement is true for k = 1.


Suppose
hr(k−1) , v (j) i = 0, j = 0, 1, ...., k − 2. (11.36)

Recall that r(k) = r(k−1) − tk−1 Av (k−1) and so


hr(k) , v (k−1) i = hr(k−1) , v (k−1) i − tk−1 hv (k−1) , Av (k−1) i = 0 (11.37)
because of the choice (11.16) of tk−1 . Now, for j = 0, 1, . . . , k − 2
hr(k) , v (j) i = hr(k−1) , v (j) i − tk−1 hv (j) , Av (k−1) i = 0, (11.38)
where the first term is zero because of the induction hypothesis and the
second term is zero because the search directions are conjugate.
Combining Theorems 11.1 and 11.2 we get the following important con-
clusion.
Theorem 11.3. If the search directions, v (0) , v (1) , . . . , v (k−1) are conjugate
(with respect to A) then x(k) = x(k−1) +tk−1 v (k−1) is the minimizer of kx− x̄k2A
over x(0) + Vk .

11.3.1 Generating the Conjugate Search Directions


The conjugate gradient method, due to Hestenes and Stiefel, is an ingenious
approach to generating efficiently the set of conjugate search directions. The
idea is to modify the negative gradient direction, r(k) , by adding information
about the previous search direction, v (k−1) . Specifically, we start with
v (k) = r(k) + sk v (k−1) , (11.39)

where the scalar sk is chosen so that v (k) is conjugate to v (k−1) with respect
to A, i.e.
0 = hv (k) , Av (k−1) i = hr(k) , Av (k−1) i + sk hv (k−1) , Av k−1) i (11.40)
which gives
hr(k) , Av (k−1) i
sk = − . (11.41)
hv (k−1) , Av (k−1) i
Magically this simple construction renders all the search directions conjugate
and the residuals orthogonal!
190 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II

Theorem 11.4.

(a) hr(i) , r(j) i = 0, i 6= j.


(i) (j)
(b) hv , Av i = 0, i 6= j.

Proof. By the choice of tk and sk it follows that

hr(k+1) , r(k) i = 0, (11.42)


(k+1) (k)
hv ,v i = 0, (11.43)

for k = 0, 1, . . . Let us now proceed by induction. We know hr(1) , r(0) i = 0


and hv (1) , v (0) i = 0. Suppose hr(i) , r(j) i = 0 and hv (i) , Av (j) i = 0 holds for
0 ≤ j < i ≤ k. We need to prove that this holds also for 0 ≤ j < i ≤ k + 1.
In view of (11.42) and (11.43) we can assume j < k. Now,

hr(k+1) , r(j) i = hr(k) − tk Av (k) , r(j) i


= hr(k) , r(j) i − tk hr(j) , Av (k) i (11.44)
(j) (k)
= −tk hr , Av i,

where we have used the induction hypothesis on the orthogonality of the


residuals for the last equality. B v (j) = r(j) + sj−1 v (j−1) and so r(j) = v (j) −
sj−1 v (j−1) . Thus,

hr(k+1) , r(j) i = −tk hv (j) − sj−1 v (j−1) , Av (k) i


(11.45)
= −tk hv (j) , Av (k) i + tk sj hv (j−1) , Av (k) i = 0.

Also for j < k

hv (k+1) , Av (i) i = hr(k+1) + sk v (k) , Av (i) i


= hr(k+1) , Av (i) i + sk hv (k) , Av (i) i
1 (11.46)
= hr(k+1) , (r(j) − r(j+1) )i
tj
1 (k+1) (j) 1
= hr , r i − hr(k+1) , r(j+1) i = 0.
tj tj
11.3. THE CONJUGATE GRADIENT METHOD 191

The conjugate gradient method is completely specified by (11.11), (11.16),


(11.39), (11.41). We are now going to do some algebra to get computationally
better formulas for tk and sk .
Recall that
hv (k) , r(k) i
tk = (k) .
hv , Av (k) i
Now,
hv (k) , r(k) i = hr(k) + sk v (k−1) , r(k) i
(11.47)
= hr(k) , r(k) i + sk hv (k−1) , r(k) i = hr(k) , r(k) i.
Therefore
hr(k) , r(k) i
tk = . (11.48)
hr(k) , Ar(k) i
Let us now work with the numerator of sk+1 , the inner product hr(k+1) , Av (k) i.
First recall that r(k+1) = r(k) − tk Av (k) and so tk Av (k) = r(k) − r(k+1) . There-
fore,
1 (k+1) (k+1) 1
−hr(k+1) , Av (k) i = hr ,r − r(k) i = hr(k+1) , r(k+1) i. (11.49)
tk tk
And for the denominator, we have
1 (k) (k)
hv (k) , Av (k) i = hv , r − r(k+1) i
tk
1 1
= hv (k) , r(k) i − hv (k) , r(k+1) i
tk tk
(11.50)
1 (k)
= hr + sk v (k−1) , r(k) i
tk
1 sk 1
= hr(k) , r(k) i + hv (k+1) , r(k) i = hr(k) , r(k) i.
tk tk tk
Thus, we can write
hr(k+1) , r(k+1) i
sk+1 = . (11.51)
hr(k) , r(k) i
A pseudo-code for the conjugate gradient method is given in Algorithm 11.1.
The main cost per iteration of the conjugate gradient method is the evalua-
tion of Av (k) . If A is sparse then this product of a and a vector can be done
192 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II

cheaply by avoiding to operate with the zeros of A. For example, for matrix
in the solution of Poisson’s equation in 2D (10.76), the cost of computing
Av (k) is just O(n), where n = N 2 is the total number of unknowns.

Algorithm 11.1 The conjugate gradient Method


1: Given x(0) and T OL, set r (0) = b − Ax(0) , v (0) = r (0) , and k = 0.
2: while kr (k) k2 > T OL do
3:
hr(k) , r(k) i
tk ←
hv (k) , Av (k) i
4:
x(k+1) ← x(k) + tk v (k)
5:
r(k+1) ← r(k) − tk Av (k)
6:
hr(k+1) , r(k+1) i
sk+1 ←
hr(k) , r(k) i
7:
v (k+1) ← r(k+1) + sk+1 v (k)
8: k ←k+1
9: end while

Theorem 11.5. Let A be an n × n symmetric positive definite matrix, then


the conjugate gradient method converges to the exact solution (assuming no
round-off errors) of Ax = b in at most n steps .

Proof. By Theorem 11.4, the residuals are orthogonal hence linearly inde-
pendent. After n steps, r(n) is orthogonal to r(0) , r(1) , . . . , r(n−1) . Since the
dimension of the space is n, r(n) has to be the zero vector.

11.4 Krylov Subspaces


In the conjugate gradient method we start with an initial guess x(0) , compute
the residual r(0) = b−Ax(0) and set v (0) = r(0) . We then get x(1) = x(0) +t0 r(0)
11.4. KRYLOV SUBSPACES 193

and evaluate the residual r(1) , etc. If we use the definition of the residual we
have

r(1) = b − Ax(1) = b − Ax(0) + t0 Ar(0) = r(0) + t0 Ar(0) (11.52)

so that r(1) is a linear combination of r(0) and Ar(0) . Similarly,

x(2) = x(1) + t1 v (1)


= x(0) + t0 r(0) + t1 r(1) + t1 s0 r(0) (11.53)
(0) (0) (1)
=x + (t0 + t1 s0 )r + t1 r

so that r(2) = b − Ax(2) is a linear combination of r(0) , Ar(0) , and A2 r(0) and
so on.
Definition 11.2. The set Kk (r(0) , A) = span{r(0) , Ar(0) , ..., Ak−1 r(0) } is called
the Krylov subspace of degree k for r(0) .
Krylov subspaces are central to an important class of numerical methods
that rely on getting approximations through matrix-vector multiplication like
the conjugate gradient method.
The following theorem provides a reinterpretation of the conjugate gra-
dient method. The approximation x(k) is the minimizer of kx − x̄k2A over
Kk (r(0) , A).
Theorem 11.6. Kk (r(0) , A) = span{r(0) , ..., r(k−1) } = span{v (0) , ..., v (k−1) }.
Proof. We will proof it by induction. The case k = 1 by construction. Let us
now assume that it holds for k and we will prove that it also holds for k + 1.
By the induction hypothesis r(k) , v (k−1) ∈ Kk (r(0) ; A) then

Av (k−1) ∈ span{Ar(0) , ..., Ak r(0) }

but r(k) = r(k−1) − tk−1 Av (k−1) and so

r(k) ∈ Kk+1 (r(0) , A).

Consequently,
span{r(0) , ..., r(k) } ⊆ Kk+1 (r(0) , A).
We now prove the reverse inclusion,

span{r(0) , ..., r(k) } ⊇ Kk+1 (r(0) , A).


194 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II

Note that Ak r(0) = A(Ak−1 r(0) ). But by the induction hypothesis

span{r(0) , Ar(0) , ..., Ak−1 r(0) } = span{v (0) , ..., v (k−1) }.

Given that

Ak r(0) = A(Ak−1 r(0) ) ∈ span{Av (0) , ..., Av (k−1) }

and since
1 (j)
Av (j) = (r − r(j+1) )
tj
it follows that
Ak r(0) ∈ span{r(0) , r(1) , ..., r(k) }.
Thus,
span{r(0) , ..., r(k) } = Kk+1 (r(0) , A).
For the last equality we observe that span{v (0) , ..., v (k) } = span({v (0) , ..., v (k) , r(k) }
because v (k) = r(k) + sk v (k−1) and by the induction hypothesis

span{v (0) , ..., v (k) , r(k) } = span{r(0) , Ar(0) , ..., Ak r(0) , r(k) }
= span{r(0) , r(1) , ..., r(k) , r(k) } (11.54)
(0)
= Kk+1 (r , A).

11.5 Convergence of the Conjugate Gradient


Method
Let us define the initial error as e(0) = x(0) − x̄. Then Ae(0) = Ax(0) − Ax̄
implies that

r(0) = −Ae(0) . (11.55)

For the conjugate gradient method x(k) ∈ x(0) + Kk (r(0) , A) and in view of
(11.55) we have that

x(k) − x̄ = e(0) + c1 Ae(0) + c2 A2 e(0) + · · · + ck Ak e(0) , (11.56)


11.5. CONVERGENCE OF THE CONJUGATE GRADIENT METHOD195

for some real constants c1 , . . . , ck . In fact,

kx(k) − x̄kA = min kp(A)e(0) kA , (11.57)


p∈P̃k

where P̃k is the set of all polynomials of degree ≤ k and that are equal
to one at 0. Since A is symmetric positive definite all its eigenvalues are
real and positive. Let’s order them as 0 < λ1 ≤ λ2 ≤ . . . ≤ λn , with
associated orthonormal eigenvectors v1 , v2 , . . . vn . Then, we can write e(0) =
α1 v1 + . . . αn vn for some scalars α0 , . . . , αn and
n
X
(0)
p(A)e = p(λj )αj vj . (11.58)
j=1

Therefore,
n
X
kp(A)e(0) k2A = hp(A)e(0) , Ap(A)e(0) i = p2 (λj )λj αj2
j=1
 n
X (11.59)
2
≤ max p (λj ) λj αj2
j
j=1

and since
n
X
ke(0) k2A = λj αj2 (11.60)
j=1

we get

ke(k) kA ≤ min max |p(λj )| ke(0) kA . (11.61)


p∈P̃k j

The min max term can be estimated using the Chebyshev polynomial Tk
with the change of variables
2λ − λ1 − λn
f (λ) = (11.62)
λn − λ1
to map [λ1 , λn ] to [−1, 1]. The polynomial
1
p(λ) = Tk (f (λ)) (11.63)
Tk (f (0))
196 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II

is in P̃k and since |T k(f (λ))| ≤ 1


1
max |p(λj )| = . (11.64)
j |Tk (f (0))|

Now
     
λ1 + λn λn /λ1 + 1 κ2 (A) + 1
|Tk (f (0))| = Tk = Tk = Tk
λn − λ1 λn /λ1 − 1 κ2 (A) − 1
(11.65)

because κ2 (A) = λn /λ1 is the condition number of A in the 2-norm. Now we


use an identity of Chebyshev polynomials, namely if x = (z + 1/z)/2 then
Tk (x) = (z k + 1/z k )/2. Noting that

κ2 (A) + 1 1
= (z + 1/z) (11.66)
κ2 (A) − 1 2

for
p p
z = ( κ2 (A) + 1)/( κ2 (A) − 1) (11.67)

we obtain
!k
−1
  p
κ2 (A) + 1 κ2 (A) − 1
Tk ≤2 p . (11.68)
κ2 (A) − 1 κ2 (A) + 1

Thus we get the upper bound error estimate for the error in the conjugate
gradient method:
p !k
κ2 (A) − 1
ke(k) kA ≤ 2 p ke(0) kA . (11.69)
κ2 (A) + 1
Chapter 12

Eigenvalue Problems

In this chapter we take a brief look at some numerical methods for the stan-
dard eigenvalue problem, i.e. of finding eigenvalues λ and eigenvectors v of
an n × n matrix A.

12.1 The Power Method


Suppose that A has a dominant eigenvalue:

|λ1 | > |λ2 | ≥≥ · · · ≥ |λn | (12.1)

and a complete set of eigenvector v1 , . . . , vn associated to λ1 , . . . , λn , respec-


tively. Then, each vector v ∈ Rn can be written in terms of the eigenvectors
as

v = c1 v1 + · · · + cn vn (12.2)

and
" n  k #
X cj λj
Ak v = c1 λk1 v1 + · · · + cn λk1 vn = c1 λk1 v1 + vj . (12.3)
j=2
c 1 λ1

Therefore Ak v/c1 λk1 → v1 and we get a method to determine v1 and λ1 .


To avoid overflow we normalize the approximating vector at each iteration
as Algorithm 12.1 shows. The rate of convergence of the power method is
determined by the ratio λ2 /λ1 . From (12.3), |λ(k) − λ| = O(|λ2 /λ1 |k ), where
λ(k) is the approximation to λ1 at the k-th iteration.

197
198 CHAPTER 12. EIGENVALUE PROBLEMS

Algorithm 12.1 The Power Method


1: Set k = 0, krk2 >> 1.
2: while krk2 > T OL do
3: v ← Av
4: v ← v/kvk2
5: λ ← v T Av
6: r = Av − λv
7: k ←k+1
8: end while

The power method is useful and efficient for computing the dominant
eigenpair λ1 , v1 when A is sparse, so that the evaluation of Av is economical,
and when |λ2 /λ1 | << 1.
One can use shifts in the matrix A to decrease |λ2 /λ1 | and improve con-
vergence. We apply the power method with the shifted matrix A − sI, where
the shift s is chosen to accelerated convergence. For example, suppose A is
symmetric and has eigenvalues 100, 90, 50, 40, 30, 30 the matrix A − 60I has
eigenvalues 40, 30, −10, −20, −30, −30 and the power method would converge
at a rate of 30/40 = 0.75 instead of a rate of 90/100 = 0.9.
A variant of the shift power method is the inverse power method, which
applies the iteration to the matrix (A − sI)−1 . The inverse is not actually
computed; instead the linear system (A − sI)v (k) = v (k−1) is solved at every
iteration. The method will converge to the eigenvalue λj for which |λj − s is
the smallest and so with an appropriate choice for s it is possible to converge
to each of the eigenpairs of A.

12.2 Methods Based on Similarity Transfor-


mations
For a general n × n matrix, numerical methods for eigenvalues are typically
based on a sequence of similarity transformations

Ak = Pk−1 APk , k = 1, 2, . . . (12.4)

to attempt to converge to either a diagonal matrix or to an upper triangular


matrix.
12.2. METHODS BASED ON SIMILARITY TRANSFORMATIONS 199

12.2.1 The QR method


The most successful numerical method for the eigenvalue problem of a general
square matrix A is the QR method. It is based on the QR factorization of
a matrix. Here Q is a unitary (orthogonal in the real case) matrix and R is
upper triangular.
Given an n × n matrix A, we set A1 = A, obtain its QR factorization

A1 = Q1 R1 (12.5)

and define A2 = R1 Q1 so that

A2 = R1 Q1 = Q∗1 AQ1 , (12.6)

etc. The k + 1-st similar matrix is generated by

Ak+1 = Rk Qk = Q∗k Ak Qk = (Q1 · · · Qk )∗ A(Q1 · · · Qk ). (12.7)

It can be proved that if A is diagonalizable and with distinct eigenvalues


in modulus then the sequence of matrices Ak , k = 1, 2, . . . produced by the
QR method will converge to a diagonal matrix with the eigenvalues of A on
the diagonal. There is no convergence proof for a general matrix A but the
method is remarkably robust and fast to converge.
The QR factorization is expensive, O(n3 ) operations, so the QR method
is usually applied only after the original matrix A has been reduced to a
tridiagonal matrix if A is symmetric or to an upper Hessenberg form if A
is not symmetric. This reduction is done with a sequence of orthogonal
transformations known as Householder reflections.

Example 12.1. Consider the matrix the 5 × 5 matrix


 
12 13 10 7 7
13 18 9 8 15
 
A=
10 9 10 4 12
 (12.8)
7 8 4 4 6
7 15 12 6 18

the A20 = R19 Q19 produced by the QR method gives the eigenvalues of A
200 CHAPTER 12. EIGENVALUE PROBLEMS

within 4 digits of accuracy


 
51.7281 0.0000 0.0000 0.0000 0.0000
 0.0000 8.2771 0.0000 0.0000 0.0000
 
A20 =   0.0000 0.0000 4.6405 0.0000 0.0000. (12.9)
 0.0000 0.0000 0.0000 −2.8486 0.0000
0.0000 0.0000 0.0000 0.0000 0.2028
Chapter 13

Non-Linear Equations

13.1 Introduction
In this chapter we consider the problem of finding zeros of a continuous
function f , i.e. solving f (x) = 0 for example ex − x = 0 or a system of
nonlinear equations:
f1 (x1 , x2 , · · · , xn ) = 0,
f2 (x1 , x2 , · · · , xn ) = 0,
.. (13.1)
.
fn (x1 , x2 , · · · , xn ) = 0.
We are going to write this generic system in vector form as
f (x) = 0, (13.2)
where f : U ⊆ Rn → Rn . Unless otherwise noted the function f is assumed
to be smooth in its domain U .
We are going to start with the scalar case, n = 1 and look a very simple
but robust method that relies only on the continuity of the function and the
existence of a zero.

13.2 Bisection
Suppose we are interested in solving a nonlinear equation in one unknown
f (x) = 0, (13.3)

201
202 CHAPTER 13. NON-LINEAR EQUATIONS

where f is a continuous function on an interval [a, b] and has at least one


zero there.
Suppose that f has values of different sign at the end points of the interval,
i.e.
f (a)f (b) < 0. (13.4)
By the Intermediate Value Theorem, f has at least one zero in (a, b). To
locate a zero we bisect the interval and check on which subinterval f changes
sign. We repeat the process until we bracket a zero within a desired accuracy.
The Bisection algorithm to find a zero x∗ is shown below.
Algorithm 13.1 The Bisection Method
1: Given f , a and b (a < b), T OL, and Nmax , set k = 1 and do:
2: while (b − a) > T OL and k ≤ Nmax do
3: c = (a + b)/2
4: if f (c) == 0 then
5: x∗ = c . This is the solution
6: stop
7: end if
8: if sign(f (c)) == sign(f (a)) then
9: a←c
10: else
11: b←c
12: end if
13: k ←k+1
14: end while
15: x∗ ← (a + b)/2

13.2.1 Convergence of the Bisection Method


With the bisection method we generate a sequence
ak + b k
ck = , k = 1, 2, . . . (13.5)
2
where each ak and bk are the endpoints of the subinterval we select at each
bisection step (because f changes sign there). Since
b−a
bk − ak = k−1 , k = 1, 2, . . . (13.6)
2
13.3. RATE OF CONVERGENCE 203

ak +bk
and ck = 2
is the midpoint of the interval then

1 b−a
|ck − x∗ | ≤ (bk − ak ) = k (13.7)
2 2
and consequently ck → x∗ , a zero of f in [a, b].

13.3 Rate of Convergence


We now define in precise terms the rate of convergence of a sequence of
approximations to a value x∗ .

Definition 13.1. Suppose a sequence {xn }∞ ∗


n=1 converges to x as n → ∞.
We say that xn → x∗ of order p (p ≥ 1) if there is a positive integer N and
a constant C such that

|xn+1 − x∗ | ≤ C |xn − x∗ |p , for all n ≥ N . (13.8)

or equivalently

|xn+1 − x∗ |
lim = C. (13.9)
n→∞ |xn − x∗ |p

C < 1 for p = 1.

Example 13.1. The sequence generated by the bisection method converges


linearly to x∗ because
b−a
|cn+1 − x∗ | 2n+1 1
≤ b−a
= .
|cn − x∗ | 2n 2

Let’s examine the significance of the rate of convergence. Consider first,


p = 1, linear convergence. Suppose

|xn+1 − x∗ | ≈ C|xn − x∗ |, n ≥ N. (13.10)

Then

|xN +1 − x∗ | ≈ C|xN − x∗ |,
|xN +2 − x∗ | ≈ C|xN +1 − x∗ | ≈ C(C|xN − x∗ |) = C 2 |xN − x∗ |.
204 CHAPTER 13. NON-LINEAR EQUATIONS

Continuing this way we get

|xN +k − x∗ | ≈ C k |xN − x∗ |, k = 0, 1, . . . (13.11)

and this is the reason of the requirement C < 1 for p = 1. If the error at the
N step, |xN − x∗ |, is small enough it will be reduced by a factor of C k after
k more steps. Setting C k = 10−dk , then the error |xN − x∗ | will be reduced
approximately
 
1
dk = log10 k (13.12)
C

digits.
Let us now do a similar analysis for p = 2, quadratic convergence. We
have

|xN +1 − x∗ | ≈ C|xN − x∗ |2 ,
|xN +2 − x∗ | ≈ C|xN +1 − x∗ |2 ≈ C(C|xN − x∗ |2 )2 = C 3 |xN − x∗ |4 ,
|xN +3 − x∗ | ≈ C|xN +2 − x∗ |2 ≈ C(C 3 |xN − x∗ |4 )2 = C 7 |xN − x∗ |8 .

It is easy to prove by induction that


k −1 k
|xN +k − x∗ | ≈ C 2 |xN − x∗ |2 , k = 0, 1, . . . (13.13)

To see how many digits of accuracy we gain in k steps beginning from xN ,


k k
we write C 2 −1 |xN − x∗ |2 = 10−dk |xN − x∗ |, and solving for dk we get
 
1 1
dk = log10 + log10 (2k − 1). (13.14)
C |xN − x∗ |

It is not difficult to prove that for the general p > 1 and as k → ∞ we get
1
dk = αp pk , where αp = p−1 log10 C1 + log10 |xN 1−x∗ | .

13.4 Interpolation-Based Methods


Assuming again that f is a continuous function in [a, b] and f (a)f (b) < 0
we can proceed as in the bisection method but instead of using the midpoint
c = a+b
2
to subdivide the interval in question we could use the root of linear
polynomial interpolating (a, f (a)) and (b, f (b)). This is called the method
13.5. NEWTON’S METHOD 205

of false position. Unfortunately, this method only converges linearly and


under stronger assumptions than the Bisection Method.
An alternative approach to use interpolation to obtain numerical methods
for f (x) = 0 is to proceed as follows: Given m + 1 approximations to the zero
of f , x0 , . . . , xm , construct the interpolating polynomial of f , pm , at those
points, and set the root of pm closest to xm as the new approximation to the
zero of f . In practice, only m = 1, 2 are used. The method for m = 1 is
called the Secant method and we will look at it in some detail later. The
method for m = 2 is called Muller’s Method.

13.5 Newton’s Method


If the function f is smooth, say at least C 2 [a, b], and we have already a
good approximation x0 to a zero x∗ of f then the tangent line of f at x0 ,
y = f (x0 ) + f 0 (x0 )(x − x0 ) provides a good approximation to f in a small
neighborhood of x0 , i.e.

f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ). (13.15)

Then we can define the next approximation as the zero of that tangent line,
i.e.

f (x0 )
x1 = x0 − , (13.16)
f 0 (x0 )

etc. At the k step or iteration we get the new approximation xk+1 according
to:

f (xk )
xk+1 = xk − , k = 0, 1, . . . (13.17)
f 0 (xk )

This iteration is called Newton’s method or Newton-Raphson’s method. There


are some conditions for this method to work and converge. But when it does
converge it does it at least quadratically. Indeed, a Taylor expansion of f
around xk gives

1
f (x) = f (xk ) + f 0 (xk )(x − xk ) + f 00 (ξk )(x − xk )2 , (13.18)
2
206 CHAPTER 13. NON-LINEAR EQUATIONS

where ξk is a point between x and xk . Evaluating at x = x∗ and using that


f (x∗ ) = 0 we get
1
0 = f (xk ) + f 0 (xk )(x∗ − xk ) + f 00 (ξk )(x∗ − xk )2 , (13.19)
2
which we can recast as
1 00 1 00
f (xk ) f (ξk ) ∗ f (ξk ) ∗
x∗ = xk − 0
− 2
0
(x − x k )2
= x k+1 − 2
(x − xk )2 .
f (xk ) f (xk ) f 0 (xk )
(13.20)
Thus,
1 00
|f (ξk )|
|xk+1 − x∗ | = 2
|xk − x∗ |2 . (13.21)
|f 0 (xk )|
So if the sequence {xk }∞
k=0 generated by Newton’s method converges then it
does so at least quadratically.
Theorem 13.1. Let x∗ be a simple zero of f (i.e. f (x∗ ) = 0 and f 0 (x∗ ) 6= 0)
and suppose f ∈ c2 . Then there’s a neighborthood I of x∗ such that Newton’s
method converges to x∗ for any initial guess in I .
Proof. Since f 0 is continuous and f 0 (x∗ ) 6= 0 we can choose  > 0, sufficiently
small so that f 0 (x) 6= 0 for all x such that |x − x∗ | ≤  (this is I ) and that
M () < 1 where
1
maxx∈I |f 00 (x)|
M () = 2 .
minx∈I |f 0 (x)|
1
|f 00 (x∗ )|
This is possible because lim→0 M () = 2|f 0 (x∗ )| < +∞.
The condition M () < 1 allows us to guarantee that x∗ is the only zero
of f in I , as we show now. A Taylor expansion of f around x∗ gives
1
f (x) = f (x∗ ) + f 0 (x∗ )(x − x∗ ) + f 00 (ξ)(x − x∗ )3
2
 1 00  (13.22)
0 ∗ ∗ ∗ 2 f (ξ)
= f (x )(x − x ) 1 + (x − x ) 0 ∗ ,
f (x )
and since
1 00 1 00
f (ξ) |f (ξ)|
(x − x∗ ) 2 0 ∗ = |x − x∗ | 2 0 ∗ ≤ M () < 1 (13.23)
f (x ) |f (x )|
13.6. THE SECANT METHOD 207

f (x) 6= 0 for all x ∈ I unless x = x∗ . We will now show that Newton’s


iteration is well defined starting from any initial guess x0 ∈ I . We prove this
by induction. From (13.21) with k = 0 it follow that x1 ∈ I as

1 00
∗ 2 f (ξ0 )
|x1 − x | = |x0 − x| 2
≤ 2 M () ≤ . (13.24)
f 0 (x0 )

Now assume that xk ∈ I then again from (13.21)

1 00
∗ 2 f (ξk )
|xk+1 − x | = |xk − x| 2
≤ 2 M () <  (13.25)
f 0 (xk )

so xk+1 ∈ I . Now,

|xk+1 − x∗ | ≤ |xk − x∗ |2 M () = |xk − x∗ |M ()


≤ |xk−1 − x∗ |(M ())2
..
.
≤ |x0 − x∗ |(M ())k+1

and since M () < 1 it follows that xk → x∗ as k → ∞.

The need for a good initial guess x0 for Newton’s method should be
emphasized. In practice, this is obtained with another method, like bisection.

13.6 The Secant Method


Sometimes it could be computationally expensive or not possible to evaluate
the derivative of f . The following method, known as the secant method,
replaces the derivative by the secant:

f (xk )
xk+1 = xk − f (xk )−f (xk−1 )
, k = 1, 2, . . . (13.26)
xk −xk−1
208 CHAPTER 13. NON-LINEAR EQUATIONS

Note that since f (x∗ ) = 0

f (xk ) − f (x∗ )
xk+1 − x∗ = xk − x∗ − f (xk )−f (xk−1 )
,
xk −xk−1
f (xk ) − f (x∗ )
= xk − x∗ −
f [xk , xk−1 ]
f (xk )−f (x∗ )
!
xk −x∗
= (xk − x∗ ) 1 −
f [xk , xk−1 ]
f [xk , x∗ ]
 

= (xk − x ) 1 −
f [xk , xk−1 ]
f [xk , xk−1 ] − f [xk , x∗ ]
 

= (xk − x )
f [xk , xk−1 ]
 
f [xk ,xk−1 ]−f [xk ,x∗ ]
xk−1 −x∗
= (xk − x∗ )(xk−1 − x∗ )  
f [xk , xk−1 ]
f [xk−1 , xk , x∗ ]
= (xk − x∗ )(xk−1 − x∗ )
f [xk , xk−1 ]

1
∗ f 00 (x∗ ) −x ∗
If xk → x∗ , then f [x k−1 ,xk ,x ]
f [xk ,xk−1 ]
→ 2f 0 (x∗ ) and limk→∞ xxk+1
k −x
∗ = 0, i.e. the
sequence generated by the secant method would converge faster than linear.
Defining ek = |xk − x∗ |, the calculation above suggests

ek+1 ≈ cek ek−1 . (13.27)

Let’s try to determine the rate of convergence of the secant method. Starting
1/p
with the ansatz ek ≈ Aepk−1 or equivalently ek−1 = A1 ek we have

  p1
1
ek+1 ≈ cek ek−1 ≈ cek ek ,
A

which implies
1
A1+ p 1−p+ p1
≈ ek . (13.28)
c
13.7. FIXED POINT ITERATION 209

Since the left hand side is a constant we must have 1 − p + p1 = 0 which gives

1± 5
p= 2
, thus

1+ 5
p= =≈ 1.61803 (13.29)
2
gives the rate of convergence of the secant method. It is better than linear,
but worse than quadratic. Sufficient conditions for local convergence are as
in Newton’s method.

13.7 Fixed Point Iteration


Newton’s method is a particular example of a functional iteration of the form

xk+1 = g(xk ), k = 0, 1, . . .

with the particular choice of g(x) = x − ff0(x)


(x)
. Clearly, if x∗ is a zero of f then
x∗ is a fixed point of g, i.e. g(x∗ ) = x∗ . We will look at fixed point iterations
as a tool for solving f (x) = 0.

Example 13.2. Suppose we want to solve x − e−x = 0 in [0, 1]. Then if we


take g(x) = e−x , a fixed point of g corresponds to a zero of f .

Definition 13.2. Let g is defined in an interval [a, b]. We say that g is a


contraction or a contractive map if there is a constant L with 0 ≤ L < 1
such that

|g(x) − g(y)| ≤ L|x − y|, for all x, y ∈ [a, b]. (13.30)

If x∗ is a fixed point of g in [a, b] then

|xk − x∗ |
= |g(xk−1 ) − g(x∗ )|
≤ L|xk−1 − x∗ |
≤ L2 |xk−2 − x∗ |
≤ ···
≤ Lk |x0 − x∗ | → 0, as k → ∞.
210 CHAPTER 13. NON-LINEAR EQUATIONS

Theorem 13.2. If g is contraction on [a, b] and maps [a, b] into [a, b] then
g has a unique fixed point x∗ in [a, b] and the fixed point iteration converges
to it for any [a, b]. Moreover
(a)
L∗
|xk − x∗ | ≤ |x1 − x0 |
1−L
(b)
|xk − x∗ | ≤ Lk |x0 − x∗ |

Proof. With proved (b) already. Since g : [a, b] → [a, b], the fixed point
iteration xk+1 = g(xk ), k = 0, 1, ... is well-defined and

|xk+1 − xk | = |g(xk ) − g(xk−1 )|


≤ L|xk − xk−1 |
≤ ···
≤ Lk |x1 − x0 |.

Now, for n ≥ m

xn − xm = xn − xn−1 + xn−1 − xn−2 + . . . + xm+1 − xm (13.31)

and so

|xn − xm | ≤ |xn − xn−1 | + |xn−1 − xn−2 | + . . . + |xm+1 − xm |


≤ Ln−1 |x1 − x0 | + Ln−2 |x1 − x0 | + . . . + Lm |x1 − x0 |
≤ Lm |x1 − x0 | 1 + L + L2 + . . . Ln−1−m


m
X
j Lm
≤ L |x1 − x0 | L = |x1 − x0 |.
j=0
1−L

Thus given  > 0 there is N such

LN
|x1 − x0 | ≤  (13.32)
1−L
and thus for n ≥ m ≥ N , |xn − xm | ≤ , i.e. {xn }∞ n=0 is a Cauchy sequence
in [a, b] so it converges to a point x∗ ∈ [a.b]. But

|xk − g(x∗ )| = |g(xk−1 ) − g(x∗ )| ≤ L|xk−1 − x∗ |, (13.33)


13.8. SYSTEMS OF NONLINEAR EQUATIONS 211

and so xk → g(x∗ ) as k → ∞ i.e. x∗ is a fixed point of g.


Suppose that there are two fixed points, x1 , x2 ∈ [a, b]. |x1 − x2 | =
|g(x1 ) − g(x2 )| ≤ L|x1 − x2 | ⇒ (1 − L)|x1 − x2 ) ≤ 0 but 0 ≤ L < 1 ⇒
|x1 − x2 | = 0 ⇒ x1 = x2 i.e the fixed point is unique.

If g is differentiable in (a, b), then by the mean value theorem

g(x) − g(y) = g 0 (ξ)(x − y), for some ξ ∈ [a, b]

and if the derivative is bounded by a constant L less than 1, i.e. |g 0 (x)| ≤ L


for all x ∈ (a, b), then |g(x) − g(y)| ≤ L|x − y| with 0 ≤ L < 1, i.e. g is
contractive in [a, b].

Example 13.3. Let g(x) = 14 (x2 + 3) for x ∈ [0, 1]. Then 0 ≤ g(x) ≤ 1 and
|g 0 (x)| ≤ 12 for all x ∈ [0, 1]. So g is contractive in [0, 1] and the fixed point
iteration will converge to the unique fixed point of g in [0, 1].

Note that

xk+1 − x∗ = g(xk ) − g(x∗ )


= g 0 (ξk )(xk − x∗ ), for some ξk ∈ [xk , x∗ ].

Thus,

xk+1 − x∗
= g 0 (ξk ) (13.34)
xk − x∗

and unless g 0 (x∗ ) = 0, the fixed point iteration converges linearly, when it
does converge.

13.8 Systems of Nonlinear Equations


We now look at the problem of finding numerical approximation to the solu-
tion(s) of a nonlinear system of equations f (x) = 0, where f : U ⊆ Rn → Rn .
The main approach to solve a nonlinear system is fixed point iteration

xk+1 = G(xk ), k = 0, 1, . . . (13.35)

where we assume that G is defined on a closed set B ⊆ Rn and G : B → B.


212 CHAPTER 13. NON-LINEAR EQUATIONS

The map G is a contraction (with respect to some norm,k · k) if there is


a constant L with 0 ≤ L < 1 and

kG(x) − G(y)k ≤ Lkx − yk, for all x, y ∈ B. (13.36)

Then, as we know, by the contraction map principle, G has a unique fixed


point and the sequence generated by the fixed point iteration (13.35) con-
verges to it.
Suppose that G is C 1 on some convex set B ⊆ Rn , for example a ball.
Consider the linear segment x + t(y − x) for t ∈ [0, 1] with x, y fixed in B.
Define the one-variable function

h(t) = G(x + t(y − x)). (13.37)

Then, by the Chain Rule, h0 (t) = DG(x + t(y − x))(y − x), where DG
stands for the derivative matrix of G. Then, using the definition of h and
the Fundamental Theorem of Calculus we have
Z 1
G(y) − G(x) = h(1) − h(0) = h0 (t)dt
0
Z 1  (13.38)
= DG(x + t(y − x))dt (y − x).
0

Thus if there is 0 ≤ L < 1 such that

kDG(x)k ≤ L, for all x ∈ B, (13.39)

for some subordinate norm k · k. Then

kG(y) − G(x)k ≤ Lky − xk (13.40)

and G is a contraction (in that norm). The spectral radius of DG, ρ(DG)
willdetermine the rate of convergence of the corresponding fixed point itera-
tion.

13.8.1 Newton’s Method


By Taylor theorem

f (x) ≈ f (x0 ) + Df (x0 )(x − x0 ) (13.41)


13.8. SYSTEMS OF NONLINEAR EQUATIONS 213

so if we take x1 as the zero of the right ahnd side of (13.41) we get

x1 = x0 − [Df (x0 )]−1 f (x0 ). (13.42)

Continuing this way, Newton’s method for the system of equations f (x) = 0
can be written as

xk+1 = xk − [Df (xk )]−1 f (xk ). (13.43)

In the implementation of Newton’s method for a system of equations we


solve the linear system Df (xk )w = −f (xk ) at each iteration and update
xk+1 = xk + w.
214 CHAPTER 13. NON-LINEAR EQUATIONS
Chapter 14

Numerical Methods for ODEs

14.1 Introduction
In this chapter we will be concerned with numerical methods for the initial
value problem:

dy(t)
= f (t, y(t)), t0 < t ≤ T, (14.1)
dt
y(t0 ) = α. (14.2)

The independent variable t often represents time but does not have to. With-
out loss of generality we will take t0 = 0. The time derivative is also fre-
quently denoted with a dot (especially in physics) or an apostrophe

dy
= ẏ = y 0 . (14.3)
dt

In (14.1)-(14.2), y and f may be vector-valued in which case we have


an initial value for a system of ordinary differential equations (ODEs). The
constant α in (14.2) is the given initial condition, which is a constant vector
in the case of a system.

Example 14.1.

y 0 (t) = t sin y(t), 0 < t < 2π (14.4)


y(0) = α (14.5)

215
216 CHAPTER 14. NUMERICAL METHODS FOR ODES

Example 14.2.
y10 (t) = y1 (t)y2 (t) − y12 , 0<t≤T
(14.6)
y20 (t) = −y2 (t) + t2 cos y1 (t), 0<t≤T
y1 (0) = α1 ,
(14.7)
y2 (0) = α2 .
These two are examples of first order ODEs, which is the type of initial
value problems we will focus on. Higher order ODEs can be written as first
order systems by introducing new variables for the derivatives from the first
up to one order less.
Example 14.3. The Harmonic Oscillator.
y 00 (t) + k 2 y(t) = 0. (14.8)
If we define y1 = y and y2 = y 0 we get
y10 (t) = y2 (t),
(14.9)
y20 (t) = −k 2 y1 (t).
Example 14.4.
y 000 (t) + 2y(t)y 00 (t) + cos y 0 (t) + et = 0. (14.10)
Introducing the variables y1 = y, y2 = y 0 , and y3 = y 00 we obtain the first
order system:
y10 (t) = y2 (t),
y20 (t) = y3 (t), (14.11)
y30 (t) = −2y1 (t)y3 (t) − cos y2 (t) − et .
If f does not depend explicitly on t we call the ODE (or the system
of ODEs) autonomous. We can turn a non-autonomous system into an
autonomous one by introducing t as a new variable.
Example 14.5. Consider the ODE
y 0 (t) = sin t − y 2 (t) (14.12)
If we define y1 = y and y2 = t we can write this ODE as the autonomous
system
y10 (t) = sin y2 (t) − y12 (t),
(14.13)
y20 (t) = 1.
14.1. INTRODUCTION 217

We now state a fundamental theorem of existence and uniqueness of so-


lutions to the initial value problem (14.1)-(14.2).
Theorem 14.1. Existence and Uniqueness.
Let

D = {(t, y) : 0 ≤ t ≤ T, y ∈ Rn } . (14.14)

If f is continuous in D and uniformly Lipschitz in y, i.e. there is a constant


L ≥ 0 such that

kf (t, y2 ) − f (t, y1 )k ≤ Lky2 − y1 k (14.15)

for all t ∈ [0, T ] and all y1 , y2 ∈ Rn , then the initial value problem (14.1)-
(14.2) has a unique solution for each α ∈ Rn .
Note that if there is L0 ≥ 0 such that
∂fi
(t, y) ≤ L0 (14.16)
∂yj
for all y, t ∈ [0, T ], and i, j = 1, . . . , n then f is uniformly Lipschitz in y
(equivalently, if the given norm of the derivative matrix of f is bounded by
L, see Section 13.8).
Example 14.6.

y 0 = y 2/3 , 0<t
(14.17)
y(0) = 0.
The partial derivative
∂f 2
= y −1/3 (14.18)
∂y 3
is not continuous around 0. Clearly, y ≡ 0 is a solution of this initial value
1 3
problem but so is y(t) = 27 t . There is no uniqueness of solution for this
initial value problem.
Example 14.7.
y0 = y2, 1<t≤3
(14.19)
y(1) = 3.
218 CHAPTER 14. NUMERICAL METHODS FOR ODES

We can integrate to obtain


1
y(t) = , (14.20)
2−t
which becomes unbounded as → 2. Consequently, there is no solution in [1, 3].
Note that
∂f
= 2y (14.21)
∂y
is unbounded (because y is so) in [1, 3]. The function f is not uniformly
Lipschitz in y for t ∈ [1, 3].
The initial value problem (14.1)-(14.2) can be reformulated as
Z T
y(t) = y(0) + f (s, y(s))ds. (14.22)
0
This is an integral equation for the unknown function y. In particular, if f
does not depend on y then the initial value problem (14.1)-(14.2) is reduced
to the approximation of the definite integral
Z T
f (s)ds. (14.23)
0
for which a numerical quadrature can be applied.
The numerical methods that we will study next deal with the more general
and important case when f depends on the unknown y. They will produce an
approximation to the exact solution of the initial value problem (assuming
uniqueness) at a set of discrete points 0 = t0 < t1 < . . . < tN ≤ T . For
simplicity in the presentation of the ideas we will assume that these points
are equispaced
tn = n∆t, n = 0, 1, . . . , N and ∆t = T /N . (14.24)
∆t is called the time step or simply the step size. A numerical method for an
initial value problem will be written as an algorithm to go from one discrete
time tn to the next tn+1 . With that in mind, it is useful to transform the
ODE (or ODE system) into a integral equation by integrating from tn to
tn+1 :
Z tn+1
y(tn+1 ) = y(tn ) + f (t, y(t))dt. (14.25)
tn
This equation provides a useful framework for the construction of numerical
methods with the aid of quadratures.
14.2. A FIRST LOOK AT NUMERICAL METHODS 219

14.2 A First Look at Numerical Methods


Let us denote by yn the approximation to the exact solution at tn , i.e.
yn ≈ y(tn ). (14.26)
Starting from the integral formulation of the problem, eq. (14.25), if we
approximate the integral using f evaluated at the lower integration limit
Z tn+1
f (t, y(t))dt ≈ f (tn , y(tn )) ∆t (14.27)
tn

and replace the exact derivative of y at tn by f (tn , yn ), we obtain the so called


forward Euler’s method:
y0 = α (14.28)
yn+1 = yn + ∆tf (tn , yn ), n = 0, 1, . . . N − 1. (14.29)
This provides an explicit formula to advance from one time step to the next.
The approximation at the future step, yn+1 , only depends on the approxima-
tion at the current step, yn . The forward Euler method is an example of an
explict one-step method.
If on the other hand we approximate the integral in (14.25) using only
the upper limit of integration and replace f (tn+1 , y(tn+1 )) by f (tn+1 , yn+1 )
we get the backward Euler method:
y0 = α (14.30)
yn+1 = yn + ∆tf (tn+1 , yn+1 ), n = 0, 1, . . . N − 1. (14.31)
Note that now, yn+1 is defined implicitly in (14.31). Thus, to update the
approximation, i.e. to obtain yn+1 for each n = 0, 1, . . . , N − 1, we need to
solve a nonlinear equation (or a system of nonlinear equations if the initial
value problem is for a system of ODEs). Since we expect the approximate
solution not to change much in one time step (for sufficiently small ∆t), yn
can be a good initial guess for a Newton iteration to find yn+1 . The backward
Euler method is an implicit one-step method.
We can start with more accurate quadratures as the basis for our numer-
ical methods. For example if we use the trapezoidal rule
Z tn+1
∆t
f (t, y(t))dt ≈ [f (tn , y(tn )) + f (tn+1 , y(tn+1 ))] (14.32)
tn 2
220 CHAPTER 14. NUMERICAL METHODS FOR ODES

and proceed as before we get the trapezoidal rule method:

y0 = α (14.33)
∆t
yn+1 = yn + [f (tn , yn ) + f (tn+1 , yn+1 )] , n = 0, 1, . . . N − 1. (14.34)
2
Like the backward Euler method, this is an implicit one-step method. We will
see later an important class of one-step methods, known as Runge-Kutta
(RK) methods, that use intermediate approximations to the derivative (i.e.
approximations to f ) and a corresponding quadrature. For example, we can
use the midpoint rule quadrature and the approximation
 
 ∆t
f tn+1/2 , y(tn+1/2 ) ≈ f tn+1/2 , yn + f (tn , yn ) (14.35)
2

to obtain the explicit midpoint Runge -Kutta method


 
∆t
yn+1 = yn + ∆tf tn+1/2 , yn + f (tn , yn ) . (14.36)
2

Another possibility is to approximate the integrand f in (14.25) by an


interpolating polynomial using f evaluated at previous approximations yn ,
yn−1 , . . ., yn−m . To simplify the notation, let us write

fn = f (tn , yn ). (14.37)

For example, if we replace f in [tn , tn+1 ] by the linear polynomial p1 interpo-


lating (tn , fn ) and (tn−1 , fn−1 ), i.e

(t − tn−1 ) (t − tn )
p1 (t) = fn − fn−1 (14.38)
∆t ∆t
we get
Z tn+1 Z tn+1
∆t
f (t, y(t))dt ≈ p1 (t)dt = [3fn − fn−1 ] (14.39)
tn tn 2

and the corresponding numerical method is


∆t
yn+1 = yn + [3fn − fn−1 ] , n = 1, 2, . . . N − 1. (14.40)
2
14.3. ONE-STEP AND MULTISTEP METHODS 221

This is an example of a multistep method. To obtain the approximation


at the future step we need the approximations of more than one step; in this
particular case we need yn and yn−1 so this is a two-step method. Note that
to start using (14.40), i.e. n = 1, we need y0 and y1 . For y0 we can use the
initial condition, y0 = α, and we can get y1 by using a one-step method. All
multistep methods require this initialization process when approximations to
y1 , . . . , ym−1 have to be generated with one-step methods to begin using the
multistep formula.
Numerical methods can also be constructed by approximating the deriva-
tive y 0 using finite differences or interpolation. For example, the central
difference approximation

y(tn + ∆t) − y(tn − ∆t) yn+1 − yn−1


y 0 (tn ) ≈ ≈ (14.41)
2∆t 2∆t
produces the two-step method

yn+1 = yn−1 + 2∆tfn . (14.42)

If we approximate y 0 (tn+1 ) by the derivative of the polynomial interpo-


lating yn+1 and some previous approximations we obtain a class of method
known as backward differentiation formula (BDF) methods. For exam-
ple, let p2 be the polynomial of degree at most 2 that interpolates (tn−1 , yn−1 ),
(tn , yn ), and (tn+1 , yn+1 ). Then

3yn+1 − 4yn + yn−1


y 0 (tn+1 ) ≈ p02 (tn+1 ) = , (14.43)
2∆t
which gives the BDF method

3yn+1 − 4yn + yn−1


= fn+1 , n = 1, 2, . . . N − 1. (14.44)
2∆t
Note that this is an implicit, multistep method.

14.3 One-Step and Multistep Methods


As we have seen, there are two broad classes of methods for the initial value
problem (14.1)-(14.2): one-step methods and multistep methods.
222 CHAPTER 14. NUMERICAL METHODS FOR ODES

Explicit one-step methods can be written in the general form


yn+1 = yn + ∆t Φ(tn , yn , ∆t) (14.45)
for some continuous function Φ. For example Φ(t, y, ∆t) = f(t, y) for the
forward Euler method and Φ(t, y, ∆t) = f t + ∆t
2
, y + ∆t
2
f (t, y) for the mid-
point RK method.
A general, m-step multistep method can be cast as
am yn+1 + am−1 yn + . . . + a0 yn−m+1 =
(14.46)
∆t [bm fn+1 + bm−1 fn + . . . + b0 fn−m+1 ] ,
for some coefficients a0 , a1 , . . . , am , with am 6= 0, and b0 , b1 , . . . , bm . If bm 6= 0
the multistep is implicit otherwise it is explicit. Without loss of generality
we will assume am = 1. Shifting the index by m − 1, we can write an m-step
(m ≥ 2) method as
m
X m
X
aj yn+j = ∆t bj fn+j . (14.47)
j=0 j=0

14.4 Local and Global Error


When computing a numerical approximation of an initial value problem, at
each time step there is an error associated to evolving the solution from
tn to tn+1 with the numerical method instead of using the ODE (or the
integral equation) and there is also an error due to the use of yn as starting
point instead of the exact value y(tn ). After several time steps, these local
errors accumulate in the global error of the approximation to the initial value
problem. Let us make the definition of these errors more precise.
Definition 14.1. The global error en at a given discrete time tn is given by
en = y(tn ) − yn , (14.48)
where y(tn ) and yn are the exact solution of the initial value problem and the
numerical approximation at tn , respectively.
Definition 14.2. The local truncation error τn , also called local discretiza-
tion error is given by
τn = y(tn ) − yn , (14.49)
14.4. LOCAL AND GLOBAL ERROR 223

where yn is computed with the numerical method using as starting value


y(tn−1 ) for a one-step method and y(tn−1 ), y(tn−2 ), . . . , y(tn−m ) for an m-step
method.
For an explicit one-step method the local truncation error is simply
τn+1 = y(tn+1 ) − [y(tn ) + ∆t Φ(tn , y(tn ), ∆t)] (14.50)
and for an explicit multistep method (bm = 0) it follows immediately that
(am = 1)
m
X m
X
τn+m = aj y(tn+j ) − ∆t bj f (tn+j , y(tn+j )) (14.51)
j=0 j=0

or equivalently
m
X m
X
τn+m = aj y(tn+j ) − ∆t bj y 0 (tn+j ), (14.52)
j=0 j=0

where we have used y 0 = f (t, y).


For implicit methods we can use also use (14.51) because it gives the local
error up to a multiplicative factor. Indeed, let
m
X m
X
Tn+m = aj y(tn+j ) − ∆t bj f (tn+j , y(tn+j )). (14.53)
j=0 j=0

Then
m
X m
X
aj y(tn+j ) = ∆t bj f (tn+j , y(tn+j )) + Tn+m . (14.54)
j=0 j=0

On the other hand yn+m in the definition of the local error is computed using
m−1
" m−1
#
X X
am yn+m + aj y(tn+j ) = ∆t bm f (tn+m , yn+m ) + bj f (tn+j , y(tn+j )) .
j=0 j=0
(14.55)
Subtracting (14.54) to (14.55) and using am = 1 we get
y(tn+m ) − yn+m = ∆t bm [f (tn+m , y(tn+m )) − f (tn+m , yn+m )] + Tn+k .
(14.56)
224 CHAPTER 14. NUMERICAL METHODS FOR ODES

Assuming f is a scalar C 1 function, from the mean value theorem we have

∂f
f (tn+m , y(tn+m )) − f (tn+m , yn+m ) = (tn+m , ηn+m ) [y(tn+m ) − yn+m ] ,
∂y
(14.57)

for some ηn+m between y(tn+m ) and yn+m . Substituting this into (14.56) and
solving for τn+m = y(tn+m ) − yn+m we get
 −1
∂f
τn+m = 1 − ∆t bm (tn+m , ηn+m ) Tn+m . (14.58)
∂y

If f is a vector valued function (a system of ODEs) then the partial derivative


in (14.58) is a derivative matrix. A similar argument can be made for an
implicit one-step method if the increment function Φ is Lipschitz in y and we
use absolute values in the errors. Thus, we can also view the local truncation
error as a measure of how well the exact solution of the initial value problem
satisfies the numerical method formula.

Example 14.8. The local truncation error for the forward Euler method is

τn+1 = y(tn+1 ) − [y(tn ) + ∆t f (tn , y(tn ))] . (14.59)

Taylor expanding the exact solution around tn we have


1
y(tn+1 ) = y(tn ) + ∆t y 0 (tn ) + (∆t)2 y 00 (ηn ) (14.60)
2
for some ηn between tn and tn+1 . Using y 0 = f and substituting (14.59) into
(14.60) we get

1
τn+1 = (∆t)2 y 00 (ηn ). (14.61)
2
Thus, assuming the exact solution is C 2 , the local truncation error of the
forward Euler method is O(∆t)2 1 .

To simplify notation we will henceforth write O(∆t)k instead of O((∆t)k ).


1
To simplify notation we write O(∆t)2 to mean O((∆t)2 ) and similarly for other orders
of ∆t.
14.4. LOCAL AND GLOBAL ERROR 225

Example 14.9. For the explicit midpoint Runge -Kutta method we have
 
∆t
τn+1 = y(tn+1 ) − y(tn ) − ∆tf tn+1/2 , y(tn ) + f (tn , y(tn )) . (14.62)
2

Taylor expanding f around (tn , y(tn )) we obtain


 
∆t
f tn+1/2 , y(tn ) + f (tn , y(tn )) = f (tn , y(tn ))
2
∆t ∂f
+ (tn , y(tn ))
2 ∂t
∆t ∂f
+ f (tn , y(tn )) (tn , y(tn )) + O(∆t)2 .
2 ∂y
(14.63)

But y 0 = f , y 00 = f 0 and
∂f ∂f 0 ∂f ∂f
f0 = + y = + f. (14.64)
∂t ∂y ∂t ∂y
Therefore
 
∆t 1
f tn+1/2 , y(tn ) + f (tn , y(tn )) = y 0 (tn ) + ∆t y 00 (tn ) + O(∆t)2 . (14.65)
2 2

On the other hand


1
y(tn+1 ) = y(tn ) + ∆t y 0 (tn ) + (∆t)2 y 00 (tn ) + O(∆t)3 . (14.66)
2
Substituting (14.65) and (14.66) into (14.62) we get

τn+1 = O(∆t)3 . (14.67)

In the previous two examples the methods are one-step. We know obtain
the local truncation error for a multistep method.

Example 14.10. Let us consider the 2-step Adams-Bashforth method (14.40).


We have
∆t
τn+2 = y(tn+2 ) − y(tn+1 ) − [3f (tn+1 , y(tn+1 )) − f (tn , y(tn ))] (14.68)
2
226 CHAPTER 14. NUMERICAL METHODS FOR ODES

and using y 0 = f

∆t
τn+2 = y(tn+2 ) − y(tn+1 ) − [3y 0 (tn+1 ) − y 0 (tn )] . (14.69)
2
Taylor expanding y(tn+2 ) and y 0 (tn ) around tn+1 we have

1
y(tn+2 ) = y(tn+1 ) + ∆ty 0 (tn+1 ) + (∆t)2 y 00 (tn+1 ) + O(∆t)3 , (14.70)
2
y 0 (tn ) = y 0 (tn+1 ) − ∆ty 00 (tn+1 ) + O(∆t)2 . (14.71)

Substituting these expressions into (14.69) we get

1
τn+2 = ∆ty 0 (tn+1 ) + (∆t)2 y 00 (tn+1 )
2
∆t (14.72)
− [2y (tn+1 ) − ∆ty 00 (tn+1 )] + O(∆t)3
0
2
= O(∆t)3 .

14.5 Order of a Method and Consistency


We saw in the previous section that assuming sufficient smoothness of the
exact solution of the initial value problem, the local truncation error can
be expressed as O(∆t)k , for some positive integer k. If the local truncation
error accumulates no worse than linearly as we march up N steps to get
the approximate solution at T = N ∆t, the global error would be order
T
N (∆t)k = ∆t (∆t)k , i.e. O(∆t)k−1 . So we need k ≥ 2 for all methods and
to prevent uncontrolled accumulation of the local errors as n → ∞ we need
the methods to be stable in a sense we will make more precise later. This
motivates the following definitions.

Definition 14.3. A numerical method for the initial value problem (14.1)-
(14.2) is said to be of order p if its local truncation error is O(∆t)p+1 .

Euler’s method is order 1 or first order. The midpoint Runge-Kutta


method and the 2-step Adams-Bashforth method are order 2 or second order.

Definition 14.4. We say that a numerical method is consistent (with the


ODE of the initial value problem) if the method is at least of order 1.
14.6. CONVERGENCE 227

For the case of one-step methods we have


τn+1 = y(tn+1 ) − [y(tn ) + ∆t Φ(tn , y(tn ), ∆t)]
= ∆t y 0 (tn ) − ∆t Φ(tn , y(tn ), ∆t) + O(∆t)2 (14.73)
= ∆t [f (tn , y(tn )) − Φ(tn , y(tn ), ∆t)] + O(∆t)2 .
Thus, assuming the increment function Φ is continuous, a one-step method
is consistent with the ODE y 0 = f (t, y) if and only if
Φ(t, y, 0) = f (t, y). (14.74)
To find a consistency condition for a multistep method, we expand y(tn+j )
and y 0 (tn+j ) around tn
1
y(tn+j ) = y(tn ) + (j∆t)y 0 (tn ) + (j∆t)2 y 00 (tn ) + . . . (14.75)
2!
1
y (tn+j ) = y (tn ) + (j∆t)y (tn ) + (j∆t)2 y 000 (tn ) + . . .
0 0 00
(14.76)
2!
and substituting in the definition of the local error (14.51) we get that for a
multistep method is consistent if and only if
a0 + a1 + . . . + am = 0, (14.77)
a1 + 2a2 + . . . mam = b0 + b1 + . . . + bm . (14.78)
All the methods that we have seen so far are consistent (with y 0 = f (t, y)).

14.6 Convergence
A basic requirement of the approximations generated by a numerical method
is that they get better and better as we take smaller step sizes. That is, we
want the approximations to approach the exact solution as ∆t → 0.
Definition 14.5. A numerical method for the initial value (14.1)-(14.2) is
convergent if the global error at a given t = n∆t converges to zero as ∆t → 0
and t = n∆t i.e.
lim [y(n∆t) − yn ] = 0. (14.79)
∆t→0
n∆t=t

Note that for a multistep method the initialization values y1 , . . . , ym−1 must
converge to y(0) = α as ∆t → 0.
228 CHAPTER 14. NUMERICAL METHODS FOR ODES

If we consider a one-step method and the definition (14.50) of the local


truncation error τ , the exact solution satisfies
y(tn+1 ) = y(tn ) + ∆t Φ(tn , y(tn ), ∆t) + τn+1 (14.80)
while the approximation is given by
yn+1 = yn + ∆t Φ(tn , yn , ∆t). (14.81)
Subtracting (14.81) from (14.80) we get a difference equation for the global
error
en+1 = en + ∆t [Φ(tn , y(tn ), ∆t) − Φ(tn , yn , ∆t)] + τn+1 . (14.82)
This the growth of the global error as we take more and more time steps is
linked not only to the local error but also to the increment function Φ. Let
us suppose that Φ is Lipschitz in y, i.e. there is L ≥ 0 such that
|Φ(t, y1 , ∆t) − Φ(t, y2 , ∆t)| ≤ L|y1 − y2 | (14.83)
for all t ∈ [0, T ] and y1 and y2 in the relevant domain of existence of the
solution. Recall that for the Euler method Φ(t, y, ∆t) = f (t, y) and we
assume f (t, y) is Lipschitz in y to guarantee existence and uniqueness of the
initial value problem so the Lipschitz assumption on Φ is somewhat natural.
Then, taking absolute values (or norms in the vector case), using the triangle
inequality and (14.83) we obtain
|en+1 | ≤ (1 + ∆tL)|en | + |τn+1 |. (14.84)
For a method of order p, |τn+1 | ≤ C(∆t)p+1 , for sufficiently small ∆t. There-
fore,
|en+1 | ≤ (1 + ∆tL)|en | + C(∆t)p+1
≤ (1 + ∆tL) (1 + ∆tL)|en−1 | + C(∆t)p+1 + C(∆t)p+1
 

≤ ... (14.85)
n
X
≤ (1 + ∆tL)n+1 |e0 | + C(∆t)p+1 (1 + ∆tL)j
j=0

and summing up the geometry sum we get


(1 + ∆tL)n+1 − 1
 
n+1
|en+1 | ≤ (1 + ∆tL) |e0 | + C(∆t)p+1 . (14.86)
∆tL
14.6. CONVERGENCE 229

Now 1 + t ≤ et for all real t and consequently (1 + ∆tL)n ≤ en∆tL . Thus,


since e0 = 0,
 n∆tL 
e −1 C
|en | ≤ C(∆t)p+1 < en∆tL (∆t)p . (14.87)
∆tL L
Thus, the global error goes to zero as ∆t → 0, keeping t = n∆t fixed and we
obtain the following important result.
Theorem 14.2. A consistent (p ≥ 1) one-step method with a Lipschitz in y
increment function Φ(t, y, ∆t) is convergent.
The Lipschitz condition on Φ gives stability to the method in the sense
that this condition allows us to control the growth of the global error in the
limit as ∆t → 0 and n → ∞.
Example 14.11. As we have seen the forward Euler method is oder 1 and
hence consistent. Since Φ = f and we are assuming that f is Lipschitz in y
then by the previous theorem the forward Euler method is convergent.
Example 14.12. Prove that the midpoint Runge-Kutta method is convergent
(assuming f is Lipschitz in y).
The increment function in this case is
 
∆t ∆t
Φ(t, y, ∆t) = f t + ,y + f (t, y) . (14.88)
2 2
Therefore
 
∆t ∆t
|Φ(t, y1 , ∆t) − Φ(t, y2 , ∆t)| = f t + , y1 + f (t, y1 )
2 2
 
∆t ∆t
− f t+ , y2 + f (t, y2 )
2 2
∆t ∆t
≤ L y1 + f (t, y1 ) − y2 − f (t, y2 ) (14.89)
2 2
∆t
≤ L |y1 − y2 | + L |f (t, y1 ) − f (t, y2 )|
  2
∆t
≤ 1+ L L |y1 − y2 | ≤ L̃ |y1 − y2 | .
2
where L̃ = (1 + ∆t2 0 L)L and ∆t ≤ ∆t0 , i.e. for sufficiently small ∆t. This
proves that Φ is Lipschitz in y and since the midpoint Runge-Kutta method
is of order 2 and hence consistent, it is convergent.
230 CHAPTER 14. NUMERICAL METHODS FOR ODES

The exact solution of the initial value problem at tn+1 is determined


uniquely from its value at tn . In contrast, multistep methods use not only yn
but also yn−1 , . . . , yn−m−1 to produce yn+1 . The use of more than one step
introduces some peculiarities to the theory of stability and convergence of
multistep methods. We will cover these topics separately after we look at
the most important class of one-step methods: the Runge-Kutta methods.

14.7 Runge-Kutta Methods


As seen earlier Runge-Kutta (RK) methods are based on replacing the inte-
gral in
Z tn+1
y(tn+1 ) = y(tn ) + f (t, y(t))dt (14.90)
tn

with a quadrature formula and using accurate enough intermediate approx-


imations for integrand f (the derivative of y). For example if we use the
trapezoidal rule quadrature
Z tn+1
∆t
f (t, y(t))dt ≈ [f (tn , y(tn )) + f (tn+1 , y(tn+1 )] (14.91)
tn 2
and the approximations y(tn ) ≈ yn and yn+1 ≈ yn + ∆tf (tn , yn ), we obtain a
two-stage RK method known as improved Euler method

K1 = f (tn , yn ), (14.92)
K2 = f (tn + ∆t, yn + ∆tK1 ), (14.93)
 
1 1
yn+1 = yn + ∆t K1 + K2 . (14.94)
2 2
Note that K1 and K2 are approximations to the derivative of y.
Example 14.13. The midpoint RK method (14.36) is also a two-stage RK
method and can be written as

K1 = f (tn , yn ), (14.95)
 
∆t ∆t
K2 = f tn + , yn + K1 , (14.96)
2 2
yn+1 = yn + ∆tK2 . (14.97)
14.7. RUNGE-KUTTA METHODS 231

We know from Example 14.9 that the midpoint RK is order 2. The


improved order method is also order 2. Obtaining the order of a RK method
using Taylor expansions becomes a long, tedious process because number
of terms in the derivatives of f rapidly grow (y 0 = f, y 00 = ft + fy f, y 000 =
ftt + 2fty f + fyy f 2 + fy ft + fy2 f , etc.).
The most popular RK method is the following 4-stage (and fourth order)
explicit RK, known as the classical fourth order RK

K1 = f (tn , yn ),
1 1
K2 = f (tn + ∆t, yn + ∆tK1 ),
2 2
1 1
K3 = f (tn + ∆t, yn + ∆tK2 ), (14.98)
2 2
K4 = f (tn + ∆t, yn + ∆tK3 ),
∆t
yn+1 = yn + [K1 + 2K2 + 2K3 + K4 ] .
6

A general s-stage RK method can be written as

s
!
X
K1 = f tn + c1 ∆t, yn + a1j Kj ,
j=1
s
!
X
K2 = f tn + c2 ∆t, yn + a2j Kj ,
j=1
.. (14.99)
.
s
!
X
Ks = f tn + cs ∆t, yn + asj Kj ,
j=1
s
X
yn+1 = yn + ∆t bj K j .
j=1

RK methods are determined by the constants c1 , . . . , cs that specify the


quadrature points, the coefficients a1j , . . . , asj for j = 1, . . . , s used to ob-
tain approximations of the solution at the intermediate quadrature points,
and the quadrature coefficients b1 , . . . , bs . Consistent RK methods need to
232 CHAPTER 14. NUMERICAL METHODS FOR ODES

satisfy the conditions


s
X
ci = aij , (14.100)
j=1
s
X
bj = 1. (14.101)
j=1

To define a RK method it is enough to specify the coefficients cj , aij


and bj for i, j = 1, . . . , s. There coefficients are often displayed in a table,
called the Butcher tableau (after J.C. Butcher) of the RK method as shown
in Table 14.1. For an explicit RK method, the matrix of coefficients A = (aij )

Table 14.1: Butcher tableau for a general RK method.


c1 a11 . . . a1s
.. .. .. ..
. . . .
cs as1 . . . ass
b1 . . . b s

is lower triangular with zeros on the diagonal, i.e. aij = 0 for i ≤ j. The
zeros of A are usually not displayed in the tableau.

Example 14.14. The tables 14.2-14.4 show the Butcher tableau of some
explicit RK methods.

Table 14.2: Improved Euler.


0
1 1
1 1
2 2

Implicit RK methods are useful for some initial values problems with dis-
parate time scales as we will see later. To reduce the computational work
needed to solve for the unknown K1 , . . . , Ks (each K is vector-valued for a
system of ODEs) in an implicit RK method two particular types of implicit
RK methods are usually employed. The first type is the diagonally implicit
14.7. RUNGE-KUTTA METHODS 233

Table 14.3: Midpoint RK.


0
1 1
2 2
0 1

Table 14.4: Classical fourth order RK.


0
1 1
2 2
1 1
2
0 2
1 0 0 1
1 1 1 1
6 3 3 6

RK method or DIRK which has aij = 0 for i < j and at least one aii is
nonzero. The second type has also aij = 0 for i < j but with the addi-
tional condition that aii = γ for all i = 1, . . . , s and γ is a constant. The
corresponding methods are called singly diagonally implicit RK method or
SDIRK.

Example 14.15. Tables 14.5-14.8 show some examples of DIRK and SDIRK
methods.

Table 14.5: Backward Euler.


1 1
1

Table 14.6: Implicit mid-point rule RK.


1 1
2 2
1
234 CHAPTER 14. NUMERICAL METHODS FOR ODES

Table 14.7: Hammer and Hollingworth DIRK.


0 0 0
2 1 1
3 3 3
1 3
4 4


3± 3
Table 14.8: Two-stage order 3 SDIRK (γ = 6
).
γ γ 0
1 − γ 1 − 2γ γ
1 1
2 2

14.8 Adaptive Stepping


So far we have considered a fixed ∆t throughout the entire computation of
an approximation to the initial value problem of an ODE or of a systems of
ODEs. We can vary ∆t as we march up in t to maintain the approximation
within a given error bound. The idea is to obtain an estimate of the error
using two different methods, one of order p and one of order p + 1, and use
this estimate to decide whether the size of ∆t is appropriate or not at the
given time step.
Let yn+1 and wn+1 be the numerical approximations updated from yn
using the method of order p, and p + 1, respectively. Then, we estimate the
error at tn+1 by

en+1 ≈ wn+1 − yn+1 . (14.102)

If |wn+1 − yn+1 | ≤ δ, where δ is a prescribed tolerance, then we maintain


the same ∆t and use wn+1 as initial condition for the next time step. If
|yn+1 − wn+1 | > δ, we decrease ∆t (e.g. we set it to ∆t/2), recompute yn+1 ,
obtain the new estimate of the error (14.102), etc.
One-step methods allow for straightforward use of variable ∆t. Variable
step, multistep methods can also be derived but are not used much in practice
due to limited stability properties.
14.9. EMBEDDED METHODS 235

14.9 Embedded Methods


For computational efficiency, adaptive stepping as described above is imple-
mented reusing as much as possible evaluations of f , the right hand side of
y 0 = f (t, y) because evaluating f is the most expensive part of Runge-Kutta
methods. So the idea is to embed, with minimal additional f evaluations, a
Runge-Kutta method inside another. The following example illustrates this
idea.
Consider the explicit trapezoidal method (second order) and the Euler
method (first order). We can embed them as follows

K1 = f (tn , yn ), (14.103)
K2 = f (tn + ∆t, yn + ∆tK1 ) , (14.104)
 
1 1
wn+1 = yn + ∆t K1 + K2 , (14.105)
2 2
yn+1 = yn + ∆tK1 . (14.106)

Note that the approximation of the derivative K1 is used for both methods.
The computation of the higher order method (14.105) only costs an additional
evaluation of f .

14.10 Multistep Methods


As we have seen, multistep methods use approximations from more than one
step to update the new approximation. They can be written in the general
form
m
X m
X
aj yn+j = ∆t bj fn+j . (14.107)
j=0 j=0

where m ≥ 2 is the number of previous steps the method employs.


Multistep methods only require one evaluation of f per step because the
other, previously computed, values of f are stored. Thus, multistep methods
have generally lower computational cost than one-step methods of the same
order. The trade-off is reduced stability as we will see later.
We saw in Section 14.2 the approaches of interpolation and finite differ-
ences to construct multistep methods. It is possible to build also multistep
236 CHAPTER 14. NUMERICAL METHODS FOR ODES

methods by choosing the coefficients a0 , . . . , am and b0 , . . . , bm so as to achieve


a desired maximal order for a given m ≥ 2 and/or to have certain stability
properties.
Two classes of multistep methods, derived both from interpolation, are
the most commonly used multistep methods. These are the (explicit) Adams-
Bashforth methods and the (implicit) Adams-Moulton methods.

14.10.1 Adams Methods


We constructed in Section 14.2 the two-step Adams-Barshforth method
∆t
yn+1 = yn + [3fn − fn−1 ] , n = 1, 2, . . . N − 1.
2
An m-step explicit Adams method, also called Adams-Bashforth, can be
derived by starting with the integral formulation of the initial value problem
Z tn+1
y(tn+1 ) = y(tn ) + f (t, y(t))dt. (14.108)
tn

and replacing the integral with that of the interpolating polynomial p of


(tj , fj ) for j = n − m + 1, . . . , n. Recall that fj = f (tj , yj ). If we represent p
in Lagrange form we have
n
X
p(t) = lj (t)fj , (14.109)
j=n−m+1

where
n
Y (t − tk )
lj (t) = , for j = n − m + 1, . . . , n. (14.110)
k=n−m+1
(tj − tk )
k6=j

Thus, the m-step explicit Adams method has the form

yn+1 = yn + ∆t [bm−1 fn + bm−2 fn−1 + . . . b0 fn−m+1 ] , (14.111)

where
Z tn+1
1
bj−(n−m+1) = lj (t)dt, for j = n − m + 1, . . . , n. (14.112)
∆t tn
14.10. MULTISTEP METHODS 237

Here are the first three explicit Adams methods, 2-step, 3-step, and 4-step,
respectively:

∆t
yn+1 = yn + [3fn − fn−1 ] , (14.113)
2
∆t
yn+1 = yn + [23fn − 16fn−1 + 5fn−2 ] , (14.114)
12
∆t
yn+1 = yn + [55fn − 59fn−1 + 37fn−2 − 9fn−3 ] . (14.115)
24
The implicit Adams methods, also called Adams-Moulton methods, are
derived by including also (tn+1 , fn+1 ) in the interpolation. That is, p is now
the polynomial interpolating (tj , fj ) for j = n − m + 1, . . . , n + 1. Here are
the first three implicit Adams methods:

∆t
yn+1 = yn + [5fn+1 + 8fn − fn−1 ] , (14.116)
12
∆t
yn+1 = yn + [9fn+1 + 19fn − 5fn−1 + fn−2 ] , (14.117)
24
∆t
yn+1 = yn + [251fn+1
720 (14.118)
+ 646fn − 264fn−1 + 106fn−2 − 19fn−3 ] .

14.10.2 Zero-Stability and Dahlquist Theorem


Recall that we can write a general multistep method as
m
X m
X
aj yn+j = ∆t bj fn+j .
j=0 j=0

Stability is related to the notion of boundedness of the numerical approxi-


mation in the limit as ∆t → 0. Thus, it is natural to consider the equation:

am yn+m + am−1 yn+m−1 + . . . a0 yn = 0. (14.119)

This is a linear difference equation. Let us look for solutions of the form
yn = ξ n . Plugging in (14.119) we get

ξ n am ξ m + am−1 ξ m−1 + . . . a0 = 0,
 
(14.120)
238 CHAPTER 14. NUMERICAL METHODS FOR ODES

which implies that ξ is a root of the polynomial

ρ(z) = am z m + am−1 z m−1 + . . . a0 . (14.121)

If ρ has m distinct roots ξ1 , ξ2 , . . . , ξm then the general solution of (14.119)


is

yn = c1 ξ1n + c2 ξ2n + . . . + cm ξm
n
, (14.122)

where c1 , c2 , . . . , cm are determined uniquely from the initialization values


y0 , y1 , . . . , ym−1 .
If the roots are not all distinct, the solution of (14.119) changes as follows:
If for example ξ1 = ξ2 is a double root, i.e. a root of multiplicity 2 (ρ(ξ1 ) = 0,
and ρ0 (ξ1 ) = 0 but ρ00 (ξ1 ) 6= 0) then yn = nξ1n is also a solution of (14.119).
Let us check this is indeed the case. Substituting yn = nξ1n in (14.119) we
get

am (n + m)ξ1n+m + am−1 (n + m − 1)ξ1n+m−1 + . . . + a0 nξ1n


= ξ1n am (n + m)ξ1m + am−1 (n + m − 1)ξ1n+m−1 + . . . + a0 n
 
(14.123)
= ξ1n [nρ(ξ1 ) + ξ1 ρ0 (ξ1 )] = 0.

Thus, in this case of a double root, the general solution of (14.119) is

yn = c1 ξ1n + c2 nξ1n + c3 ξ3n + . . . + cm ξm


n
. (14.124)

If there is a triple root, say z1 = z2 = z3 , then the general solution of (14.119)


is given by

yn = c1 ξ1n + c2 nξ1n + c3 n(n − 1)ξ1n + . . . + cm ξm


n
. (14.125)

Clearly, for the numerical approximation yn to remain bounded as n → 0 we


need that all the roots ξ1 , ξ2 , . . . , ξm of ρ satisfy:

(a) |ξi | ≤ 1, for all i = 1, 2, . . . , m.

(b) If ξi is a root of multiplicity greater than one then |ξi | < 1.

Conditions (a) and (b) above are known as the root condition.

Definition 14.6. A multistep method is zero-stable (or D-stable) if the zeros


of ρ satisfy the root condition.
14.11. A-STABILITY 239

Clearly, zero-stability is a necessary condition for convergence of a mul-


tistep method. It is remarkable that it is also a sufficient condition for con-
sistent multistep methods. This is the content of the following fundamental
theorem due to Dahlquist.
Theorem 14.3. (Dahlquist Theorem) A consistent multistep method is con-
vergent if and only if it is zero-stable.

14.11 A-Stability
So far we have talked about numerical stability in the sense of boundedness
of the numerical approximation in the limit as ∆t → 0. There is another
type of numerical stability which give us some guidance on the actual size
∆t one can take for a stable computation using a given numerical method for
an ODE initial value problem. This type of stability is called linear stability,
absolute stability, or A-stability. It is based on the behavior of a numerical
method for the simple linear problem:

y 0 = λy, (14.126)
y(0) = 1, (14.127)

where λ is a complex number. The exact solution is y(t) = eλt . Let us look
at forward Euler method applied to this model problem. We have

yn+1 = yn + ∆tλyn = (1 + ∆tλ)yn (14.128)


= (1 + ∆tλ)(1 + ∆tλ)yn−1 = (1 + ∆tλ)2 yn−1 (14.129)
= . . . = (1 + ∆tλ)n+1 y0 = (1 + ∆tλ)n+1 . (14.130)

Thus, the forward Euler solution is yn = (1+∆tλ)n . Clearly, in order for this
numerical approximation to remain bounded as n → ∞ (long time behavior)
we need

|1 + ∆tλ| ≤ 1. (14.131)

Denoting z = ∆tλ, the set

S = {z ∈ C : |1 + z| ≤ 1} , (14.132)

i.e. the unit disk centered at −1 is the region of linear stability or A-stability
of the forward Euler method.
240 CHAPTER 14. NUMERICAL METHODS FOR ODES

Runge-Kutta methods applied to the linear problem (14.126) produce a


solution of the form

yn+1 = R(∆tλ)yn , (14.133)


P (z)
where R is a rational function, R(z) = Q(z) , where P ands Q are polyno-
mials. In particular, when the Runge-Kutta method is explicit R is just a
polynomial. Therefore, for a Runge-Kutta method the region of A-stability
is given by the set

S = {z ∈ C : |R(z)| ≤ 1} . (14.134)

R is called the stability function of the Runge-Kutta method.

Example 14.16. The implicit trapezoidal rule method. We have


∆t
yn+1 = yn + (λyn + λyn+1 ) (14.135)
2
and solving for yn+1 we get
" #
∆t
1+ 2
λ
yn+1 = ∆t
yn . (14.136)
1− 2
λ

so the region of A-stability of the (implicit) trapezoidal rule method is the set
complex numbers z such that
z
1+ 2
z ≤1 (14.137)
1− 2

and this is the entire left half complex plane, Re{z} ≤ 0.

Example 14.17. The improved Euler method. We have


∆t
yn+1 = yn + [λyn + λ(yn + ∆λyn ), ] (14.138)
2
that is
 
1 2
yn+1 = 1 + ∆tλ + (∆tλ) yn . (14.139)
2
14.11. A-STABILITY 241

2
The stability function is therefore R(z) = 1 + z + z2 and the set of linear
stability consists of all the complex numbers such |R(z)| ≤ 1. Note that

R(z) = ez + O(z 3 ). (14.140)

That is, R approximates e∆tλ to third order in ∆t as it should because the


method is second order (local truncation error is O(∆t)3 .
Example 14.18. The backward Euler method. In this case we have,

yn+1 = yn + ∆tλyn+1 , (14.141)

and solving for yn+1 we obtain


 
1
yn+1 = yn (14.142)
1 − ∆tλ
so its stability function is R(z) = 1/(1 − z) and its A-stability region is
therefore the set of complex numbers z such that |1 − z| ≥ 1, i.e. the exterior
of the unit disk centered at 1.
Definition 14.7. A method is called A-stable is its linear stability region
contains the left half complex plane.
The trapezoidal rule method and the backward Euler method are both
A-stable.
Let us consider now A-stability for linear multistep methods. When we
applied an m-step (m > 1) method to the linear ODE (14.126) we get
m
X m
X
aj yn+j − ∆tλ bj yn+j = 0. (14.143)
j=0 j=0

This is a constant coefficients, linear difference equation. We look for solu-


tions of this equation in the form yn = ξ n . Substituting into (14.143) we
have
m
X
n
ξ (aj − ∆tλbj )ξ j = 0, (14.144)
j=0

which implies that ξ is a root of the polynomial

Π(ξ; z) = (am − zbm )ξ m + (am−1 − zbm−1 )ξ m−1 + . . . + (a0 − zb0 ) (14.145)


242 CHAPTER 14. NUMERICAL METHODS FOR ODES

where z = ∆tλ. We are going to write this polynomial in terms of the


polynomials defined by the coefficients of the multistep method:
ρ(ξ) = am ξ m + am−1 ξ m−1 + . . . + a0 , (14.146)
σ(ξ) = bm ξ m + bm−1 ξ m−1 + . . . + b0 . (14.147)
We have
Π(ξ; z) = ρ(ξ) − zσ(ξ). (14.148)
Hence, in order for the numerical approximation yn to remain bounded we
need that all the roots of Π satisfy the root condition.
Definition 14.8. The region of A-stability of a linear multistep method is
the set
S = {z ∈ C : all the roots of Π(ξ; z) satisfy the root condition} . (14.149)
Recall that consistency for a multistep method translate into the following
conditions:
a0 + a1 + . . . + am = 0,
a1 + 2a2 + . . . mam = b0 + b1 + . . . + bm ,
or in terms of the multistep method polynomials:
ρ(1) = 0, (14.150)
ρ0 (1) = σ(1). (14.151)
The first condition implies that Π(1; 0) = 0. Therefore, by the implicit
function theorem Π has a root for z in the neighborhood of zero. Such root
is called the principal root of the multistep method. Multistep (m > 1)
methods have one or more additional roots and these are called parasitic
roots.
Example 14.19. Consider the 2-step method
yn+1 + 4yn − 5yn−1 = ∆t [ 4fn + 2fn−1 ] . (14.152)
Then
ρ(ξ) = ξ 2 + 4ξ − 5 = (ξ − 1)(ξ + 5), (14.153)
σ(ξ) = 4ξ + 2. (14.154)
14.12. STIFF ODES 243

Thus, ρ(1) = 0 and ρ0 (1) = σ(1) and the method is consistent. However, the
roots of ρ are 1 and −5 and hence the method is not zero-stable. Therefore,
by Dahlquist theorem, it is not convergent. For the polynomial Π we have

Π(ξ, z) = ξ 2 + 4(1 − z)ξ − (5 + 2z), (14.155)

which has roots



ξ± = −2(1 − z) ± 9 − 6z + 4z 2 (14.156)

and expanding for small |z| we have

ξ− = 1 + z + O(z 2 ) principal root. (14.157)


ξ+ = −5 + O(z) parasitic root. (14.158)

14.12 Stiff ODEs

You might also like