Introduction Numerical Analysis (1) (1)
Introduction Numerical Analysis (1) (1)
Hector D. Ceniceros
c Draft date July 6, 2020
Contents
Contents i
Preface 1
1 Introduction 3
1.1 What is Numerical Analysis? . . . . . . . . . . . . . . . . . . 3
1.2 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 An Approximation Principle . . . . . . . . . . . . . . . 4
1.2.2 Divide and Conquer . . . . . . . . . . . . . . . . . . . 6
1.2.3 Convergence and Rate of Convergence . . . . . . . . . 7
1.2.4 Error Correction . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 Richardson Extrapolation . . . . . . . . . . . . . . . . 11
1.3 Super-algebraic Convergence . . . . . . . . . . . . . . . . . . . 13
2 Function Approximation 17
2.1 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Uniform Polynomial Approximation . . . . . . . . . . . . . . . 19
2.2.1 Bernstein Polynomials and Bézier Curves . . . . . . . . 19
2.2.2 Weierstrass Approximation Theorem . . . . . . . . . . 23
2.3 Best Approximation . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Best Uniform Polynomial Approximation . . . . . . . . 27
2.4 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . . . . 31
3 Interpolation 37
3.1 Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Equispaced and Chebyshev Nodes . . . . . . . . . . . . 40
3.2 Connection to Best Uniform Approximation . . . . . . . . . . 41
3.3 Barycentric Formula . . . . . . . . . . . . . . . . . . . . . . . 43
i
ii CONTENTS
4 Trigonometric Approximation 65
4.1 Approximating a Periodic Function . . . . . . . . . . . . . . . 65
4.2 Interpolating Fourier Polynomial . . . . . . . . . . . . . . . . 70
4.3 The Fast Fourier Transform . . . . . . . . . . . . . . . . . . . 75
6 Computer Arithmetic 99
6.1 Floating Point Numbers . . . . . . . . . . . . . . . . . . . . . 99
6.2 Rounding and Machine Precision . . . . . . . . . . . . . . . . 100
6.3 Correctly Rounded Arithmetic . . . . . . . . . . . . . . . . . . 101
6.4 Propagation of Errors and Cancellation of Digits . . . . . . . . 102
2.1 The Bernstein weights bk,n (x) for x = 0.25 (◦)and x = 0.75
(•), n = 50 and k = 1 . . . n. . . . . . . . . . . . . . . . . . . . . 21
2.2 Quadratic Bézier curve. . . . . . . . . . . . . . . . . . . . . . . 21
2.3 If the error function en does not equioscillate at least twice we
could lower ken k∞ by an amount c > 0. . . . . . . . . . . . . . 28
5.1 The function f (x) = ex on [0, 1] and its Least Squares Ap-
proximation p1 (x) = 4e − 10 + (18 − 6e)x. . . . . . . . . . . . 81
5.2 Geometric interpretation of the solution Xa of the Least Squares
problem as the orthogonal projection of f on the approximat-
ing linear subspace W . . . . . . . . . . . . . . . . . . . . . . . 97
v
vi LIST OF FIGURES
List of Tables
vii
viii LIST OF TABLES
Preface
These notes were prepared by the author for use in the upper division under-
graduate course of Numerical Analysis (Math 104 ABC) at the University of
California at Santa Barbara. They were written with the intent to emphasize
the foundations of Numerical Analysis rather than to present a long list of
numerical methods for different mathematical problems.
We begin with an introduction to Approximation Theory and then use
the different ideas of function approximation in the derivation and analysis
of many numerical methods.
These notes are intended for undergraduate students with a strong math-
ematics background. The prerequisites are Advanced Calculus, Linear Alge-
bra, and introductory courses in Analysis, Differential Equations, and Com-
plex Variables. The ability to write computer code to implement the nu-
merical methods is also a necessary and essential part of learning Numerical
Analysis.
These notes are not in finalized form and may contain errors, misprints,
and other inaccuracies. They cannot be used or distributed without written
consent from the author.
1
2 LIST OF TABLES
Chapter 1
Introduction
3
4 CHAPTER 1. INTRODUCTION
In most cases we cannot find an exact value of I[f ] and very often we only
know the integrand f at finite number of points in [a, b]. The problem is
then to produce an approximation to I[f ] as accurate as we need and at a
reasonable computational cost.
f (b) − f (a)
f (x) ≈ p1 (x) = f (a) + (x − a). (1.2)
b−a
and
Z b Z b
1
f (x)dx ≈ p1 (x)dx = f (a)(b − a) + [f (b) − f (a)](b − a)
a a 2 (1.3)
1
= [f (a) + f (b)](b − a).
2
That is
b
b−a
Z
f (x)dx ≈ [f (a) + f (b)]. (1.4)
a 2
The right hand side is known as the simple Trapezoidal Rule Quadrature. A
quadrature is a method to approximate an integral. How accurate is this
approximation? Clearly, if f is a linear polynomial or a constant then the
Trapezoidal Rule would give us the exact value of the integral, i.e. it would
be exact. The underlying question is: how well does a linear polynomial p1 ,
satisfying
approximate f on the interval [a, b]? We can almost guess the answer. The
approximation is exact at x = a and x = b because of (1.5)-(1.6) and it is
1.2. AN ILLUSTRATIVE EXAMPLE 5
exact for all polynomials of degree ≤ 1. This suggests that f (x) − p1 (x) =
Cf 00 (ξ)(x − a)(x − b), where C is a constant. But where is f 00 evaluated
at? it cannot be at x for if it did f would be the solution of a second order
ODE and f is an arbitrary (but sufficiently smooth, C 2 [a, b] ) function so
it has to be at some undetermined point ξ(x) in (a, b). Now, if we take the
particular case f (x) = x2 on [0, 1] then p1 (x) = x, f (x) − p1 (x) = x(x − 1),
and f 00 (x) = 2, which implies that C would have to be 1/2. So our conjecture
is
1
f (x) − p1 (x) = f 00 (ξ(x))(x − a)(x − b). (1.7)
2
There is a beautiful 19th Century proof of this result by A. Cauchy. It goes
as follows. If x = a or x = b (1.7) holds trivially. So let us take x in (a, b)
and define the following function of a new variable t as
(t − a)(t − b)
φ(t) = f (t) − p1 (t) − [f (x) − p1 (x)] . (1.8)
(x − a)(x − b)
The last integral can be easily evaluated if we shift to the midpoint, i.e.,
changing variables to x = y + 21 (a + b) then
Z b Z b−a " 2 #
2 b − a 1
(x − a)(x − b)dx = y2 − dy = − (b − a)3 . (1.13)
a − b−a
2
2 6
But we know
Z xj+1
1 1
f (x)dx = [f (xj ) + f (xj+1 )]h − f 00 (ξj )h3 (1.17)
xj 2 12
The first term on the right hand side is called the Composite Trapezoidal
Rule Quadrature (CTR):
1 1
Th [f ] := h f (x0 ) + f (x1 ) + . . . + f (xN −1 ) + f (xN ) . (1.18)
2 2
I[f ] = Th [f ] + Eh [f ]. (1.21)
8 CHAPTER 1. INTRODUCTION
Definition 1.1. We say that g(h) is order hα , and write g(h) = O(hα ), if
there is a constant C and h0 such that |g(h)| ≤ Chα for 0 ≤ h ≤ h0 , i.e. for
sufficiently small h.
Example 1.1. Let’s check the Trapezoidal Rule approximation for an integral
we can compute exactly. Take f (x) = ex in [0, 1]. The exact value of the
integral is e − 1. Observe how the error |I[ex ] − T1/N [ex ]| decreases by a
1
|Eh [f ]| ≤ (b − a)h2 M2 . (1.22)
12
1
Neglecting round-off errors introduced by finite precision number representation and
computer arithmetic.
1.2. AN ILLUSTRATIVE EXAMPLE 9
However, this bound does not in general provide an accurate estimate of the
error. It could grossly overestimate it. This can be seen from (1.19). As
N → ∞ the term in brackets converges to a mean value of f 00 , i.e.
N −1 Z b
1 X 00 1 1
f (ξj ) −→ f 00 (x)dx = [f 0 (b) − f 0 (a)], (1.23)
N j=0 b−a a b−a
Eh [f ] = C2 h2 + R(h), (1.24)
where
1 0
C2 = − [f (b) − f 0 (a)] (1.25)
12
and R(h) goes to zero faster than h2 as h → 0, i.e.
R(h)
lim = 0. (1.26)
h→0 h2
g(h)
lim =0
h→0 hα
f at the end points of the interval then we can compute directly C2 h2 and
use this leading order approximation of the error to obtain the improved
approximation
1
Teh [f ] = Th [f ] − [f 0 (b) − f 0 (a)]h2 . (1.28)
12
This is called the (composite) Modified Trapezoidal Rule. It then follows from
(1.27) that error of this “corrected approximation” is R(h), which goes to
zero faster than h2 . In fact, we will prove later that the error of the Modified
Trapezoidal Rule is O(h4 ).
Often, we only have access to values of f and/or it is difficult to evaluate
f 0 (a) and f 0 (b). Fortunately, we can compute a sufficiently good approxi-
mation of the leading order term of the error, C2 h2 , so that we can use the
same error correction idea that we did for the Modified Trapezoidal Rule.
Roughly speaking, the error can be estimated by comparing two approxima-
tions obtained with different h.
Consider (1.27). If we halve h we get
1
I[f ] = Th/2 [f ] + C2 h2 + R(h/2). (1.29)
4
Subtracting (1.29) from (1.27) we get
4 4
C2 h2 = Th/2 [f ] − Th [f ] + (R(h/2) − R(h)) . (1.30)
3 3
The last term on the right hand side is o(h2 ). Hence, for h sufficiently small,
we have
4
C2 h2 ≈
Th/2 [f ] − Th [f ] (1.31)
3
and this could provide a good, computable estimate for the error, i.e.
4
Eh [f ] ≈ Th/2 [f ] − Th [f ] . (1.32)
3
The key here is that h has to be sufficiently small to make the asymptotic
approximation (1.31) valid. We can check this by working backwards. If h
is sufficiently small, then evaluating (1.31) at h/2 we get
2
h 4
C2 ≈ Th/4 [f ] − Th/2 [f ] (1.33)
2 3
1.2. AN ILLUSTRATIVE EXAMPLE 11
Therefore
" N −1 N −1
#
h X X
Sh [f ] = f (a) + 2 f (a + kh) + 4 f (a + kh/2) + f (b) . (1.39)
6 k=1 k=1
Table 1.2: Composite Trapezoidal Rule for f (x) = 1/(2 + sin x) in [0, 2π].
N T2π/N [1/(2 + sin x)] |I[1/(2 + sin x)] − T2π/N [1/(2 + sin x)]|
8 3.627791516645356 1.927881769203665 × 10−4
16 3.627598733591013 5.122577029226250 × 10−9
32 3.627598728468435 4.440892098500626 × 10−16
where the integrand f is periodic in [0, 2π] and has m > 1 continuous deriva-
tives, i.e. f ∈ C m [0, 2π] and f (x + 2π) = f (x) for all x. Due to periodicity
we can work in any interval of length 2π and if the function has a different
period, with a simple change of variables, we can reduce the problem to one
in [0, 2π].
Consider the equally spaced points in [0, 2π], xj = jh for j = 0, 1, . . . , N
and h = 2π/N . Because f is periodic f (x0 = 0) = f (xN = 2π) and the CTR
becomes
N −1
f (x0 ) f (xN ) X
Th [f ] = h + f (x1 ) + . . . + f (xN −1 ) + =h f (xj ). (1.41)
2 2 j=0
Being f smooth and periodic in [0, 2π], it has a uniformly convergent Fourier
Series:
∞
a0 X
f (x) = + (ak cos kx + bk sin kx) (1.42)
2 k=1
where
1 2π
Z
ak = f (x) cos kx dx, k = 0, 1, . . . (1.43)
π 0
1 2π
Z
bk = f (x) sin kx dx, k = 1, 2, . . . (1.44)
π 0
14 CHAPTER 1. INTRODUCTION
we can write
eix + e−ix
cos x = , (1.46)
2
eix − e−ix
sin x = (1.47)
2i
and the Fourier series can be conveniently expressed in complex form in terms
of functions eikx for k = 0, ±1, ±2, . . . so that (1.42) becomes
∞
X
f (x) = ck eikx , (1.48)
k=−∞
where
Z 2π
1
ck = f (x)e−ikx dx. (1.49)
2π 0
−1 ∞
N
!
X X
Th [f ] = h ck eikxj . (1.50)
j=0 k=−∞
Justified by the uniform convergence of the series we can exchange the finite
and the infinite sums to get
∞ N −1
2π X X 2π
Th [f ] = ck eik N j . (1.51)
N k=−∞ j=0
3 2
i = −1 and if c = a + ib, with a, b ∈ R, then its complex conjugate c̄ = a − ib.
1.3. SUPER-ALGEBRAIC CONVERGENCE 15
But
N −1 N −1 j
ik 2π ik 2π
X X
j
e N = e N . (1.52)
j=0 j=0
2π
Note that eik N = 1 precisely when k is an integer multiple of N , i.e. k = lN ,
l ∈ Z and if so
N −1 j
2π
X
eik N =N for k = lN . (1.53)
j=0
Otherwise, if k 6= lN , then
2π N
N
X −1
2π
j 1 − eik N
eik N = 2π = 0 for k 6= lN (1.54)
j=0 1 − eik N
that is
So now, the relevant question is how fast the Fourier coefficients clN of f
decay with N . The answer is tied to the smoothness of f . Doing integration
by parts in the formula (4.11) for the Fourier coefficients of f we have
Z 2π
1 1 0 −ikx −ikx 2π
ck = f (x)e dx − f (x)e 0
k 6= 0 (1.59)
2π ik 0
16 CHAPTER 1. INTRODUCTION
and the last term vanishes due to the periodicity of f (x)e−ikx . Hence,
1 1 2π 0
Z
ck = f (x)e−ikx dx k=6 0. (1.60)
2π ik 0
Integrating by parts m times we obtain
m Z 2π
1 1
ck = f (m) (x)e−ikx dx k 6= 0, (1.61)
2π ik 0
where f (m) is the m-th derivative of f . Therefore, for f ∈ C m [0, 2π] and
periodic
Am
|ck | ≤ , (1.62)
|k|m
Function Approximation
We saw in the introductory chapter that one key step in the construction of
a numerical method to approximate a definite integral is the approximation
of the integrand by a simpler function, which we can integrate exactly.
The problem of function approximation is central to many numerical
methods: given a continuous function f in an interval [a, b], we would like to
find a good approximation to it by simpler functions, such as polynomials,
trigonometric polynomials, wavelets, rational functions, etc. We are going
to measure the accuracy of an approximation using norms and ask whether
or not there is a best approximation out of functions from a given family of
simpler functions. These are the main topics of this introductory chapter to
Approximation Theory.
2.1 Norms
A norm on a vector space V over a field K = R (or C) is a mapping
k · k : V → [0, ∞),
17
18 CHAPTER 2. FUNCTION APPROXIMATION
where
n n!
= , k = 0, . . . , n (2.17)
k (n − k)!k!
are the binomial coefficients. Note that Bn f (0) = f (0) and Bn f (1) = f (1)
for all n. The terms
n k
bk,n (x) = x (1 − x)n−k , k = 0, . . . , n (2.18)
k
which are all nonnegative, are called the Bernstein basis polynomials and can
be viewed as x-dependent weights that sum up to one:
n n
X X n k
bk,n (x) = x (1 − x)n−k = [x + (1 − x)]n = 1. (2.19)
k=0 k=0
k
Thus, for each x ∈ [0, 1], Bn f (x) represents a weighted average of the values
of f at 0, 1/n, 2/n, . . . , 1. Moreover, as n increases the weights bk,n (x) con-
centrate more and more around the points k/n close to x as Fig. 2.1 indicates
for bk,n (0.25) and bk,n (0.75).
For n = 1, the Bernstein polynomial is just the straight line connecting
f (0) and f (1), B1 f (x) = (1 − x)f (0) + xf (1). Given two points P0 = (x0 , y0 )
and P1 = (x1 , y1 ), the segment of the straight line connecting them can be
written in parametric form as
This curve connects again P0 and P2 but P1 can be used to control how
the curve bends. More precisely, the tangents at the end points are B20 (0) =
2(P1 − P0 ) and B20 (1) = 2(P2 − P1 ), which intersect at P1 , as Fig. 2.2 illus-
trates. These parametric curves formed with the Bernstein basis polynomials
are called Bézier curves and have been widely employed in computer graph-
ics, specially in the design of vector fonts, and in computer animation. To
allow the representation of complex shapes, quadratic or cubic Bézier curves
2.2. UNIFORM POLYNOMIAL APPROXIMATION 21
0.14
0.12
0.1
0.08
0.06
0.04
bk,n(0.25) bk,n(0.75)
0.02
0
0 10 20 30 40 50
k
Figure 2.1: The Bernstein weights bk,n (x) for x = 0.25 (◦)and x = 0.75 (•),
n = 50 and k = 1 . . . n.
P1
P0 P2
are pieced together to form composite Bézier curves. To have some degree
of smoothness (C 1 ), the common point for two pieces of a composite Bézier
curve has to lie on the line connecting the two adjacent control points on ei-
ther side. For example, the TrueType font used in most computers today is
generated with composite, quadratic Bézier curves while the Metafont used
in these pages, via LATEX, employs composite, cubic Bézier curves. For each
character, many pieces of Bézier are stitched together.
Let us now do some algebra to prove some useful identities of the Bern-
stein polynomials. First, for f (x) = x we have,
n n
X k n k n−k
X kn!
x (1 − x) = xk (1 − x)n−k
k=0
n k k=1
n(n − k)!k!
n
X n − 1 k−1
=x x (1 − x)n−k
k=1
k − 1 (2.22)
n−1
X n − 1
=x xk (1 − x)n−1−k
k=0
k
= x [x + (1 − x)]n−1 = x.
n 2 n
X k n k n−k
X k n−1
x (1 − x) = xk (1 − x)n−k (2.23)
k=0
n k k=1
n k−1
and writing
k k−1 1 n−1k−1 1
= + = + , (2.24)
n n n n n−1 n
2.2. UNIFORM POLYNOMIAL APPROXIMATION 23
we have
n 2 n
X k n k n−k n−1Xk−1 n−1 k
x (1 − x) = x (1 − x)n−k
k=0
n k n k=2 n − 1 k − 1
n
1 X n−1 k
+ x (1 − x)n−k
n k=1 k − 1
n
n−1X n−2 k x
= x (1 − x)n−k +
n k=2 k − 2 n
n−2
n−1 2X n−2 k x
= x x (1 − x)n−2−k + .
n k=0
k n
Thus,
n 2
X k n n−1 2 x
xk (1 − x)n−k = x + . (2.25)
k=0
n k n n
k
2
Now, expanding n
−x and using (2.19), (2.22), and (2.25) it follows that
n 2
X k n k 1
−x x (1 − x)n−k = x(1 − x). (2.26)
k=0
n k n
Proof. We are going to work on the interval [0, 1]. For a general interval
[a, b], we consider the simple change of variables x = a + (b − a)t for t ∈ [0, 1]
so that F (t) = f (a + (b − a)t) is continuous in [0, 1].
Using (2.19), we have
n
X k n k
f (x) − Bn f (x) = f (x) − f x (1 − x)n−k . (2.27)
k=0
n k
24 CHAPTER 2. FUNCTION APPROXIMATION
We now split the sum in (2.27) in two sums, one over the points such that
|k/n − x| < δ and the other over the points such that |k/n − x| ≥ δ:
X k n k
f (x) − Bn f (x) = f (x) − f x (1 − x)n−k
n k
|k/n−x|<δ
(2.30)
X k n k
+ f (x) − f x (1 − x)n−k .
n k
|k/n−x|≥δ
Using (2.28) and (2.19) it follows immediately that the first sum is bounded
by /2. For the second sum we have
X k n k
f (x) − f x (1 − x)n−k
n k
|k/n−x|≥δ
X n
≤ 2kf k∞ xk (1 − x)n−k
k
|k/n−x|≥δ
2
2kf k∞ X k n k (2.31)
≤ 2
−x x (1 − x)n−k
δ n k
|k/n−x|≥δ
n 2
2kf k∞ X k n k
≤ 2
− x x (1 − x)n−k
δ k=0
n k
2kf k∞ kf k∞
= 2
x(1 − x) ≤ .
nδ 2nδ 2
Therefore, there is N such that for all n ≥ N the second sum in (2.30) is
bounded by /2 and this completes the proof.
2.3. BEST APPROXIMATION 25
For example, the normed linear space V could be C[a, b] with the supre-
mum norm (2.10) and W could be the set of all polynomials of degree at
most n, which henceforth we will denote by Pn .
kf − pk ≤ kf − 0k = kf k. (2.33)
F = {p ∈ W : kf − pk ≤ kf k}. (2.34)
Example 2.1. Let V = C[0, 1/2] and W be the space of all polynomials
(clearly of subspace of V ). Take f (x) = 1/(1 − x). Then, given > 0 there
is N such that
1
max − (1 + x + x2 + . . . + xN ) < . (2.35)
x∈[0,1/2] 1−x
In other words, a norm is strictly convex if its unit ball is strictly convex.
The p-norm is strictly convex for 1 < p < ∞ but not for p = 1 or p = ∞.
Theorem 2.3. Let V be a vector space with a strictly convex norm, W a
subspace of V , and f ∈ V . If p∗ and q ∗ are best approximations of f in W
then p∗ = q ∗ .
Proof. Let M = kf − p∗ k = kf − q ∗ k, if p∗ 6= q ∗ by the strict convexity of
the norm
en(x)
en(x)−c
Figure 2.3: If the error function en does not equioscillate at least twice we
could lower ken k∞ by an amount c > 0.
Therefore, ken −ck∞ = kf −(p∗n +c)k∞ = ken k∞ −c < ken k∞ but p∗n +c ∈ Pn
so this is impossible since p∗n is a best approximation. A similar argument
can used when en (x1 ) = −ken k∞ .
Before proceeding to the general case, let us look at the n = 1 situation.
Suppose there are only two alternating extrema x1 and x2 for e1 as described
in (2.40). We are going to construct a linear polynomial that has the same
sign as e1 at x1 and x2 and which can be used to decrease ke1 k∞ . Suppose
e1 (x1 ) = ke1 k∞ and e1 (x2 ) = −ke1 k∞ . Since e1 is continuous, we can find
small closed intervals I1 and I2 , containing x1 and x2 , respectively, and such
that
ke1 k∞
e1 (x) > for all x ∈ I1 , (2.43)
2
ke1 k∞
e1 (x) < − for all x ∈ I2 . (2.44)
2
Clearly I1 and I2 are disjoint sets so we can choose a point x0 between the
two intervals. Then, it is possible to find a linear polynomial q that passes
through x0 and that is positive in I1 and negative in I2 . We are now going
2.3. BEST APPROXIMATION 29
to find a suitable constant α > 0 such that kf − p∗1 − αqk∞ < ke1 k∞ . Since
p∗1 + αq ∈ P1 this would be a contradiction to the fact that p∗1 is a best
approximation.
Let R = [a, b] \ (I1 ∪ I2 ) and d = maxx∈R |e1 (x)|. Clearly d < ke1 k∞ .
Choose α such that
1
0<α< (ke1 k∞ − d) . (2.45)
2kqk∞
On I1 , we have
1 1
0 < αq(x) < (ke1 k∞ − d) q(x) ≤ (ke1 k∞ − d) < e1 (x). (2.46)
2kqk∞ 2
Therefore
|e1 (x) − αq(x)| = e1 (x) − αq(x) < ke1 k∞ , for all x ∈ I1 . (2.47)
Similarly, on I2 , we can show that |e1 (x) − αq(x)| < ke1 k∞ . Finally, on R we
have
1
|e1 (x) − αq(x)| ≤ |e1 (x)| + |αq(x)| ≤ d + (ke1 k∞ − d) < ke1 k∞ . (2.48)
2
Therefore, ke1 − αqk∞ = kf − (p∗1 + αq)k∞ < ke1 k∞ , which contradicts the
best approximation assumption on p∗1 .
Proof. We first prove that if the error en = f − p∗n , for some p∗n ∈ Pn ,
equioscillates at least n + 2 times then p∗n is a best approximation. Suppose
the contrary. Then, there is qn ∈ Pn such that
and since
we have that
But |f (xj ) − p∗n (xj )| ≤ ken k∞ and |f (xj ) − qn∗ (xj )| ≤ ken k∞ . As a conse-
quence,
Definition 2.3. The Chebyshev polynomial (of the first kind) of degree n,
Tn is defined by
Note that (2.64) only defines Tn for x ∈ [−1, 1]. However, once the
coefficients of this polynomial are determined we can define it for any real
(or complex) x.
Using the trigonometry identity
we immediately get
T0 (x) = 1,
T1 (x) = x, (2.67)
Tn+1 (x) = 2xTn (x) − Tn−1 (x), n ≥ 1,
T0 (x) = 1,
T1 (x) = x,
T2 (x) = 2x · x − 1 = 2x2 − 1,
(2.68)
T3 (x) = 2x · (2x2 − 1) − x = 4x3 − 3x,
T4 (x) = 2x(4x3 − 3x) − (2x2 − 1) = 8x4 − 8x2 + 1
T5 (x) = 2x(8x4 − 8x2 + 1) − (4x3 − 3x) = 16x5 − 20x3 + 5x.
From these few Chebyshev polynomials, and from (2.67), we see that
Going back to (2.63), since the leading order coefficient of en is 1 and that
of Tn+1 is 2n , it follows that ken k∞ = 2−n . Therefore
1
p∗n (x) = xn+1 − Tn+1 (x) (2.71)
2n
is the best uniform approximation of xn+1 in [−1, 1] by polynomials of degree
at most n. Equivalently, as noted in the beginning of this section, the monic
polynomial of degree n with smallest infinity norm in [−1, 1] is
1
T̃n (x) = Tn (x). (2.72)
2n−1
Hence, for any other monic polynomial p of degree n
1
max |p(x)| > . (2.73)
x∈[−1,1] 2n−1
The zeros and extrema of Tn are easy to find. Because Tn (x) = cos nθ
and 0 ≤ θ ≤ π, the zeros occur when θ is an odd multiple of π/2. Therefore,
(2j + 1) π
x̄j = cos j = 0, . . . , n − 1. (2.74)
n 2
In other words, the Chebyshev points (2.75) are the n − 1 zeros of Tn0 plus
the end points x0 = 1 and xn = −1.
Using the Chain Rule we can differentiate Tn with respect to x we get
dθ sin nθ
Tn0 (x) = −n sin nθ =n , (x = cos θ). (2.77)
dx sin θ
2.4. CHEBYSHEV POLYNOMIALS 35
Therefore
0 0
Tn+1 (x) Tn−1 (x) 1
− = [sin(n + 1)θ − sin(n − 1)θ] (2.78)
n+1 n−1 sin θ
and since sin(n + 1)θ − sin(n − 1)θ = 2 sin θ cos nθ, we get that
0 0
Tn+1 (x) Tn−1 (x)
− = 2Tn (x). (2.79)
n+1 n−1
The polynomial
0
Tn+1 (x) sin(n + 1)θ
Un (x) = = , (x = cos θ) (2.80)
n+1 sin θ
of degree n is called the Chebyshev polynomial of second kind. Thus, the
Chebyshev nodes (2.75) are the zeros of the polynomial
Interpolation
pn (x0 ) = f0 ,
pn (x1 ) = f1 ,
..
.
pn (xn ) = fn .
a0 + a1 x0 + · · · + an xn0 = f0 ,
a0 + a1 x1 + · · · + an xn1 = f1 ,
..
.
37
38 CHAPTER 3. INTERPOLATION
a0 + a1 xn + · · · + an xnn = fn .
Does this linear system have a solution? Is this solution unique? The answer
is yes to both. Here is a simple proof. Take fj = 0 for j = 0, 1, . . . , n. Then
pn (xj ) = 0, for j = 0, 1, ..., n but pn is a polynomial of degree at most n, it
cannot have n + 1 zeros unless pn ≡ 0, which implies a0 = a1 = · · · = an = 0.
That is, the homogenous problem associated with (3.1) has only the trivial
solution. Therefore, (3.1) has a unique solution.
x − x1 x − x0
p1 (x) = f0 + f1 . (3.2)
x0 − x1 x1 − x0
Clearly, this polynomial has degree at most 1 and satisfies the interpolation
property:
p1 (x0 ) = f0 , (3.3)
p1 (x1 ) = f1 . (3.4)
If we define
(x − x1 )(x − x2 )
l0 (x) = , (3.5)
(x0 − x1 )(x0 − x2 )
(x − x0 )(x − x2 )
l1 (x) = , (3.6)
(x1 − x0 )(x1 − x2 )
(x − x0 )(x − x1 )
l2 (x) = , (3.7)
(x2 − x0 )(x2 − x1 )
then we simply have
Note that each of the polynomials (3.5), (3.6), and (3.7) are exactly of degree
2 and they satisfy lj (xk ) = δjk 1 . Therefore, it follows that p2 given by (3.8)
satisfies the interpolation property
p2 (x0 ) = f0 ,
p2 (x1 ) = f1 , (3.9)
p2 (x2 ) = f2 .
We can now write down the polynomial of degree at most n that interpo-
lates n + 1 given values, (x0 , f0 ), . . . , (xn , fn ), where the interpolation nodes
x0 , . . . , xn are assumed distinct. Define
(x − x0 ) · · · (x − xj−1 )(x − xj+1 ) · · · (x − xn )
lj (x) =
(xj − x0 ) · · · (xj − xj−1 )(xj − xj+1 ) · · · (xj − xn )
n (3.10)
Y (x − xk )
= , for j = 0, 1, ..., n.
k=0
(xj − xk )
k6=j
These are called the elementary Lagrange polynomials of degree n. For sim-
plicity, we are omitting in the notation their dependence on the n + 1 nodes.
Since lj (xk ) = δjk , we have that
n
X
pn (x) = l0 (x)f0 + l1 (x)f1 + · · · + ln (x)fn = lj (x)fj (3.11)
j=0
1
δjk is the Kronecker delta, i.e. δjk = 0 if k 6= j and 1 if k = j.
40 CHAPTER 3. INTERPOLATION
interpolates the given data, i.e., it satisfies the interpolation property pn (xj ) =
fj for j = 0, 1, 2, . . . , n. Relation (3.11) is called the Lagrange form of the
interpolating polynomial. The following result summarizes our discussion.
1 1
x = (a + b) + (b − a)t, t ∈ [−1, 1]. (3.12)
2 2
The uniform or equispaced nodes are given by
These nodes yield very accurate and efficient trigonometric polynomial inter-
polation but are generally not good for (algebraic) polynomial interpolation
as we will see later.
One of the preferred set of nodes for high order, accurate, and computa-
tionally efficient polynomial interpolation is the Chebyshev or Gauss-Lobatto
set
jπ
xj = cos , j = 0, . . . , n, (3.14)
n
which, as discussed in Section 2.4, are the extrema of the Chebyshev poly-
nomial (2.64) of degree n . Note that these nodes are obtained from the
3.2. CONNECTION TO BEST UNIFORM APPROXIMATION 41
where
n
X
Λn = max |lj (x)| (3.18)
a≤x≤b
j=0
is called the Lebesgue Constant and depends only on the interpolation nodes,
not on f . On the other hand, we have that
kf − p∗n k∞ → 0 as n → ∞. (3.22)
Dividing (3.28) by (3.29), we get the so-called Barycentric Formula for in-
terpolation:
n
X λj
fj
j=0
x − xj
pn (x) = n , for x 6= xj , j = 0, 1, . . . , n. (3.30)
X λj
j=0
x − xj
We can factor out −n in (3.33) to obtain the barycentric weights for the
Chebyshev points
1
2 , for j = 0,
λj = (−1)j , for j = 1, . . . , n − 1, (3.34)
1 n
2
(−1) for j = n.
Note that for a general interval [a, b], the term (a + b)/2 in the change of
variables (3.12) cancels out in (3.25) but we gain an extra factor of [(b−a)/2]n .
However, this factor can be omitted as it does not alter the barycentric
formula. Therefore, the same barycentric weights (3.34) can also be used for
the Chebyshev nodes in an interval [a, b].
1
λj =
(xj − x0 ) · · · (xj − xj−1 )(xj − xj+1 ) · · · (xj − xn )
1
=
(jh)[(j − 1)h] · · · (h)(−h)(−2h) · · · (j − n)h
1
= n−j n
(−1) h [j(j − 1) · · · 1][1 · 2 · · · (n − j)]
1 n!
= n−j n
(−1) h n! j!(n − j)!
1 j n
= (−1) .
(−1)n hn n! j
We can omit the factor 1/((−1)n hn n!) because it cancels out in the barycen-
tric formula. Thus, for equispaced nodes we can use
j n
λj = (−1) , j = 0, 1, . . . n. (3.35)
j
(0)
λ0 = 1;
for m = 1 : n
for j = 0 : m − 1
(m−1)
(m) λ
λj = xjj −xm ;
end
(m)
λm = m−1 1 ;
Y
(xm − xk )
k=0
end
If we want to add one more point (xn+1 , fn+1 ) we just extend the m-loop
(n+1) (n+1) (n+1)
to n + 1 to generate λ0 , λ1 , · · · , λn+1 .
pn−1 (x) = pn−2 (x) + f [x0 , . . . , xn−1 ](x − x0 )(x − x1 ) · · · (x − xn−2 ), (3.39)
Therefore
f [x0 ] = f0 , (3.41)
f1 − f0
f [x0 , x1 ] = , (3.42)
x1 − x0
and
f1 − f0
p1 (x) = f0 + (x − x0 ). (3.43)
x1 − x0
Define f [xj ] = fj for j = 0, 1, ...n. The following identity will allow us to
compute all the required divided differences.
Theorem 3.2.
f [x1 , x2 , ..., xk ] − f [x0 , x1 , ..., xk−1 ]
f [x0 , x1 , ..., xk ] = . (3.44)
xk − x0
Proof. Let pk−1 be the interpolating polynomial of degree at most k − 1 of
(x1 , f1 ), . . . , (xk , fk ) and qk−1 the interpolating polynomial of degree at most
k − 1 of (x0 , f0 ), . . . , (xk−1 , fk−1 ). Then
x − xk
p(x) = pk−1 (x) + [pk−1 (x) − qk−1 (x)]. (3.45)
xk − x0
48 CHAPTER 3. INTERPOLATION
For general n we can use the following Horner-like scheme to get p = pn (x):
p = cn ;
for k = n − 1 : 0
p = ck + (x − xk ) ∗ p;
end
1
f (x) − p1 (x) = f 00 (ξ(x))(x − x0 )(x − x1 ),
2
where p1 is the polynomial of degree at most 1 that interpolates (x0 , f (x0 )),
(x1 , f (x1 )) and ξ(x) ∈ (a, b). The general result about the interpolation error
is the following theorem:
Theorem 3.3. Let f ∈ C n+1 [a, b], x0 , x1 , ..., xn , x be contained in [a, b], and
pn be the interpolation polynomial of degree at most n of f at x0 , ..., xn then
1
f (x) − pn (x) = f (n+1) (ξ(x))(x − x0 )(x − x1 ) · · · (x − xn ), (3.46)
(n + 1)!
Proof. The right hand side of (3.46) is known as the Cauchy Remainder and
the following proof is due to Cauchy.
50 CHAPTER 3. INTERPOLATION
For x equal to one of the nodes xj the result is trivially true. Take x fixed
not equal to any of the nodes and define
(t − x0 )(t − x1 ) · · · (t − xn )
φ(t) = f (t) − pn (t) − [f (x) − pn (x)] . (3.47)
(x − x0 )(x − x1 ) · · · (x − xn )
Clearly, φ ∈ C n+1 [a, b] and vanishes at t = x0 , x1 , ..., xn , x. That is, φ has at
least n + 2 zeros. Applying Rolle’s Theorem n + 1 times we conclude that
there exists a point ξ(x) ∈ (a, b) such that φ(n+1) (ξ(x)) = 0. Therefore,
(n + 1)!
0 = φ(n+1) (ξ(x)) = f (n+1) (ξ(x)) − [f (x) − pn (x)]
(x − x0 )(x − x1 ) · · · (x − xn )
from which (3.46) follows. Note that the repeated application of Rolle’s theo-
rem implies that ξ(x) is between min{x0 , x1 , ..., xn , x} and max{x0 , x1 , ..., xn , x}.
p(x0 ) = f (x0 ),
p0 (x0 ) = f 0 (x0 ),
p(x1 ) = f (x1 ),
p0 (x1 ) = f 0 (x1 ).
x0 f (x0 )
x0 f (x0 ) f 0 (x0 )
(3.60)
x1 f (x1 ) f [x0 , x1 ] f [x0 , x0 , x1 ]
x1 f (x1 ) f 0 (x1 ) f [x0 , x1 , x1 ] f [x0 , x0 , x1 , x1 ]
and
p(x) = f (x0 ) + f 0 (x0 )(x − x0 ) + f [x0 , x0 , x1 ](x − x0 )2
(3.61)
+ f [x0 , x0 , x1 , x1 ](x − x0 )2 (x − x1 ).
3.7. CONVERGENCE OF POLYNOMIAL INTERPOLATION 53
√
Example 3.4. Let f (0) = 1, f 0 (0) = 0 and f (1) = 2. Find the Hermite
Interpolation Polynomial.
We construct the table of divided differences as follows:
0 1
0 1 0
(3.62)
√ √ √
1 2 2−1 2−1
and therefore
√ √
p(x) = 1 + 0(x − 0) + ( 2 − 1)(x − 0)2 = 1 + ( 2 − 1)x2 . (3.63)
is the total number of nodes in [a, x]. Take for example, [−1, 1]. For
√ equis-
paced nodes ρ(x) = 1/2 and for the Chebyshev nodes ρ(x) = 1/(π 1 − x2 ).
It turns out that the relevant domain of analyticity is given in terms of
the function
Z b
φ(z) = − ρ(t) ln |z − t|dt. (3.67)
a
Let Γc be the level curve consisting of all the points z ∈ C such that φ(z) = c
for c constant. For very large and negative c, Γc approximates a large circle.
As c is increased, Γc shrinks. We take the “smallest” level curve, Γc0 , which
contains [a, b]. The relevant domain of analyticity is
Now the max at the right hand side is attained at the midpoint (xj + xj+1 )/2
and
2
xj+1 − xj 1
max |(x − xj )(x − xj+1 )| = = h2j , (3.73)
xj ≤x≤xj+1 2 4
where hj = xj+1 − xj . Therefore
1
max |f (x) − p(x)| ≤ M2 h2j . (3.74)
xj ≤x≤xj+1 8
56 CHAPTER 3. INTERPOLATION
sj (x) = Aj (x − xj )3 + Bj (x − xj )2 + Cj (x − xj ) + Dj . (3.76)
Let
hj = xj+1 − xj . (3.77)
sj (xj ) = fj = Dj , (3.78)
sj (xj+1 ) = Aj h3j + Bj h2j + Cj hj + Dj = fj+1 . (3.79)
Now s0j (x) = 3Aj (x − xj )2 + 2Bj (x − xj ) + Cj and s00j (x) = 6Aj (x − xj ) + 2Bj .
Therefore
have
Dj = fj ,
1
Bj = zj ,
2
1
6Aj hj + 2Bj = zj+1 ⇒ Aj = (zj+1 − zj )
6hj
1 1
Cj = (fj+1 − fj ) − hj (zj+1 + 2zj ).
hj 6
1
Aj = (zj+1 − zj ), (3.84)
6hj
1
Bj = zj , (3.85)
2
1 1
Cj = (fj+1 − fj ) − hj (zj+1 + 2zj ), (3.86)
hj 6
Dj = fj . (3.87)
Note that the second derivative of s is continuous, s00j (xj+1 ) = zj+1 = s00j+1 (xj+1 ),
and by construction s interpolates the given data. We are now going to use
the condition of continuity of the first derivative of s to determine equations
for the unknown values zj , j = 1, 2, . . . , n − 1:
1 1
s0j−1 (xj ) = (fj − fj−1 ) + hj−1 (2zj + zj−1 ). (3.88)
hj−1 6
58 CHAPTER 3. INTERPOLATION
Continuity of the first derivative at an interior node means s0j−1 (xj ) = s0j (xj )
for j = 1, 2, ..., n − 1. Therefore
1 1 1 1
(fj − fj−1 ) + hj−1 (2zj + zj−1 ) = Cj = (fj+1 − fj ) − hj (zj+1 + 2zj )
hj−1 6 hj 6
which can be written as
hj−1 zj−1 + 2(hj−1 + hj )zj + hj zj+1 =
6 6 (3.89)
− (fj − fj−1 ) + (fj+1 − fj ), j = 1, . . . , n − 1.
hj−1 hj
This is a linear system of n−1 equations for the n−1 unknowns z1 , z2 , . . . , zn−1 .
In matrix form
2(h0 + h1 ) h1 ··· 0 z1
d1
.
h1 2(h1 + h2 ) h2 . . 0 z2 d2
.. ... ... =
... ...
,
. hn−2
...
0 hn−2 2(hn−2 + hn−1 ) zn−1 dn−1
(3.90)
where
− h60 (f1 − f0 ) + h61 (f2 − f1 ) − h0 z0
d1
d2 − h61 (f2 − f1 ) + h62 (f3 − f2 )
..
..
= . (3.91)
. .
6 6
− hn−3 (fn−2 − fn−3 ) + hn−2 (fn−1 − fn−2 )
dn−2
dn−1 6 6
− hn−2 (fn−1 − fn−2 ) + hn−1 (fn − fn−1 ) − hn−1 zn
Note that z0 = f000 and zn = fn00 are unspecified. The values z0 = zn = 0 de-
fined what is called a Natural Spline. The matrix of the linear system (3.90)
is strictly diagonally dominant, a concept we make precise in the definition
below. A consequence of this property, as we will see shortly, is that the ma-
trix is nonsingular and therefore there is a unique solution for z1 , z2 , . . . , zn−1
corresponding the second derivative values of the spline at the interior nodes.
Definition 3.1. An n × n matrix A with entries aij , i, j = 1, . . . , n is strictly
diagonally dominant if
n
X
|aii | > |aij |, for i = 1, . . . , n. (3.92)
j=1
j6=i
3.9. CUBIC SPLINES 59
and consequently
n
X
|akk ||xk | ≤ |akj ||xj |. (3.94)
j=1
j6=k
Dividing by |xk |, which by assumption in nonzero, and using that |xj |/|xk | ≤
1 for all j = 1, . . . , n, we get
n
X
|akk | ≤ |akj |, (3.95)
j=1
j6=k
Once the z1 , z2 , . . . , zn−1 are found the spline coefficients can be computed
from (3.84)-(3.87).
Example 3.5. Find the natural cubic spline that interpolates (0, 0), (1, 1), (2, 16).
We know z0 = 0 and z2 = 0. We only need to find z1 (only 1 interior node).
The system (3.89) degenerates to just one equation and h = 1, thus
z0 + 4z1 + z2 = 6[f0 − 2f1 + f2 ] ⇒ z1 = 21
60 CHAPTER 3. INTERPOLATION
In [0, 1] we have
1 1 7
A0 = (z1 − z0 ) = × 21 = ,
6h 6 2
1
B0 = z0 = 0
2
1 1 1 5
C0 = (f1 − f0 ) − h(z1 + 2z0 ) = 1 − 21 = − ,
h 6 6 2
D0 = f0 = 0.
a1 b 1
c 1 a2 b 2
. .
c2 . . . .
.. .. ..
A=
. . .
(3.97)
... .. ..
. .
. .. ...
bN −1
cN −1 aN
3.9. CUBIC SPLINES 61
m1 ; u1 , l1 , m2 ; u2 , l2 , m3 ; . . . , u4 , l4 , m5 .
62 CHAPTER 3. INTERPOLATION
Since uj = bj for all j we can write down the algorithm for general N as
% Forward substitution on Ly = d
y1 = d1
for j = 2 : N
yj = dj − lj−1 ∗ yj−1
end
sj (x) = Aj (x − xj )3 + Bj (x − xj )2 + Cj (x − xj ) + Dj
and so
Therefore
These two equations together with (3.89) uniquely determine the second
derivative values at all the nodes. The resulting (n + 1) × (n + 1) is also
tridiagonal and diagonally dominant (hence nonsingular). Once the values
z0 , z1 , . . . , zn are found the splines coefficients are obtained from (3.84)-(3.87).
Trigonometric Approximation
n
1 X
f (x) ≈ a0 + (ak cos kx + bk sin kx), (4.1)
2 k=1
65
66 CHAPTER 4. TRIGONOMETRIC APPROXIMATION
where
1 2π
Z
ak = f (x) cos kx dx, k = 0, 1, . . . , n (4.2)
π 0
1 2π
Z
bk = f (x) sin kx dx, k = 1, 2, . . . , n. (4.3)
π 0
We will show that this is the best approximation to f , in the L2 norm, by
a trigonometric polynomial of degree n (the right hand side of (4.1)). For
convenience, we write a trigonometric polynomial Sn (of degree n) in complex
form (see (1.45)-(1.48) ) as
n
X
Sn (x) = ck eikx . (4.4)
k=−n
This problem simplifies if we use the orthogonality of the set {1, eix , e−ix , . . . , einx , e−inx },
as for k 6= −l
Z 2π Z 2π 2π
ikx ilx i(k+l)x 1 i(k+l)x
e e dx = e dx = e =0 (4.7)
0 0 i(k + l) 0
and for k = −l
Z 2π Z 2π
ikx ilx
e e dx = dx = 2π. (4.8)
0 0
4.1. APPROXIMATING A PERIODIC FUNCTION 67
Thus, we get
Z 2π n
X Z 2π n
X
2 ikx
Jn = [f (x)] dx − 2 ck f (x)e dx + 2π ck c−k . (4.9)
0 k=−n 0 k=−n
that is
n Z 2π
X
2 1
|ck | ≤ [f (x)]2 dx. (4.13)
k=−n
2π 0
∞
1 2 X 2
a + (a + b2k )
2 0 k=1 k
This convergence in the mean for a continuous periodic function implies that
Bessel’s inequality becomes the equality
∞
1 2π
Z
1 2 X 2 2
a + (a + bk ) = [f (x)]2 dx, (4.16)
2 0 k=1 k π 0
Theorem 4.1. Suppose that f is piecewise continuous and periodic in [0, 2π]
and with a piecewise continuous first derivative. Then
∞
1 X 1 +
f (x) + f − (x)
a0 + (ak cos kx + bk sin kx) =
2 k=1
2
for each x ∈ [0, 2π], where ak and bk are the Fourier coefficients of f . Here
We have been working on the interval [0, 2π] but we can choose any other
interval of length 2π. This is so because if g is 2π periodic then we have the
following result
Lemma 2.
Z 2π Z t+2π
g(x)dx = g(x)dx (4.19)
0 t
Proof. Define
Z t+2π
G(t) = g(x)dx. (4.20)
t
Then
Z 0 Z t+2π Z t+2π Z t
G(t) = g(x)dx + g(x)dx = g(x)dx − g(x)dx. (4.21)
t 0 0 0
1 π 1 π
Z Z
ak = f (x) cos kxdx = |x| cos kxdx
π −π π −π
2 π
Z
= x cos kxdx
π 0
u = x, dv = cos kxdx
1
du = dx, v = sin kx
k Z
π
1 π
2 x π 2
= sin kx − sin kxdx = cos kx
π k 0 k 0 πk 2 0
2
= 2
[(−1)k − 1], k = 1, ...
πk
70 CHAPTER 4. TRIGONOMETRIC APPROXIMATION
The convenience of the 1/2 factor in the last term will be seen in the formulas
for the coefficients below. It is conceptually and computationally simpler to
work with the corresponding polynomial in complex form
N/2
X 00
PN (x) = ck eikx , (4.26)
k=−N/2
where the double prime in the sum means that the first and last term (for
k = −N/2 and k = N/2) have a factor of 1/2. It is also understood that
c−N/2 = cN/2 , which is equivalent to the bN/2 = 0 condition in (4.25).
Theorem 4.2.
N/2
X 00
PN (x) = ck eikx
k=−N/2
with
N −1
1 X N N
ck = f (xj )e−ikxj , k=− ,...,
N j=0 2 2
Defining
N/2
1 X00 ik(x−xj )
lj (x) = e (4.27)
N
k=−N/2
72 CHAPTER 4. TRIGONOMETRIC APPROXIMATION
we have
N
X −1
PN (x) = f (xj )lj (x). (4.28)
j=0
N/2
1 X00 ik(m−j)2π/N
lj (xm ) = e .
N
k=−N/2
N/2−1
1 X
lj (xm ) = eik(m−j)2π/N
N
k=−N/2
N/2−1
1 X
= ei(k+N/2)(m−j)2π/N e−i(N/2)(m−j)2π/N
N
k=−N/2
N −1
−i(m−j)π 1 X ik(m−j)2π/N
=e e
N k=0
N −1
(
1 X −ik(j−m)2π/N 0 if j − m is not divisible by N
e = (4.30)
N k=0 1 otherwise.
PN (xm ) = f (xm ), m = 0, 1, . . . N − 1.
4.2. INTERPOLATING FOURIER POLYNOMIAL 73
Using the relations (4.12) between the ck and the ak and bk coefficients
we find that
N/2−1
1 X 1 N
PN (x) = a0 + (ak cos kx + bk sin kx) + aN/2 cos x
2 k=1
2 2
and in particular c−N/2 = cN/2 . Using the interpolation property and setting
fj = f (xj ), we have
N/2 N/2−1
X 00 X
ikxj
fj = ck e = ck eikxj (4.35)
k=−N/2 k=−N/2
74 CHAPTER 4. TRIGONOMETRIC APPROXIMATION
1.5
0.5
−0.5
−1
−1.5
0 1 2 3 4 5 6
and
N/2−1 −1 N/2−1
X X X
ikxj ikxj
ck e = ck e + ck eikxj
k=−N/2 k=−N/2 k=0
(4.36)
N −1 N/2−1 N −1
X X X
= ck eikxj + ck eikxj = ck eikxj ,
k=N/2 k=0 k=0
where we have used that ck+N = ck . Combining this with the formula for the
ck ’s we get Discrete Fourier Transform (DFT) pair
N −1
1 X
ck = fj e−ikxj , k = 0, . . . , N − 1, (4.37)
N j=0
N
X −1
fj = ck eikxj , j = 0, . . . , N − 1. (4.38)
k=0
The set of discrete coefficients (4.39) is known as DFT of the periodic array
f0 , f1 , . . . , fN −1 and (4.40) is referred to as the Inverse DFT.
The direct evaluation of the DFT is computationally expensive, it re-
quires order N 2 operations. However, there is a remarkable algorithm which
4.3. THE FAST FOURIER TRANSFORM 75
achieves this in merely order N log N operations. This is known as the Fast
Fourier Transform.
N
X −1
kj
dk = fj ωN , (4.41)
j=0
Let us call the matrix on the right FN . Then FN is N times the matrix of
the DFT and the matrix of the Inverse DFT is simply the complex conjugate
76 CHAPTER 4. TRIGONOMETRIC APPROXIMATION
But
2jk 2π −ijk 2π 2π
= e−i2jk N = e = e−ijk n = ωnkj ,
N
ωN 2 (4.46)
(2j+1)k −i(2j+1)k 2π −ik 2π −i2jk 2π k kj
ωN =e N =e N e N = ωN ωn . (4.47)
So denoting fje = f2j and fj0 = f2j+1 , we get
n−1
X n−1
X
dk = fje ωnjk + k
ωN fj0 ωnjk (4.48)
j=0 j=0
4.3. THE FAST FOURIER TRANSFORM 77
mN = 2m N + N
2
= 2m2p−1 + 2p
= 2(2m2p−2 + 2p−1 ) + 2p
= 22 m2p−2 + 2 · 2p
= ···
= 2p m20 + p · 2p = p · 2p
= N log2 N,
79
80 CHAPTER 5. LEAST SQUARES APPROXIMATION
pn (x) = a0 + a1 x + · · · + an xn
such that
Z b
[f (x) − pn (x)]2 dx = min . (5.9)
a
5.1. CONTINUOUS LEAST SQUARES APPROXIMATION 81
2.8
P1(x)
x
f(x)=e
2.6
2.4
2.2
1.8
1.6
1.4
1.2
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 5.1: The function f (x) = ex on [0, 1] and its Least Squares Approxi-
mation p1 (x) = 4e − 10 + (18 − 6e)x.
Xn X n Z b
= ai xi aj xj dx
i=0 j=0 a
n X
Z bX n
= ai xi aj xj dx
a i=0 j=0
Z b X n
= ( aj xj )2 dx ≥ 0.
a j=0
1 1 1
···
1 2 n+1
1 1 1
H=
···
, (5.12)
2 3 n+2
.. .. .. ..
. . . .
1 1 1
···
n+1 n+2 2n + 1
and
Z b n Z b
∂J X
0= = −2 φm (x)f (x)dx + 2 ak φk (x)φm (x)dx.
∂am a k=0 a
then the coefficients of the least squares approximation are explicitly given
by
Z b Z b
1
am = φm (x)f (x)dx, αm = φ2m (x)dx, m = 0, 1, ..., n. (5.16)
αm a a
and
pn (x) = a0 φ0 (x) + a1 φ1 (x) + ... + an φn (x).
Note that if the set {φ0 , ....φn } is orthogonal, (5.16) and (5.13) imply the
Bessel inequality
n
X Z b
αk a2k ≤ f 2 (x)dx. (5.17)
k=0 a
5.2. LINEAR INDEPENDENCE AND GRAM-SCHMIDT ORTHOGONALIZATION85
∞
X
then the series αk a2k converges.
k=0
We can consider the Least Squares approximation for a class of linear
combinations of orthogonal functions {φ0 , ..., φn } not necessarily polynomials.
We saw an example of this with Fourier approximations 1 . It is convenient
to define a weighted L2 norm associated with the Least Squares problem
Z b 21
kf kw,2 = f 2 (x)w(x)dx , (5.18)
a
Example 5.2. The set of functions {φ0 (x), ..., φn (x)}, where φk (x) is a poly-
nomial of degree k for k = 0, 1, . . . , n is linearly independent on any interval
[a, b]. For a0 φ0 (x) + a1 φ1 (x) + . . . an φn (x) is a polynomial of degree at most
n and hence a0 φ0 (x) + a1 φ1 (x) + . . . an φn (x) = 0 for all x in a given interval
[a, b] implies a0 = a1 = . . . = an = 0.
Given a set of linearly independent functions {φ0 (x), ..., φn (x)} we can
produce an orthogonal set {ψ0 (x), ..., ψn (x)} by doing the Gram-Schmidt
procedure:
ψ0 (x) = φ0 (x)
ψ1 (x) = φ1 (x) − c0 ψ0 (x),
hφ1 , ψ0 i
hψ1 , ψ0 i = 0 ⇒ c0 =
hψ0 , ψ0 i
ψ2 (x) = φ2 (x) − c0 ψ0 (x) − c1 ψ1 (x),
hφ2 , ψ0 i
hψ2 , ψ0 i = 0 ⇒ c0 =
hψ0 , ψ0 i
hφ2 , ψ1 i
hψ2 , ψ1 i = 0 ⇒ c1 =
hψ1 , ψ1 i
..
.
ψ0 (x) = φ0 (x),
k−1
X hφk , ψj i (5.21)
ψk (x) = φk (x) − cj ψj (x), cj = .
j=0
hψj , ψj i
Then taking the inner product of this expression with ψk and using orthog-
onality we get
−hxψk , ψk i = −αk hψk , ψk i
and
hxψk , ψk i
αk = .
hψk , ψk i
Similarly, taking the inner product with ψk−1 we obtain
−hxψk , ψk−1 i = −βk hψk−1 , ψk−1 i
but hxψk , ψk−1 i = hψk , xψk−1 i and xψk−1 (x) = ψk (x)+qk−1 (x), where qk−1 (x)
is a polynomial of degree at most k − 1 then
hψk , xψk−1 i = hψk , ψk i + hψk , qk−1 i = hψk , ψk i,
where we have used orthogonality in the last equation. Therefore
hψk , ψk i
βk = .
hψk−1 , ψk−1 i
Finally, taking the inner product of (5.22) with ψm for m = 0, . . . , k − 2 we
get
−hψk , xψm i = cm hψm , ψm i m = 0, . . . , k − 2
but the left hand side is zero because xψm (x) is a polynomial of degree at
most k − 1 and hence it is orthogonal to ψk (x). Collecting the results we
obtain a three-term recursion formula
ψ0 (x) = 1, (5.23)
hxψ0 , ψ0 i
ψ1 (x) = x − α0 , α0 = (5.24)
hψ0 , ψ0 i
for k = 1, . . . n
ψk+1 (x) = (x − αk )ψk (x) − βk ψk−1 (x), (5.25)
hxψk , ψk i
αk = , (5.26)
hψk , ψk i
hψk , ψk i
βk = . (5.27)
hψk−1 , ψk−1 i
88 CHAPTER 5. LEAST SQUARES APPROXIMATION
Example 5.3. Let [a, b] = [−1, 1] and w(x) ≡ 1. The corresponding orthog-
onal polynomials are known as the Legendre Polynomials and are widely
used in a variety of numerical methods. Because xψk2 (x)w(x) is an odd func-
tion it follows that αk = 0 for all k. We have ψ0 (x) = 1 and ψ1 (x) = x. We
can now use the three-term recursion (5.25) to obtain
Z 1
x2 dx
β1 = Z−11 = 1/3
dx
−1
ψ0 (x) = 1,
ψ1 (x) = x,
1
ψ2 (x) = x2 − ,
3
3
ψ3 (x) = x3 − x,
5
..
.
Theorem 5.1. The zeros of orthogonal polynomials are real, simple, and
they all lie in (a, b).
Proof. Indeed, ψk (x) is orthogonal to ψ0 (x) = 1 for each k ≥ 1, thus
Z b
ψk (x)w(x)dx = 0 (5.28)
a
i.e. ψk has to change sign in [a, b] so it has a zero, say x1 ∈ (a, b). Suppose x1
is not a simple root, then q(x) = ψk (x)/(x − x1 )2 is a polynomial of degree
5.3. ORTHOGONAL POLYNOMIALS 89
k − 2 and so
b
ψk2 (x)
Z
0 = hψk , qi = w(x)dx > 0,
a (x − x1 )2
which is of course impossible. Assume that ψk (x) has only l zeros in (a, b),
x1 , . . . , xl . Then ψk (x)(x − x1 ) · · · (x − xl ) = qk−l (x)(x − x1 )2 · · · (x − xl )2 ,
where qk−l (x) is a polynomial of degree k − l which does not change sign in
[a, b]. Then
Z b
hψk , (x − x1 ) · · · (x − xl )i = qk−j (x)(x − x1 )2 · · · (x − xl )2 w(x)dx 6= 0
a
and since 2 cos nθ cos mθ = cos(m + n)θ + cos(m − n)θ, we get for m 6= n
Z 1 π
Tn (x)Tm (x) 1 1 1
√ dx = sin(n + m)θ + sin(n − m)θ = 0.
−1 1 − x2 2 n+m n−m 0
Consequently,
Z 1 m 6= n,
0
Tn (x)Tm (x) π
√ dx = m = n > 0, (5.31)
−1 1 − x2 2
π m = n = 0.
90 CHAPTER 5. LEAST SQUARES APPROXIMATION
N
X
J(a0 , a1 ) = [fj − (a0 + a1 xj )]2 . (5.32)
j=1
We can repeat all the Least Squares Approximation theory that we have
seen at the continuum level except that integrals are replaced by sums. The
conditions for the minimum
N
∂J(a0 , a1 ) X
=2 [fj − (a0 + a1 xj )](−1) = 0, (5.33)
∂a0 j=1
N
∂J(a0 , a1 ) X
=2 [fj − (a0 + a1 xj )](−xj ) = 0, (5.34)
∂a1 j=1
N
X N
X N
X
a0 1 + a1 xj = fj , (5.35)
j=1 j=1 j=1
N
X N
X N
X
a0 x j + a1 x2j = xj f j . (5.36)
j=1 j=1 j=1
pn (x) = a0 + a1 x + · · · + an xn , n<N −1
5.4. DISCRETE LEAST SQUARES APPROXIMATION 91
That is
N
X N
X N
X N
X
a0 x0j + a1 x1j + · · · + an xnj = x0j fj
j=1 j=1 j=1 j=1
N
X N
X N
X N
X
a0 x1j + a1 x2j + · · · + an xn+1
j = x1j fj
j=1 j=1 j=1 j=1
..
.
N
X N
X N
X N
X
a0 xnj + a1 xn+1
j + · · · + an x2n
j = xnj fj .
j=1 j=1 j=1 j=1
The matrix of coefficients of this linear system is, as in the continuum case,
symmetric, positive definite, and highly sensitive to small perturbations in
the data. And again, if we want to increase the degree of the approximating
polynomial we cannot reuse the coefficients we already computed for the lower
order polynomial. But now we know how to go around these two problems:
we use orthogonality.
Let us define the (weighted) discrete inner product as
N
X
hf, giN = fj gj ωj , (5.39)
j=1
where ωj > 0, for j = 1, . . . , N are given weights. Let {φ0 , ..., φn } be a set
of polynomials such that φk is of degree exactly k. Then the solution to the
92 CHAPTER 5. LEAST SQUARES APPROXIMATION
If {φ0 , · · · , φn } are orthogonal with respect to the inner product h·, ·iN ,
i.e. if hφk , φl iN = 0 for k 6= l, then the coefficients of the Least Squares
Approximation are given by
hφk , f iN
ak = , k = 0, 1, ..., n (5.43)
hφk , φk iN
and pn (x) = a0 φ0 (x) + a1 φ1 (x) + · · · + an φn (x).
If the {φ0 , · · · , φn } are not orthogonal we can produce an orthogonal set
{ψ0 , · · · , ψn } using the 3-term recursion formula adapted to the discrete inner
product. We have
ψ0 (x) ≡ 1, (5.44)
hxψ0 , ψ0 iN
ψ1 (x) = x − α0 , α0 = , (5.45)
hψ0 , ψ0 iN
for k = 1, ..., n
ψk+1 (x) = (x − αk )ψk (x) − βk ψk−1 (x), (5.46)
hxψk , ψk iN hψk , ψk iN
αk = , βk = . (5.47)
hψk , ψk iN hψk−1 , ψk−1 iN
Then,
hψj , f iN
aj = , j = 0, ..., n. (5.48)
hψj , ψj iN
and the Least Squares Approximation is pn (x) = a0 ψ0 (x) + a1 ψ1 (x) + · · · +
an ψn (x).
5.4. DISCRETE LEAST SQUARES APPROXIMATION 93
Example 5.4. Suppose we are given the data: xj : 0, 1, 2, 3, fj = 1.1, 3.2, 5.1, 6.9
and we would like to fit to a line. The normal equations are
4
X 4
X 4
X
a0 1 + a1 xj = fj (5.49)
j=1 j=1 j=1
4
X 4
X 4
X
a0 x j + a1 x2j = xj f j (5.50)
j=1 j=1 j=1
Solving this 2 × 2 linear system we get a0 = 1.18 and a1 = 1.93. Thus, the
Least Squares Approximation is
hψ1 , ψ1 iN
β1 = = 0.0825. (5.59)
hψ0 , ψ0 iN
Therefore ψ2 (x) = (x − 0.55)2 − 0.0825. We can now use these orthogonal
polynomials to find the Least Squares Approximation by polynomial of degree
5.5. HIGH-DIMENSIONAL DATA FITTING 95
at most two of a given set of data. Let us take fj = x2j + 2xj + 3. Clearly, the
Least Squares Approximation should be p2 (x) = x2 + 2x + 3. Let us confirm
this by using the orthogonal polynomials ψ0 , ψ1 and ψ2 . The Least Squares
Approximation coefficients are given by
hf, ψ0 iN
a0 = = 4.485, (5.60)
hψ0 , ψ0 iN
hf, ψ1 iN
a1 = = 3.1, (5.61)
hψ1 , ψ1 iN
hf, ψ2 iN
a2 = = 1, (5.62)
hψ2 , ψ2 iN
which gives, p2 (x) = (x−0.55)2 −0.0825+(3.1)(x−0.55)+4.485 = x2 +2x+3.
We have studied in detail the case d = 1. Here we are interested in the case
d >> 1.
If we add an extra component, equal to 1, to each data vector xj so that
now xj = [1, xj1 , . . . , xjd ]T , for j = 1, . . . , N , then we can write (5.63) as
f (x) = aT x (5.64)
96 CHAPTER 5. LEAST SQUARES APPROXIMATION
The normal equations are given by the condition ∇a J(a) = 0. Since ∇a J(a) =
−2X T f + 2X T Xa, we get the linear system of equations
X T Xa = X T f . (5.68)
among all vectors w in W . There is always at least one solution, which can
be obtained by projecting f onto W , as Fig. 5.2 illustrates. First, note that if
a ∈ Rd is a solution of the normal equations (5.68) then the residual f − Xa
is orthogonal to W because
X T (f − Xa) = X T f − X T Xa = 0 (5.69)
f
f − Xa
Xa
a∗ = (X T X)−1 X T f . (5.73)
The d × N matrix
X † = (X T X)−1 X T (5.74)
X = QR, (5.75)
Here R1 is a d×d upper triangular matrix and the zero stands for a (N −d)×d
zero matrix.
Using X = QR we have
Computer Arithmetic
x = ±S × 2E , 1 ≤ S < 2, (6.1)
S = (1.b1 b2 · · · )2 . (6.2)
99
100 CHAPTER 6. COMPUTER ARITHMETIC
The relative error is generally more meaningful than the absolute error
to measure a given approximation.
6.3. CORRECTLY ROUNDED ARITHMETIC 101
where |δx |,|δy | ≤ eps. Therefore, for the relative error we get
x · y − f l(x) · f l(y)
≈ |δx + δy |, (6.12)
x·y
which is acceptable.
Let us now consider addition (or subtraction):
x = (1.01011100 ∗ ∗)2 × 2E ,
y = (1.01011000 ∗ ∗)2 × 2E ,
where the ∗ stands for inaccurate bits (i.e. garbage) that say were generated in
previous floating point computations. Then, in this 10 bit precision arithmetic
Example 6.2. Sometimes we can rewrite the difference of two very close
numbers to avoid digit cancellation. For example, suppose we would like to
compute
√
y = 1+x−1
for x > 0 and very small. Clearly, we will have loss of digits if we proceed
directly. However, if we rewrite y as
√
√ 1+x−1 x
y = ( 1 + x + 1) √ =√
1+x+1 1+x+1
then the computation can be performed at nearly machine precision level.
104 CHAPTER 6. COMPUTER ARITHMETIC
Chapter 7
Numerical Differentiation
etc. We are going to focus here on simple, finite difference formulas obtained
by differentiating low order interpolating polynomials.
Assuming x, x0 , . . . , xn ∈ [a, b] and f ∈ C n+1 [a, b], we have
1
f (x) = pn (x) + f (n+1) (ξ(x)) ωn (x), (7.1)
(n + 1)!
for some ξ(x) ∈ (a, b) and
Thus,
0 1 d (n+1)
f (x0 ) = p0n (x0 ) + f (ξ(x))ωn (x) + f (n+1) 0
(ξ(x)) ωn (x) .
(n + 1)! dx x=x0
105
106 CHAPTER 7. NUMERICAL DIFFERENTIATION
Therefore,
f (x0 ) − f (x0 − h) f (x0 + h) − 2f (x0 ) + f (x0 − h)
p02 (x0 ) = + h
h 2h2
and thus
f (x0 + h) − f (x0 − h)
p02 (x0 ) = . (7.9)
2h
This defines the Centered Difference Formula to approximate f 0 (x0 )
f (x0 + h) − f (x0 − h)
Dh0 f (x0 ) := . (7.10)
2h
Its error is
1 000 1
f 0 (x0 ) − Dh0 f (x0 ) = f (ξ)(x0 − x1 )(x0 − x2 ) = − f 000 (ξ)h2 . (7.11)
3! 6
Example 7.4. Let n = 2 and x1 = x0 + h, x2 = x0 + 2h. The table of finite
differences is
x0 f (x0 )
f (x0 +h)−f (x0 )
h
f (x0 +2h)−2f (x0 +h)+f (x0 )
x0 + h f (x0 + h) 2h2
f (x0 +2h)−f (x0 +h)
h
x0 + 2h f (x0 + 2h)
and
f (x0 + h) − f (x0 ) f (x0 + 2h) − 2f (x0 + h) + f (x0 )
p02 (x0 ) = + h
h 2h2
thus
−f (x0 + 2h) + 4f (x0 + h) − 3f (x0 )
p02 (x0 ) = . (7.12)
2h
If we use this sided difference to approximate f 0 (x0 ), the error is
1 000 1
f 0 (x0 ) − p02 (x0 ) = f (ξ)(x0 − x1 )(x0 − x2 ) = h2 f 000 (ξ), (7.13)
3! 3
which is twice as large as that in Centered Finite Difference Formula.
108 CHAPTER 7. NUMERICAL DIFFERENTIATION
Numerical Integration
111
112 CHAPTER 8. NUMERICAL INTEGRATION
where
Z 1
E[f ] = f [−1, 0, 1, x]x(x2 − 1)dx (8.7)
−1
is the error. Note that x(x2 − 1) changes sign in [−1, 1] so we cannot use the
Mean Value Theorem for integrals. However, if we add another node, x4 , we
can relate f [−1, 0, 1, x] to the fourth order divided difference f [−1, 0, 1, x4 , x],
which will make the integral in (8.7) easier to evaluate:
The first integral is zero, because the integrand is odd. Now we choose x4
symmetrically, x4 = 0, so that x(x2 − 1)(x − x4 ) does not change sign in
[−1, 1] and
Z 1 Z 1
2 2
E[f ] = f [−1, 0, 1, 0, x]x (x − 1)dx = f [−1, 0, 0, 1, x]x2 (x2 − 1)dx.
−1 −1
(8.9)
f (4) (ξ(x))
f [−1, 0, 0, 1, x] = , (8.10)
4!
and assuming f ∈ C 4 [−1, 1], by the Mean Value Theorem for integrals, there
is η ∈ (−1, 1) such that
1
f (4) (η) 4 f (4) (η)
Z
1
E[f ] = x2 (x2 − 1)dx = − = − f (4) (η). (8.11)
4! −1 15 4! 90
Note that Simpson’s elementary quadrature gives the exact value of the
integral when f is polynomial of degree 3 or less (the error is proportional
to the fourth derivative), even though we used a second order polynomial to
approximate the integrand. We gain extra precision because of the symme-
try of the quadrature around 0. In fact, we could have derived Simpson’s
quadrature by using the Hermite (third order) interpolating polynomial of f
at −1, 0, 0, 1.
To obtain the corresponding formula for a general interval [a, b] we use
the change of variables (8.4)
Z b Z 1
1
f (x)dx = (b − a) F (t)dt,
a 2 −1
where
1 1
F (t) = f (a + b) + (b − a)t , (8.13)
2 2
2. What is that k?
Rb
where w is an admissible weight function (w ≥ 0, a w(x)dx > 0, and
Rb k
a
x w(x)dx < +∞ for k = 0, 1, . . .), w ≡ 1 being a particular case. The
interval of integration [a, b] can be either finite or infinite (e.g. [0, +∞],
[−∞, +∞]).
Example 8.1. The trapezoidal rule quadrature has degree of precision 1 while
the Simpson quadrature has degree of precision 3.
where
n
Y (x − xk )
lj (x) = , for j = 0, 1, ..., n. (8.17)
k=0,k6=j
(xj − xk )
116 CHAPTER 8. NUMERICAL INTEGRATION
if < ψk , q >= 0 for all polynomials q of degree less than k. Recall also that
the zeros of the orthogonal polynomials are real, simple, and contained in
[a, b] (see Theorem 5.1).
Definition 8.2. Let ψn+1 be the (n + 1)st orthogonal polynomial and let
x0 , x1 , ..., xn be its n + 1 zeros. Then the interpolatory quadrature (8.18) with
the nodes so chosen is called a Gaussian quadrature.
8.3. GAUSSIAN QUADRATURES 117
The first integral on the right hand side is zero because of orthogonality. For
the second integral the quadrature is exact (it is interpolatory). Therefore
Z b n
X
f (x)w(x)dx = Aj r(xj ). (8.24)
a j=0
Example 8.2. Consider the interval [−1, 1] and the weight function w ≡ 1.
The corresponding orthogonal the Legendre Polynomials 1, x, x2 − 31 , x3 −
118 CHAPTER 8. NUMERICAL INTEGRATION
q q
3
5
Take n = 1. The roots of ψ2 are x0 = − 13 and x1 = 13 . There-
x, · · · .
fore, the corresponding Gaussian quadrature is
Z 1 r ! r !
1 1
f (x)dx ≈ A0 f − + A1 f (8.26)
−1 3 3
where
Z 1
A0 = l0 (x)dx, (8.27)
−1
Z 1
A1 = l1 (x)dx. (8.28)
−1
1
Example 8.3. Let us take again the interval [−1, 1] but now w(x) = √1−x 2.
π r r
1 5π 1
cos = , cos =− . (8.34)
4 2 4 2
" r ! r !#
π 1 1
Q1 [f ] = f − +f . (8.37)
2 2 2
Theorem 8.3. For a Gaussian quadrature all the quadrature weights are
positive and sum up to kwk1 , i.e.,
n
X Z b
(b) Aj = w(x)dx.
j=0 a
for k = 0, 1, . . . , n.
(b) Take f (x) ≡ 1 then
Z b n
X
w(x)dx = Aj . (8.39)
a j=0
Proof. Let p∗2n+1 be the best uniform approximation to f (in the max norm,
kf k∞ = maxx∈[a,b] |f (x)|) by polynomials of degree ≤ 2n + 1. Then,
and therefore
Z b n
X
En [f ] = En [f − p∗2n+1 ] = [f (x) − p∗2n+1 (x)]w(x)dx − Aj [f (xj ) − p∗2n+1 (xj )].
a j=0
8.4. COMPUTING THE GAUSSIAN NODES AND WEIGHTS 121
Taking the absolute value, using the triangle inequality, and the fact that
the quadrature weights are positive we obtain
Z b n
X
∗
|En [f ]| ≤ |f (x) − p2n+1 (x)|w(x)dx + Aj |f (xj ) − p∗2n+1 (xj )|
a j=0
Z b n
X
≤ kf − p∗2n+1 k∞ w(x)dx + kf − p∗2n+1 k∞ Aj
a j=0
= 2kwk1 kf − p∗2n+1 k∞
That is, the rate of convergence is not fixed; it depends on the number of
derivatives the integrand has. We say in this case that the approximation is
spectral. In particular if f ∈ C ∞ [a, b] then the error decreases down to zero
faster than any power of 1/(2n).
Let Πn (θ) = pn (cos θ) and F (θ) = f (cos θ). By extending F evenly over
[−π, 0] (or over [π, 2π]) and using Theorem 4.2, we conclude that Πn (θ)
interpolates F (θ) = f (cos θ) at the equally spaced points θj = jπ
n
, j =
0, 1, ...n if and only if
n
2 X00
ak = F (θj ) cos kθj , k = 0, 1, .., n. (8.53)
n j=0
and approximating F (θ) by its interpolant Πn (θ) = Pn (cos θ), we obtain the
corresponding quadradure
Z 1 Z π
f (x)dx ≈ Πn (θ) sin θdθ. (8.55)
−1 0
124 CHAPTER 8. NUMERICAL INTEGRATION
Assuming n even and using cos kθ sin θ = 12 [sin(1 + k)θ + sin(1 − k)θ] we get
the Clenshaw-Curtis quadrature:
Z 1 n−2
X 2ak an
f (x)dx ≈ a0 + 2
+ . (8.57)
−1 k=2
1−k 1 − n2
k even
and since the elementary Simpson quadrature applied to [xj , xj+2 ] reads:
Z xj+2
h 1
f (x)dx = [f (xj ) + 4f (xj+1 ) + f (xj+2 )] − f (4) (ηj )h5 (8.62)
xj 3 90
for some ηj ∈ (xj , xj+2 ), summing up all the N/2 contributions we get the
composite Simpson quadrature:
Z b N/2−1 N/2
h X X
f (x)dx = f (a) + 2 f (x2j ) + 4 f (x2j−1 ) + f (b)
a 3 j=1 j=1
1
− (b − a)h4 f (4) (η),
180
for some η ∈ (a, b).
p3 (x) = f (0) + f [0, 0]x + f [0, 0, 1]x2 + f [0, 0, 1, 1]x2 (x − 1), (8.63)
and thus
Z 1
1 1 1
p3 (x)dx = f (0) + f 0 (0) + f [0, 0, 1] − f [0, 0, 1, 1]. (8.64)
0 2 3 12
126 CHAPTER 8. NUMERICAL INTEGRATION
Thus,
Z 1
1 1 1
p3 (x)dx = f (0)+ f 0 (0)+ [f (1)−f (0)−f 0 (0)]− [f 0 (0)+f 0 (1)+2(f (0)−f (1))]
0 2 3 12
and simplifying the right hand side we get
Z 1
1 1
p3 (x)dx = [f (0) + f (1)] + [f 0 (0) − f 0 (1)], (8.65)
0 2 12
which is the simple trapezoidal rule plus a correction involving the derivative
of the integrand at the end points.
We can obtain an expression for the error of this quadrature formula by
recalling that the Cauchy remainder in the interpolation is
1 (4)
f (x) − p3 (x) = f (ξ(x))x2 (x − 1)2 (8.66)
4!
and since x2 (x − 1)2 does not change sign in [0, 1] we can use the mean value
Theorem for integrals to get
Z 1 Z 1
1 (4) 1 (4)
E[f ] = [f (x) − p3 (x)]dx = f (η) x2 (x − 1)2 dx = f (η)
0 4! 0 720
(8.67)
Indeed,
00
Bk+1 (x) = (k + 1)Bk0 (x) = (k + 1)kBk−1 (x) (8.74)
for k = 1, 2, . . ..
8.8. THE EULER-MACLAURIN FORMULA 129
Proof. We will prove it by contradiction. Let us suppose that C2M (x) changes
0 0
sign. Then it has at least 3 zeros and, by Rolle’s theorem, C2m (x) = B2m (x)
has at least 2 zeros in (0, 1). This implies that B2m−1 (x) has 2 zeros in
0
(0, 1). Since B2m−1 (0) = B2m−1 (1) = 0, again by Rolle’s theorem, B2m−1 (x)
has 3 zeros in (0, 1), which implies that B2m−2 (x) has 3 zeros, ...,etc. We
then conclude that B2l−1 (x) has 2 zeros in (0, 1) plus the two at the end
points, B2l−1 (0) = B2l−1 (1) for all l = 1, 2, . . ., which is a contradiction (for
l = 1, 2).
B0 (x) = 1 (8.80)
1
B1 (x) = x − (8.81)
2
2
1 1 1
B2 (x) = x − − = x2 − x + (8.82)
2 12 6
3
1 1 1 3 1
B3 (x) = x − − x− = x3 − x2 + x (8.83)
2 4 2 2 2
4 2
1 1 1 7 1
B4 (x) = x − − x− + = x4 − 2x3 + x2 − . (8.84)
2 2 2 5 · 48 30
1
1 1 0
Z Z
0
− f (x)B1 (x)dx = − f (x)B20 (x)dx
0 2 0
(8.85)
1 1 00
Z
1 0 0
= B2 [f (0) − f (1)] + f (x)B2 (x)dx
2 2 0
130 CHAPTER 8. NUMERICAL INTEGRATION
and
Z 1 Z 1
1 00 1
f (x)B2 (x)dx = f 00 (x)B30 (x)dx
2 0 2·3 0
Z 1
1 00
1
000
= f (x)B3 (x) − f (x)B3 (x)dx
2·3 0 0
Z 1 Z 1
1 000 1
=− f (x)B3 (x)dx = − f 000 (x)B40 (x)dx
2·3 0 2·3·4 0
1 1 (4)
Z
B4 000 000
= [f (0) − f (1)] + f (x)B4 (x)dx.
4! 4! 0
(8.86)
Continuing this way we arrive at the Euler-Maclaurin formula for the simple
trapezoidal rule in [0, 1]:
Theorem 8.5.
Z 1 m
1 X B2k (2k−1)
f (x)dx = [f (0) + f (1)] + [f (0) − f (2k−1) (1)] + Rm
0 2 k=1
(2k)!
(8.87)
where
Z 1
1
Rm = f (2m+2) (x)[B2m+2 (x) − B2m+2 ]dx (8.88)
(2m + 2)! 0
and using (8.79), the Mean Value theorem for integrals, and Lemma 3
Z 1
1 B2m+2 (2m+2)
Rm = f (2m+2) (η)[B2m+2 (x) − B2m+2 ]dx = − f (η)
(2m + 2)! 0 (2m + 2)!
(8.89)
It is now straight forward to obtain the Euler Maclaurin formula for the
composite trapezoidal rule with equally spaced points:
8.9. ROMBERG INTEGRATION 131
we have
b
4Th [f ] − T2h [f ]
Z
f (x)dx = + c˜4 h4 + c˜6 h6 + · · · (8.94)
a 3
We can continue the Richardson extrapolation process but we can do this
more efficiently if we reuse the work we have done to compute T2h [f ] to
evaluate Th [f ]. To this end, we note that
N N
N 2 2
1 X 00 X 00 X
Th [f ] − T2h [f ] = h f (a + jh) − h f (a + 2jh) = h f (a + (2j − 1)h)
2 j=0 j=0 j=1
b−a
If we let hl = 2l
then
2l−1
1 X
Thl = Thl−1 [f ] + hl f (a + (2j − 1)hl ). (8.95)
2 j=1
Beginning with the simple trapezoidal rule (two points) we can successively
double the number of points in the quadrature by using (8.95) and immedi-
ately do extrapolation.
Let
b−a
R(0, 0) = Th0 [f ] = [f (a) + f (b)] (8.96)
2
and for l = 1, 2, ..., M define
2l−1
1 X
R(l, 0) = R(l − 1, 0) + hl f (a + 2j − 1)hl ). (8.97)
2 j=1
h = b − a;
R(0, 0) = 12 (b − a)[f (a) + f (b)];
for l = 1 : M
h = h/2;
P l−1
R(1, 0) = 21 R(l − 1, 0) + h 2j=1 f (a + (2j − 1)h);
for m = 1 : M
R(l, m) = R(l, m − 1) + 4m1−1 [R(l, m − l) − R(l − 1, m − 1)];
end
end
134 CHAPTER 8. NUMERICAL INTEGRATION
Chapter 9
Linear Algebra
Ax = b (9.1)
135
136 CHAPTER 9. LINEAR ALGEBRA
Av = λv. (9.4)
probability vector so all its entries are positive, add up to 1, and represent
the probabilities of the system described by the Markov process to be in a
given state in the limit as time goes to infinity. This eigenvector v is in effect
a fixed point of the linear transformation represented by the Markov matrix
A.
The third problem is related to the second one and finds applications
in image compression, model reduction techniques, data analysis, and many
other fields. Given an m × n matrix A, the idea is to consider the eigenvalues
and eigenvectors of the square, n × n matrix AT A (or A∗ A, where A∗ is the
conjugate transpose of A as defined below, if A is complex). As we will see,
the eigenvalues are all real and nonnegative and AT A has a complete set of
orthogonal eigenvectors. The singular values of a matrix A are the positive
square roots of the eigenvalues of AT A. Using this, it follows that any real
m × n matrix A has the singular value decomposition (SVD)
U T AV = Σ, (9.7)
where U is an orthogonal m × m matrix, V is an orthogonal n × n matrix,
and Σ is a “diagonal” matrix of the form
D 0
Σ= , D = diag(σ1 , σ2 , . . . , σr ), (9.8)
0 0
where σ1 ≥ σ2 ≥ . . . σr > 0 are the nonzero singular values of A.
9.2 Notation
A matrix A with elements aij will be denoted A = (aij ), this could be a
square n × n matrix or an m × n matrix. AT denotes the transpose of A, i.e.
AT = (aji ).
A vector in x ∈ Rn will be represented as the n-tuple
x1
x2
x = .. . (9.9)
.
xn
The canonical vectors, corresponding to the standard basis in Rn , will be
denoted by e1 , e2 , . . . , en , where ek is the n-vector with all entries equal to
zero except the j-th one, which is equal to one.
138 CHAPTER 9. LINEAR ALGEBRA
If the vectors are complex, i.e. x and y in Cn we define their inner product
as
n
X
hx, yi = x̄i yi , (9.11)
i=1
that is
hx, Ayi = hAT x, yi. (9.14)
Similarly in the complex case we have
hx, Ayi = hA∗ x, yi, (9.15)
where A∗ is the conjugate transpose of A, i.e. A∗ = (aji ).
But if AT = A then
a11 · · · a1k
..
Ak = . , k = 1, . . . , n. (9.20)
ak1 · · · akk
y1
..
.
y
x = k , (9.21)
0
.
..
0
The converse of Theorem 9.1 is also true but the proof is much more
technical: A is positive definite if and only if det(Ak ) > 0 for k = 1, . . . , n.
Note also that if A is positive definite then all its diagonal elements are
positive because 0 < hej , Aej i = ajj , for j = 1, . . . , n.
9.4. SCHUR THEOREM 141
where λ1 , . . . , λn are the eigenvalues of A and all the elements below the
diagonal are zero.
T ∗ AT = Tk+1
∗
T1∗ AT1 Tk+1 = Tk+1
∗
(T1∗ AT1 )Tk+1 (9.26)
9.5 Norms
A norm on a vector space V (for example Rn or Cn ) over K = R (or C) is a
mapping k · k : V → [0, ∞), which satisfy the following properties:
(i) kxk ≥ 0 ∀x ∈ V and kxk = 0 iff x = 0.
Example 9.1.
This is called the Frobenius norm for matrices. A different matrix norm can
be obtained by using a given vector norm and matrix-vector multiplication.
Given a vector norm k · k in Rn (or in Cn ), it is easy to show that
kAxk
kAk = max , (9.34)
x6=0 kxk
satisfies the properties (i), (ii), (iii) of a norm for all n × n matrices A . That
is, the vector norm induces a matrix norm.
144 CHAPTER 9. LINEAR ALGEBRA
Definition 9.6. The matrix norm defined by (11.1) is called the subordinate
or natural norm induced by the vector norm k · k.
Example 9.2.
kAxk1
kAk1 = max , (9.35)
x6=0 kxk1
kAxk∞
kAk∞ = max , (9.36)
x6=0 kxk∞
kAxk2
kAk2 = max . (9.37)
x6=0 kxk2
n
X
(b) kAk∞ = max |aij |.
i
j=1
9.5. NORMS 145
p
(c) kAk2 = ρ(AT A),
Proof. (a)
n n n n
! n
!
X X X X X
kAxk1 = aij xj ≤ |xj | |aij | ≤ max |aij | kxk1 .
j
i=1 j=1 j=1 i=1 i=1
n
X
Thus, kAk1 ≤ max |aij |. We just need to show there is a vector x for
j
i=1
which the equality holds. Let j ∗ be the index such that
n
X n
X
|aij ∗ | = max |aij | (9.40)
j
i=1 i=1
(c) By definition
kAxk22 xT AT Ax
kAk22 = max = max (9.46)
x6=0 kxk22 x6=0 xT x
Note that the matrix AT A is symmetric and all its eigenvalues are nonnega-
tive. Let us label them in increasing order, 0 ≤ λ1 ≤ λ1 ≤ · · · ≤ λn . Then,
λn = ρ(AT A). Now, since AT A is symmetric, there is an orthogonal matrix Q
such that QT AT AQ = D = diag(λ1 , . . . , λn ). Therefore, changing variables,
x = Qy, we have
xT AT Ax y T Dy λ1 y12 + · · · + λn yn2
= = ≤ λn . (9.47)
xT x yT y y12 + · · · + yn2
Now take the vector y such that yj = 0 for j 6= n and yn = 1 and the equality
holds. Thus,
s
kAxk22 p p
kAk2 = max = λ n = ρ(AT A). (9.48)
x6=0 kxk22
Now,
λ1 0 δb12 δ 2 b13 · · · δ n−1 b1n
λ2
0 δb23 · · · δ n−2 b2n
≤
..
+
.. ..
. . .
δbn−1,n
λn ∞
0 ∞
≤ ρ(A) + .
The exact solution of this linear system is x = [1, 1, 1, 1, 1]T . Note that
b ≈ [2.28, 1.45, 1.09, 0.88, 0.74]T . Let us perturb b slightly (about % 1)
2.28
1.46
b + δb = 1.10
(9.60)
0.89
0.75
1 1
≤ kAk . (9.63)
kxk kbk
kδxk kδbk
≤ kAkkA−1 k . (9.64)
kxk kbk
The right hand side of this inequality is actually a least upper bound, there
are b and δb for which the equality holds.
150 CHAPTER 9. LINEAR ALGEBRA
Because, for any induced norm, 1 = kIk = kA−1 kkAk ≤ kA−1 kkAk, we get
that κ(A) ≥ 1. We say that A is ill-conditioned if κ(A) is very large.
153
154 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
or in pseudo-code:
Note that the assumption that A is nonsingular implies that aii 6= 0 for all
i = 1, 2, . . . , n since det(A) = a11 a22 · · · ann . Also observe that (10.8) shows
that xi is a linear combination of bi , bi−1 , . . . , b1 and since x = A−1 b it follows
that A−1 is also lower triangular.
To compute xi we perform i−1 multiplications, i−1 additions/subtractions,
and one division, so the total amount of computational work W (n) to do for-
ward substitution is
n
X
W (n) = 2 (i − 1) + n = n2 − 2n, (10.9)
i=1
n
X n(n − 1)
i= . (10.10)
i=1
2
we solve the linear system Ax = b starting from xn , then we solve for xn−1 ,
156 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
x1 + 2x2 − x3 + x4 = 0,
2x1 + 4x2 − x4 = −3,
(10.13)
3x1 + x2 − x3 + x4 = 3,
x1 − x2 + 2x3 + x4 = 3.
10.2. GAUSSIAN ELIMINATION 157
The last matrix, let us call it Ub , corresponds to the upper triangular system
1 2 −1 1 x1 0
0 −5 2 −2 x2 3
= , (10.19)
0 0 2 −3 x3 −3
0 0 0 39 10
x4 39
10
Each of the steps in the Gaussian elimination process are linear trans-
formations and hence we can represent these transformations with matrices.
Note, however, that these matrices are not constructed in practice, we only
implement their effect (row exchange or elimination). The first round of elim-
ination (10.15) is equivalent to multiplying (from the left) Ab by the lower
triangular matrix
1 0 0 0
−2 1 0 0
E1 =
−3 0 1 0 ,
(10.21)
−1 0 0 1
that is
1 2 −1 1 0
0 0 2 −3 −3
E1 Ab =
0 −5
. (10.22)
2 −2 3
0 −3 3 0 3
10.2. GAUSSIAN ELIMINATION 159
and we get
1 2 −1 1 0
0 −5 2 −2 3
E2 P E1 Ab = . (10.26)
0 0 2 −3 −3
9 6 6
0 0 5 5 5
and E3 E2 P E1 Ab = Ub .
Observe that P E1 Ab = E10 P Ab , where
1 0 0 0
−3 1 0 0
E10 =
, (10.28)
−2 0 1 0
−1 0 0 1
i.e., we exchange rows in advance and then reorder the multipliers accord-
ingly. If we focus on the matrix A, the first four columns of Ab , we have the
matrix factorization
E3 E2 E10 P A = U, (10.29)
where U is the upper triangular matrix
1 2 −1 1
0 −5 2 −2
U = . (10.30)
0 0 2 −3
39
0 0 0 10
Moreover, the product of upper (lower) triangular matrices is also an upper
(lower) triangular matrix and so is the inverse. Hence, we obtain the so-called
LU factorization
P A = LU, (10.31)
where L = (E3 E2 E10 )−1 = E10−1 E2−1 E3−1 is a lower triangular matrix. Now
recall that the matrices E10 , E2 , E3 perform the transformation of subtracting
the row of the pivot times the multiplier to the rows below. Therefore, the
inverse operation is to add the subtracted row back, i.e. we simply remove
the negative sign in front of the multipliers,
1 0 0 0 1 0 0 0 1 0 0 0
3 1 0 0 0 1 0 0 0 1 0 0
E10−1 = −1
2 0 1 0 , E2 = 0 0 1 0 , E3 = 0 0 1 0 .
−1
1 0 0 1 0 53 0 1 0 0 10 9
1
It then follows that
1 0 0 0
3 1 0 0
L=
2 0
. (10.32)
1 0
1 53 9
10
1
10.2. GAUSSIAN ELIMINATION 161
Note that L has all the multipliers below the diagonal and U has all the
pivots on the diagonal. We will see that a factorization P A = LU is always
possible for any nonsingular n × n matrix A and can be very useful.
We now consider the general linear system (10.1). The matrix of coeffi-
cients and the right hand size are
a11 a12 · · · a1n b1
a21 a22 · · · a2n b2
A = .. .. , b = .. , (10.33)
.. ..
. . . . .
an1 an2 · · · ann bn
1. Find the max |ai1 |, let us say this corresponds to the m-th row, i.e.
i
|am1 | = max |ai1 |. If |am1 | = 0, the matrix is singular. Stop.
i
(1)
This corresponds to Ab = E1 P1 Ab , where P1 is the permutation matrix
that exchanges rows 1 and m (P1 = I if no exchange is made) and E1 is the
matrix to obtain the elimination of the entries below the first element in the
first column. The same three steps above can now be applied to the smaller
(n − 1) × n matrix
and so on. Doing this process (n − 1) times, we obtain the reduced, upper
triangular system, which can be solved with backward substitution.
In matrix terms, the linear transformations in the Gaussian elimination
(k) (k−1) (0)
process correspond to Ab = Ek Pk Ab , for k = 1, 2, . . . , n − 1 (Ab = Ab ),
where the Pk and Ek are permutation and elimination matrices, respectively.
Pk = I if no row exchange is made prior to the k-th elimination round (but
recall that we do not construct the matrices Ek and Pk in practice). Hence,
the Gaussian elimination process for a nonsingular linear system produces
the matrix factorization
(n−1)
Ub ≡ Ab = En−1 Pn−1 En−2 Pn−2 · · · E1 P1 Ab . (10.37)
(n−1) 0 0
Ub ≡ Ab = En−1 En−2 · · · E10 P Ab . (10.38)
10.2. GAUSSIAN ELIMINATION 163
0 0
Since the inverse of En−1 En−2 · · · E10 is the lower triangular matrix
1 0 ··· ··· 0
l21 1
0 ··· 0
L=
l31 l32 1 ··· 0 , (10.39)
.. .. . .. ... ..
. . .
ln1 ln2 · · · ln,n−1 1
where the lij , j = 1, . . . , n − 1, i = j + 1, . . . , n are the multipliers (com-
puted after all the rows have been rearranged), we arrive at the anticipated
factorization P A = LU . Incidentally, up to sign, Gaussian elimination also
produces the determinant of A because
(1) (2)
det(P A) = ± det(A) = det(LU ) = det(U ) = a11 a22 · · · a(n)
nn (10.40)
and so det(A) is plus or minus the product of all the pivots in the elimination
process.
In the implementation of Gaussian elimination the array storing the aug-
mented matrix Ab is overwritten to save memory. The pseudo code with
partial pivoting (assuming ai,n+1 = bi , i = 1, . . . , n) is presented in Algo-
rithm 10.3.
we get
2
W (n) = n3 + O(n2 ). (10.43)
3
Thus, Gaussian elimination is computationally rather expensive for large
systems of equations.
Ly = b, (10.44)
U x = y. (10.45)
10.3. LU AND CHOLESKI FACTORIZATIONS 165
Given b, we can solve the first system for y with forward substitution and
then we solve the second system for x with backward substitution. Thus,
while the LU factorization of A has an O(n3 ) cost, subsequent solutions to
the linear system with the same matrix A but different right hand sides can
be done in O(n2 ) operations.
When can we obtain the factorization A = LU ? the following result
provides a useful sufficient condition.
L−1 −1
2 L1 = U2 U1 . (10.47)
166 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
But the matrix on the left hand side is lower triangular (with ones in its diag-
onal) whereas the one on the right hand side is upper triangular. Therefore
L−1 −1
2 L1 = I = U2 U1 , which implies that L2 = L1 and U2 = U1 .
The matrix on the left hand side is lower triangular with ones in its diagonal
while the matrix on the right hand side is upper triangular also with ones
in its diagonal. Therefore, B −1 C T = I = C(B T )−1 and thus, C = B T
10.3. LU AND CHOLESKI FACTORIZATIONS 167
−1
L = BDB = CDC−1 , (10.49)
T T
U = DB B = DC C , (10.50)
where DC = diag(c11 , . . . , cnn ). Equation (10.50) implies that b2ii = c2ii for
i = 1, . . . , n and since bii > 0 and cii > 0 for all i, then DC = DB and
consequently C = B.
n min(i,j)
X X
aij = lik ljk = lik ljk . (10.51)
k=1 k=1
i
X
aij = lik ljk 1 ≤ i ≤ j ≤ n. (10.52)
k=1
2 √
a11 = l11 , → l11 = a11 ,
a12 = l11 l21 ,
..
.
a1n = l11 ln1
168 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
and this allows us to get the first column of L. The second column is now
found by using (10.52) for i = 2
q
2 2 2
a22 = l21 + l22 , → l22 = a22 − l21 ,
a23 = l21 l31 + l22 l32 ,
..
.
a2n = l21 ln1 + l22 ln2 ,
etc. Algorithm 10.4 gives the pseudo code for the Choleski factorization.
where
m1 = a, (10.55)
lj = cj /mj , mj+1 = aj+1 − lj bj , for j = 1, . . . , n − 1, (10.56)
1st row: a1 = m1 , b1 = b1 ,
2nd row: c1 = m1 l1 , a2 = l1 b1 + m2 , b2 = b2 ,
..
.
(n − 1)-st row: cn−2 = mn−2 ln−2 , an−1 = ln−2 bn−2 + mn−1 , bn−1 = bn−1 ,
n-th row: cn−1 = mn−1 ln−1 , an = ln−1 bn−1 + mn
Thus,
det(A1 ) = a1 = m1 , (10.58)
det(A2 ) = a2 a1 − c1 b1 = a2 m1 − b1 c1 = m1 m2 . (10.59)
170 CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS I
at each point x ∈ [0, 1], and pinned at end points. Let u(x) be the beam
deformation from the horizontal position. Assuming that the deformations
are small (linear elasticity regime), u satisfies
where c(x) ≥ 0 is related to the elastic, material properties of the beam. Be-
cause the beam is pinned at the end points we have the boundary conditions
x0 = 0, x1 = h, x2 = 2h, . . . , xN = N h, xN +1 = 1, (10.64)
v0 = vN +1 = 0. (10.67)
2 + c1 h2 −1 0 ··· ··· 0
v1 f1
... ..
2 v
2 f2
−1 2 + c2 h −1 .
.. ..
0 ... ... ... ... . .
1 .. .. .. .. .. ..
. .
. .
. = . .
2 . . . . . .
h
. .
.. .. .. .. .. ..
. . . . 0
.. .. .. .. ..
. . . −1 . .
0 ··· 0 −1 2 + cN h2 vN fN
(10.68)
N
" 2 #
X vj + cj vj
vT Av = + cj vj2 > 0, ∀ v 6= 0 (10.69)
j=1
h
f . Denoting by Ω, and ∂Ω, the unit square [0, 1] × [0, 1] and its boundary,
respectively, the BVP is to find u such that
and
∂ 2u ∂ 2u
∆u = ∇2 u = uxx + uyy = + . (10.72)
∂x2 ∂y 2
Equation (10.70) is Poisson’s equation (in 2D) and together with (10.71)
specify a (homogeneous) Dirichlet problem because the value of u is given at
the boundary.
To construct a numerical approximation to (10.70)-(10.71), we proceed as
in the previous 1D BVP example by discretizing the domain. For simplicity,
we will use uniformly spaced grid points. We choose a positive integer N and
define the grid points of our domain Ω = [0, 1] × [0, 1] as
Neglecting the O(h2 ) discretization error and denoting by vij the approxima-
tion to u(xi , yj ) we get:
T −I 0 0
v1 f1
.
−I T −I . . v2 f2
.. ..
0 ... ... ... ... . .
.
2 ..
.. .. .. .. .. .
. = h . . (10.76)
. . . . .
. .
.. .. .. .. . ..
. . . . 0 .
.. .. .. .. ..
. . . −I . .
0 0 −I T vN fN
4 −1 0 0
−1 4 −1 . . .
0 . . .
.. .. .. .. .
T =
. . . . . . . . . . .
(10.77)
. . . . .
.. .. .. ..
. . . . 0
. . .
. . . . . . −1
0 0 −1 4
Thus, the matrix of coefficients in (10.76), is sparse, i.e. the vast majority of
10.7. LINEAR ITERATIVE METHODS FOR AX = B 175
But first we look at three concrete iterative methods of the form (10.79).
Unless otherwise stated, A is assumed to be a non-singular n × n matrix and
b a given n-column vector.
kx(k+1) − x(k) k∞
≤ Tolerance. (10.81)
kx(k+1) k∞
10x1 − x2 + 2x3 = 6,
−x1 + 11x2 − x3 + 3x4 = 25,
(10.82)
2x1 − x2 + 10x3 − x4 = −11,
3x2 − x3 + 8x4 = 15.
10.8. JACOBI, GAUSS-SEIDEL, AND S.O.R. 177
It has the unique solution (1,2,-1,1). Jacobi’s iteration for this system is
(a) lim T k = 0.
k→∞
kT k xk ≤ kT k k kxk (10.92)
(b) ⇒ (c): Let us suppose that lim T k x = 0 for all x ∈ Rn but that
k→∞
ρ(T ) ≥ 1. Then, there is a eigenvector v such that T v = λv with |λ| ≥ 1 and
the sequence T k v = λk v does not converge, which is a contradiction.
(c) ⇒ (d): By Theorem 9.5, for each > 0, there is at least one induced
norm k · k such that kT k ≤ ρ(T ) + from which the statement follows.
Theorem 10.5. The iterative method (10.89) is convergent for any initial
guess x(0) if and only if ρ(T ) < 1 or equivalently if and only if kT k < 1 for
at least one induced norm.
from which it follows that the error of the k iterate, ek = x(k) − x, satisfies
ek = T k e0 , (10.94)
for k = 1, 2, . . . and where e0 = x − x(0) is the error of the initial guess. The
conclusion now follows immediately from Theorem 10.4.
The spectral radius ρ(T ) of the iteration matrix T measures the rate of
convergence of the method. For if T is normal, then kT k2 = ρ(T ) and from
(10.94) we get
But each k we can find a vector e0 for which the equality holds so ρ(T )k ke0 k2
is a least upper bound for the error kek k2 . If T is not normal, the following
results shows that, asymptotically kT k k ≈ ρ(T )k , for any matrix norm.
Theorem 10.6. Let T be any n × n matrix. Then, for any matrix norm k · k
Now, for any given > 0 construct the matrix T = T /(ρ(T ) + ). Then
limk→∞ Tk = 0 as ρ(T ) < 1. Therefore, there is an integer K such that
kT k k
kTk k = ≤ 1, for all k ≥ K . (10.98)
(ρ(T ) + )k
Proof. (a) The Jacobi iteration matrix T has entries Tii = 0 and Tij =
−aij /aii for i 6= j. Therefore,
n n
X aij 1 X
kT k∞ = max = max |aij | < 1. (10.100)
1≤i≤n aii 1≤i≤n |aii |
j=1 j=1,j6=i
j6=i
(b) We will proof that ρ(T ) < 1 for the Gauss-Seidel iteration. Let x be an
eigenvector of T with eigenvalue λ, normalized to have kxk∞ = 1. Recall
that T = I − M −1 A. Then, T x = λx implies M x − Ax = λM x from which
we get
n
X i
X i−1
X
− aij xj = λ aij xj = λaii xi + λ aij xj . (10.101)
j=i+1 j=1 j=1
where the last inequality was obtained by using that A is SDD. Thus, |λ| < 1
and so ρ(T ) < 1.
Theorem 10.8. A necessary condition for the S.O.R. iteration is 0 < ω < 2.
Proof. We will show that det(T ) = (1−ω)n and because det(T ) is equal, up to
a sign, to the product of the eigenvalues of T we have that | det(T )| ≤ ρn (T )
and this implies that
Since ρ(T ) < 1 is required for convergence, the conclusion follows. Now,
T = M −1 [M − A] and det(T ) = det(M −1 ) det(M − A). From the definition
of the S.O.R. iteration (10.88) we get that
i−1
X n
X
(k+1) (k+1) (k) (k)
aii xi +ω aij xj = aii xi −ω aij xj + ωb. (10.103)
j=1 j=i
In this chapter we focus on some numerical methods for the solution of large
linear systems Ax = b where A is a sparse, symmetric positive definite matrix.
We also look briefly at the non-symmetric case.
183
184 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II
Denoting g(t) = J(x + tv) and using the definition (11.3) of J we get
1
g(t) = hx − x̄ + tv, A(x − x̄ + tv)i
2
1
= J(x) + hx − x̄, Avi t + hv, Avi t2 (11.5)
2
1
= J(x) + hAx − b, vi t + hv, Avi t2 .
2
This is a parabola opening upward because hv, Avi > 0 for all v 6= 0. Thus,
its minimum is given by the critical point
that is
hv, b − Axi
t∗ = (11.7)
hv, Avi
∗ 1 hv, b − Axi2
g(t ) = J(x) − . (11.8)
2 hv, Avi
1 1 1
kx − x̄k2A = kxk2A − hb, xi + kx̄k2A (11.9)
2 2 2
and so it follows that
∇J(x) = Ax − b. (11.10)
11.2. LINE SEARCH METHODS 185
hr(k) , r(k) i
tk = , (11.18)
hr(k) , Ar(k) i
x(k+1) = x(k) + tk r(k) , (11.19)
(k+1) (k) (k)
r =r − tk Ar , (11.20)
that
For fixed v (0) , v (1) , . . . , v (k−1) , define the following function of c0 , c1 , . . . , ck−1
∂G ∗ ∗
(c0 , c1 , ..., c∗k−1 ) = 0, j = 0, . . . , k − 1. (11.25)
∂cj
That is, the residual r(k) = b − Ax(k) is orthogonal to all the search directions
v (0) , . . . , v (k−1) .
Let us go back to one step of a line search method, x(k+1) = x(k) + tk v (k) ,
where tk is given by the one-dimensional minimizer (11.16). As we have
done in the Steepest Descent method, we find that the corresponding residual
188 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II
satisfies r(k+1) = r(k) −tk Av (k) . Starting with an initial guess x(0) , we compute
r(0) = b − Ax(0) and take v (0) = r(0) . Then,
and
where the last equality follows from the definition (11.16) of t0 . Now,
and consequently
hr(2) , v (0) i = hr(1) , v (0) i − t1 hv (0) , Av (1) i = −t1 hv (0) , Av (1) i. (11.32)
Thus if
then hr(2) , v (0) i = 0. Moreover, r(2) = r(1) − t1 Av (1) from which it follows that
where in the last equality we have used the definition of t1 , (11.16). Thus, if
condition (11.33) holds we can guarantee that hr(1) , v (0) i = 0 and hr(2) , v (j) i =
0, j = 0, 1, i.e. we satisfy the conditions of Theorem 11.1 for k = 1, 2.
Theorem 11.2. Suppose v (0) , ..., v (k−1) are conjugate with respect to A, then
for k = 1, 2, . . .
hr(k) , v (j) i = 0, j = 0, 1, . . . , k − 1.
11.3. THE CONJUGATE GRADIENT METHOD 189
where the scalar sk is chosen so that v (k) is conjugate to v (k−1) with respect
to A, i.e.
0 = hv (k) , Av (k−1) i = hr(k) , Av (k−1) i + sk hv (k−1) , Av k−1) i (11.40)
which gives
hr(k) , Av (k−1) i
sk = − . (11.41)
hv (k−1) , Av (k−1) i
Magically this simple construction renders all the search directions conjugate
and the residuals orthogonal!
190 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II
Theorem 11.4.
cheaply by avoiding to operate with the zeros of A. For example, for matrix
in the solution of Poisson’s equation in 2D (10.76), the cost of computing
Av (k) is just O(n), where n = N 2 is the total number of unknowns.
Proof. By Theorem 11.4, the residuals are orthogonal hence linearly inde-
pendent. After n steps, r(n) is orthogonal to r(0) , r(1) , . . . , r(n−1) . Since the
dimension of the space is n, r(n) has to be the zero vector.
and evaluate the residual r(1) , etc. If we use the definition of the residual we
have
so that r(2) = b − Ax(2) is a linear combination of r(0) , Ar(0) , and A2 r(0) and
so on.
Definition 11.2. The set Kk (r(0) , A) = span{r(0) , Ar(0) , ..., Ak−1 r(0) } is called
the Krylov subspace of degree k for r(0) .
Krylov subspaces are central to an important class of numerical methods
that rely on getting approximations through matrix-vector multiplication like
the conjugate gradient method.
The following theorem provides a reinterpretation of the conjugate gra-
dient method. The approximation x(k) is the minimizer of kx − x̄k2A over
Kk (r(0) , A).
Theorem 11.6. Kk (r(0) , A) = span{r(0) , ..., r(k−1) } = span{v (0) , ..., v (k−1) }.
Proof. We will proof it by induction. The case k = 1 by construction. Let us
now assume that it holds for k and we will prove that it also holds for k + 1.
By the induction hypothesis r(k) , v (k−1) ∈ Kk (r(0) ; A) then
Consequently,
span{r(0) , ..., r(k) } ⊆ Kk+1 (r(0) , A).
We now prove the reverse inclusion,
Given that
and since
1 (j)
Av (j) = (r − r(j+1) )
tj
it follows that
Ak r(0) ∈ span{r(0) , r(1) , ..., r(k) }.
Thus,
span{r(0) , ..., r(k) } = Kk+1 (r(0) , A).
For the last equality we observe that span{v (0) , ..., v (k) } = span({v (0) , ..., v (k) , r(k) }
because v (k) = r(k) + sk v (k−1) and by the induction hypothesis
span{v (0) , ..., v (k) , r(k) } = span{r(0) , Ar(0) , ..., Ak r(0) , r(k) }
= span{r(0) , r(1) , ..., r(k) , r(k) } (11.54)
(0)
= Kk+1 (r , A).
For the conjugate gradient method x(k) ∈ x(0) + Kk (r(0) , A) and in view of
(11.55) we have that
where P̃k is the set of all polynomials of degree ≤ k and that are equal
to one at 0. Since A is symmetric positive definite all its eigenvalues are
real and positive. Let’s order them as 0 < λ1 ≤ λ2 ≤ . . . ≤ λn , with
associated orthonormal eigenvectors v1 , v2 , . . . vn . Then, we can write e(0) =
α1 v1 + . . . αn vn for some scalars α0 , . . . , αn and
n
X
(0)
p(A)e = p(λj )αj vj . (11.58)
j=1
Therefore,
n
X
kp(A)e(0) k2A = hp(A)e(0) , Ap(A)e(0) i = p2 (λj )λj αj2
j=1
n
X (11.59)
2
≤ max p (λj ) λj αj2
j
j=1
and since
n
X
ke(0) k2A = λj αj2 (11.60)
j=1
we get
The min max term can be estimated using the Chebyshev polynomial Tk
with the change of variables
2λ − λ1 − λn
f (λ) = (11.62)
λn − λ1
to map [λ1 , λn ] to [−1, 1]. The polynomial
1
p(λ) = Tk (f (λ)) (11.63)
Tk (f (0))
196 CHAPTER 11. LINEAR SYSTEMS OF EQUATIONS II
Now
λ1 + λn λn /λ1 + 1 κ2 (A) + 1
|Tk (f (0))| = Tk = Tk = Tk
λn − λ1 λn /λ1 − 1 κ2 (A) − 1
(11.65)
κ2 (A) + 1 1
= (z + 1/z) (11.66)
κ2 (A) − 1 2
for
p p
z = ( κ2 (A) + 1)/( κ2 (A) − 1) (11.67)
we obtain
!k
−1
p
κ2 (A) + 1 κ2 (A) − 1
Tk ≤2 p . (11.68)
κ2 (A) − 1 κ2 (A) + 1
Thus we get the upper bound error estimate for the error in the conjugate
gradient method:
p !k
κ2 (A) − 1
ke(k) kA ≤ 2 p ke(0) kA . (11.69)
κ2 (A) + 1
Chapter 12
Eigenvalue Problems
In this chapter we take a brief look at some numerical methods for the stan-
dard eigenvalue problem, i.e. of finding eigenvalues λ and eigenvectors v of
an n × n matrix A.
v = c1 v1 + · · · + cn vn (12.2)
and
" n k #
X cj λj
Ak v = c1 λk1 v1 + · · · + cn λk1 vn = c1 λk1 v1 + vj . (12.3)
j=2
c 1 λ1
197
198 CHAPTER 12. EIGENVALUE PROBLEMS
The power method is useful and efficient for computing the dominant
eigenpair λ1 , v1 when A is sparse, so that the evaluation of Av is economical,
and when |λ2 /λ1 | << 1.
One can use shifts in the matrix A to decrease |λ2 /λ1 | and improve con-
vergence. We apply the power method with the shifted matrix A − sI, where
the shift s is chosen to accelerated convergence. For example, suppose A is
symmetric and has eigenvalues 100, 90, 50, 40, 30, 30 the matrix A − 60I has
eigenvalues 40, 30, −10, −20, −30, −30 and the power method would converge
at a rate of 30/40 = 0.75 instead of a rate of 90/100 = 0.9.
A variant of the shift power method is the inverse power method, which
applies the iteration to the matrix (A − sI)−1 . The inverse is not actually
computed; instead the linear system (A − sI)v (k) = v (k−1) is solved at every
iteration. The method will converge to the eigenvalue λj for which |λj − s is
the smallest and so with an appropriate choice for s it is possible to converge
to each of the eigenpairs of A.
A1 = Q1 R1 (12.5)
the A20 = R19 Q19 produced by the QR method gives the eigenvalues of A
200 CHAPTER 12. EIGENVALUE PROBLEMS
Non-Linear Equations
13.1 Introduction
In this chapter we consider the problem of finding zeros of a continuous
function f , i.e. solving f (x) = 0 for example ex − x = 0 or a system of
nonlinear equations:
f1 (x1 , x2 , · · · , xn ) = 0,
f2 (x1 , x2 , · · · , xn ) = 0,
.. (13.1)
.
fn (x1 , x2 , · · · , xn ) = 0.
We are going to write this generic system in vector form as
f (x) = 0, (13.2)
where f : U ⊆ Rn → Rn . Unless otherwise noted the function f is assumed
to be smooth in its domain U .
We are going to start with the scalar case, n = 1 and look a very simple
but robust method that relies only on the continuity of the function and the
existence of a zero.
13.2 Bisection
Suppose we are interested in solving a nonlinear equation in one unknown
f (x) = 0, (13.3)
201
202 CHAPTER 13. NON-LINEAR EQUATIONS
ak +bk
and ck = 2
is the midpoint of the interval then
1 b−a
|ck − x∗ | ≤ (bk − ak ) = k (13.7)
2 2
and consequently ck → x∗ , a zero of f in [a, b].
or equivalently
|xn+1 − x∗ |
lim = C. (13.9)
n→∞ |xn − x∗ |p
C < 1 for p = 1.
Then
|xN +1 − x∗ | ≈ C|xN − x∗ |,
|xN +2 − x∗ | ≈ C|xN +1 − x∗ | ≈ C(C|xN − x∗ |) = C 2 |xN − x∗ |.
204 CHAPTER 13. NON-LINEAR EQUATIONS
and this is the reason of the requirement C < 1 for p = 1. If the error at the
N step, |xN − x∗ |, is small enough it will be reduced by a factor of C k after
k more steps. Setting C k = 10−dk , then the error |xN − x∗ | will be reduced
approximately
1
dk = log10 k (13.12)
C
digits.
Let us now do a similar analysis for p = 2, quadratic convergence. We
have
|xN +1 − x∗ | ≈ C|xN − x∗ |2 ,
|xN +2 − x∗ | ≈ C|xN +1 − x∗ |2 ≈ C(C|xN − x∗ |2 )2 = C 3 |xN − x∗ |4 ,
|xN +3 − x∗ | ≈ C|xN +2 − x∗ |2 ≈ C(C 3 |xN − x∗ |4 )2 = C 7 |xN − x∗ |8 .
It is not difficult to prove that for the general p > 1 and as k → ∞ we get
1
dk = αp pk , where αp = p−1 log10 C1 + log10 |xN 1−x∗ | .
Then we can define the next approximation as the zero of that tangent line,
i.e.
f (x0 )
x1 = x0 − , (13.16)
f 0 (x0 )
etc. At the k step or iteration we get the new approximation xk+1 according
to:
f (xk )
xk+1 = xk − , k = 0, 1, . . . (13.17)
f 0 (xk )
1
f (x) = f (xk ) + f 0 (xk )(x − xk ) + f 00 (ξk )(x − xk )2 , (13.18)
2
206 CHAPTER 13. NON-LINEAR EQUATIONS
1 00
∗ 2 f (ξ0 )
|x1 − x | = |x0 − x| 2
≤ 2 M () ≤ . (13.24)
f 0 (x0 )
1 00
∗ 2 f (ξk )
|xk+1 − x | = |xk − x| 2
≤ 2 M () < (13.25)
f 0 (xk )
so xk+1 ∈ I . Now,
The need for a good initial guess x0 for Newton’s method should be
emphasized. In practice, this is obtained with another method, like bisection.
f (xk )
xk+1 = xk − f (xk )−f (xk−1 )
, k = 1, 2, . . . (13.26)
xk −xk−1
208 CHAPTER 13. NON-LINEAR EQUATIONS
f (xk ) − f (x∗ )
xk+1 − x∗ = xk − x∗ − f (xk )−f (xk−1 )
,
xk −xk−1
f (xk ) − f (x∗ )
= xk − x∗ −
f [xk , xk−1 ]
f (xk )−f (x∗ )
!
xk −x∗
= (xk − x∗ ) 1 −
f [xk , xk−1 ]
f [xk , x∗ ]
∗
= (xk − x ) 1 −
f [xk , xk−1 ]
f [xk , xk−1 ] − f [xk , x∗ ]
∗
= (xk − x )
f [xk , xk−1 ]
f [xk ,xk−1 ]−f [xk ,x∗ ]
xk−1 −x∗
= (xk − x∗ )(xk−1 − x∗ )
f [xk , xk−1 ]
f [xk−1 , xk , x∗ ]
= (xk − x∗ )(xk−1 − x∗ )
f [xk , xk−1 ]
1
∗ f 00 (x∗ ) −x ∗
If xk → x∗ , then f [x k−1 ,xk ,x ]
f [xk ,xk−1 ]
→ 2f 0 (x∗ ) and limk→∞ xxk+1
k −x
∗ = 0, i.e. the
sequence generated by the secant method would converge faster than linear.
Defining ek = |xk − x∗ |, the calculation above suggests
Let’s try to determine the rate of convergence of the secant method. Starting
1/p
with the ansatz ek ≈ Aepk−1 or equivalently ek−1 = A1 ek we have
p1
1
ek+1 ≈ cek ek−1 ≈ cek ek ,
A
which implies
1
A1+ p 1−p+ p1
≈ ek . (13.28)
c
13.7. FIXED POINT ITERATION 209
Since the left hand side is a constant we must have 1 − p + p1 = 0 which gives
√
1± 5
p= 2
, thus
√
1+ 5
p= =≈ 1.61803 (13.29)
2
gives the rate of convergence of the secant method. It is better than linear,
but worse than quadratic. Sufficient conditions for local convergence are as
in Newton’s method.
xk+1 = g(xk ), k = 0, 1, . . .
|xk − x∗ |
= |g(xk−1 ) − g(x∗ )|
≤ L|xk−1 − x∗ |
≤ L2 |xk−2 − x∗ |
≤ ···
≤ Lk |x0 − x∗ | → 0, as k → ∞.
210 CHAPTER 13. NON-LINEAR EQUATIONS
Theorem 13.2. If g is contraction on [a, b] and maps [a, b] into [a, b] then
g has a unique fixed point x∗ in [a, b] and the fixed point iteration converges
to it for any [a, b]. Moreover
(a)
L∗
|xk − x∗ | ≤ |x1 − x0 |
1−L
(b)
|xk − x∗ | ≤ Lk |x0 − x∗ |
Proof. With proved (b) already. Since g : [a, b] → [a, b], the fixed point
iteration xk+1 = g(xk ), k = 0, 1, ... is well-defined and
Now, for n ≥ m
and so
LN
|x1 − x0 | ≤ (13.32)
1−L
and thus for n ≥ m ≥ N , |xn − xm | ≤ , i.e. {xn }∞ n=0 is a Cauchy sequence
in [a, b] so it converges to a point x∗ ∈ [a.b]. But
Example 13.3. Let g(x) = 14 (x2 + 3) for x ∈ [0, 1]. Then 0 ≤ g(x) ≤ 1 and
|g 0 (x)| ≤ 12 for all x ∈ [0, 1]. So g is contractive in [0, 1] and the fixed point
iteration will converge to the unique fixed point of g in [0, 1].
Note that
Thus,
xk+1 − x∗
= g 0 (ξk ) (13.34)
xk − x∗
and unless g 0 (x∗ ) = 0, the fixed point iteration converges linearly, when it
does converge.
Then, by the Chain Rule, h0 (t) = DG(x + t(y − x))(y − x), where DG
stands for the derivative matrix of G. Then, using the definition of h and
the Fundamental Theorem of Calculus we have
Z 1
G(y) − G(x) = h(1) − h(0) = h0 (t)dt
0
Z 1 (13.38)
= DG(x + t(y − x))dt (y − x).
0
and G is a contraction (in that norm). The spectral radius of DG, ρ(DG)
willdetermine the rate of convergence of the corresponding fixed point itera-
tion.
Continuing this way, Newton’s method for the system of equations f (x) = 0
can be written as
14.1 Introduction
In this chapter we will be concerned with numerical methods for the initial
value problem:
dy(t)
= f (t, y(t)), t0 < t ≤ T, (14.1)
dt
y(t0 ) = α. (14.2)
The independent variable t often represents time but does not have to. With-
out loss of generality we will take t0 = 0. The time derivative is also fre-
quently denoted with a dot (especially in physics) or an apostrophe
dy
= ẏ = y 0 . (14.3)
dt
Example 14.1.
215
216 CHAPTER 14. NUMERICAL METHODS FOR ODES
Example 14.2.
y10 (t) = y1 (t)y2 (t) − y12 , 0<t≤T
(14.6)
y20 (t) = −y2 (t) + t2 cos y1 (t), 0<t≤T
y1 (0) = α1 ,
(14.7)
y2 (0) = α2 .
These two are examples of first order ODEs, which is the type of initial
value problems we will focus on. Higher order ODEs can be written as first
order systems by introducing new variables for the derivatives from the first
up to one order less.
Example 14.3. The Harmonic Oscillator.
y 00 (t) + k 2 y(t) = 0. (14.8)
If we define y1 = y and y2 = y 0 we get
y10 (t) = y2 (t),
(14.9)
y20 (t) = −k 2 y1 (t).
Example 14.4.
y 000 (t) + 2y(t)y 00 (t) + cos y 0 (t) + et = 0. (14.10)
Introducing the variables y1 = y, y2 = y 0 , and y3 = y 00 we obtain the first
order system:
y10 (t) = y2 (t),
y20 (t) = y3 (t), (14.11)
y30 (t) = −2y1 (t)y3 (t) − cos y2 (t) − et .
If f does not depend explicitly on t we call the ODE (or the system
of ODEs) autonomous. We can turn a non-autonomous system into an
autonomous one by introducing t as a new variable.
Example 14.5. Consider the ODE
y 0 (t) = sin t − y 2 (t) (14.12)
If we define y1 = y and y2 = t we can write this ODE as the autonomous
system
y10 (t) = sin y2 (t) − y12 (t),
(14.13)
y20 (t) = 1.
14.1. INTRODUCTION 217
D = {(t, y) : 0 ≤ t ≤ T, y ∈ Rn } . (14.14)
for all t ∈ [0, T ] and all y1 , y2 ∈ Rn , then the initial value problem (14.1)-
(14.2) has a unique solution for each α ∈ Rn .
Note that if there is L0 ≥ 0 such that
∂fi
(t, y) ≤ L0 (14.16)
∂yj
for all y, t ∈ [0, T ], and i, j = 1, . . . , n then f is uniformly Lipschitz in y
(equivalently, if the given norm of the derivative matrix of f is bounded by
L, see Section 13.8).
Example 14.6.
y 0 = y 2/3 , 0<t
(14.17)
y(0) = 0.
The partial derivative
∂f 2
= y −1/3 (14.18)
∂y 3
is not continuous around 0. Clearly, y ≡ 0 is a solution of this initial value
1 3
problem but so is y(t) = 27 t . There is no uniqueness of solution for this
initial value problem.
Example 14.7.
y0 = y2, 1<t≤3
(14.19)
y(1) = 3.
218 CHAPTER 14. NUMERICAL METHODS FOR ODES
y0 = α (14.33)
∆t
yn+1 = yn + [f (tn , yn ) + f (tn+1 , yn+1 )] , n = 0, 1, . . . N − 1. (14.34)
2
Like the backward Euler method, this is an implicit one-step method. We will
see later an important class of one-step methods, known as Runge-Kutta
(RK) methods, that use intermediate approximations to the derivative (i.e.
approximations to f ) and a corresponding quadrature. For example, we can
use the midpoint rule quadrature and the approximation
∆t
f tn+1/2 , y(tn+1/2 ) ≈ f tn+1/2 , yn + f (tn , yn ) (14.35)
2
fn = f (tn , yn ). (14.37)
(t − tn−1 ) (t − tn )
p1 (t) = fn − fn−1 (14.38)
∆t ∆t
we get
Z tn+1 Z tn+1
∆t
f (t, y(t))dt ≈ p1 (t)dt = [3fn − fn−1 ] (14.39)
tn tn 2
or equivalently
m
X m
X
τn+m = aj y(tn+j ) − ∆t bj y 0 (tn+j ), (14.52)
j=0 j=0
Then
m
X m
X
aj y(tn+j ) = ∆t bj f (tn+j , y(tn+j )) + Tn+m . (14.54)
j=0 j=0
On the other hand yn+m in the definition of the local error is computed using
m−1
" m−1
#
X X
am yn+m + aj y(tn+j ) = ∆t bm f (tn+m , yn+m ) + bj f (tn+j , y(tn+j )) .
j=0 j=0
(14.55)
Subtracting (14.54) to (14.55) and using am = 1 we get
y(tn+m ) − yn+m = ∆t bm [f (tn+m , y(tn+m )) − f (tn+m , yn+m )] + Tn+k .
(14.56)
224 CHAPTER 14. NUMERICAL METHODS FOR ODES
∂f
f (tn+m , y(tn+m )) − f (tn+m , yn+m ) = (tn+m , ηn+m ) [y(tn+m ) − yn+m ] ,
∂y
(14.57)
for some ηn+m between y(tn+m ) and yn+m . Substituting this into (14.56) and
solving for τn+m = y(tn+m ) − yn+m we get
−1
∂f
τn+m = 1 − ∆t bm (tn+m , ηn+m ) Tn+m . (14.58)
∂y
Example 14.8. The local truncation error for the forward Euler method is
1
τn+1 = (∆t)2 y 00 (ηn ). (14.61)
2
Thus, assuming the exact solution is C 2 , the local truncation error of the
forward Euler method is O(∆t)2 1 .
Example 14.9. For the explicit midpoint Runge -Kutta method we have
∆t
τn+1 = y(tn+1 ) − y(tn ) − ∆tf tn+1/2 , y(tn ) + f (tn , y(tn )) . (14.62)
2
But y 0 = f , y 00 = f 0 and
∂f ∂f 0 ∂f ∂f
f0 = + y = + f. (14.64)
∂t ∂y ∂t ∂y
Therefore
∆t 1
f tn+1/2 , y(tn ) + f (tn , y(tn )) = y 0 (tn ) + ∆t y 00 (tn ) + O(∆t)2 . (14.65)
2 2
In the previous two examples the methods are one-step. We know obtain
the local truncation error for a multistep method.
and using y 0 = f
∆t
τn+2 = y(tn+2 ) − y(tn+1 ) − [3y 0 (tn+1 ) − y 0 (tn )] . (14.69)
2
Taylor expanding y(tn+2 ) and y 0 (tn ) around tn+1 we have
1
y(tn+2 ) = y(tn+1 ) + ∆ty 0 (tn+1 ) + (∆t)2 y 00 (tn+1 ) + O(∆t)3 , (14.70)
2
y 0 (tn ) = y 0 (tn+1 ) − ∆ty 00 (tn+1 ) + O(∆t)2 . (14.71)
1
τn+2 = ∆ty 0 (tn+1 ) + (∆t)2 y 00 (tn+1 )
2
∆t (14.72)
− [2y (tn+1 ) − ∆ty 00 (tn+1 )] + O(∆t)3
0
2
= O(∆t)3 .
Definition 14.3. A numerical method for the initial value problem (14.1)-
(14.2) is said to be of order p if its local truncation error is O(∆t)p+1 .
14.6 Convergence
A basic requirement of the approximations generated by a numerical method
is that they get better and better as we take smaller step sizes. That is, we
want the approximations to approach the exact solution as ∆t → 0.
Definition 14.5. A numerical method for the initial value (14.1)-(14.2) is
convergent if the global error at a given t = n∆t converges to zero as ∆t → 0
and t = n∆t i.e.
lim [y(n∆t) − yn ] = 0. (14.79)
∆t→0
n∆t=t
Note that for a multistep method the initialization values y1 , . . . , ym−1 must
converge to y(0) = α as ∆t → 0.
228 CHAPTER 14. NUMERICAL METHODS FOR ODES
≤ ... (14.85)
n
X
≤ (1 + ∆tL)n+1 |e0 | + C(∆t)p+1 (1 + ∆tL)j
j=0
K1 = f (tn , yn ), (14.92)
K2 = f (tn + ∆t, yn + ∆tK1 ), (14.93)
1 1
yn+1 = yn + ∆t K1 + K2 . (14.94)
2 2
Note that K1 and K2 are approximations to the derivative of y.
Example 14.13. The midpoint RK method (14.36) is also a two-stage RK
method and can be written as
K1 = f (tn , yn ), (14.95)
∆t ∆t
K2 = f tn + , yn + K1 , (14.96)
2 2
yn+1 = yn + ∆tK2 . (14.97)
14.7. RUNGE-KUTTA METHODS 231
K1 = f (tn , yn ),
1 1
K2 = f (tn + ∆t, yn + ∆tK1 ),
2 2
1 1
K3 = f (tn + ∆t, yn + ∆tK2 ), (14.98)
2 2
K4 = f (tn + ∆t, yn + ∆tK3 ),
∆t
yn+1 = yn + [K1 + 2K2 + 2K3 + K4 ] .
6
s
!
X
K1 = f tn + c1 ∆t, yn + a1j Kj ,
j=1
s
!
X
K2 = f tn + c2 ∆t, yn + a2j Kj ,
j=1
.. (14.99)
.
s
!
X
Ks = f tn + cs ∆t, yn + asj Kj ,
j=1
s
X
yn+1 = yn + ∆t bj K j .
j=1
is lower triangular with zeros on the diagonal, i.e. aij = 0 for i ≤ j. The
zeros of A are usually not displayed in the tableau.
Example 14.14. The tables 14.2-14.4 show the Butcher tableau of some
explicit RK methods.
Implicit RK methods are useful for some initial values problems with dis-
parate time scales as we will see later. To reduce the computational work
needed to solve for the unknown K1 , . . . , Ks (each K is vector-valued for a
system of ODEs) in an implicit RK method two particular types of implicit
RK methods are usually employed. The first type is the diagonally implicit
14.7. RUNGE-KUTTA METHODS 233
RK method or DIRK which has aij = 0 for i < j and at least one aii is
nonzero. The second type has also aij = 0 for i < j but with the addi-
tional condition that aii = γ for all i = 1, . . . , s and γ is a constant. The
corresponding methods are called singly diagonally implicit RK method or
SDIRK.
Example 14.15. Tables 14.5-14.8 show some examples of DIRK and SDIRK
methods.
√
3± 3
Table 14.8: Two-stage order 3 SDIRK (γ = 6
).
γ γ 0
1 − γ 1 − 2γ γ
1 1
2 2
K1 = f (tn , yn ), (14.103)
K2 = f (tn + ∆t, yn + ∆tK1 ) , (14.104)
1 1
wn+1 = yn + ∆t K1 + K2 , (14.105)
2 2
yn+1 = yn + ∆tK1 . (14.106)
Note that the approximation of the derivative K1 is used for both methods.
The computation of the higher order method (14.105) only costs an additional
evaluation of f .
where
n
Y (t − tk )
lj (t) = , for j = n − m + 1, . . . , n. (14.110)
k=n−m+1
(tj − tk )
k6=j
where
Z tn+1
1
bj−(n−m+1) = lj (t)dt, for j = n − m + 1, . . . , n. (14.112)
∆t tn
14.10. MULTISTEP METHODS 237
Here are the first three explicit Adams methods, 2-step, 3-step, and 4-step,
respectively:
∆t
yn+1 = yn + [3fn − fn−1 ] , (14.113)
2
∆t
yn+1 = yn + [23fn − 16fn−1 + 5fn−2 ] , (14.114)
12
∆t
yn+1 = yn + [55fn − 59fn−1 + 37fn−2 − 9fn−3 ] . (14.115)
24
The implicit Adams methods, also called Adams-Moulton methods, are
derived by including also (tn+1 , fn+1 ) in the interpolation. That is, p is now
the polynomial interpolating (tj , fj ) for j = n − m + 1, . . . , n + 1. Here are
the first three implicit Adams methods:
∆t
yn+1 = yn + [5fn+1 + 8fn − fn−1 ] , (14.116)
12
∆t
yn+1 = yn + [9fn+1 + 19fn − 5fn−1 + fn−2 ] , (14.117)
24
∆t
yn+1 = yn + [251fn+1
720 (14.118)
+ 646fn − 264fn−1 + 106fn−2 − 19fn−3 ] .
This is a linear difference equation. Let us look for solutions of the form
yn = ξ n . Plugging in (14.119) we get
ξ n am ξ m + am−1 ξ m−1 + . . . a0 = 0,
(14.120)
238 CHAPTER 14. NUMERICAL METHODS FOR ODES
yn = c1 ξ1n + c2 ξ2n + . . . + cm ξm
n
, (14.122)
Conditions (a) and (b) above are known as the root condition.
14.11 A-Stability
So far we have talked about numerical stability in the sense of boundedness
of the numerical approximation in the limit as ∆t → 0. There is another
type of numerical stability which give us some guidance on the actual size
∆t one can take for a stable computation using a given numerical method for
an ODE initial value problem. This type of stability is called linear stability,
absolute stability, or A-stability. It is based on the behavior of a numerical
method for the simple linear problem:
y 0 = λy, (14.126)
y(0) = 1, (14.127)
where λ is a complex number. The exact solution is y(t) = eλt . Let us look
at forward Euler method applied to this model problem. We have
Thus, the forward Euler solution is yn = (1+∆tλ)n . Clearly, in order for this
numerical approximation to remain bounded as n → ∞ (long time behavior)
we need
|1 + ∆tλ| ≤ 1. (14.131)
S = {z ∈ C : |1 + z| ≤ 1} , (14.132)
i.e. the unit disk centered at −1 is the region of linear stability or A-stability
of the forward Euler method.
240 CHAPTER 14. NUMERICAL METHODS FOR ODES
S = {z ∈ C : |R(z)| ≤ 1} . (14.134)
so the region of A-stability of the (implicit) trapezoidal rule method is the set
complex numbers z such that
z
1+ 2
z ≤1 (14.137)
1− 2
2
The stability function is therefore R(z) = 1 + z + z2 and the set of linear
stability consists of all the complex numbers such |R(z)| ≤ 1. Note that
Thus, ρ(1) = 0 and ρ0 (1) = σ(1) and the method is consistent. However, the
roots of ρ are 1 and −5 and hence the method is not zero-stable. Therefore,
by Dahlquist theorem, it is not convergent. For the polynomial Π we have