Notes
Notes
Numerical Analysis 1
Simon L. Cotter
1
MATH20602: S.L. COTTER
Contents
Page
1 Introduction 3
1.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Interpolation 8
2.1 Lagrange Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Interpolation Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Newton’s divided differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 An alternative form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Non-linear Equations 49
5.1 The bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Fixed point iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Rates of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.5 Newton’s method in the complex plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
1 Introduction
Video 1.1
“Since none of the numbers which we take out from logarithmic and trigonometric tables ad-
mit of absolute precision, but are all to a certain extent approximate only, the results of all cal-
culations performed by the aid of these numbers can only be approximately true.”— C.F. Gauss,
Theoria motus corporum coelestium in sectionibus conicis solem ambientium, 1809
Classical mathematical analysis owes its existence to the need to model the natural world.
The study of functions and their properties, of differentiation and integration, has its origins
in the attempt to describe how things move and behave. With the rise of technology it became
increasingly important to get actual numbers out of formulae and equations. This is where
numerical analysis comes into the scene: to develop methods to make mathematical models
based on continuous mathematics effective.
In practice, one often cannot simply plug numbers into formulae and get all the exact
results. Most problems require an infinite number of steps to solve, but one only has a finite
amount of time available; most numerical data also require an infinite amount of storage
(just try to store π on a computer!), but a piece of paper or a computer only has so much
space. These are some of the reasons that lead us to work with approximations.1
An algorithm is a sequence of instructions to be carried out by a computer (machine or
human), in order to solve a problem. There are two guiding principles to keep in mind when
designing and analysing numerical algorithms.
The first aspect is due to limited time; the second due to limited space. In what follows, we
discuss these two aspects in some more detail.
p n (x) = a 0 + a 1 x + a 2 x 2 + · · · + a n x n
2. Multiply a k x k for k = 1, . . . , n,
If each of the x k is computed individually from scratch, the overall number of multiplications
is n(n + 1)/2. This can be improved to 2n − 1 multiplications by computing the powers x k ,
1 ≤ k ≤ n, iteratively. An even smarter way, that also uses less intermediate storage, can be
derived by observing that the polynomial can be written in the following form:
p n (x) = a 0 + x(a 1 + a 2 x + a 3 x 2 + · · · + a n x n−1 )
= a 0 + xp n−1 (x).
The polynomial in brackets has degree n −1, and once we have evaluated it, we only need one
additional multiplication to have the value of p(x). In the same way, p n−1 (x) can be written as
p n−1 (x) = a 1 + xp n−2 (x) for a polynomial p n−2 (x) of degree n −2, and so on. This suggests the
possibility of recursion, leading to Horner’s Algorithm. This algorithm computes a sequence
of numbers
bn = an
b n−1 = a n−1 + x · b n
....
..
b0 = a0 + x · b1 ,
where b 0 turns out to be the value of the polynomial p n evaluated at x. In practice, one
would not compute a sequence but overwrite the value of a single variable at each step. The
following M ATLAB and Python code illustrates how the algorithm can be implemented. Note
that M ATLAB encodes the coefficients a 0 , . . . , a n as a vector with entries a(1), . . . , a(n + 1).
M ATLAB Python
This algorithm only requires n multiplications. Horner’s Method is the standard way of
evaluating polynomials on computers.
Here we are less concerned with the precise numbers, but with the order of magnitude.
Thus we will not care so much whether a computation procedure uses 1.5n 2 operations (n
is the input size) or 20n 2 , but we will care whether the algorithm needs n 3 as opposed to
n log(n) arithmetic operations to solve a problem. Video 1.2
To study conveniently the performance of algorithms, we use the big-O notation. Given
two functions f and g taking integer arguments n, we say that
f (n) ∈ O(g (n)) or f (n) = O(g (n)),
if there exists a constant C > 0 and an integer n 0 > 0, such that | f (n)| < C · |g (n)| for all n > n 0 .
For example, n log(n) = O(n 2 ) and n 3 + 10308 n ∈ O(n 3 ).
Ax = b,
where A is an n × n matrix, and x and b are n-vectors. Normally, the number of multiplica-
tions needed is n 2 , and the number of additions is n(n − 1) (verify this!). However, there are
some matrices, for example the one with the n-th roots of unity a i j = e 2πi j /n as entries, for
which there are algorithms (in this case, the Fast Fourier Transform) that can compute the
product Ax in O(n log n) operations. This example is of great practical importance, but will
not be discussed further at the moment.
An interesting and challenging field is algebraic complexity theory, which deals with lower
bounds on the number of arithmetic operations needed to perform certain computational
tasks. It also asks questions such as whether Horner’s method and other algorithms are opti-
mal, that is, can’t be improved upon.
1.2 Accuracy
In the early 19th century, C.F. Gauss, one of the most influential mathematicians of all time
and a pioneer of numerical analysis, developed the method of least squares in order to pre-
dict the reappearance of the recently discovered asteroid Ceres. He was well aware of the
limitations of numerical computing, as the quote at the beginning of this lecture indicates.
To measure the quality of approximations, we use the concept of relative error. Given a quan-
tity x and a computed approximation x̂, the absolute error is given by
x = ±(1 + f ) × 2e ,
2
A bit is a binary digit, that is, either 0 or 1.
where f is a fraction in [0, 1), represented using 52 bits, and e is the exponent, using 11 bits
(what is the remaining 64th bit used for?). Two things are worth noticing about this represen-
tation: there are largest possible numbers, and there are gaps between representable num-
bers. The largest and smallest numbers representable in this form are of the order of ±10308 ,
enough for most practical purposes. A bigger concern are the gaps, which means that the
results of many computations almost always have to be rounded to the closest floating-point
number.
Throughout this course, when going through calculations without using a computer, we
will usually use the terminology of significant
p figures (s.f.) and work with 4 significant figures
in base 10. For example, in base 10, 3 equals 1.732 to 4 significant figures. To count the
number of significant figures in a given number, start with the first non-zero digit from the
left and, moving to the right, count all the digits thereafter, counting final zeros if they are to
the right of the decimal point. For example, 1.2048, 12.040, 0.012048, 0.0012040 and 1204.0
all have 5 significant figures (s.f.). In rounding or truncation of a number to n s.f., the original
is replaced by the closest number with n s.f. An approximation x̂ of a number x is said to be
correct to n significant figures if both x̂ and x round to the same n s.f. number.3
Remark 1.1. Note that final zeros to the left of the decimal point may or may not be signifi-
cant: the number 1204000 has at least 4 significant figures, but without any more information
there is no way of knowing whether or not any more figures are significant. When 1203970
is rounded to 5 significant figures to give 1204000, an explanation that this has 5 significant
figures is required. This could be made clear by writing it in scientific notation: 1.2040 × 106 .
In some cases we also have to agree whether to round up or round down: for example, 1.25
could equal 1.2 or 1.3 to two significant figures. If we agree on rounding up, then to say that
a = 1.2048 to 5 s.f. means that the exact value of a satisfies 1.20475 ≤ a < 1.20485.
Example 3. Suppose we want to find the solution to the quadratic equation Video 1.6
ax 2 + bx + c = 0.
x 1 = −0.005000, x 2 = −39.70.
3
This definition is not without problems, see for example the discussion in Section 1.2 of Nicholas J. Higham,
Accuracy and Stability of Numerical Algorithms, SIAM 2002.
x 1 = −0.0032748..., x 2 = −39.6967...
The computed solution x 1 is completely wrong, at least if we look at the relative error:
|x 1 − x 1 |
= 0.5268.
|x 1 |
While the accuracy can be improved by increasing the number of significant figures during
the calculation, such effects happen all the time in scientific computing and the possibility
of such effects has to be taken into account when designing numerical algorithms.
Note that it makes sense, as in the above example, to look at errors in a relative sense.
An error of one mile is negligible when dealing with astronomical distances, but not so when
measuring the length of a race track.
By analysing what causes the error it is sometimes possible to modify the method of
calculation in order to improve
p the result. In the present example, the problems are being
2
caused by the fact that b ≈ b − 4ac, and therefore
p
−b + b 2 − 4ac −39.7 + 39.69
=
2a 2
exhibits what is called “catastrophic cancellation.” A way out is provided by the observation
that the two solutions are related by
c
x1 · x2 = . (1.2)
a
When b > 0, the computation of x 2 according to (1.1) shouldn’t cause any problems, and in
our case we get −39.70 which is accurate to four significant figures. We can then use (1.2) to
derive x 1 = c/(ax 2 ) = −0.003275, also accurate to 4 s.f.
As we have seen, one can sometimes get around numerical catastrophes by choosing a clever
method for solving a problem, rather than increasing precision. So far we have considered
errors introduced due to rounding operations. There are other sources of errors:
1. Overflow;
The first is rarely an issue, as we can represent numbers of order 10308 on a computer.
The second and third are important factors, but fall outside the scope of this lecture. The
fourth has to do with the fact that many computations are done approximately rather than
exactly. For computing the exponential, for example, we might use a method that gives the
approximation
x2
ex ≈ 1 + x + .
2
As it turns out, many practical methods give approximations to the “true” solution. End week 1
2 Interpolation
Video 2.1
How do we represent a function on a computer? If f is a polynomial of degree n,
f (x) = p n (x) = a 0 + a 1 x + · · · + a n x n ,
then we only need to store the n + 1 coefficients a 0 , . . . , a n . In fact, one can approximate an
arbitrary continuous function on a bounded interval by a polynomial. Recall that C k ([a, b])
is the set of functions that are k-times continuously differentiable on [a, b].
Theorem 2.1 (Weierstrass). For any f ∈ C ([0, 1]) and any ε > 0 there exists a polynomial p(x)
such that
max ¯ f (x) − p(x)¯ ≤ ε.
¯ ¯
0≤x≤1
R
Given pairs (x j , y j ) ∈ , 0 ≤ j ≤ n, with distinct x j , the interpolation problem consists of
finding a polynomial p of lowest possible degree such that
p(x j ) = y j , 0 ≤ j ≤ n. (2.1)
22
20
18
16
14
12
10
6
0 1 2 3 4 5 6 7
Example 4. Let h = 1/n, x 0 = 0, and x i = i h for 1 ≤ i ≤ n. The x i subdivide the interval [0, 1]
into segments of equal length h. Now let y i = i h/2 for 0 ≤ i ≤ n. Then the points (x i , y i )
all lie on the line p 1 (x) = x/2, as is easily verified. It is also easy to see that p 1 is the unique
polynomial of degree at most 1 that goes through these points. In fact, we will see that it is
the unique polynomial of degree at most n that passes through these points!
We will first describe the method of Lagrange interpolation, which also helps to establish
the existence and uniqueness of an interpolation polynomial satisfying (2.1). We then discuss
the quality of approximating polynomials by interpolation, the question of convergence, as
well as other methods such as Newton interpolation.
Note that we assumed the x j to be distinct, otherwise we would have to divide by zero and
cause a disaster. We therefore obtain the representation
Q
j ̸=k (x − x j )
L k (x) = Q .
j ̸=k (x k − x j )
Proof. The case n = 0 is clear, so let us assume n ≥ 1. In Lemma 2.1 we constructed a poly-
nomial p n (x) of degree at most n satisfying the conditions (2.2), proving the existence part.
For the uniqueness, assume that we have two such polynomials p n (x) and q n (x) of degree at
most n satisfying the interpolating property (2.2). The goal is to show that they are the same.
By assumption, the difference p n (x) − q n (x) is a polynomial of degree at most n that takes on
the value p n (x j ) − q n (x j ) = y j − y j = 0 at the n + 1 distinct x j , 0 ≤ j ≤ n. By the Fundamental
Theorem of Algebra, a non-zero polynomial of degree n can have no more than n distinct real
roots, from which it follows that p n (x) − q n (x) ≡ 0, or p n (x) = q n (x).
is called the Lagrange interpolation polynomial of degree n corresponding to the data points
(x j , y j ), 0 ≤ j ≤ n. If the y k are the values of a function f , that is, if f (x k ) = y k , 0 ≤ k ≤ n, then
p n (x) is called the Lagrange interpolation polynomial associated with f and x 0 , . . . , x n .
Remark 2.1. Note that the interpolation polynomial is uniquely determined, but that the
polynomial can be written in different ways. The term Lagrange interpolation polynomial
thus referes to to the particular form (2.3) of this polynomial. For example, the two expres-
sions
x(x − 1) x(x + 1)
q 2 (x) = x 2 , p 2 (x) = + ,
2 2
define the same polynomial (as can be verified by multiplying out the terms on the right),
and thus both represent the unique polynomial interpolating the points (x 0 , y 0 ) = (−1, 1),
(x 1 , y 1 ) = (0, 0), (x 2 , y 2 ) = (1, 1), but only p 2 (x) is in the Lagrange form.
(*) A different take on the uniqueness problem can be arrived at by translating the prob-
lem into a linear algebra one. For this, note that if p n (x) = a 0 + a 1 x + · · · + a n x n , then the
polynomial evaluation problem at the x j , 0 ≤ j ≤ n, can be written as a matrix vector prod-
uct:
1 x 0 · · · x 0n a 0
y0
y 1 1 x 1 · · · x n a 1
1
.. = .. .. . . . . ,
. . . . .. ..
yn 1 xn · · · x nn an
or y = X a. If the matrix X is invertible, then the interpolating polynomial is uniquely deter-
mined by the coefficient vector a = X −1 y. The matrix X is invertible if and only if det(X ) ̸= 0.
The determinant of X is the well-known Vandermonde determinant:
1 x 0 · · · x 0n
1 x 1 · · · x n Y
1
det(X ) = det . . . = (x j − x i ).
.. .. . . ...
j >i
1 xn ··· x nn
Clearly, this determinant is different from zero if and only if the x j are all distinct, which
shows the importance of this assumption.
Example 5. Consider the function f (x) = e x on the interval [−1, 1], with interpolation points
x 0 = −1, x 1 = 0, x 2 = 1. The Lagrange basis functions are Video 2.2
(x − x 1 )(x − x 2 ) 1
L 0 (x) = = x(x − 1),
(x 0 − x 1 )(x 0 − x 2 ) 2
L 1 (x) = 1 − x 2 ,
1
L 2 (x) = x(x + 1).
2
The Lagrange interpolation polynomial is therefore given by
1 1
p 2 (x) = x(x − 1)e −1 + (1 − x 2 )e 0 + x(x + 1)e 1
2 2
= 1 + x sinh(1) + x 2 (cosh(1) − 1).
2.5
1.5
0.5
0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Now that we have established the existence and uniqueness of a function’s interpolation
polynomial, we would like know how well it approximates that function.
Theorem 2.3. Let n ≥ 0 and assume f ∈ C n+1 ([a, b]). Let p n (x) ∈ P n be the Lagrange interpo-
lation polynomial associated with f and distinct x j , 0 ≤ j ≤ n. Then for every x ∈ [a, b] there
exists ξ = ξ(x) ∈ (a, b) such that
f (n+1) (ξ)
f (x) − p n (x) = πn+1 (x), (2.4)
(n + 1)!
where
πn+1 (x) = (x − x 0 ) · · · (x − x n ).
For the proof of Theorem 2.3 we need the following consequence of Rolle’s Theorem. Video 2.5
m
Lemma 2.2. Let g ∈ C ([a, b]), and suppose g vanishes at m + 1 points x 0 , . . . , x m . Then there
exists ξ ∈ (a, b) such that the m-th derivative g (m) satisfies g (m) (ξ) = 0.
Proof. By Rolle’s Theorem, for any two x i , x j there exists a point in between where g ′ van-
ishes, therefore g ′ vanishes at (at least) m points. Repeating this argument, it follows that
g (m) vanishes at some point ξ ∈ (a, b).
Proof of Theorem 2.3. Assume x ̸= x j for 0 ≤ j ≤ n (otherwise the theorem is clearly true).
Define the function Video 2.6
f (x) − p n (x)
ϕ(t ) = f (t ) − p n (t ) − πn+1 (t ).
πn+1 (x)
This function vanishes at n + 2 distinct points, namely t = x j , 0 ≤ j ≤ n, and x. Assume
n > 0 (the case n = 0 is left as an exercise). By Lemma 2.2, the function ϕ(n+1) has a zero
ξ ∈ (a, b), while the (n + 1)-th derivative of p n vanishes (since p n is a polynomial of degree n).
We therefore have
f (x) − p n (x)
0 = ϕ(n+1) (ξ) = f (n+1) (ξ) − (n + 1)!,
πn+1 (x)
from which we get
f (n+1) (ξ)
f (x) − p n (x) = πn+1 (x).
(n + 1)!
This completes the proof.
Theorem 2.3 contains an unspecified number ξ. Even though we can’t find this location
in practice, the situation is not too bad as we can sometimes bound the (n + 1)-th derivative
of f on the interval [a, b].
Corollary 2.1. Under the conditions as in Theorem 2.3,
M n+1
| f (x) − p n (x)| ≤ |πn+1 (x)|,
(n + 1)!
where
M n+1 = max | f (n+1) (x)|.
a≤x≤b
x = x 0 + θh, x 1 = x 0 + h.
While the interpolation polynomial of degree at most n for a function f and n + 1 points
x 0 , . . . , x n is unique, it can appear in different forms. The one we have seen so far is the La-
grange form, where the polynomial is given as a linear combination of the Lagrange basis
functions:
Xn
p(x) = L k (x) f (x k ),
k=0
or some modifications of this form, such as the barycentric form (see Section 2.5). A different
approach to constructing the interpolation polynomial is based on Newton’s divided differ-
ences.
Provided we have the coefficients a 0 , . . . , a n , evaluating the polynomial only requires n mul-
tiplications using Horner’s Method. Moreover, it is easy to add new points: if x n+1 is added,
the coefficients a 0 , . . . , a n don’t need to be changed.
Example 8. Let x 0 = −1, x 1 = 0, x 2 = 1 and x 3 = 2. Then the polynomial p 3 (x) = x 3 can be
written in the form (2.5) as
A pleasant feature of the form (2.5) is that the coefficients a 0 , . . . , a n can be computed
easily using divided differences. The divided differences associated with the function f and
R
distinct x 0 , . . . , x n ∈ are defined recursively as
f [x i ] := f (x i ),
f [x i +1 ] − f [x i ]
f [x i , x i +1 ] := ,
x i +1 − x i
f [x i +1 , x i +2 , . . . , x i +k ] − f [x i , x i +1 , . . . , x i +k−1 ]
f [x i , x i +1 , . . . , x i +k ] : = .
x i +k − x i
The divided differences can be computed from a divided difference table, where we move
from one column to the next by applying the rules above (here we use the shorthand f i :=
f (x i )):
0 x0 f 0
f [x 0 , x 1 ]
1 x1 f 1 f [x 0 , x 1 , x 2 ]
f [x 1 , x 2 ] f [x 0 , x 1 , x 2 , x 3 ]
2 x2 f 2 f [x 1 , x 2 , x 3 ]
f [x 2 , x 3 ]
3 x3 f 3
From this table we also see that adding a new pair (x n+1 , f n+1 ) would require an update of the
table that takes O(n) operations.
Theorem 2.4. Let x 0 , . . . , x n be distinct points. Then the interpolation polynomial for f at
points x i , . . . , x i +k is given by
In particular, the coefficients in Equation (2.5) are given by the divided differences
a k = f [x 0 , . . . , x k ],
Before going into the proof, observe that the divided difference f [x 0 , . . . , x n ] is the high-
est order coefficient, that is, the coefficient of x n , of the interpolation polynomial x n . This
observation is crucial in the proof.
The only thing that needs to be shown is that a k+1 = f [x i , . . . , x i +k+1 ]. For this, we define a
new polynomial q(x), show that it coincides with p i ,k+1 (x) by being the unique interpolation
polynomial of f at x i , . . . , x i +k+1 , and then show that the highest order coefficient of q(x) is
precisely f [x i , . . . , x i +k+1 ]. Define
q(x i ) = p i ,k (x i ) = f (x i )
q(x i +k+1 ) = p i +1,k (x i +k+1 ) = f (x i +k+1 )
(x j − x i ) f (x j ) − (x j − x i +k+1 ) f (x j )
q(x j ) = = f (x j ), i + 1 ≤ j ≤ i + k.
x i +k+1 − x i
This means that q(x) also interpolates f at x i , . . . , x i +k+1 , and by the uniqueness of the in-
terpolation polynomial, must equal p i ,k+1 (x). Let’s now compare the coefficients of x k+1 in
both polynomials. The coefficient of x k+1 in p i ,k+1 is a k+1 , as can be seen from (2.6). By the
induction hypothesis, the polynomials p i +1,k (x) and p i ,k (x) have the form
f [x i +1 , . . . , x i +k+1 ] − f [x i , . . . , x i +k ]
= f [x i , . . . , x i +k+1 ].
x i +k+1 − x i
Example 9. Let’s find the divided difference form of a cubic interpolation polynomial for the
points Video 3.2
(−1, 1), (0, 1), (3, 181), (−2, −39).
The divided difference table would look like
j xj fj f [x j , x j +1 ] f [x j , x j +1 , x j +2 ] f [x 0 , x 1 , x 2 , x 3 ]
0 −1 1
0
60−0
1 0 1 3−(−1) = 15
181−1 8−15 (2.8)
3−0 = 60 −2−(−1) =7
44−60
2 3 181 −2−0
=8
−39−181
−2−3 = 44
3 −2 −39
The coefficients a j = f [x 0 , . . . , x j ] are given by the upper diagonal, and the interpolation poly-
nomial is thus
Now suppose we add another data point (4, 801). This amounts to adding only one new term
to the polynomial. The new coefficient a 4 = f [x 0 , . . . , x 4 ] is calculated by adding a new line at
the bottom of Table (2.8) as follows:
∗ Another thing to notice is that the order of the x i plays a role in assembling the Newton
interpolation polynomial, while the order did not play a role in Lagrange interpolation. Recall
the characterisation of the interpolation polynomial in terms of the Vandermonde matrix
from Week 2. The coefficients a i of the Newton divided difference form can also be derived
as the solution of a system of linear equations, this time in convenient triangular form:
f0 1 0 0 ··· 0 a0
f 1 1 x 1 − x 0 0 ··· 0 a1
f 2 1 x 2 − x 1 (x 2 − x 0 )(x 2 − x 1 ) ··· 0 a2
= .
. . .. .. .. .. .
.. .. . . . . ..
Q
fn 1 x n − x 0 (x n − x 0 )(x n − x 1 ) ··· j <n (x n − x j ) an
2.4 Convergence
Video 3.3
For a given set of points x 0 , . . . , x n and function f , we have a bound on the interpolation error.
Is it possible to make the error smaller by adding more interpolation points, or by modifying
the distribution of these points? The answer to this question can depend on two things: the
class of functions considered, and the spacing of the points. Let p n (x) denote the Lagrange
interpolation polynomial of degree n for f at the points x 0 , . . . , x n . The question we ask is
whether
lim max |p n (x) − f (x)| → 0.
n→∞ a≤x≤b
Perhaps surprisingly, the answer is negative, as the following famous example, known as the
Runge Phenomenon, shows.
Example 10. Consider the interval [a, b] and let
j
xj = a + (b − a), 0 ≤ j ≤ n,
n
2 2
1 1
0 0
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
2 2
1 1
0 0
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
The problem is not due to the interpolation method, but has to do with the spacing of the
points.
Example 11. Let us revisit the function 1/(1 + 25x 2 ) and try to interpolate it at Chebyshev
points:
x j = cos( j π/n), 0 ≤ j ≤ n.
Calculating the interpolation error for this example shows a completely different result to the
previous example. In fact, plotting the error and comparing it with the case of equispaced
points shows that choosing the interpolation points in a clever way can be of huge benefit.
30
Equal spacing
1 25 Chebyshev spacing
0.8
20
0.6
15
0.4
10
0.2
0 5
−0.2 0
−1 −0.5 0 0.5 1 0 5 10 15 20
It can be shown (see Part A of problem sheet 3) that the interpolation error at Chebyshev
points in the interval [−1, 1] can be bounded as
M n+1
| f (x) − p n (x)| ≤ .
2n (n + 1)!
This is entirely due to the behaviour of the polynomial πn+1 (x) at these points.
To summarize, we have the following two observations:
1. To estimate the difference | f (x) − p n (x)| we need assumptions on the function f , for
example, that it is sufficiently smooth.
has some drawbacks. On the one hand, it requires O(n 2 ) operations to evaluate. Besides this,
adding new interpolation points requires the recalculation of the Lagrange basis polynomi-
als L k (x). Both of these problems can be remedied by rewriting the Lagrange interpolation
formula.
Provided x ̸= x j for 0 ≤ j ≤ n, the Lagrange interpolation polynomial can be written as
Pn wk
k=0 x−x k
f (x k )
p(x) = Pn wk , (2.9)
k=0 x−x k
Q
where w k = 1/ j ̸=k (x k − x j ) are called the barycentric weights. Once the weights have been
computed, the evaluation only takes O(n) operations, and updating it with new weights is
also only O(n) operations. To derive this formula, define L(x) = nk=0 (x − x k ) and note that
Q
p(x) = L(x) nk=0 w k /(x − x k ) f (x k ). Noting also that 1 = nk=0 L k (x) = L(x) nk=0 w k /(x − x k )
P P P
and dividing by this “intelligent one”, Equation (2.9) follows. Finally, it can be shown that the
problem of computing the barycentric Lagrange interpolation is numerically stable at points
such as Chebyshev points.
If possible, one can compute the antiderivative F (x) (the function such that F ′ (x) = f (x))
and obtain the integral as F (b) − F (a). However, it is not always possible to compute the
antiderivative, as in the cases
ˆ 1 ˆ π
x2
e d x, cos(x 2 ) d x.
0 0
More prominently, the standard normal (or Gaussian) probability distribution amounts to
evaluating the integral ˆ z
1 x2
p e − 2 d x,
2π −∞
which is not possible in closed form. Even if it is possible in principle, evaluating the an-
tiderivative may not be numerically the best thing to do. The problem is then is to approxi-
mate such integrals numerically as well as possible.
The Trapezium Rule seeks to approximate the integral, interpreted as the area under the
curve, by the area of a trapezium defined by the graph of the function.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Suppose we want to approximate the integral between a and b, and let h = b − a. Then
the trapezium approximation is given by
ˆ b
h
f (x) d x ≈ I ( f ) = ( f (a) + f (b)),
a 2
as can be verified easily. The Trapezium Rule may be interpreted as integrating the linear
interpolant of f at the points x 0 = a and x 1 = b. The linear interpolant is given as
x −b x −a
p 1 (x) = f (a) + f (b).
a −b b−a
Integrating this function gives rise to the representation as the area of a trapezium:
ˆ b
h
p 1 (x) d x = ( f (a) + f (b)).
a 2
Using the interpolation error, we can derive the integration error for the Trapezium Rule.
Theorem 3.1. Given f ∈ C 2 ([a, b]), we claim that Video 3.5
ˆ b ˆ b
1 3 ′′
f (x) d x = p 1 (x) d x − h f (ξ),
a a 12
Proof. To derive this, recall (from Theorem 2.3) that the interpolation error is given by
(x − a)(x − b) ′′
f (x) = p 1 (x) + f (η(x)),
2!
for some η(x) ∈ (a, b). We can therefore write the integral as
ˆ b ˆ b ˆ b
1
f (x) d x = p 1 (x) d x + (x − a)(x − b) f ′′ (η(x)) d x.
a a 2 a
By the Integral Mean Value Theorem, there exists a ξ ∈ (a, b) such that
ˆ b ˆ b
′′ ′′
(x − a)(x − b) f (η(x)) d x = f (ξ) (x − a)(x − b) d x.
a a
as claimed.
where the x k are the quadrature nodes and the w k are the quadrature weights.
Simpson’s rule uses three points, x 0 = a, x 2 = b, and x 1 = (a+b)/2 to approximate the integral.
Let h = (b − a)/2. Then Simpson’s Rule gives
ˆ b
h
f (x) d x ≈ I 2 ( f ) = ( f (x 0 ) + 4 f (x 1 ) + f (x 2 )).
a 3
Example 13. For the integrand f (x) = 1/(1 + x) from 1 to 2, Simpson’s rule gives the approxi-
mation µ ¶
1 1 8 1
I2( f ) = + + ≈ 0.4056.
6 2 5 3
This is much closer to the true value 0.4055 than the trapezium rule approximation.
Example 14. For the function f (x) = 3x 2 − x + 1 and interval [0, 1] (that is, h = 0.5), Simpson’s Video 4.3
rule gives the approximation
µ µ ¶ ¶
1 3 1 3
I2( f ) = 1 + 4 − + 1 + 3 = .
6 4 2 2
The antiderivative of this polynomial is x 3 − x 2 /2 + x, so the true integral is 1 − 1/2 + 1 = 3/2.
In this case, Simpson’s rule gives the exact value of the integral! As we will see, this is the case
for any quadratic polynomial.
Then ˆ ˆ
b b n
X
f (x) d x ≈ I n ( f ) := p n (x) d x = w k f (x k ),
a a k=0
´b
where w k = a L k (x) d x.
We now show that Simpson’s rule is indeed a Newton-Cotes rule of order 2. Let x 0 =
a, x 2 = b and x 1 = (a + b)/2. Define h := x 1 − x 0 = (b − a)/2. The quadratic interpolation
polynomial is given by
(x − x 1 )(x − x 2 ) (x − x 0 )(x − x 2 ) (x − x 1 )(x − x 0 )
p 2 (x) = f (x 0 ) + f (x 1 ) + f (x 2 ).
(x 0 − x 1 )(x 0 − x 2 ) (x 1 − x 0 )(x 1 − x 2 ) (x 2 − x 0 )(x 2 − x 1 )
We claim that ˆ b
h
I2( f ) = p 2 (x) d x = ( f (x 0 ) + 4 f (x 1 ) + f (x 2 )).
a 3
To show this, we make use of the identities x 1 = x 0 + h, x 2 = x 0 + 2h, to get the representation
(we use f i := f (x i ) for brevity)
f0 f1 f2
p 2 (x) = 2
(x − x 1 )(x − x 2 ) + 2
(x − x 0 )(x − x 2 ) + 2 (x − x 1 )(x − x 0 ).
2h −h 2h
Using integration by parts or otherwise, we can evaluate the integrals of each term
ˆ x2 ˆ x2 ˆ x2
2 4 2
(x − x 1 )(x − x 2 ) d x = h 3 , (x − x 0 )(x − x 2 ) d x = − h 3 , (x − x 0 )(x − x 1 ) d x = h 3 .
x0 3 x0 3 x0 3
Altogether, we get ˆ x2
h
p2 d x = ( f 0 + 4 f 1 + f 2 ).
x0 3
This shows the claim. As with the Trapezium rule, we can bound the error for Simpson’s rule.
Theorem 3.2. Let f ∈ C 4 ([a, b]), h = (b − a)/2 and x 0 = a, x 1 = x 0 + h, x 2 = b. Then there exists Video 4.3
ξ ∈ (a, b) such that the integration error is
ˆ b
h¡ h5
f (x 0 ) + 4 f (x 1 ) + f (x 2 ) = − f (4) (ξ).
¢
E ( f ) := f (x) d x −
a 3 90
Note that, in some places in the literature, the bound is written in terms of (b − a) as
(b − a)5 (4)
E(f ) = − f (ξ).
2880
The two versions are equivalent, noting that h = (b − a)/2.
Proof. (*) The proof is based on Chapter 7 of Süli and Mayers, An Introduction to Numerical
Analysis. Consider the change of variable
Now define ˆ t
t
G(t ) = F (τ) d τ − (F (−t ) + 4F (0) + F (t ))
−t 3
for t ∈ [−1, 1]. In particular, h G(1) is the integration error we are trying to estimate. Consider
the function
H (t ) = G(t ) − t 5G(1).
Since H (0) = H (1) = 0, by Rolle’s Theorem there exists ξ1 ∈ (0, 1) such that H ′ (ξ1 ) = 0. Since
also H ′ (0) = 0 (exercise), there exists ξ2 ∈ (0, 1) such that H ′′ (ξ2 ) = 0. Since also H ′′ (0) = 0
(exercise), we apply Rolle’s Theorem to find that there exists µ ∈ (0, 1) such that
H ′′′ (µ) = 0.
Note that the third derivative of G is given by G ′′′ (t ) = − 3t (F ′′′ (t ) − F ′′′ (−t )), from which it
follows that
µ
H ′′′ (µ) = − (F ′′′ (µ) − F ′′′ (−µ)) − 60µ2G(1) = 0.
3
We can rewrite this equation as
Since µ ̸= 0, we can divide both sides by 2µ2 /3. Then by the Mean Value Theorem, there exists
ξ ∈ (−µ, µ) such that
90G(1) = −F (4) (ξ),
from which we get for the error (after multiplying by h),
h (4)
h G(1) = − F (ξ).
90
Now note that, using again the substitution x(t ) = x 1 + ht as we made at the beginning,
d4 d4
F (4) (t ) = f (x(t )) = f (x 1 + ht ) = h 4 f (4) (x).
dt4 dt4
This completes the proof.
where M 4 is an upper bound on the absolute value of the fourth derivative of f on the interval
[a, b].
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
h3 h5
E1( f ) ≤ M2 , E2( f ) ≤ M4 ,
12 90
where E 1 ( f ) is the absolute error for the trapezium rule, E 2 ( f ) the absolute error for Simpson’s
rule, h is the distance between two nodes, and M k the maximum absolute value of the k-th
derivative of f . In particular, it follows that the trapezium rule has error 0 for polynomials of
degree at most one (since p 1′′ = 0 so M 2 = 0), and Simpson’s rule for polynomials of degree at
most three (since p 3′′′′ = 0 so M 4 = 0). One may wonder whether increasing the degree of the
interpolating polynomial necessarily decreases the integration error.
Example 15. Consider the infamous function
1
f (x) =
1 + x2
on the interval [−5, 5]. The integral of this function is given by
ˆ 5
1
2
d x = arctan(x)|5−5 ≈ 2.7468.
−5 1 + x
Now let’s compute the Newton-Cotes quadrature
ˆ 5
In ( f ) = p n (x) d x,
−5
As this example shows, increasing the degree may not always be an effective choice, and
we have to think of other ways to increase the precision of numerical integration.
8 6
Newton−Cotes Absolute error
7 True value
5
6
5 4
4
3
3
2 2
1
1
0
−1 0
0 5 10 15 0 5 10 15
The trapezium rule uses only two points to approximate an integral, certainly not enough for
most applications. There are different ways to make use of more points and function values
in order to increase precision. One way, as we have just seen with the Newton-Cotes scheme
and Simpson’s rule, is to use higher-order interpolants. A different approach is to subdivide
the interval into smaller intervals and use lower-order schemes, like the trapezium rule, on
these smaller intervals. For this, we subdivide the integral
ˆ b n−1
ˆ x j +1
X
f (x) d x = f (x) d x,
a j =0 xj
Proof. Recall from the error analysis of the trapezium rule (Theorem 3.1) that, for some
points ξ j ∈ (x j , x j +1 ) we have
ˆ b n−1
X h
µ ¶
1
f (x) d x = ( f (x j ) + f (x j +1 )) − h 3 f ′′ (ξ j )
a j =0 2 12
1
µ
1
¶
1 n−1
= h · f (x 0 ) + f (x 1 ) + · · · + f (x n−1 ) + f (x n ) − h 3
X ′′
f (ξ j ).
2 2 12 j =0
Clearly the values f ′′ (ξ j ) lie between the minimum and maximum of f ′′ on the interval (a, b),
and so their average is also bounded by
1 n−1
min f ′′ (x) ≤
X ′′
f (ξ j ) ≤ max f ′′ (x).
x∈[a,b] n j =0 x∈[a,b]
Since the function f ′′ is continuous on [a, b], it assumes every value between the minimum
and the maximum, and in particular also the value given by the average above (this is the
statement of the Intermediate Value Theorem). In other words, there exists µ ∈ (a, b) such
that the average above is attained:
1 n−1
X ′′
f (ξ j ) = f ′′ (µ).
n j =0
where we used that h = (b − a)/n. This is the claimed expression for the error.
Example 17. Consider the function f (x) = e −x /x and the integral Video 4.7
ˆ 2
e −x
d x.
1 x
What choice of parameter h will ensure that the approximation error of the composite trapez-
ium rule will be below 10−5 ? Let M 2 denote an upper bound on the second derivative of f (x).
The approximation error for the composite trapezium rule with step length h is bounded by
h2
E(f ) ≤ (b − a) · M 2 .
12
We can find M 2 by calculating the derivatives of f :
µ ¶
1 1
f ′ (x) = −e −x + 2 ,
x x
µ ¶
′′ −x 1 2 2
f (x) = e + + .
x x2 x3
The second derivative f ′′ (x) is decreasing with x, so its maximum on [1, 2] is attained at x = 1,
i.e. M 2 = f ′′ (1) ≈ 1.8394. In the interval [1, 2] we therefore have the bound
h2
E(f ) ≤ × 1.8394 × (2 − 1) = 0.1533 × h 2 .
12
We find that taking 124 steps or more will guarantee an error below 10−5 . For example, with
h = 0.005 (this corresponds to taking n = 200 steps), the error is bounded by 3.83 × 10−6 . End week 4
Applying Simpson’s rule to each of the integrals in the sum, we arrive at the expression
h¡ ¢
f (x 0 ) + 4 f (x 1 ) + 2 f (x 2 ) + 4 f (x 3 ) + · · · + 4 f (x 2m−3 ) + 2 f (x 2m−2 ) + 4 f (x 2m−1 ) + f (x 2m ) ,
3
where the coefficients of the f (x i ) alternate between 4 and 2 for 1 ≤ i ≤ 2m − 1.
Theorem 3.4. If f ∈ C 4 (a, b) and a = x 0 < · · · < x n = b, then there exists µ ∈ (a, b) such that Video 5.2
ˆ b
h¡ 1 4
h (b − a) f (4) (µ).
¢
f (x) d x = f (x 0 ) + 4 f (x 1 ) + 2 f (x 2 ) + · · · + 4 f (x n−1 ) + f (x n ) −
a 3 180
1 4
h (b − a)M 4 .
180
Proof. The proof is very similar to the proof of Theorem 3.3, using the previous bounds for
Simpson’s rule.
Example 18. Having an error of order h 2 means that, every time we halve the stepsize (or, Video 5.3
equivalently, double the number of points n), the error decreases by a factor of 4. This can
be written as E ( f ) = O(n −2 ). Looking at the example function f (x) = 1/(1 + x) and apply-
ing the composite Trapezium rule, we get the following relationship between the logarithm
of the number of points log n and the logarithm of the error. The fitted line has a slope of
−1.9935 ≈ −2, as expected from the theory.
Summarizing, we have seen the following integration schemes with their corresponding
error bounds:
1 3
Trapezium: 12 (b − a) M 2
1 2
Composite Trapezium: 12 h (b − a)M 2
1
Simpson: 2880
(b − a)5 M 4
1 4
Composite Simpson: 180 h (b − a)M 4 .
−8
−10
log of error
−12
−14
−16
−18
−20
0 1 2 3 4 5 6
log of number of steps
Note that we expressed the error bound for Simpson’s rule in terms of (b − a) rather than
h = (b − a)/2. The h in the bounds for the composite rules corresponds to the distance be-
tween any two consecutive nodes, x j +1 − x j .
We conclude the section with a definition of the order of precision of a quadrature rule.
Definition 3.3. A quadrature rule I ( f ) has degree of precision k, if it evaluates polynomials of
degree at most k exactly. That is,
ˆ b
j 1
I (x ) = xj dx = (b j +1 − a j +1 ), for 0 ≤ j ≤ k.
a j +1
For example, it is easy to show that the Trapezium rule has degree of precision 1 (it eval-
uates 1 and x exactly), while Simpson’s rule has degree of precision 3 (rather than 2 as ex-
pected!). In general, Newton-Cotes quadrature of degree n has degree of precision n, if n is
odd, and n + 1 if n is even.
are vectors. We will often deal with the case m = n (a square matrix A).
There are two main classes of methods for solving such systems.
1. Direct methods attempt to solve (4.1) using a finite number of operations. An example
is the well-known Gaussian elimination algorithm.
Direct methods generally work well for dense matrices and moderately large n. Iterative
methods work well with sparse matrices, that is, matrices with few non-zero entries a i j , and
large n.
Example 19. Consider the ordinary differential equation
−u xx = f (x),
with boundary conditions u(0) = u(1) = 0, where u is a twice differentiable function on [0, 1],
and u xx = ∂2 u/∂x 2 denotes the second derivative in x. We can discretize the interval [0, 1] by
setting ∆x = 1/(n + 1), x j = j ∆x, and denoting
u j := u(x j ), f j := f (x j ), for j = 0, . . . , n.
u i −1 − 2u i + u i +1
u xx ≈ .
(∆x)2
u i −1 − 2u i + u i +1
− = fj.
(∆x)2
Making use of the initial conditions u(0) = u(1) = 0, we get the system of equations
−2 1 0 0 ··· 0 0 u1 f1
u2 f 2
1 −2 1 0 · · · 0 0
0 1 −2 1 · · · 0 0 u3 f 3
1
− . .. .. .. . . . .. .. = .. .
(∆x)2 .. . . . . .. . . .
0 0 0 0 · · · −2 1 u n−1 f n−1
0 0 0 0 · · · 1 −2 un fn
The matrix is very sparse, it has only 3n − 2 non-zero entries, out of n 2 possible! This form is
typical for matrices arising from partial differential equations, and is well-suited for iterative
methods that exploit the specific structure of the matrix.
A 1 x = −A 2 x + b.
This motivates the following approach: start with a vector x 0 and successively compute x k+1
from x k by solving the system
A 1 x k+1 = −A 2 x k + b. (4.2)
Note that after the k-th step, the right-hand side is known, while the unknown to be found is
the vector x k+1 on the left-hand side.
x k+1 = D −1 (b − (L +U )x k ). (4.3)
Note that since D is diagonal, it is easy to invert: just take reciprocals of the individual entries.
Example 20. For a concrete example, take the following matrix with its decomposition into
diagonal and off-diagonal parts:
µ ¶ µ ¶ µ ¶
2 −1 2 0 0 −1
A= = + .
−1 2 0 2 −1 0
Since µ ¶−1 µ ¶ µ ¶
2 0 1/2 0 1 1 0
= = ,
0 2 0 1/2 2 0 1
we get the iteration scheme µ ¶
k+1 0 1/2 k 1
x = x + b.
1/2 0 2
We can also write the iteration (4.3) in terms of individual entries. If we denote Video 6.1
x 1(k)
.
x k := .
. ,
x n(k)
i.e., we write x i(k) for the i -th entry of the k-th iterate, then the iteration (4.3) becomes
à !
1
x i(k+1) = a i j x (k)
X
bi − j
. (4.4)
ai i j ̸=i
for 1 ≤ i ≤ n.
Let’s try this out with b = (1, 1)⊤ , to see if we get a solution. Let x 0 = 0 to start with. Then Video 5.5
µ ¶ µ1¶ µ1¶
1 0 1/2
x = 0 + 21 = 21
1/2 0 2 2
µ1¶ µ
0 1/2 12
µ ¶ ¶µ ¶ µ1¶ µ3¶
2 0 1/2 1 2 2 4
x = x + 1 = 1 + 1 = 3
1/2 0 2
1/2 0 2 2 4
µ ¶ µ1¶ µ ¶µ3¶ µ1¶ µ 7 ¶
0 1/2 2 0 1/2 4
x3 = x + 21 = 2
3 + 1 = 7
8
1/2 0 2
1/2 0 4 2 8.
We see a pattern emerging: in fact, one can show (exercise, induction) that in this example
1 − 2−k
µ ¶
k
x = .
1 − 2−k
In particular, as k → ∞ the vectors x k approach (1, 1)⊤ , which is easily verified to be a solution
of Ax = b. End week 5
We saw that a general approach to finding an approximate solution to a system of linear
equations
Ax = b
is to generate a sequence of vectors x (k) for k ≥ 0 by some procedure
x (k+1) = T x (k) + c,
in the hope that the sequence approaches a solution. In the case of the Jacobi method, we
had the iteration
x (k+1) = D −1 (b − (L +U )x (k) ),
with L, D, and U the lower, diagonal, and upper triangular part of A. That is,
T = −D −1 (L +U ), c = D −1 b.
Next, we discuss a refinement of this method, and will also address the issue of convergence.
4.1.2 Gauss-Seidel
Video 6.2
In the Gauss-Seidel method, we use a different decomposition, leading to the following sys-
tem
(D + L)x k+1 = −U x k + b. (4.5)
Though the right-hand side is not diagonal (as in the Jacobi method), the system is still easily
solved for x k+1 when x k is given. To derive the entry-wise formula for this method, we take a
closer look at (4.5)
(k+1) (k)
x1 0 a 12 · · · a 1n x 1
a 11 0 ··· 0 b1
a 21 a 22 · · · (k+1) (k)
0 x2 b 2 0 0 · · · a 2n x
.. .. .
. = . −. . .
2. .
. .
. . .. .. .. .. ..
.. .. .. ..
a n1 a n2 · · · a nn x n(k+1) bn 0 0 ··· 0 x n(k)
for the (k + 1)-th iterate of x i . Note that in order to compute the (k + 1)-th iterate of x i , we
already use values of the (k + 1)-th iterate of x j for j < i . This differs from the Jacobi form,
where we only resort to the k-th iterate. Both methods have their advantages and disadvan-
tages. While Gauss-Seidel may require less storage (we can overwrite each x i(k) by x i(k+1) as we
don’t need the old value subsequently), Jacobi’s method can be used more easily in parallel
processing (that is, all the x i(k+1) can be computed by different processors for each i ).
Example 21. Consider a simple system of the form
2 −1 0 1
−1 2 −1 x = 1 .
0 −1 2 1
Note that this is the kind of system that arises in the discretization of a differential equation.
Although for matrices of this size we can easily solve the system directly, we will illustrate the
use of the Gauss-Seidel method. The Gauss-Seidel iteration has the form
2 0 0 0 1 0 1
−1 2 0 x k+1 = 0 0 1 x k + 1 .
0 −1 2 0 0 0 1
The system is easily solved to find x 1 = (1/2, 3/4, 7/8)⊤ . Continuing this process we get
x 2 , x 3 , ... until we are satisfied with the accuracy of the solution.
1. The 2-norm s
n ¢1/2
x i2 = x ⊤ x
X ¡
∥x∥2 = .
i =1
This is just the usual notion of Euclidean length.
2. The 1-norm
n
X
∥x∥1 = |x i |.
i =1
3. The ∞-norm
∥x∥∞ = max |x i |.
1≤i ≤n
A convenient way to visualize these norms is via their “unit circles.” If we look at the sets
{x ∈ R2 : ∥x∥p = 1}
for p = 2, 1, ∞, then we see the following shapes:
Now that we have defined a way of measuring distances between vectors, we can talk
about convergence.
R R
Definition 4.2. A sequence of vectors x k ∈ n , k = 0, 1, 2, . . . , converges to x ∈ n with respect Video 6.4
to a norm ∥·∥ if for all ε > 0 there exists an N > 0 such that, for all k ≥ N , we have
° °
° k
°x − x ° < ε.
°
We sometimes write
lim x k = x or xk → x
k→∞
x k → x, x k → x, xk → x
1 ∞ 2
to indicate convergence with respect to the 1-, ∞-, and 2-norms, respectively.
The following lemma implies that, for the purpose of convergence, it doesn’t matter
whether we take the ∞- or the 2-norm.
Lemma 4.1. For x ∈ Rn , p
Video 6.5
∥x∥∞ ≤ ∥x∥2 ≤ n ∥x∥∞ .
because x i2 /M 2 ≤ 1 for all i . This shows the second inequality. For the first one, note that
there is at least one value of i such that M = |x i |. It follows that
!1
n x2
Ã
2
i
X
∥x∥2 = M · ≥ M = ∥x∥∞ .
i =1 M2
A similar relationship can be shown between the 1-norm and the 2-norm, and also be-
tween the 1-norm and the ∞-norm.
xk → x ⇐⇒ x k → x.
2 ∞
In words: if x k → x with respect to the ∞-norm, then x k → x with respect to the 2-norm, and
vice versa.
Proof. Suppose that x k → x with respect to the 2-norm, and let ε > 0. Since x k converges
with respect to the 2-norm, there exists N > 0 such that for all k > N , ∥x k − x∥2 < ε. Since
∥x k − x∥∞ ≤ ∥x k − x∥2 , we also get convergence with respect to the ∞-norm. Now suppose
p
conversely that x k converges with respect to the ∞-norm. Then given ε > 0, for ε′ = ε/ n
p
there exists N > 0 such that ∥x k − x∥∞ < ε′ for k > N . But since ∥x k − x∥2 ≤ n∥x k − x∥∞ <
p ′
nε = ε, it follows that x k also converges with respect to the 2-norm.
The benefit of this type of result is that some norms are easier to compute than others.
Even if we are interested in measuring convergence with respect to the 2-norm, it may be
quicker to show that a sequence converges with respect to the ∞-norm, and once this is
shown, convergence in the 2-norm follows automatically by the above corollary.
x k+1 = T x k + c,
for some matrix T . The hope is that this sequence will converge to a solution vector x such
that x = T x + c. Given such an x, we can subtract x from both sides of the iteration to obtain
x k+1 − x = T x k + c − x = T (x k − x).
That is, the difference x k+1 − x °arises from the previous difference x k − x by multiplication by
T . For convergence, we want °x k − x ° to become smaller as k increases, or in other words,
°
we want multiplication by T to reduce the norm of a vector. In order to quantify the effect of a
linear transformation T on the norm of a vector, we introduce the concept of a matrix norm.
Definition 4.3. A matrix norm is a non-negative function ∥·∥ on the set of real n × n matrices
such that, for every n × n matrix A,
Note that properties 1–3 just state that a matrix norm is also a vector norm, if we think of
the matrix as a vector. Property 4 of the definition is about the “matrix-ness” of a matrix. The
most useful class of matrix norms are the operator norms induced by a vector norm.
Example 24. If we treat a matrix as a column vector of n 2 entries, then the 2-norm is called
the Frobenius norm of the matrix,
v
u n
uX 2
∥A∥F = t ai j .
i , j =1
The properties 1–3 are clearly satisfied, since this is just the 2-norm of the matrix considered
as a vector. Property 4 can be verified using the Cauchy-Schwarz inequality, and is left as an
exercise. End week 6
The most important matrix norms are the operator norms associated with certain vector
norms, which measure the extent to which a vector x is “stretched” by the matrix A with
respect to a given norm.
Definition 4.4. Given a vector norm ∥·∥, the corresponding operator norm of an n × n matrix Video 7.1
A is defined as
∥Ax∥
∥A∥ = max = maxn ∥Ax∥ .
x̸=0 ∥x∥ x∈R
∥x∥=1
Remark 4.1. To see the second equality, note that for x ̸= 0 we can write
° °
∥Ax∥ ° x ° ° °
= °A
° ° = ° Ay ° ,
∥x∥ ∥x∥ °
with y = x/ ∥x∥, where we used Property 2 of the definition of a vector norm. The vector
y = x/ ∥x∥ is a vector with norm ∥y∥ = ∥x/ ∥x∥∥ = 1, so that for every x ̸= 0 there exists a
vector y with ∥y∥ = 1 such that
∥Ax∥ ° °
= ° Ay ° .
∥x∥
In particular, minimizing the left-hand
° ° side over x ̸= 0 gives the same result as minimizing
the right-hand side over y with y = 1.
° °
First, we have to verify that the operator norm is indeed a matrix norm.
Theorem 4.1. The operator norm corresponding to a vector norm ∥·∥ is a matrix norm. Video 7.2
Proof. Properties 1–3 are easy to verify from the corresponding properties of the vector
norms. For example, ∥A∥ ≥ 0 because by the definition, there is no way it could be negative.
To show property 4, namely,
∥AB ∥ ≤ ∥A∥ ∥B ∥
for n × n matrices A and B , we first note that for any y ∈ n ,
° °
R
° Ay ° ∥Ax∥
° ° ≤ max = ∥A∥ ,
°y ° x̸=0 ∥x∥
and therefore ° ° ° °
° Ay ° ≤ ∥A∥ ° y ° .
Now let y = B x for some x with ∥x∥ = 1. Then
∥AB x∥ ≤ ∥A∥ ∥B x∥ ≤ ∥A∥ ∥B ∥ .
As this inequality holds for all unit-norm x, it also holds for the vector that maximises ∥AB x∥,
and therefore we get
∥AB ∥ = max ∥AB x∥ ≤ ∥A∥ ∥B ∥ .
∥x∥=1
This completes the proof.
Ax
End week 7
Even though the operator norms with respect to the various vector norms are of immense
importance in the analysis of numerical methods, they are hard to compute or even estimate
from their definition alone. It is therefore useful to have alternative characterizations of these
norms. The first of these characterizations is concerned with the norms ∥·∥1 and ∥·∥∞ , and
provides an easy criterion to compute these.
Lemma 4.2. For an n × n matrix A, the operator norms with respect to the 1-norm and the Video 8.2
∞-norm are given by
n
X
∥A∥1 = max |a i j | (maximum absolute column sum) ,
1≤ j ≤n i =1
n
X
∥A∥∞ = max |a i j | (maximum absolute row sum) .
1≤i ≤n j =1
Proof. (*) We will prove this for the ∞-norm. We first show the inequality ∥A∥∞ ≤ Video 8.3
max1≤i ≤n nj=1 |a i j |. Let x be a vector such that ∥x∥∞ = 1. That means that all the entries
P
X n
≤ max |a i j |,
1≤i ≤n j =1
where the inequality follows from writing out the matrix vector product, interpreting the ∞-
norm, using the triangle inequality for the absolute value, and the fact that |x j | ≤ 1 for 1 ≤ j ≤
n. Since this holds for arbitrary x with ∥x∥∞ = 1, it also holds for the vector that maximizes
max∥x∥∞ =1 ∥Ax∥∞ = ∥A∥∞ , which shows that ∥A∥∞ ≤ max1≤i ≤n nj=1 |a i j |.
P
In order to show the other direction, ∥A∥∞ ≥ max1≤i ≤n nj=1 |a i j |, let i ′ be the value of the
P
Now choose° ° y to be the vector with entries y j = 1 if a i ′ j ≥ 0 and y j = −1 if a i ′ j < 0. This vector
satisfies ° y °∞ = 1 and, moreover,
n
X n
X
yj a i′ j = |a i ′ j |,
j =1 j =1
How do we characterize the matrix 2-norm ∥A∥2 of a matrix? The answer is in terms of Video 8.4
the eigenvalues of A. Recall that a (possibly complex) number λ is an eigenvalue of A, with
associated eigenvector u, if
Au = λu.
Definition 4.5. The spectral radius of A is defined as
Proof. (*) Note that for a vector x, ∥Ax∥22 = x ⊤ A ⊤ Ax. We can therefore express the squared
2-norm of A as
∥A∥22 = max ∥Ax∥22 = max x ⊤ A ⊤ Ax.
∥x∥2 =1 ∥x∥2 =1
As a continuous function over a compact set, f (x) = x ⊤ A ⊤ Ax attains its maximum over the
unit sphere {x : ∥x∥2 = 1} at some x = u, say. Using a Lagrange multiplier, there exists a pa-
rameter λ such that
∇ f (u) = 2λu. (4.6)
To compute the gradient ∇ f (x), set B = A ⊤ A, so that
n n
f (x) = x ⊤ B x = b i i x i2 + 2
X X X
bi j xi x j = bi j xi x j ,
i , j =1 i =1 i<j
where the last inequality follows from the symmetry of B (that is, b i j = b j i ). Then
∂f X X n
= 2b kk x k + 2 b ki x i = 2 b ki x i .
∂x k i ̸=k i =1
∇ f (x) = 2B x = 2A ⊤ Ax.
A ⊤ Au = λu,
u ⊤ A ⊤ Au = λu ⊤ u = λ,
and since u was a maximizer of the left-hand function, it follows that λ is the maximal eigen-
value of A ⊤ A, i.e. λ = ρ(A ⊤ A). Summarizing, we have ∥A∥22 = ρ(A ⊤ A).
1−λ
0 2
p(λ) = det(A − λ1) = det 0 1 − λ −1 .
−1 1 1−λ
(1 − λ)(λ2 − 2λ + 4) = 0.
p
The solutions are given by λ1 = 1 and λ2,3 = 1 ± i 3. The spectral radius of A is therefore
ρ(A) = max{1, 2, 2} = 2.
We introduced the spectral radius of a matrix, ρ(A), as the maximum absolute value of
an eigenvalue of A, and characterized the 2-norm of A as
q
∥A∥2 = ρ(A ⊤ A).
Note that the matrix A ⊤ A is symmetric, and therefore has real eigenvalues.
For symmetric matrices A, i.e. matrices such that A ⊤ = A, the situation is simpler: the Video 8.5
2-norm is just the spectral radius.
Lemma 4.3. If A is symmetric, then ∥A∥2 = ρ(A).
Au = λu.
Then
A ⊤ Au = A ⊤ λu = λA ⊤ u = λAu = λ2 u.
It follows that λ2 is eigenvalue of A ⊤ A with corresponding eigenvector u. In particular,
Example 28. We compute the eigenvalues, and thus the spectral radius and the 2-norm, of
the finite difference matrix
2 −1 0 0 ··· 0 0
−1 2 −1 0 · · · 0 0
0 −1 2 −1 · · · 0 0
A= .
.. .. .. . . . .
..
.. . . . . .. .
0 0 0 0 · · · 2 −1
0 0 0 0 · · · −1 2
Let h = 1/(n + 1). We first claim that the vectors u k , 1 ≤ k ≤ n, defined by
sin(kπh)
..
uk =
.
sin(nkπh)
are the eigenvectors of A, with corresponding eigenvalues
λk = 2(1 − cos(kπh)).
This can be verified by checking that
Au k = λk u k .
In fact, for 2 ≤ k ≤ n − 1, the left-hand side of the j -th entry of the above product is given by
2 sin( j kπh) − sin(( j − 1)kπh) − sin(( j + 1)kπh).
Using the trigonometric identity sin(x + y) = sin(x) cos(y) + cos(x) sin(y), we can write this as
2 sin( j kπh) − (cos(kπh) sin( j kπh) − cos( j kπh) sin(kπh))
− (cos(kπh) sin( j kπh) + cos( j kπh) sin(kπh))
= 2(1 − cos(kπh)) · sin( j kπh).
Now sin( j kπh) is just the j -th entry of u k as defined above, so the coefficient in front must
equal the corresponding eigenvalue. The argument for k = 1 and k = n is similar.
The spectral radius is the maximum modulus of such an eigenvalue,
³ ³ nπ ´´
ρ(A) = max |λk | = 2 1 − cos .
1≤k≤n n +1
As the matrix A is symmetric, this is also equal to the matrix 2-norm of A:
³ ³ nπ ´´
∥A∥2 = 2 1 − cos .
n +1
Example 29. The Jacobi and Gauss-Seidel methods fall into this framework. Recall the de-
composition
A = L + D +U ,
where L is the lower triangular, D the diagonal, and U the upper triangular part. Then the
Jacobi method corresponds to the choice
T = TJ = −D −1 (L +U ), c = D −1 b,
while the Gauss-Seidel method corresponds to
T = TGS = −(L + D)−1U , c = (L + D)−1 b.
Lemma 4.4. Let T and c be the matrices in the iteration scheme (4.8) corresponding to either
the Jacobi method or the Gauss-Seidel method, and assume that D and L + D are invertible.
Then x is a solution of the system of equations (4.7) if and only if x is a fixed point of the
iteration (4.8), that is,
x = T x + c.
Proof. We write down the proof for the case of Jacobi’s method, the Gauss-Seidel case being
similar. We have
Ax = b ⇔ (L + D +U )x = b
⇔ D x = −(L +U )x + b
⇔ x = −D −1 (L +U )x + D −1 b
⇔ x = T x + c.
This shows the claim.
The problem of solving Ax = b is thus reduced to the problem of finding a fixed point to
an iteration scheme. The following important result shows how to bound the distance of an
iterate x k from the solution x in terms of the operator norm of T and an initial distance of x 0 .
Theorem 4.3. Let x be a solution of Ax = b, and let x k , for k ≥ 0, be a sequence of vectors such
that
x k+1 = T x k + c
R
for an n×n matrix T and a vector c ∈ n . Then, for any vector norm ∥·∥ and associated matrix
norm, we have ° °
° k+1
− x ° ≤ ∥T ∥k+1 °x 0 − x ° .
° ° °
°x
for all k ≥ 0.
Proof. We prove this by induction on k. Recall that for every vector y, we have
° ° ° °
°T y ° ≤ ∥T ∥ ° y ° .
Setting k = 0 gives the claim of the theorem for this case. If we assume that the claim holds
for k − 1, k ≥ 1, then ° ° ° °
° k k ° k−1
°x − x ° ≤ ∥T ∥ − x°
° °
°x
by this assumption, and plugging this into (4.9) finishes the proof.
Corollary 4.2. Assume that in addition to the assumptions of Theorem 4.3, we have ∥T ∥ < 1.
Then the sequence x k , k ≥ 0, converges to a fixed point x with x = T x + c, with respect to the
chosen norm ∥·∥.
Proof. Assume x 0 ̸= x (otherwise there is nothing to prove) and let ε > 0. Since ∥T ∥ < 1,
∥T ∥k → 0 as k → ∞. In particular, there exists an integer N > 1 such that for all k > N ,
ε
∥T ∥k < ° 0 °.
°x − x °
It follows that for k > N we have °x k − x ° < ε, which completes the convergence proof.
° °
Recall that for the Gauss-Seidel and Jacobi methods, a fixed point of x = T x + c was the
same as a solution of Ax = b. It follows that the Gauss-Seidel and Jacobi methods converge
to a solution (with respect to some norm) provided that ∥T ∥ < 1. Note also that either one
of ∥T ∥∞ < 1 or ∥T ∥2 < 1 will imply convergence with respect to both the 2-norm and the
∞-norm. The reason is the equivalence of norms (Lemma 4.1)
p
∥x∥∞ ≤ ∥x∥2 ≤ n ∥x∥∞ ,
which implies that if the sequence x k , k ≥ 0, converges to x with respect to one of these
norms, it also converges with respect to the other one. Such an equivalence can also be shown
between the 2- and the 1-norm.
So far we have seen that the condition ∥T ∥ < 1 ensures that an iterative scheme of the Video 8.7
form (4.8) converges to a vector x such that x = T x + c as k → ∞. The converse is not true,
there are examples for which ∥T ∥ ≥ 1 but the iteration (4.8) converges nevertheless.
Example 30. Recall the finite difference matrix
2 −1 0 ··· 0
−1 2 −1 · · · 0
A = 0 −1 2 · · · 0
. .. .. . . ..
.. . . . .
0 0 0 ··· 2
and apply the Jacobi method to compute a solution of Ax = b. The Jacobi method computes
the sequence x k+1 = T x k + c, and for this particular A we have c = 21 b and
0 1 0 ··· 0
1 0 1 · · · 0
1
T = T J = −D −1 (L +U ) = 0 1 0 · · · 0 .
2 .. .. .. . . ..
. . . . .
0 0 0 ··· 0
We have ∥T ∥∞ = 1, so the convergence criterion doesn’t apply for this norm. However, one
can show that all the eigenvalues satisfy |λ| < 1. Since the matrix T is symmetric, we have
∥T ∥2 = ρ(T ) < 1,
where ρ(T ) denotes the spectral radius. It follows that the iteration (4.8) does converge with
respect to the 2-norm, and therefore also with respect to the ∞-norm, despite having ∥T ∥∞ =
1.
It turns out that the spectral radius gives rise to a necessary and sufficient condition for
convergence.
Theorem 4.4. The iterates x k of (4.8) converge to a solution x of x = T x + c for all starting
points x 0 if and only if ρ(T ) < 1.
Proof. (*) Let x 0 be any starting point, and define, for all k ≥ 0,
z k = x k − x.
Then z k+1 = T z k , as is easily verified. The convergence of the sequence x k to x is then equiv-
alent to the convergence of z k to 0.
Assume T has n eigenvalues λk (possibly 0), 1 ≤ k ≤ n. We will only prove the claim for
R
case where the eigenvectors u k form a basis of n (equivalently, that T is diagonalizable),
and mention below how the general case can be deduced. We can write
n
z0 = αj u j
X
(4.10)
j =1
z k+1 = T z k
= T k+1 z 0
à !
n
k+1
αj u j
X
=T
j =1
n
α j T k+1 u j
X
=
j =1
n
α j λk+1
X
= j uj.
j =1
Now suppose ρ(T ) < 1. Then |λ j | < 1 for all eigenvalues λ j , and therefore λk+1
j
→ 0 as k → ∞.
Therefore, z k+1 → 0 as k → ∞ and x k+1 → x. If, on the other hand, ρ(T ) ≥ 1, then there exists
an index j such that |λ j | ≥ 1. If we choose a starting point x 0 such that the coefficient α j
in (4.10) is not zero, then |α j λk+1
j
| ≥ |α j | for all k and we deduce that z k+1 does not converge
to zero.
If T is not diagonalizable, then we still have the Jordan normal form J = P −1 T P , where P
is an invertible matrix and J consists of Jordan blocks
λi 1 · · · 0
0 λi · · · 0
.. .. . . ..
. . . .
0 0 · · · λi
on the diagonal for each eigenvalue λi . Rather than considering a basis of eigenvectors, we
take one consisting of generalized eigenvectors, that is, solutions to the equation
(A − λi 1)k = 0,
Au = λu.
If the index i is such that u i is the component of u with largest absolute value, then the right-
hand side is bounded by r i , and we get
|λ − a i i | ≤ r i ,
Gershgorin’s Theorem has implications on the convergence of Jacobi’s method. To state Video 9.2
these implications, we need a definition.
Definition 4.6. A matrix A is called diagonally dominant, if for all indices i we have
X
|a i i | > r i = |a i j |.
j ̸=i
Corollary 4.3. Let A be diagonally dominant. Then the Jacobi method converges to a solution
of the system Ax = b for any starting point x 0 .
Proof. We need to show that if A is diagonally dominant, then ρ(T J ) < 1, where T J = −D −1 (L+
U ) is the iteration matrix of Jacobi’s method. The i -th row of T J is
1 ¡ ¢
− a i 1 · · · a i i −1 0 a i i +1 · · · a i n .
ai i
By Gershgorin’s Theorem, all the eigenvalues of T J lie in a circle around 0 of radius
1 X
ri = |a i j |.
|a i i | j ̸=i
It follows that if A is diagonally dominant, then r i < 1, and therefore |λ| < 1 for all eigenvalues
λ of T J . Thus, ρ(T J ) < 1 and so Jacobi’s method converges for any x 0 .
ε 1 1+δ
µ ¶ µ ¶
A= , b= ,
0 1 1
We can think of δ as representing the effect of rounding error. Thus δ = 0 would give us an
exact solution, while if δ is small and ε ≪ δ, then the change in x due to δ ̸= 0 can be large!
The following definition is deliberately vague, and will be made more precise in light of
the condition number.
Definition 4.7. A system of equations Ax = b is called ill-conditioned, if small changes in the
system cause large changes in the solution.
We write cond1 (A), cond2 (A), cond∞ (A) for the condition number with respect to the 1,
2, and ∞ norms.
Let x be the true solution of a system of equations Ax = b, and let x c = x + ∆x be the Video 9.4
solution of a perturbed system
A(x + ∆x) = b + ∆b, (4.11)
where ∆b is a perturbation to b. We are interested in bounding the relative error in the solu-
tion
∥∆x∥
∥x∥
in terms of the relative error in b, which is ∥∆b∥ / ∥b∥. We have
The condition number therefore bounds the relative error in the solution in terms of the rela-
tive error in b. We can also derive a similar bound for perturbations ∆A in the matrix A. Note
that a small condition number is a good thing, as it implies a small error.
The above analysis can also be rephrased in terms of the residual of a computed solution.
Suppose we have A and b exactly, but solving the system Ax = b by a computational method
gives a computed solution x c = x + ∆x that has an error. We don’t necessarily know the error,
but we have access to the residual
r = Ax c − b.
We can rewrite this equation as in (4.11), with r instead of ∆b, so that we can interpret the
residual as a perturbation of b. The condition number bound (4.12) therefore implies
∥∆x∥ ∥r ∥
≤ cond(A) · .
∥x∥ ∥b∥
2(1 + ε) 2(1 + ε)
cond1 (A) = , cond∞ (A) = .
ε ε
Since ε is small, the condition numbers are large and therefore we cannot guarantee small
errors.
Example 34. A well-known example is the Hilbert matrix. Let Hn by the n × n matrix with
entries
1
hi j =
i + j −1
for 1 ≤ i , j ≤ n. This matrix is symmetric (Hn⊤ = Hn ) and positive definite (x ⊤ Hn x > 0 for all
x ̸= 0). For example, for n = 3 the matrix looks as follows
1 1
1 2 3
H3 = 12 1
3
1
4 .
1 1 1
3 4 5
Examples such as the Hilbert matrix are not common in applications, but they serve as a
reminder that one should keep an eye on the conditioning of a matrix.
n 5 10 15 20
5 13 20
cond2 (Hn ) 4.8 × 10 1.6 × 10 6.1 × 10 2.5 × 1028
10
log (cond (H))
2
10
0
2 3 4 5 6 7 8 9 10 11 12
n
It can be shown that the condition number of the Hilbert matrix is asymptotically
¡p ¢4n+4
2+1
cond2 (Hn ) ∼ 15/4 p
2 πn
for n → ∞. To see the effect that this conditioning has on solving systems of equations, let’s
look at a system
Hn x = b,
Pn 1
with entries b i = j =1 i + j −1 . The system is constructed such that the solution is x =
(1, . . . , 1) . For n = 20 we get, solving the system using Matlab, a solution x + ∆x which differs
⊤
∥∆x∥2
≈ 44.9844.
∥x∥2
4
cond2 (A) ∼ ,
π2 h 2
where h = 1/(n + 1). If follows that the condition number increases with the number of dis-
cretisation steps n.
Example 36. What is the condition number of a random matrix? If we generate random 100×
100 matrices with normally distributed entries and look at the frequency of the logarithm of
the condition number, then we get the following:
1200
1000
Frequency
800
600
400
200
0
1 2 3 4 5 6 7 8
log(cond2(A))
It should be noted that a random matrix is not the same as “any old matrix”, and equally
not the same as a typical matrix arising in applications, so one should be careful in interpret-
ing statements about random matrices!
Computing the condition number can be difficult, as it involves computing the inverse
of a matrix. In many cases one can find good bounds on the condition number, which can,
for example, be used to tell whether a problem is ill-conditioned.
Example 37. Consider the matrix
µ ¶ µ ¶
1 1 −1 4 1.0001 −1
A= , A = 10 .
1 1.0001 −1 1
The condition number with respect to the ∞-norm is given by cond∞ (A) ≈ 4×104 . We would
like to find an estimate for this condition number without having to invert the matrix A. To Video 9.6
do this, note that for any x and b = Ax we have
Ax = b ⇒ x = A −1 b ⇒ ∥x∥ ≤ ° A −1 ° ∥b∥ ,
° °
∥x∥∞
cond∞ (A) = ∥A∥ ° A −1 ° ≥ ∥A∥∞ ≈ 2 × 104 .
° °
∥b∥∞
This estimate is of the right order of magnitude (in particular, it shows that the condition
number is large), and no inversion was necessary.
To summarize:
• A small condition number is a good thing, as small changes in the data lead to small
changes in the solution.
• Condition numbers may depend on the problem from which the matrix arises, and can
be very large.
• A large condition number indicates that the matrix is “close” to being singular.
Condition numbers also play a role in the convergence analysis of iterative matrix algo-
rithms. We will not discuss this aspect here and refer to more advanced lectures on numerical
linear algebra and matrix analysis. End week 9
5 Non-linear Equations
Given a function f : R → R, we would like to find a solution to the equation Video 10.1
f (x) = 0. (5.1)
0.8
0.6
0.4
0.2
f(x)
0
−0.2
−0.4
−0.6
−0.8
−1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
For example, if f is a polynomial of degree 2, we can write down the solutions in closed form
(though, as seen in Section 1, this by no means solves the problem from a numerical point of
view!). In general, we will encounter functions for which a closed form does not exist, or it is
not convenient to write down or evaluate. The best way to deal with (5.1) is then to find an
approximate solution using an iterative method. Here we will discuss two methods:
• Newton’s method.
The bisection method only requires that f be continuous, while Newton’s method requires
differentiability too, but is faster.
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.5 1 0 0.5 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.5 1 0 0.5 1
Example 38. Let’s look at the polynomial x 6 − x − 1 on the interval [1, 2] with tolerance TOL =
0.2 (that is, we stop when we have located a sub-interval of length ≤ 0.2 containing the root
x). Note that no closed form solution exists for polynomials of degree ≥ 5. The bisection
method is best carried out in the form of a table. At each step the midpoint p n is obtained,
and this serves as the next left or right boundary, depending on whether f (a n ) f (p n ) < 0 or
not.
n an f (a n ) bn f (b n ) pn f (p n )
1 1 −1 2 61 1.5 8.8906
2 1 −1 1.5 8.8906 1.25 1.5647
3 1 −1 1.25 1.5647 1.125 −0.097713
4 1.125 −0.097713 1.25 1.5647 1.1875
We see that |b 4 − a 4 | = 0.125 < TOL, so we stop there and declare the a solution p 4 = 1.1875.
The following result shows that the bisection method indeed approximates a zero of f to
arbitrary precision.
R R
Lemma 5.1. Let f : → be a continuous function on an interval [a, b] and let p n , n ≥ 1, Video 10.3
be the sequence of midpoints generated by the bisection method on f . Let x be such that
f (x) = 0. Then
1
|p n − x| ≤ n |b − a|.
2
In particular, p n → x as n → ∞.
Proof. Let x ∈ [a, b] be such that f (x) = 0. Since p n is the midpoint of [a n , b n ] and x ∈ [a n , b n ],
we have
1
|p n − x| ≤ |b n − a n |.
2
By bisection, each interval has half the length of the preceding one:
1
|b n − a n | = |b n−1 − a n−1 |.
2
Therefore,
1
|p n − x| ≤ |b n − a n |
2
1
= 2 |b n−1 − a n−1 |
2
= ···
1 1
= n |b 1 − a 1 | = n |b − a|.
2 2
This completes the proof.
Given a point x n with function value f (x n ), we need to find the root of the tangent line to the
function at (x n , f (x n )):
y = f ′ (x n )(x − x n ) + f (x n ) = 0.
Solving this for x, we get
f (x n )
x = xn − ,
f ′ (x n )
which is well-defined provided f ′ (x n ) ̸= 0. Formally, Newton’s method is as follows:
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.5 1 0 0.5 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
0 0.5 1 0 0.5 1
f (x n )
x n+1 = x n − .
f ′ (x n )
Example 39. Consider again the function f (x) = x 6 − x − 1. The derivative is f ′ (x) = 6x 5 − 1.
We apply Newton’s method using a tolerance TOL = 0.001. We get the sequence:
x1 = 1
f (x 1 )
x2 = x1 − = 1.2
f ′ (x 1 )
f (x 2 )
x3 = x2 − ′ = 1.1436
f (x 2 )
x 4 = 1.1349
x 5 = 1.1347.
The difference |x 5 − x 4 | is below the given tolerance, so we stop and declare x 5 to be our
solution. We can already see that in just four iterations we get a far better approximation
than by using the bisection method.
for a constant k, provided we start “sufficiently close” to x. This will be shown using the
theory of fixed point iterations, discussed in the next section.
Newton’s method is not without difficulties. One can easily come up with starting points
where the method does not converge. One example is when f ′ (x 1 ) ≈ 0, in which case the tan-
gent line at (x 1 , f (x 1 )) is almost horizontal and takes us far away from the solution. Another
issue is where the iteration oscillates between two values, as in the following example.
y=x3−2x+2
6
3
f(x)
−1
−2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
x
x n+1 = g (x n ).
We construct fixed point iterations in the hope that they will converge to a fixed point
of g . We will study conditions under which this happens.
Example 40. Let f (x) = x 3 +4x 2 −10. There are several ways to rephrase the problem f (x) = 0 Video 10.6
as a fixed point problem g (x) = x.
¢1/2
2. Let g 2 (x) = 12 10 − x 3 . Suppose x ≥ 0. Then g 2 (x) = x ⇔ x 2 = 14 (10 − x 3 ) ⇔ f (x) = 0.
¡
¡ 10 ¢1/2
3. Let g 3 (x) = 4+x . Then (if x ≥ 0) it is also not difficult to verify that g 3 (x) = x is
equivalent to f (x) = 0.
Example 41. We briefly discuss a more intriguing example, the logistic map
g (x) = r x(1 − x)
with a parameter r ∈ [0, 4]. Whether the iteration x n+1 = g (x n ) converges to a fixed point, and
how it converges, depends on the value of r . Three examples are shown below.
0.68 0.9
0.8
0.66 0.8
0.6
0.64 0.7
0.62 0.6
0.4
0.6 0.5
0.2
0.58 0.4
0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
If we record the movement of x n for r ranging between 0 and 4, the following bifurcation
diagram emerges:
0.9
0.8
0.7
0.6
0.5
x
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4
r
It turns out that for small values of r we have convergence (which, incidentally, does not
depend on the starting value). For values slightly above 3, oscillation between two, and then
four values, while for larger r we have “chaotic” behaviour. In the latter region, the trajectory
of x n is also highly sensitive to perturbations of the initial value x 1 . The precise behaviour of
such iterations is studied in dynamical systems.
Given a fixed point problem, the all important question is: when does the iteration x n+1 = Video 10.7
g (x n ) converge? The following theorem gives an answer to this question.
Theorem 5.1. (fixed point theorem) Let g be a smooth function on [a, b]. Suppose
Then there exists a unique fixed point x = g (x) in [a, b] and the sequence {x n } defined by
x n+1 = g (x n ) converges to x. Moreover,
|x n+1 − x| ≤ λn |x 1 − x|
Proof. Let f (x) = g (x) − x. Then by (1), f (a) = g (a) − a ≥ 0, and f (b) = g (b) − b ≤ 0. By the Video 11.1
intermediate value theorem, there exists an x ∈ [a, b] such that f (x) = 0. Hence, there exists
x ∈ [a, b] such that g (x) = x, showing the existence of a fixed point.
Next, consider x n+1 = g (x n ) for n ≥ 1, and let x = g (x) be a fixed point. Then
x n+1 − x = g (x n ) − g (x).
Assume without lack of generality x n > x. By the mean value theorem there exists ξ ∈ (x, x n )
such that
g (x n ) − g (x)
g ′ (ξ) = ,
xn − x
and hence
x n+1 − x = g ′ (ξ)(x n − x).
Since ξ ∈ (a, b), assumption (2) gives g ′ (ξ) ≤ λ for some λ < 1. Hence,
|x n+1 − x| ≤ λ|x n − x|
≤ ···
≤ λn |x 1 − x|.
This proves the convergence. To show uniqueness, suppose x, y are two distinct fixed points
of g with x < y. By the mean value theorem and assumption (2), there exists ξ ∈ (x, y) such
that ¯ ¯
¯ g (x) − g (y) ¯
¯ ¯ = |g ′ (ξ)| < 1.
¯ x−y ¯
But since both x and y are fixed points, we have
g (x) − g (y) x − y
= = 1,
x−y x−y
so we have a contradiction, and so the fixed point is unique.
Example 42. Let’s look at the functions from Example 40 to see for which one we have con- Video 10.7
vergence.
1. g 1 (x) = x −x 3 −4x 2 +10 on [a, b] = [1, 2]. Note that g (1) = 6 ̸∈ [1, 2], therefore assumption
(1) is violated.
−3x 2
g 2′ (x) = ,
4(10 − x 3 )1/2
and therefore g 2′ (2) ≈ −2.12. Condition (2) fails.
g ′ (α) = 0.
Hence, |g ′ (α)| < 1 at the fixed point. Now let ε > 0 be sufficiently small and a = α−ε, b = α+ε.
Then, by continuity,
1
|g ′ (x)| <
2
for x ∈ [a, b], and (2) holds. Furthermore,
1
|g (x) − α| = |g ′ (ξ)||x − α| ≤ |x − α| < ε
2
for x ∈ [a, b]. Hence, g (x) ∈ [a, b] and (1) holds. It follows that in a small enough neighbour-
hood of a root of f (x), Newton’s method converges to that root (provided f ′ (x) ̸= 0 at that
root).
Note that the argument with ε illustrates a key aspect of Newton’s method: it only con-
verges when the initial guess x 1 is close enough to a root of f . What “close enough” means is
often not so clear. In the next section we derive a stronger result for Newton’s method, namely
that it converges quadratically if the starting point is close enough.
|x n+1 − α| ≤ k|x n − α|
with k > 0. If the sequence converges with order r = 2, it is said to converge quadratically.
n
Example 44. Consider the sequence x n = 1/2r for r > 1. Then x n → 0 as n → ∞. Note that
1 r
µ ¶
1 1
x n+1 = n+1 = r n r = r n = x nr ,
2r 2 2
For example, if a sequence converges quadratically, then if |x n − α| ≤ 0.1, in the next step
we have |x n+1 −α| ≤ k ·0.01. We would like to show that Newton’s method converges quadrat-
ically to a root of a function f if we start the iteration sufficiently close to that root.
Theorem 5.2. Let g be continuously differentiable in the neighbourhood of a fixed point α. Video 11.4
The fixed point iteration x n+1 = g (x n ) converges quadratically to α, if g ′ (α) = 0 and the start-
ing point x 1 is sufficiently close to α.
Again, sufficiently close means that there exists an interval [a, b] for which this holds.
1
g (x) = g (α) + g ′ (α)(x − α) + g ′′ (α)(x − α)2 + R,
2
1
g (x) − g (α) = g ′′ (α)(x − α)2 + R.
2
where R 1 = R/(x − α)2 = O((x − α)) ≤ C ε for a constant C . Taking absolute values, we get
for a constant k. Set x = x n , x n+1 = g (x n ), α = g (α). Then g (x) − g (α) = x n+1 − α and
Corollary 5.1. Newton’s method converges quadratically if we start sufficiently close to a root.
In summary, we have the following points worth noting about the bisection method and
Newton’s method.
• The bisection method requires that f is continuous on [a, b], and that f (a) f (b) < 0.
• Newton’s method requires that f is continuous and differentiable, and moreover re-
quires a good starting point x 1 .
• The bisection method converges linearly, while Newton’s method converges quadrati-
cally.
f (z n )
z n+1 = z n −
f ′ (z n )
converges to one of these roots of unity if we start close enough. But what happens at the
boundaries? The following picture illustrates the behaviour of Newton’s method for this func-
tion in the complex plane, where each colour indicates the root to which a given starting value
converges.
End week 11