Applied Scientific Computing
Applied Scientific Computing
Peter R. Turner
Thomas Arildsen
Kathleen Kavanagh
Applied
Scientific
Computing
With Python
Texts in Computer Science
Series editors
David Gries, Department of Computer Science, Cornell University, Ithaca, NY,
USA
Orit Hazzan, Faculty of Education in Technology and Science, Technion—Israel
Institute of Technology, Haifa, Israel
Fred B. Schneider, Department of Computer Science, Cornell University, Ithaca,
NY, USA
More information about this series at http://www.springer.com/series/3191
Peter R. Turner Thomas Arildsen
•
Kathleen Kavanagh
Applied Scientific
Computing
With Python
123
Peter R. Turner Kathleen Kavanagh
Clarkson University Clarkson University
Potsdam, NY Potsdam, NY
USA USA
Thomas Arildsen
Aalborg University
Aalborg
Denmark
This Springer imprint is published by the registered company Springer International Publishing AG
part of Springer Nature
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
Our choice of Python for the computer language is based on the desire to
minimize the overhead of learning a high-level language or of the intricacies (and
cost) of a specific application package. The coding examples here are intended to be
relatively easily readable; they are not intended to be professional-level software.
The Python code is there to facilitate the learning and understanding of the basic
material, rather than being an objective in itself.
Turning briefly to the content of the book, we gave considerable thought to the
ordering of material and believe that the order we have is one good way to organize
a course, though of course every instructor will have his/her own ideas on that. The
chapters are largely independent and so using them in a different order will not be
problematic.
We begin with a brief background chapter that simply introduces the main topics
application and modeling, Python programming, sources of error. The latter is
exemplified at this point with simple series expansions which should be familiar
from calculus but need nothing more. These series expansions also demonstrate that
the “most obvious” approach that a beginning student might adopt will not always
be practical. As such it serves to motivate the need for more careful analysis of
problems and their solutions. Chapter 2 is still somewhat foundational with its focus
on number representation and errors. The impact of how numbers are represented in
a computer, and the effects of rounding and truncation errors recur in discussing
almost any computational solution to any real-life problem.
From Chap. 3 onwards, we are more focused on modeling, applications, and the
numerical methods needed to solve them. In Chap. 3 itself the focus is on numerical
calculus. We put this before interpolation to reflect the students’ familiarity with the
basic concepts and likely exposure to at least some simple approaches. These are
treated without any explicit reference to finite difference schemes.
Chapters 4 and 5 are devoted to linear and then nonlinear equations. Systems of
linear equations are another topic with which students are somewhat familiar, at
least in low dimension, and we build on that knowledge to develop more robust
methods. The nonlinear equation chapter also builds on prior knowledge and is
motivated by practical applications. It concludes with a brief treatment of the
multivariable Newton’s method.
The final two chapters are on interpolation and the numerical solution of dif-
ferential equations. Polynomial interpolation is based mostly on a divided differ-
ence approach which then leads naturally to splines. Differential equations start
from Euler’s method, and then Runge Kutta, for initial value problems. Shooting
methods for two-point boundary value problems are also covered, tying in many
ideas from the previous chapters.
All of this material is presented in a gentle way that relies very heavily on
applications and includes working Python code for many of the methods. The
intention is to enable students to overcome the combined demands of the mathe-
matics, the computing, the applications, and motivation to gain a good working
insight into the fundamental issues of scientific computing. The impact of the
inevitable reliance on algebraic manipulation is largely minimized through careful
explanation of those details to help the student appreciate the essential material.
Preface vii
All of us have benefited from many helpful discussions both on the philosophy
and the details of teaching this content over many years–more for some than others!
Those influencers are too many to list, and so we simply thank them en masse for
all the helpful conversations we have had over the years.
ix
x Contents
The need for a workforce with interdisciplinary problem solving skills is critical and
at the heart of that lies applied scientific computing. The integration of computer
programming, mathematics, numerical methods, and modeling can be combined
to address global issues in health, finances, natural resources, and a wide range
of complex systems across scientific disciplines. Those types of issues all share
a unique property–they are comprised of open-ended questions. There is not one
necessarily right answer to the question, “How should we manage natural resources?”
As a matter of fact, that question in and of itself needs to be better defined, but we
know it is an issue. Certainly “a solution” requires the use of mathematics, most
likely through the creation, application and refinement of innovative mathematical
models that represent the physical situation. To solve those models, requires computer
programming, efficient and accurate simulation tools, and likely the analysis and
incorporation of large data sets.
In this chapter, we motivate the need for applied scientific computing and provide
some of the foundational ideas used throughout the rest of the text. We proceed
by describing mathematical modeling which is our primary tool for developing the
mathematical problems that we need scientific computing to address. Then, we point
to how computational science is at the heart of solving real-world problems, and
finally we review some important mathematical ideas needed throughout.
To get an idea of what math modeling is all about, consider the following questions;
(1) A new strain of the flu has surfaced. How significant is the outbreak? and (2)
A sick person infects two people/day. If your dorm consists 100 people, and two
people are initially sick, how long before everyone is infected? When you read the
first question, you might not even think of it as a math question at all. Arguably, it
is not–but insight could be gained by using mathematics (and ultimately scientific
© Springer International Publishing AG, part of Springer Nature 2018 1
P. R. Turner et al., Applied Scientific Computing, Texts in Computer Science,
https://doi.org/10.1007/978-3-319-89575-8_1
2 1 Motivation and Background
computing). When you read the second question, you probably feel more confident
and could get started immediately doing some computations. You might begin by
making a table of values and calculating how many people are sick each day until you
reach 100. The answer would then be the day that happened. You could conceivably,
in a few lines of computer code, write a program to solve this problem even for a
more general case where the population size and the rate of spreading are inputs to
your code.
Both questions have the same context, understanding how a disease spreads
through a population. What is the main difference between these questions? The
second one provided you with everything you need to get to the answer. If your
entire class approached this problem, you would all get the same answer.
With the first question though, you initially need to decide what is even meant
by “significant” and then think about how to quantify that. Modeling allows us to
use mathematics to analyze an outbreak and decide if there is cause for alarm or
to propose an intervention strategy. Math modeling allows for interpretation and
innovation and there is often not an obvious single solution. If your entire class split
up and worked on the first question, you would likely see a wide range of approaches.
You might make an assumption along the lines of; The rate at which a disease spreads
is proportional to product of the people who have it and the people who do not. In
this case, you may ultimately wind up with a system of differential equations that
model how the infected and uninfected populations change over time. Depending on
the level of complexity you chose to include, numerical techniques would be needed
to approximate a solution to the differential equation model.
Specifically, a mathematical model is often defined as a representation of a
system or scenario that is used to gain qualitative and/or quantitative understanding
of some real world problem, and to predict future behavior. Models are used across all
disciplines and help people make informed decisions. Throughout this text, we will
use real-world applications like the second question to motivate and demonstrate the
usefulness of numerical methods. These will often be posed as mathematical models
that have already been developed for you, and it will be your role to apply scientific
computing to get an answer. In addition to that, we provide math modeling in which
numerical methods have powerful potential to give insight into a real-world problem
but you first need to develop the model. We provide a brief overview of the math
modeling process below for inspiration.
Overview of the Math Modeling Process
Math modeling is an iterative process and the key components may be revisited mul-
tiple times and not necessarily even done in order, as we show in Fig. 1.1. We review
the components here and point the reader towards the free resource Math Modeling:
Getting Started Getting Solutions, Karen M. Bliss et al., SIAM, Philadelphia, 2013
for more information, examples, and practice.
• Defining the Problem Statement If the initial idea behind the problem is broad
or vague, you may need to form a concise problem statement which specifies what
the output of your model will be.
1.1 Mathematical Modeling and Applications 3
In this text, you will learn the tools to tackle these sorts of messy real-world problems
and gain experience by using them. Within that modeling process in Fig. 1.1, we will
be primarily concerned with solving the resulting model and analyzing the results
(although there will be opportunities to build models as well). In the process of
solving real-world problems, once a mathematical representation is obtained, then
one must choose the right tool to find a solution.
4 1 Motivation and Background
Fig. 1.2 Examples of types of problems resulting from mathematical modeling of real world prob-
lems
The mathematical model itself can take on many forms (we suggest taking an
entire course in just that topic!). Figure 1.2 shows where the topics covered in this
text emerge as tools to find a solution within the modeling process. Here is a brief
look at the types of problems you will be able to solve after you are done with this
text. First note that you learned all kinds of pencil and paper techniques for many of
these “types” of problems presented throughout this book, but the purpose here is to
move beyond that to solve more complex problems.
In order to be able to use scientific computing tools, it is necessary to become
familiar with how numbers are stored and used in a computer. The chapter on Number
Representations and Errors provides some background on this as well as examples
of different types of errors that can impact the accuracy of your solution and tools to
measure the accuracy.
In the Numerical Calculus chapter, you will learn tools to approximate derivatives
and integrals using only function values or data. You’ll also learn how to analyze
the accuracy in those approximations. These can be powerful tools in practice. For
example, the function you are working with may not even have an analytic integral.
Sometimes you don’t have a function at all. That instance may occur if a function
evaluation requires output from simulation software that you used to compute some
other quantity of interest. It may be the case that you only have data, for example
collected from a sensor or even given to you as a graph from a collaborator, but you
need to use those values to approximate an integral or a rate of change.
The chapter on Linear Equations is about how to solve problems of the form
Ax = b where A is square matrix and x and b are vectors. Linear models that lead
to problems of this form arise across disciplines and are prone to errors if care is
not taken in choosing the right algorithm. We discuss direct methods and iterative
1.2 Applied Scientific Computing 5
methods as well as ways to make the algorithmic approaches more efficient in some
cases. The topic of eigenvalues is also introduced.
In Chap. 5, we address nonlinear equations of one variable, that is problems of
the form f (x) = 0 and systems of nonlinear equations, where f : R n → R n . Prob-
lems of this type almost always require iterative methods and several are presented,
compared, and analyzed.
Chapter 6 on Interpolation covers the idea of finding a function that agrees with
a data set or another function at certain points. This is different than try to fit a model
as closely as possible to data (which would result in a linear or nonlinear regression
or least squares problem and ideas from the previous two chapters). The focus in
this text is on finding polynomials or piecewise polynomials that equal the function
(or data) exactly.
Finally in Chap. 7, we present numerical approaches to solving Differential Equa-
tions, which arise routinely in math modeling, for example in the context of infectious
diseases or population dynamics. Problems in that chapter have the form y = f (x, y)
where here the solution is a function y(x) that satisfies that differential equation. The
continuous problem is discretized so that approximations at specific x values are
computed. We also consider higher order differential equations and boundary value
problems with an emphasis on accuracy and computational efficiency. Solution tech-
niques here rely on ideas from the previous chapters and are a culminating way to
gain appreciation for and practice with the tools presented throughout the text.
Throughout this text, examples are used to provide insight to the numerical methods
and how they work. In some cases, a few computations, or steps through an algo-
rithm, are shown and other times Python code is provided (along with the output).
Python is an open source, interpreted, cross-platform language that was developed
in the late 1980s (no compiling and it’s free). The motivation was to be flexible with
straightforward syntax. Python has grown to be an industry standard now and a pow-
erful computing tool, with a wide range of libraries for mathematics, data science,
modeling, visualization, and even game design. It is considered “easy” to get started
with and there are a plethora of tutorials, resources, and support forums available to
the scientific community. At the end of each chapter in this book, we point to Python
tools and resources for further exploration. Head to
https://www.python.org
to get started. In particular, you should be sure to install the NumPy and SciPy libraries
which are the standard scientific computation packages, along with Matplotlib for
plotting. For an easy way to install Python and all essential packages for scientific
computing, consider looking up the Anaconda Python distribution.
This book is a not a book about Python programming. For a thorough introduction
to Python for scientific computing, we recommend Hans Petter Langtangen: A Primer
on Scientific Programming with Python, Springer, 2016.
6 1 Motivation and Background
Throughout this book, we make use of the three essential scientific computing
packages mentioned above. NumPy provides basic vector- and matrix-based numer-
ical computing based on the NumPy array data structure and its accompanying func-
tionality such as basic linear algebra, random numbers etc. SciPy provides a larger
selection of higher-level functionality such as interpolation, integration, optimization
etc. that works on top of NumPy. Matplotlib provides advanced plotting functionality
for visualising numerical data and is used in some of the examples to illustrate the
results.
In order to use additional packages such as NumPy, SciPy, and Matplotlib, they
must first be imported in the Python script. This is conventionally done as:
import numpy as np
import s c i p y as sp
import m a t p l o t l i b . p y p l o t as p l t
Some of the code examples in the coming chapters implicitly assume that these pack-
ages have been imported although it is not always shown in the particular example.
When you come across statements in the code examples beginning with the above
np., sp., or plt., assume that it refers to the above packages.
1.4 Background
where here Cn are coefficients and x is a variable. In its simplest definition, a power
series is an infinite degree polynomial. Significant time is spent in Calculus answering
the question “For what values of x will the power series converge?” and multiple
approaches could be used to determine a radius and interval of convergence. Usually
this is followed by an even more important question, “Which functions have a power
series representation?” which lead into Taylor and MacLaurin series. Recall that the
Taylor series representation of a function f (x) about a point a and some radius of
convergence is given by
1.4 Background 7
∞
f (n) (a) f (a)
f (x) = (x − a)n = f (a) + f (a)(x − a) + (x − a)2 + . . .
n! 2!
n=0
(1.1)
and a MacLaurin series was the special case with a = 0. This was a powerful new
tool in that complicated functions could be represented as polynomials, which are
easy to differentiate, integrate, and evaluate.
Two series which we will make use of are the geometric series
∞
1
= xk = 1 + x + x2 + · · · (|x| < 1) (1.2)
1−x
k=0
ei x = cos x + i sin x
√
where i = −1, we get the following series:
∞
(−1)k x 2k x2 x4
cos x = =1− + ··· (all x) (1.4)
(2k)! 2! 4!
k=0
∞
(−1)k x 2k+1 x3 x5
sin x = =x− + ··· (all x) (1.5)
(2k + 1)! 3! 5!
k=0
(If you are unfamiliar with complex numbers, these are just the MacLaurin series for
these functions)
By integrating the power series (1.3) we get
∞
x k+1 x2 x3
ln (1 − x) = − = −x − − − ··· (|x| < 1) (1.6)
k+1 2 3
k=0
Don’t worry if you do not remember all these details–everything you need can also
be found in any Calculus text to refresh your memory. The following examples are
meant to demonstrate these ideas further.
8 1 Motivation and Background
1 1 1
ln 2 = 1 − + − ···
2 3 4
Use the first 8 terms of this series to estimate ln 2. How many terms would be needed
to compute ln 2 with an error less than 10−6 using this series? (Note the true value
of ln 2 ≈ 0. 693 147 18)
1 1 1 1 1
ln 2 ≈ 1 − + − + − = 0.61666667
2 3 4 5 6
which has an error close to 0.08.
Since the series (1.7) is an alternating series of decreasing terms (for 0 < x ≤ 1),
the truncation error is smaller than the first term omitted. To force this truncation
error to be smaller than 10−6 would therefore require that the first term omitted is
smaller than 1/1, 000, 000. That is, the first one million terms would suffice.
Example 2 Find the number of terms of the exponential series that are needed for
exp x to have error < 10−5 for |x| ≤ 2.
First note that the tail of the exponential series truncated after N terms increases
with |x|. Also the truncation error for x > 0 will be greater than that for −x since the
series for exp (−x) will be alternating in sign. It is sufficient therefore to consider
x = 2.
We shall denote by E N (x) the truncation error in the approximation using N
terms:
N −1
xk
exp x ≈
k!
k=0
Then, we obtain, for x > 0
∞
xk xN x N +1 x N +2
E N (x) = = + + + ···
k! N! (N + 1)! (N + 2)!
k=N
xN x x2
= 1+ + + ···
N! N + 1 (N + 2) (N + 1)
N
x x x2
≤ 1+ + + · · ·
N! N + 1 (N + 1)2
xN 1
= ·
N ! 1 − x/ (N + 1)
1.4 Background 9
2N N + 1
E N (2) ≤ ·
N! N − 1
and we require this quantity to be smaller than 10−5 . Now 211 /11! = 5. 130 671 8 ×
10−5 , while 212 /12! = 8. 551 119 7 × 10−6 . We must check the effect of the factor
N +1
N −1 = 11 = 1. 181 818 2. Since (1. 181 818 2) (8. 551 119 7) = 10. 105 869, twelve
13
13
terms is not quite sufficient. N = 13 terms are needed: 213! · 1214
= 1. 534 816 4 ×
−6
10 .
+1
We note that for |x| ≤ 1 in the previous example, we obtain E N (1) ≤ NN ·N ! <
−5
10 for N ≥ 9. For |x| ≤ 1/2, just 7 terms are needed. The number of terms required
increases rapidly with x. These ideas can be used as a basis for range reduction so
that the series would only be used for very small values of x. For example, we could
use the 7 terms to obtain e1/2 and then take
2 4
e2 = e1 = e1/2
to obtain e2 . Greater care would be needed over the precision obtained from the
series to allow for any loss of accuracy in squaring the result twice. The details are
unimportant here, the point is that the series expansion can provide the basis for a
good algorithm.
The magnitude of the error in any of these approximations to e x increases
rapidly
with |x| as can be seen from Fig. 1.3. The curve plotted is the error e x − 6k=0 x k /k!.
We see that the error remains very small throughout [−1, 1] but rises sharply outside
this interval indicating that more terms are needed there. The truncated exponential
series is computed using the Python function
0.04
0.02
0.00
−3 −2 −1 1 2 3
−0.02
−0.04
10 1 Motivation and Background
import numpy as np
def expn ( x , n ) :
"""
Evaluates the f i r s t ‘ n ‘ terms of the exponential
series
"""
s = np . o n e s _ l i k e ( x )
t = s;
f o r k i n range ( 1 , n ) :
t = t ∗ x / k
s = s + t
return s
What have you learned in this chapter? And where does it lead?
This first, short, chapter sets the stage for the remainder of the book. You have
been introduced to some of the underlying issues–and even some of the ways in
which these can be tackled. In many senses this chapter answers the fundamental
question “Why do I need to learn this?”, or even “Why do I care?”
Scientific Computing is by its very nature–almost by definition–a subject that
only becomes important in its ability to help us solve real problems, and those real
problems inevitably start with the need to “translate” a question from some other area
of science into a mathematical model and then to solve the resulting mathematical
problem.
The solution will almost always be a computational, and approximate, solution
since most models are not amenable to paper and pencil approaches. Therefore we
need to study the process of mathematical modeling and the fundamental topics
of scientific computing per se. The latter includes the ability to code your solution
algorithm. We have chosen Python as the language for this book and so this first
chapter included introductions to both modeling and python programming.
Any computational solution has errors inherent in the process. There’s more on
that in the next chapter, but even in this introduction we see that simplistic approaches
to approximating a function through a power series are often not practical, but that
a careful restructuring of the problem can overcome that. This motivates the subse-
quent study of numerical methods and the importance of controlling errors and the
practicality of any particular approach.
That is what much of the rest of the book addresses. Read on and enjoy the
challenges and successes.
12 1 Motivation and Background
Exercises
1. Consider the motivating modeling problem from the beginning of this section:
“There is an outbreak of the flu. Is it significant?” Brainstorm about this situation
and list some assumptions, variables, and relationships. Try to come up with a
model that will predict the number of infected people at a given time. You don’t
need to find a particular solution. What are the strengths and weaknesses of your
approach? What are some challenges in seeking a solution to this problem?
2. Suppose that two stray cats, a male and a female, have a litter of kittens. Derive
a sequence, using mathematical modeling, that describes how the stray cat pop-
ulation will grow over time. How many cats will there be in two, five, and ten
years.
3. For the above cat problem, propose a humane and cost effective intervention
strategy to control the population and incorporate it into your model. Compare
the results.
4. Suppose that two bicyclists are traveling towards each other. Derive a mathemat-
ical model to determine when they would meet. Clearly define all the assump-
tions, variables, and relationships you choose to come up with the model. Create
a python program to simulate this scenario and test it for a range of model param-
eters. Do your answers make sense? Could you explain your model to someone
else clearly?
5. Write a script to approximate the natural logarithm using the first 6 terms of
Eq. (1.7). Use it to estimate ln 1.128.
6. How many terms of the series (1.7) are needed to approximate ln 1.128 with
error smaller than 10−4 ? Evaluate this approximation and verify that the error
is indeed within the tolerance.
7. Determine the number of terms of the exponential series that are needed to obtain
e0.1 with error smaller than 10−8 . Evaluate the sum of these terms and verify
that the desired accuracy is achieved.
8. Use the series in (1.6) and (1.7) together with basic properties of the logarithm
function to obtain a series expansion for ln ((1 + x) / (1 − x)) . Show that choos-
ing x = 1/3 then provides a convergent series to compute ln 2. Use this series to
estimate ln 2 with error less than 10−6 . Write a script to approximate the natural
logarithm function using the first six terms of this series and graph the function,
and its approximation, and its error for x in (0, 3).
9. Write a script to estimate the natural logarithm function using n terms of a series
expansion. Graph the error in this approximation to ln x for x ∈ (0, 3) using 10
terms. √
10. Estimate π using the identity arctan 1/ 3 = π/6 by considering a series expan-
sion arctan x. How many terms are needed to have an error of less than 10−3 ?
11. The erf function, or “error function”, defined by
2 x
erf (x) = √ exp −t 2 dt
π 0
1.6 Conclusions and Connections: Motivation and Background 13
Use the first 10 terms of this series to obtain a graph of this function over the
interval [0, 4] .
12. A car is moving at 18 m/s with an instantaneous acceleration of 1.7 m/s2 . Build
a second degree Taylor polynomial to estimate how far the car goes in the next
second, two seconds, and ten seconds. Discuss at which point you think the
polynomial fails to be a good approximation to estimate the distance that car
travels.
13. In Calculus Early Transcendentals 8th Edition by James Stewart they reference
Physics: Calculus 2nd Edition by Eugene Hecht in deriving the equations to
determine the period of a pendulum of length L they get
T = 2π L/g.
αT = −g sin(θ)
for the tangential acceleration of the bob of the pendulum and claim “for small
angles, the value of θ in radians is very nearly the value of sin(θ); they differ by
less than 2% out to about 20◦ .”
(a) Explain how sin(x) ≈ x is just a first degree Taylor polynomial approxima-
tion.
(b) Write a script to determine the range of x values for which the error in that
approximation is within 2%. Does Hecht’s statement hold?
Number Representations and Errors
2
2.1 Introduction
and each of the base-β digits an , an−1 , . . . , a1 , a0 , a−1 , a−2 , . . . , a−m is an integer
in the range 0, 1, 2, . . . , β − 1. The representation above could be rewritten using
summation notation as
n
x= ak β k
k=−m
x = f × βE (2.1)
N
f = b0 + b1 β −1 + b2 β −2 + · · · + b−N β N = bk β −k . (2.3)
k=0
b0 = 1 (2.4)
in the binary representation (2.3) of the mantissa. This first bit is not stored explicitly
in normalized binary floating-point representations and is called the implicit bit.
Keep in mind, that bit (although not stored) must be used in any computations.
Throughout the book, we may consider hypothetical computers that have a speci-
fied number of significant figures in order to illustrate certain ideas easily. Remember,
2.2 Floating-Point Numbers 17
significant figures are all digits after the first nonzero one that determine the magni-
tude of a number with several rules that apply to zero; zeros between nonzero digits
are significant, when there is a decimal–trailing zeros are significant and leading
zeros are not. For example, 123.45 and 0.012345 both have five significant figures
while 10.2345 and 0.0123450 have six.
π ≈ +3.142 × 100
π ≈ +3.14159 × 100
This representation consists of three pieces: the sign, the exponent (in this case 0)
and the mantissa, or fraction, 3.14 . . ..
In the binary floating-point system using 18 significant binary digits, or bits , we
would have
π ≈ +1.10010010000111111 × 21
Here the digits following the binary point represent increasing powers of 1/2. Thus
the first five bits represent
1 0 0 1
1+ + + + × 21 = 3. 125
2 4 8 16
Note that most computer systems and programming languages allow quantities
which are known to be (or declared as) integers to be represented in their exact
binary form. However, this restricts the size of integers which can be stored in
a fixed number of bits (referred to as a word). This approach does help avoid the
introduction of any further errors as the result of the representation. The IEEE binary
floating-point standards is the standard floating point representation used in practice
(See IEEE Standard 754, Binary Floating-point Arithmetic, Institute for Electrical
and Electronic Engineers, New York 2008). There are two formats that differ in the
number of bits used. IEEE single precision uses a 32-bit word to represent the sign,
exponent and mantissa and double precision uses 64-bits . Table 2.1 shows how these
bits are used to store the sign, exponent, and mantissa.
In every day computations, we are used to simply rounding numbers to the nearest
decimal. This is generalized by the notion of symmetric rounding in which a number
is represented by the nearest member of the set of representable numbers (such as
in Example 1). Be aware though that this may not be the approach taken by your
18 2 Number Representations and Errors
Table 2.1 IEEE normalized Numbers of bits Single precision Double precision
binary floating-point
representations Sign 1 1
Exponent 8 11
Mantissa (including 24 53
implicit bit)
computer system. The IEEE standard requires the inclusion of symmetric rounding,
rounding towards zero (or chopping), and the directed rounding modes towards
either +∞ or −∞.
Consider Example 1 again. Using chopping in the decimal floating-point approx-
imations of π gives π ≈ +3.141 × 100 and with 7 significant figures would give
π ≈ +3.1415926 × 100 . Throughout this text, symmetric rounding will be used
but the reader should remember that there may be additional details to consider
depending on the computer systems he or she is using.
It may be becoming clear that the errors in the floating-point representation of
real numbers will also impact computations and may introduce additional errors. To
gain insight into this issue, consider the following examples.
1. The addition
has a rounding error of 4 × 10−4 which is also true of the corresponding sub-
traction
Multiplication of the same pair of numbers has the exact result 1.522756 × 10−1
which would be rounded to 1.523 × 10−1 . Again there is a small rounding error.
2. The somewhat more complicated piece of arithmetic
1.234 1.234
−
0.1234 0.1233
demonstrates some of the pitfalls more dramatically. Proceeding in the order
suggested by the layout of the formula, we obtain
1.234
≈ 1.000 × 101 , and
0.1234
1.234
≈ 1.001 × 101
0.1233
2.2 Floating-Point Numbers 19
from which we get the result −0.001 × 101 which becomes −1.000 × 10−2
after normalization.
3. If we perform the last calculation rather differently, we can first compute
1 1
− ≈ 8.104 − 8.110 = −6.000 × 10−3
0.1234 0.1233
Multiplying this result by 1.234 we get −7.404 × 10−3 which is much closer to
the correct result which, rounded to the same precision, is −8.110 × 10−3 .
The examples above highlight that, because number representation is not exact
on a computer, additional errors occur from basic arithmetic. The subtraction of two
numbers of similar magnitudes can be especially troublesome. (The result in part
2 of Example 2 has only one significant figure since the zeros introduced after the
normalization are negligible.) Although this may seem alarming, part 3 demonstrates
that by being careful with the order in which a particular computation takes place,
errors can be avoided.
Something else to be aware of–problems can arise from the fact that a computer can
only store a finite set of numbers. Suppose that, on the same hypothetical machine
as we used in Example 2 there is just one (decimal) digit for the exponent of a
floating-point number. Then, since
3.456 × 103 × 3.456 × 107 = 3456 × 34560000
= 119439360000
≈ 1.194 × 1011
the result of this operation is too large to be represented in our hypothetical machine.
This is called overflow. Similarly, if the (absolute value of a) result is too small to
be represented, it is called underflow. For our hypothetical computer this happens
when the exponent of the normalized result is less than −9.
These ideas indicate that even some seemingly simple computations must be
programmed carefully in order to avoid overflow or underflow, which may cause a
program to fail (although many systems will not fail on underflow but will simply
set such results to zero, but that can often result in a later failure, or in meaningless
answers). The take-away message is not to stress too much about these issues, but
be aware of them when solutions seem questionable or a program fails to execute.
Most computing platforms use only one type of number – IEEE double precision
floating-point. Dictated by the computer’s hardware, Python is no exception to this.
On typical computers, Python floating-point numbers map to the IEEE double pre-
cision type. Integers in Python are represented separately as integer data types, so
there is no need to worry about numerical precision as long as operations only involve
integers.
20 2 Number Representations and Errors
From Table 2.1, we see that the double-precision representation uses 11 bits for
the binary exponent which therefore ranges from about −210 to 210 = 1024. (The
actual range is not exactly this because of special representations for small numbers
and for ±∞.) The mantissa has 53 bits including the implicit bit. If x = f × 2 E is
a normalized floating-point number then f ∈ [1, 2) is represented by
52
f =1+ bk 2−k
k=1
Since 210 = 1024 ≈ 103 , these 53 significant bits are equivalent to approximately
16 significant decimal digits accuracy.
The fact that the mantissa has 52 bits after the binary point means that the next
machine number greater than 1 is 1 + 2−52 . This gap is called the machine unit,
or machine epsilon. This and other key constants of Python’s arithmetic are easily
obtained from the sys module (import sys first).
Here, and throughout, the notation xen where x, n are numbers stands for x × 10n .
The quantity 10−8 may be written in practice as 1e−8.
In most cases we wish to use the NumPy package for Python for numerical compu-
tations. NumPy defines its own floating-point and integer data types offering choice
between different levels of precision, e.g. numpy.float16, numpy.float32,
and numpy.float64. These types are compatible with the built-in floating-point
type of ordinary Python and numpy.float64 (double precision) is the default
type. NumPy offers access to an extended precision 80-bit floating-point types as
well on operating systems where this is available. Somewhat confusingly, this data
type is available as numpy.float96 and numpy.float128. The latter two
types might seem to suggest that they offer 96-bit and 128-bit precision, respec-
tively, but the name only relates to how many bits the variables are zero-padded to in
memory. They still offer only 80 bits’ precision at most, and only 64 bits by default
on Windows. We advise only using NumPy’s default numpy.float64 type unless
special requirements dictate otherwise and you know your platform actually offers
the higher precision.
The machine precision-related constants of NumPy’s arithmetic can be obtained
using NumPy’s finfo function (import numpy first).
2.2 Floating-Point Numbers 21
In this section, we consider three primary sources of error, although the topic of error
analysis and error control is in and of itself worthy of an entire text. The purpose
here is to introduce the reader to these ideas so later we can use them to understand
the behavior of numerical methods and the quality of the resulting solutions.
We have already seen that these arise from the storage of numbers to a fixed number
of binary or decimal places or significant figures.
If however, we use this same formula for the larger root together with the fact
that the pr oduct of the roots is 10, then, of course, we obtain the values 50.00 and
10.00/50.00 = 0.2000. Again, note that rounding errors can be addressed by the
careful choice of numerical process, or the design of the algorithm. This approach
to solving for roots of a quadratic equation is the best approach!
The previous computations in Example 3 used symmetric rounding but if chopping
were used instead, the larger root would be obtained as 49.99 and the smaller would
then be 10/49.99 = 0.200 as expected.
22 2 Number Representations and Errors
Truncation error is the name given to the errors which result from the many ways in
which a numerical process can be cut short or truncated. Many numerical methods
are based on truncated Taylor series approximations to functions and thus, this type
of error is of particular importance and will be revisited throughout the text.
To get more comfortable with the notion of truncation error, consider the geometric
series,
∞ n
1 1 1 1
+ + ··· =
2 4 8 2
n=1
If we truncate this series after only the first three terms, then we get 0.875, which
gives a truncation error of 0.125.
Of course, including more terms in the sum before truncating would certainly
lead to a more accurate approximation. The following examples show how to use
the notion of truncation error to determine a desired accuracy.
Example 4 (a) Determine the number of terms of the standard power series for
sin(x) needed to compute sin(1/2) to five significant figures.
The series
∞
x3 x5 x 2n+1
sin(x) = x − + − ··· = (−1)n
3! 5! (2n + 1)!
n=0
has infinite radius of convergence, so we need not worry about the issue of con-
vergence for x = 1/2. Indeed the terms of the series in this case have decreasing
magnitude and alternating sign. Therefore the error in truncating the series after N
terms is bounded by the first term omitted. We have already seen that sin(x) can be
well approximated by x itself for small values so the true value is expected to be a
little less than 0.5. Accuracy to five significant figures is therefore achieved if the
absolute error is smaller than 5 × 10−6 . Thus we seek N such that
1
< 5 × 10−6 ,
22N +1 (2N + 1)!
or equivalently
22N +1 (2N + 1)! > 200, 000.
The first few odd factorials are 1, 6, 120, 5040, and the first few odd powers of 2 are
2, 8, 64, 512. Now 64 × 120 = 7, 680 is too small, while 512 × 5040 = 2, 580, 480
readily exceeds our threshold. Therefore the first three terms suffice, so that to five
significant figures we have
1
sin(1/2) = − 18 × 6 + 164 × 120 = 0.47917.
2
2.3 Sources of Errors 23
√ (b) Determine the number of terms N of the exponential series required to estimate
e to five decimal places.
This time we need to use the exponential series to compute e1/2 . Again, conver-
gence of the power series
1 ∞
1 1
e = exp (1) = 1 + 1 + + + ··· =
2! 3! n!
n=0
1 1 1
1+ ( + + ... .
2N N ! 2 N + 1) [2(N + 1)]2
1 2N + 2
=
1− 1
2N +2
2N + 1
1 2N + 2
.
2N N ! 2N + 1
The first few values for N = 0, 1, 2, , 7 are 2, 0.666667, 0.15, 0.02381, 0.002894,
0.000284, 2.34 × 10−05 , 1.65 × 10−06 . We see that the required tolerance is
achieved by 2.34 × 10−5 for N = 6.
24 2 Number Representations and Errors
2.3.3 Ill-Conditioning
Given that the underlying model is sound and the numerical method is “stable” (more
on that later), there still may be inherent issues in the underlying problem you are
trying to solve. This notion is called the conditioning of the problem and is based on
the idea that small changes to the input should lead to small changes in the output.
When that is not the case, we say the underlying problem is ill-conditioned.
A classical example of ill-conditioning is the so-called Wilkinson polynomial.
Consider finding the roots of
p (x) = (x − 1) (x − 2) · · · (x − 20) = 0
p (x) = x 20 + a19 x 19 + · · · + a1 x + a0
If the coefficient a19 = −210 is changed by about 2−22 , then the resulting polynomial
has only 10 real zeros, and five complex conjugate pairs. The largest real root is now
at x ≈ 20.85. A change of only one part in about 2 billion in just one coefficient has
certainly had a significant effect!
Underlying ill-conditioning is often less of an issue than modeling errors or for
example, using an inappropriate method to solve a specific problem. However, under-
standing the behavior of the underlying problem (or model or method) you have in
front of you is always important so that you can make the right choices moving
forward. This idea will be revisited throughout the text.
Example 5 Find the absolute and relative errors in approximating e by 2.7183. What
are the corresponding errors in the approximation 100e ≈ 271.83?
Unfortunately, we cannot find the absolute errors exactly since we do not know a
numerical value for e exactly. However we can get an idea of these errors by using
the more accurate approximation
e ≈ 2.718281828459046.
which is exactly 100 times the error in approximating e, of course. The relative error
is however unchanged since the factor of 100 affects numerator and denominator the
same way:
The L 1 metric
b
|| f − p||1 = | f (x) − p (x)| d x.
a
The L 2 or least-squares metric
b
|| f − p||2 = | f (x) − p (x)|2 d x.
a
These metrics are often referred to as norms. The first measures the extreme
discrepancy between f and the approximating function p while the others are both
measures of the “total discrepancy” over the interval. The following examples provide
insight.
The plot of the two functions in Fig. 2.1 helps shed light on what we are measuring.
The L ∞ error requires the maximum discrepancy between the two curves. From the
graphs, or using calculus, one can deduce that occurs at θ = π/4 so that
0.6
y = sin(θ)
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8
2.4 Measures of Error and Precision 27
The L 1 error is the area of the shaded region between the curves:
π/4 1 2 1√
||sin θ − θ||1 = (θ − sin θ) dθ = π + 2 − 1 = 1. 553 2 × 10−2
0 32 2
N
| f (xk ) − p (xk )| , and
k=0
N
| f (xk ) − p (xk )|2
k=0
The last measure arises when trying fit to experimental data to a model and is called,
a least-squares measure. All of these are still referred to as norms as well.
With the previous ideas for measuring and interpreting errors, we briefly revisit
floating-point arithmetic and the importance of algorithm design. In what follows,
consider an arbitrary base β but keeping in mind this is 2 for computers and 10 for
calculators.
Recall that a positive number x will be represented in normalized floating-point
form by
x=
f βE (2.5)
where the fraction
f consists of a fixed number, (N + 1) say, of base-β digits:
If
f is obtained from f by chopping then dk = dk for k = 0, 1, . . . , N so that (2.7)
becomes
∞
|x −
x| = βE dk β −k ≤ β E−N
k=N +1
Symmetric rounding is equivalent to first adding β/2 to d N +1 and then chopping the
result. It follows that, for symmetric rounding,
β E−N
|x −
x| ≤ (2.8)
2
Of more interest for floating-point computation is the size of the relative error:
|x − x| f − f
=
|x| |f|
In the same manner as above, we find that
f − f
1
for chopping
βN
≤ (2.9)
|f| 1
2β N
for symmetric rounding
Such bounds vary according to the nature of the operation . Note that this level of
analysis is not common in practice for many real-world applications, but the reader
should be aware that indeed thought (and theory) has been given to these ideas!
The following examples will provide further insight into the errors associated with
floating-point arithmetic.
2.5 Floating-Point Arithmetic 29
Example 7 Subtraction of nearly equal numbers can result in large relative rounding
errors.
x = (1.234) 10−1 ,
y = (1.235) 10−1
so that
x −
( y) = y = − (0.001) 10−1 = − (1.000) 10−4
x −
1 1 1
+ + ··· + = 1.428968
3 4 10
to six decimal places. However, rounding each term to four decimal places we obtain
the approximate sum 1.4290.
The accumulated rounding error is only about 3 × 10−5 . The error for each term
is bounded by 5 × 10−5 , so that the cumulative error bound for the sum is 4 × 10−4
which is an order of magnitude bigger the actual error committed, demonstrating
that error bounds can certainly be significantly worse than the actual error.
This phenomenon is by no means unusual and leads to the study of probabilistic
error analyses for floating-point calculations. For such analyses to be reliable, it
is important to have a good model for the distribution of numbers as they arise
in scientific computing. It is a well-established fact that the fractions of floating-
point numbers are logarithmically distributed. One immediate implication of this
distribution is the rather surprising statement that the proportion of (base- β) floating-
point numbers with leading digit n is given by
n+1 log (n + 1) − log n
logβ =
n log β
In particular, 30% of decimal numbers have leading significant digit 1 while only
about 4. 6% have leading digit 9.
30 2 Number Representations and Errors
Let n be any positive integer and let x = 1/n. Convince yourself that (n + 1) x −
1 = x. The Python command
x = ( n + 1) ∗ x − 1
should leave the value of x unchanged no matter how many times this is repeated.
The table below shows the effect of doing this ten and thirty times for each n =
1, 2, . . . , 10. The code used to generate one of these columns was:
>>> a = [ ]
f o r n i n range ( 1 ,1 1 ) :
x = 1/n
f o r k i n range ( 1 0 ) :
x = ( n + 1) ∗ x − 1
a . append ( x )
n x = 1/n k = 0, . . . , 9 k = 0, . . . , 29
1 1.00000000000000 1.00000000000000 1
2 0.50000000000000 0.50000000000000 0.5
3 0.33333333333333 0.33333333331393 −21
4 0.25000000000000 0.25000000000000 0.25
5 0.20000000000000 0.20000000179016 6.5451e+06
6 0.16666666666667 0.16666666069313 −4.76642e+08
7 0.14285714285714 0.14285713434219 −9.81707e+09
8 0.12500000000000 0.12500000000000 0.125
9 0.11111111111111 0.11111116045436 4.93432e+12
10 0.10000000000000 0.10000020942783 1.40893e+14
Initially the results of ten iterations appear to conform to the expected results, that
x remains unchanged. However the third column shows that several of these values
are already contaminated with errors. By the time thirty iterations are performed,
many of the answers are not recognizable as being 1/n.
On closer inspection, note that the ones which remain fixed are those where n is
a power of 2. This makes some sense since such numbers are representable exactly
in the binary floating-point system. For all the others there is some initially small
representation error: it is the propagation of this error through this computation which
results in final (large!) error.
For example with n = 3, the initial value x is not exactly 1/3 but is instead
1
x= +δ
3
where δ is the rounding error in the floating-point representation of 1/3. Now the
operation
2.5 Floating-Point Arithmetic 31
x = ( n + 1) ∗ x − 1
has the effect of magnifying this error. One iteration, gives (ignoring the final round-
ing error)
1 1
x = 4x −1=4 + δ − 1 = + 4δ
3 3
which shows that the error is increasing by a factor of 4 with each iteration.
In the particular case of 1/3, it can be shown that the initial representation error in
IEEE double precision floating-point arithmetic is approximately − 13 2−54 . Multiply-
ing this by 430 = 260 yields an estimate of the final error equal to − 13 26 = −21.333
which explains the entry −21 in the final column for n = 3.
Try repeating this experiment to verify that your calculator uses base 10 for its
arithmetic.
What have you learned in this chapter? And where does it lead?
This chapter focused on the types of errors inherent in numerical processes. The
most basic source of these errors arises from the fact that arithmetic in the computer is
not exact except in very special circumstances. Even the representation of numbers–
the floating point system–has errors inherent in it because numbers are represented
in finite binary “words”.
The binary system is almost universal but even in systems using other bases the
same basic issues are present. Errors in representation lead inexorably to errors in
arithmetic and evaluation of the elementary functions that are so critical to almost
all mathematical models and their solutions. Much of this impacts rounding errors
which typically accumulate during a lengthy computation and so must be controlled
as much as possible. The IEEE Floating Point Standards help us know what bounds
there are on these rounding errors, and therefore to estimate them, or even mitigate
them.
It is not only roundoff error that is impacted by the finite nature of the computer.
The numerical processes are themselves finite–unless we are infinitely patient. The
finiteness of processes gives rise to truncation errors. At the simplest level this might
be just restricting the number of terms in a series that we compute. In other settings
it might be a spatial, or temporal, discrete “step-size” that is used. We’ll meet more
situations as we proceed.
Remember that truncation and rounding errors interact. We cannot control one
independent of the other. Theoretically we might get a better series approximation
to a function by taking more terms and so controlling the truncation error. However
trying to do this will often prove futile because rounding errors will render those
additional terms ineffective.
32 2 Number Representations and Errors
Take all this together with the fact that poor models are inevitably only approxi-
mate representations of reality, and you could be forgiven for thinking we are embark-
ing on a mission that is doomed. That is not the case as we’ll soon see. That is
what much of the rest of the book addresses. Read on and enjoy the challenges and
successes, starting in the next chapter with numerical approaches to fundamental
calculus computations.
Exercises
(a) all subsequent terms are zero to four decimal places, and
(b) two successive “sums” are equal to five decimal places.
are needed to estimate cosh (1/2) with a truncation error less than 10−8 ? Check
that your answer achieves the desired accuracy.
6. What is the range of values of x so that the truncation error in the approximation
5
xk
exp x = e x ≈
k!
k=0
be bounded by 10−8 ?
7. Find the absolute and relative errors in the decimal approximations of e found
in problem 1 above.
2.6 Conclusions and Connections: Number Representation and Errors 33
8. What is the absolute error in approximating 1/6 by 0.1667? What is the corre-
sponding relative error?
9. Find the absolute and relative errors in approximating 1/3, 1/5, 1/6, 1/7, 1/9,
and 1/10 by their binary floating-point representations using 12-bit mantissas.
2
10. Suppose the function e x is approximated by 1 + x + x2 on [0, 1] . Find the
L ∞ , L 1 , and L 2 measures of the error in this approximation.
11. Let x = 1.2576, y = 1.2574. For a hypothetical four decimal digit machine,
write down the representations x,y of x, y and find the relative errors in the
stored results of x + y, x − y, x y, and x/y using
12. Try to convince yourself of the validity of the statement that floating-point num-
bers are logarithmically distributed using the following experiment.
(a) Write computer code which finds the leading significant (decimal) digit of
a number in [0, 1). (Hint: keep multiplying the number by 10 until it is in
[1, 10) and then use the floor function to find its integer part.)
(b) Use a random number generator to obtain vectors of uniformly distributed
numbers in [0, 1). Form 1000 products of pairs of such numbers and find
how many of these have leading significant digit 1, 2, . . . , 9.
(c) Repeat this with products of three, four and five such factors.
13. Repeat the previous exercise but find the leading hexadecimal (base 16) digits.
See how closely these conform to the logarithmic distribution.
14. Consider Example 9. Try to explain the entry in the final column for n = 5. Use
an argument similar to that used for n = 3. (You will find it easier if you just
bound the initial representation error rather than trying to estimate it accurately.)
Numerical Calculus
3
3.1 Introduction
With the notion of how numbers are stored on a computer and how that representa-
tion impacts computation, we are ready to take the plunge into numerical calculus.
At the heart of calculus I and II students learn about the theory and applications
of differentiation and integration. In this chapter, many of the fundamentals tools
from Calculus are revisited with a new objective: approximating solutions instead of
deriving exact solutions. Many of the ideas here will be familiar to you if you look
back at how you were introduced to Calculus, which makes this a somewhat gentle
transition into numerical methods. But first, you may be wondering why we need
such computational tools after spending several semesters learning pencil and paper
Calculus. We provide some motivating examples below.
The world around us is often described using physics, specifically conserva-
tion laws. For example, conservation of mass, energy, electrical charge, and linear
momentum. These laws state that the specific physical property doesn’t change as the
system evolves. Mathematically, these conservation laws often lead to partial differ-
ential equation models that are used to build large-scale simulation tools. Such tools
provide insight to weather behavior, biological processes in the body, groundwater
flow patterns, and more. The purpose of the underlying algorithms which comprise
those simulation tools is to approximate a solution to the system of partial differential
equations that models the scenario of interest. That can be achieved by using numer-
ical derivatives in those partial differential equations while paying special attention
to the accuracy of the approach.
Throughout this text, we will use the so-called diffusion equation as an application
to demonstrate the usefulness of a variety of numerical methods. We introduce the
diffusion equation briefly in the next model in the context of heat flow, but in general
the diffusion equation is used in a variety of disciplines (but the coefficients have
different meanings). Do not worry if you are unfamiliar with partial differential
Example 1 The need for numerical calculus: Heat flow on a thin insulated wire.
Consider a wire of length L whose temperature is kept at 0 ◦ C at both ends. Let u(x, t)
be the temperature on the wire at time t and position x with an initial temperature
at t = 0 given as u 0 (x) = u(x, 0). A partial differential equation that models the
temperature distribution along the wire is given by
∂u ∂2u
ρc − K 2 = f (x, t) (3.1)
∂t ∂x
or
ρcu t − K u x x = f (x, t).
Here ρ is the density, c is the specific heat, and K is the thermal conductivity of the
wire. The function f (x, t) is a heat source, for example from the electrical resistance
of the wire.
The model is based on the assumption that the rate of change of u (i.e. u t ) is
proportional to the curvature of u (i.e. u x x ). Thus, the sharper the curvature, the
faster the rate of change. If u is linear in space then its second derivative is zero at a
given point, and we say u has reached steady-state.
Suppose we wanted to approximate values of the temperature along the wire at
discrete locations and at specific times until some final time, T f . The idea is to
partition the time interval 0 ≤ t ≤ T f using increments of t and the length of the
wire 0 ≤ x ≤ L using increments of x. This generates a set of discrete points
in space and time, as in Fig. 3.1. We now seek u ik ≈ u(xi , t k ), where we use t k
here to denote tk as an approximation to the temperature at those points for say
i = 0, 1, 2, . . . n and k = 0, 1, 2, . . . m. The next step (which we will get to later in
this chapter) will be to discretize Eq. (3.1) using numerical derivatives.
Numerical differentiation also becomes necessary in circumstances where the
actual derivative is not available. In some modeling situations, a function is only
available from a table of values, perhaps from some experimental data, or from a
call to an off-the-shelf executable (such as a simulation tool). However it may be
necessary to estimate this derivative based on only those data points or simulation
output values. Another example we will consider later arises in shooting methods
for solving boundary value problems.
Even for functions of a single variable, the standard calculus technique for opti-
mization of testing for critical points is not always sufficient. It is easy to find exam-
ples where setting the derivative to zero results in an equation which we cannot solve
algebraically.
3.1 Introduction 37
Fig. 3.1 The grid of points in space (horizontally) and time (vertically) at which we seek to approx-
imate the temperature along the wire
When it comes to integration there is similar motivation. Some times, you may
only have data or function values rather than an analytic function to integrate. Alter-
natively, the function may not have an analytic anti-derivative.
Think back to when you were in Calculus and first learned about the derivative
(before you learned all the handy differentiation rules). Likely it was this definition:
f (x0 + h) − f (x0 )
f (x0 ) = lim .
h→0 h
This was motivated by the idea that the slope of the tangent line at a point on a
curve is the result from looking at slopes of secant lines between two points on the
curve and letting them get closer and closer together.
That definition gives a simple approximation to the derivative with
f (x0 + h) − f (x0 )
f (x0 ) ≈ (3.3)
h
for some small h. Again, this can also be interpreted as, “the slope of the tangent
line can be approximated by the slope of a secant line between nearby points” often
referred to as a forward finite difference approximation. But, what is the error in this
approximation and how small should h be?
Taylor’s theorem can provide answers to those questions. Suppose that f exists,
then Taylor’s theorem gives
h2
f (x0 + h) = f (x0 ) + h f (x0 ) + f
(ξ)
2
for some ξ between x0 and x0 + h. We can now express the error in the approximation
as
f (x0 + h) − f (x0 ) h
f (x0 ) − = − f (ξ) (3.4)
h 2
Recall this sort of error is called the truncation error. Here, the truncation error in
approximating the derivative in the way is proportional to the steplength h used in
its computation. We often say that the truncation error is first order since h is raised
to the first power. This is typically denoted as O(h).
Having an expression for the truncation error can be extremely valuable when it
comes to implementation, especially if this approximation is embedded in a more
complex computer code. The truncation error can be used to validate that your numer-
ical method is actually coded up correctly (that is, bug-free) by conducting a grid
refinement study to verify the theory holds. To see how, suppose that we approximate
the derivative of a function that we know the analytic derivative of using a specific
3.2 Numerical Differentiation 39
h. We could then calculate the exact error E h . The idea is to then start over with h/2
and calculate a new error E h/2 . Since our error is first order (i.e. proportional to the
steplength) we must have that
Eh h
≈ = 2.
E h/2 h/2
where c and δ are constants. This error function has the generic shape illustrated in
Fig. 3.2. This places severe restriction on the accuracy that can be achieved with this
formula, but it also provides insight.
For example, let’s suppose that machine precision is ¯ ≈ 10−16 . Then should h =
10 ? Actually, no. To see this, note that our error is O(h + h¯ ) which is minimized
−16
√
when h ≈ ¯ = 10−8 .
h
40 3 Numerical Calculus
Example 3 Use Eq. (3.3) with h = 2−1 , 2−2 , . . . to estimate f (0.7) for f (x) =
cos x.
where f˜k is the kth approximation of f , and absolute relative difference between
consecutive estimates
f˜ − f˜
k k−1
.
f˜
k−1
3.2 Numerical Differentiation 41
As the theory predicts (and described above) this ratio should be approximately
2 since we are halving h each time for this grid refinement study. This occurs until
the error begins to grow slightly towards the end.
We also see in Example 3 that we cannot get very high precision using this
formula. Similar results would be obtained using negative steps −h, called backward
differences. (See the exercises.)
Improved results can be obtained by using (3.3) with both h and −h as follows:
f (x0 + h) − f (x0 )
f (x0 ) ≈
h
f (x − h) − f (x0 )
f (x0 ) ≈
0
−h
h 2 (3)
f (ξ) = O(h 2 ) (3.6)
6
for some ξ ∈ (x0 − h, x0 + h) .
We will eventually see that the approximation (3.5) can also be obtained as the
result of differentiating the interpolation polynomial agreeing with f at the nodes
x0 − h, x0, and x0 + h.
In Fig. 3.3, the differences between the two approaches is illustrated.
x0 x0 + h x0 − h x0 x0 + h
Fig. 3.3 Difference between the two numerical differentiation approaches (3.3) and (3.5)
42 3 Numerical Calculus
fˆ (x0 + h) − fˆ (x0 )
f (x0 ) −
h
f (x0 + h) − f (x0 ) f (x0 + h) − f (x0 ) fˆ (x0 + h) − fˆ (x0 )
≤ f (x0 ) − + −
h h h
3.2 Numerical Differentiation 43
h
f (x0 + h) − fˆ (x0 + h) f (x0 ) − fˆ (x0 )
≤ f (ξ) + +
2 h h
h 2δ
≤ f (ξ) +
2 h
Note that h2 f (ξ) is the truncation error in the approximation of the first deriva-
tive. It is the second term of this error bound which leads to the growth of the error
as h → 0 illustrated in Fig. 3.2. A similar effect is present for all numerical differ-
entiation formulas.
Similar approaches to these can be used to approximate higher derivatives. For
example, adding third order Taylor expansions for f (x0 ± h) , we obtain
which has error given by h 2 f (4) (θ) /12. Similarly, the five point formula is
Example 5 Use Eqs. (3.7) and (3.8) to estimate f (0.7) for f (x) = cos x.
Using (3.7):
h = 0.1 f (0.7) ≈ 100 [cos (0.8) − 2 cos (0.7) + cos (0.6)] = −0. 764 205
h = 0.01 f (0.7) ≈ 10000 [cos (0.71) − 2 cos (0.7) + cos (0.69)] = −0. 764 836
Using (3.8):
h = 0.1 f (0.7) ≈ −0. 764 841
h = 0.01 f (0.7) ≈ −0. 764 842
This last estimate is exact in all places shown.
44 3 Numerical Calculus
The derivatives in the model from Eq. (3.1) can be discretized using the above approx-
imate derivatives and an explicit formula for the temperature on the wire at a given
time can be derived. Consider using the forward difference approximation to the time
derivative,
u(x, t + t) − u(x, t)
ut = + O(t)
t
so that
u ik+1 − u ik
u t (xi , t k ) ≈ .
t
For the second derivative term, using Eq. (3.7) gives,
t k
u ik+1 = H u i−1
k
− (2H − 1)u ik + H u i+1
k
+ f .
ρc i
Some things to consider–how would you code this up? How would you perform a
grid refinement study? See the exercises (and subsequent chapters) for more with
this model.
Exercises
1. Use Eq. (3.3) with h = 0.1, 0.01, . . . , 10−6 to estimate f (x0 ) for
√
(a) f (x) = x, x0 = 4
3.2 Numerical Differentiation 45
π
(b) f (x) = sin(x), x0 = 4
(c) f (x) = ln x, x0 = 1
day 0 2 4 6 8 10 12 14 16 20
height (inches) 8.23 9.66 10.95 12.68 14.20 15.5 16.27 18.78 22.27 24.56
8. The table below shows the position of a drone being tracked by a sensor. Use
this to tack its velocity and acceleration over time
t (s) 0 2 4 6 8 10 12
position (m) 0.7 1.9 3.7 5.3 6.3 7.4 8.0
9. More and more seafood is being farm-raised these days. A model used for the
rate of change for a fish population, P(t) in farming ponds is given by
P(t)
P (t) = b 1 − P(t) − h P(t)
PM
where b is the birth rate, PM is the maximum number of fish the pond can
support, and h is the rate the fish are harvested.
(a) Use forward differences to discretize this problem and write a script that
takes in an initial population P(0) as well as all model parameters and
outputs (via a plot for example) the population over time.
46 3 Numerical Calculus
(b) Demonstrate your model for a carrying capacity PM of 20,000 fish, a birth
rate of b = 6% and a harvesting rate of h = 4% if the pond starts with 5,000
fish.
(c) Perform some numerical experiments to demonstrate how sensitive this
resulting populations are to all of the model parameters in part (b).
u t − K u x x − K u yy = f (x, y, t) (3.10)
newpic = np . e m p t y _ l i k e ( p i c )
coeff = 1
f o r j i n range ( 1 , p i c . shape [ 1 ] − 1 ) :
f o r i i n range ( 1 , p i c . shape [ 0 ] − 1 ) :
newpic [ i , j ] = ( p i c [ i , j ] + c o e f f ∗
( pic [ i − 1 , j ] − 2 ∗ pic [ i , j ] +
pic [ i + 1 , j ] ) + coeff ∗
( pic [ i , j − 1] − 2 ∗ pic [ i , j ] +
pic [ i , j + 1 ] ) )
p i c [ 1 : − 1 ,1 : − 1 ] = newpic [ 1 : − 1 ,1 : − 1 ]
# S c a l e image v a l u e s t o c o r r e c t r a n g e f o r
# saving
p i c /= np . max ( np . abs ( p i c ) )
i o . imsave ( ’ b l u r r y . j p g ’ , p i c )
The results look like this with K = 1 (Figs. 3.4 and 3.5)
(a) Run this code using a picture of your choice by modifying the jpg file.
Experiment with different time steps in the k-loop. Now consider different
model coefficients. There is a stability requirement you may run in to and
we hope you do! Summarize what you did and provide some illustrative
pictures.
(b) Discussion: Explain in your own words, the implementation here in terms
of the computation of newpic for blurring and sharpening. What is going
on there? Specifically relate the code to the discretized model above. What
is the meaning of coeff with regards to the heat flow (diffusion) model?
You already know more about numerical integration then you may think. Recall the
Riemann integral is defined as a limit of Riemann sums for arbitrary partitions over
the interval of integration. When you were first learning about integration, you likely
became familiar with the ideas by considering left and right endpoint approximations
for finding the area below a function and above the x axis. One could start by taking
uniform partitions with some number N of subdivisions. We can then denote the
steplength of these subdivisions by
b−a
h=
N
and write
xk = a + kh (k = 0, 1, . . . , N )
Then the Riemann left and right sums are given by
N −1
N −1
L=h f (xk ) = h f (a + kh) (3.12)
k=0 k=0
N
N
R=h f (xk ) = h f (a + kh) (3.13)
k=1 k=1
These Riemann sums are actually (very basic) numerical integration techniques.
They are equivalent to integrating step-functions which agree with f at the appropri-
ate set of nodes. These step-functions are piecewise constant functions (i.e. degree
0 interpolating polynomials). The efficient methods we shall investigate result from
increasing the degree of the interpolating polynomial.
3.3 Numerical Integration 49
(a) (b)
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Fig. 3.6 Left (a) and right (b) sum approximations to an integral
by a weighted sum of function values. That is, we use approximations of the form
b
N
f (x) d x ≈ ck f (xk ) (3.14)
a k=0
where the nodes x0 , x1 , . . . , x N ∈ [a, b] and the coefficients, or weights, ck are cho-
sen appropriately. This is often referred to as an interpolatory quadrature formula.
Note that Eq. (3.14) is actually a particular form of a Riemann sum where the
nodes are the representative points from the various subintervals in the partition. The
weights would be the widths of the subdivisions of the interval. Clearly as a user of
numerical integration, algorithms should be accurate in that they give approximations
of the true integrals but also simplicity is desirable. All the methods we shall con-
sider are based on approximating f (x) with a polynomial p(x) on each subinterval
and integrating the polynomial exactly;
b b
f (x) d x ≈ p(x) d x.
a a
50 3 Numerical Calculus
Most importantly, the polynomials used have the same values as f at the nodes.
This is called an interpolating polynomial. These ideas can be used to derive specific
nodes and weights that ensure a particular accuracy, resulting in a technique of form
in Eq. (3.14), which is just a weighted sum of function values. It is important to note
here, the numerical integration schemes will be derived in terms of function values
of f (x) at the nodes, but in practice, data values could be used instead.
The following is a definition that we will use repeatedly in this section to derive
numerical integration schemes.
The key idea is that any interpolatory quadrature rule using the nodes x0 , x1 , . . . ,
x N must have degree of precision at least N since it necessarily integrates polyno-
mials of degree up to N exactly. (We shall see later that the degree of precision can
be greater than N .) We illustrate the process with an example.
Example 7 Find the quadrature rule using the nodes x0 = 0, x1 = 1, and x2 = 2 for
3 3
approximating 0 f (x) d x. Use the resulting rule to estimate 0 x 3 d x.
we obtain
3 9
c0 =
, c1 = 0, c2 =
4 4
The required quadrature rule is therefore
3 3 f (0) + 0 f (1) + 9 f (2)
f (x) d x ≈
0 4
3.3 Numerical Integration 51
whereas the true integral is 34 /4 = 81/4. This shows that the degree of precision of
this formula is 2 since it fails to be exact for x 3 . Actually, for f (x) = x 3 , the error
is 9/4.
Note though, that the scheme should integrate any second degree polynomial
exactly. For example, consider f (x) = 3x 2 + 2x + 1 so that
3
f (x)d x = 39
0
y = f (x)
Midpoint
b rule estimate of
a f (x)dx
is area of shaded rectangle
a b
y = f (x)
Trapezoid
b
rule estimate of
a f (x)dx
is area of shaded region
a b
Simpson’s Rule
b b−a
f (x) d x ≈ [ f (a) + 4 f ((a + b) /2) + f (b)] (3.17)
a 6
3.3 Numerical Integration 53
y = f (x)
Simpson’s
b
rule estimate of
a f (x)dx
is area of shaded region
Example 8 Estimate the integral for the standard normal distribution function
1 1 1/2
N0,1 (1/2) = +√ exp −x 2 /2 d x
2 2π 0
1 1 1
N0,1 (1/2) ≈ +√ exp − (1/4)2 /2 ≈ 0.693334
2 2π 2
1 1 1/2
N0,1 (1/2) ≈ +√ exp (0) + exp − (1/2)2 /2 ≈ 0.687752
2 2π 2
1 1 1/4
N0,1 (1/2) ≈ +√ exp (0) + 4 exp − (1/4)2 /2 + exp − (1/2)2 /2
2 2π 3
≈ 0.691473
The true value is 0.6914625 to 7 decimals. The errors in these approximations are
therefore approximately 0.0019, 0.0037 and 0.00001 respectively. It is obvious that
Simpson’s rule has given a significantly better result. We also see that the error in the
midpoint rule is close to one-half that of the trapezoid rule, which will be explained
soon!
54 3 Numerical Calculus
c0 + c1 + c2 = 2h
−c0 h + c2 h = 0
2h 3
c0 (−h)2 + c2 h 2 =
3
The second of these immediately gives us c0 = c2 , and then the third yields c0 = h/3.
Substituting these into the first equation, we get c1 = 4h/3 so that Simpson’s rule
for this interval is
h
h
f (x) d x ≈ [ f (−h) + 4 f (0) + f (h)]
−h 3
h h4 h4
(−h)3 + 4 (0)3 + h 3 = − + =0
3 3 3
which is exact, as claimed. You can show that Simpson’s rule is not exact for f (x) =
x 4 . Similarly, the midpoint rule gives higher precision: it has degree of precision 1.
3.3 Numerical Integration 55
where h = b − a,
b b−a (b − a) h 2
f (x) d x − [ f (a) + f (b)] = − f (ξT ) (3.19)
a 2 12
Example 9 Find the quadrature formulas with maximum degree of precision, the
Gaussian rules, on [−1, 1] using
56 3 Numerical Calculus
(a) We now seek the nodes x0 , x1 and their weights c0 , c1 to make the integration
rule exact for f (x) = 1, x, x 2 , . . . as high of a degree polynomial as possible.
Since there are four unknowns, it makes sense to try to satisfy four equations.
Note that
1 2
if k is even
I x =
k
x d x = k+1
k
−1 0 if k is odd
Thus we need:
We can assume that neither c0 nor c1 is zero since we would then have just a
one-point formula. From (ii) , we see that c0 x0 = −c1 x1 , and substituting this
in (iv) , we get
0 = c0 x03 + c1 x13 = c0 x0 x02 − x12
We have already concluded that c0 = 0. If x0 = 0, then (ii) would also give
us x1 = 0 and again the formula would reduce to the midpoint rule. It follows
that x02 = x12 , and therefore that x0 = −x1 . Now (ii) implies that c0 = c1 , and
from (i) , we deduce that c0 = c1 = 1. Substituting all of this into (iii) , we get
2x02 = 2/3 so that we can take
1 1
x0 = − √ , x1 = √
3 3
x0 = −x2 , x1 = 0
c0 = c2
3.3 Numerical Integration 57
2c0 + c1 = 2
2
2c0 x02 =
3
2
2c0 x04 =
5
Example 10 Use the Gaussian rules with two and three nodes to estimate
1 dx
−1 x +2
Compare the results with those obtained using the midpoint, trapezoid and Simpson’s
rules.
For comparison, let’s calculate the midpoint, trapezoid and Simpson’s rule esti-
mates:
1 1
M=2 =
0+2 2
1 1 1
T =1 + = 1 = 1.33333
(−1) + 2 1 + 2 3
1 1 1 10
S= 1+4 + = ≈ 1.11111
3 2 3 9
Note the true value is ln(3) ≈ 1.09861. The two Gaussian rules obtained in Exam-
ple 9 yield
1 1
(3.21) : G2 =
√ +
√ = 1.09091
−1/ 3 + 2 1/ 3 + 2
1 5 1 5
(3.22) : G3 = √ +8 +√ ≈ 1.09804
9 (− 3/5) + 2 2 3/5 + 2
58 3 Numerical Calculus
The Gaussian rules have significantly greater accuracy than the simpler rules using
similar numbers of nodes.
Exercises
1. Derive the formula for the basic trapezoid rule by finding coefficients so that
b
f (x) d x = α f (a) + β f (b)
a
for f (x) = 1, x.
2. Find the quadrature rule for the interval [−2, 2] using the nodes −1, 0, +1. What
is its degree of precision?
3. Repeat Exercise 2 for the nodes −2, −1, 0, +1, +2. Use the resulting formula to
estimate the area of a semicircle of radius 2. 1 1
4. Use the midpoint, trapezoid and Simpson’s rules to estimate 0 1+x 2 d x. Compare
the results to the true value π/4.
2
5. Use the midpoint, trapezoid and Simpson’s rules to estimate 1 x1 d x. Compare
the results to the true value ln 2.
6. Find the Gaussian quadrature rule using three nodes in [0, 1] . Use this rule to
1 1 1 1
estimate 0 1+x 2 d x and 0 x+1 d x. Compare the results with those obtained in
Exercises 4 and 5.
The accuracy obtained from the trapezoid rule (or any other quadrature formula) can
be improved by using the fact that
b c b
f (x) d x = f (x) d x + f (x) d x
a a c
for some intermediate point c ∈ (a, b) . For example, using c = 1/4 for the trapezoid
rule in Example 8 we get
1 1 1/4
N0,1 (1/2) ≈ +√ exp (0) + exp − (1/4)2 /2
2 2π 2
1 1/4
+√ exp − (1/4)2 /2 + exp − (1/2)2 /2
2π 2
= 0.690543
reducing the error to about 0.0009 (See Fig. 3.10). Notice that by cutting the
steplength h, in half we have reduced the error to approximately one-fourth of its
previous value. This is directly related to our error expression from Eq. (3.19) and our
3.4 Composite Formulas 59
y = f (x)
a m b
previous notion of the grid refinement study presented in the context of numerical
derivatives. We’ll discuss this more soon.
We can rewrite the formula for the trapezoid rule using two subdivisions of the
original interval to obtain
b h h
f (x) d x ≈ [ f (a) + f (a + h)] + [ f (a + h) + f (b)] (3.23)
a 2 2
h
= [ f (a) + 2 f (a + h) + f (b)]
2
where now h = (b − a) /2. Geometrically, it is easy to see that this formula
is just the average of the left and right sums for this partition of the original
interval. Algebraically, this is also easy to see: Eq. (3.23) is exactly this aver-
age since the corresponding left and right sums are just h [ f (a) + f (a + h)] and
h [ f (a + h) + f (b)] .
This provides an easy way to obtain the general formula for the composite trape-
zoid rule TN using N subdivisions of the original interval. With h = (b − a) /N , the
left and right sums are given by (3.12) and (3.13), and averaging these we get
b 1
f (x) d x ≈ TN = (L N + R N )
a 2
N −1
1 N
= h f (a + kh) + h f (a + kh)
2
k=0 k=1
−1
h
N
= f (a) + 2 f (a + kh) + f (b) (3.24)
2
k=1
60 3 Numerical Calculus
def trapsum ( f c n , a , b , N ) :
"""
Function f o r approximating the i n t e g r a l of
the f u n c t i o n ‘ fcn ‘ over the i n t e r v a l [ a , b ]
i n N segments using the t r a p e z o i d r u l e .
"""
h = (b − a) / N
s = ( fcn ( a ) + fcn ( b ) ) / 2
f o r k i n range ( 1 , N ) :
s += f c n ( a + k ∗ h )
return s ∗ h
Here s accumulates the sum of the function values which are finally multiplied by
the steplength h to complete the computation. In the version above the summation is
done using a loop. The explicit loop can be avoided by using Python’s sum function
as follows.
def t r a p N ( f c n , a , b , N ) :
"""
Computes t h e t r a p e z o i d r u l e f o r t h e i n t e g r a l
of the f u n c t i o n ‘ fcn ‘ over [ a , b ] using N
subdivisions .
"""
h = (b − a) / N
x = a + np . arange ( 1 , N ) ∗ h
r e t u r n h ∗ ( ( f c n ( a ) + f c n ( b ) ) / 2 + sum ( f c n ( x )
))
For the latter version it is essential that the function fcn is written for vector inputs.
Let’s take a look back at our error formulas. It makes sense that the error formula
(3.19) remains valid for the composite trapezoid rule – except with h = (b − a) /N .
The proof of this result is left as an exercise while we provide the proof for Simpson’s
rule later.
3.4 Composite Formulas 61
Next, consider this error formula empirically. The error should depend in some
way on the function to be integrated. Since the trapezoid rule integrates polynomials
of degree 1 exactly, it is reasonable to suppose that this dependence is on the second
derivative. (After all f (x) ≡ 0 for any polynomial of degree 1.) Furthermore, if a
b
function f varies slowly, we expect a f (x) d x to be approximately proportional
to the length of the interval [a, b] . Using a constant steplength, the trapezoid rule
estimate of this integral is expected to be proportional to b − a. In that case the error
should be, too. That the error depends on the steplength seems natural and so we
may reasonably suppose that the error
b
E (TN ) = f (x) d x − TN = c (b − a) h k f (ξ)
a
for some “mean value point” ξ, some power k, and some constant c.
We set f (x) = x 2 to achieve a constant f . By fixing [a, b], we can determine
b
suitable k and c values. [a, b] = [0, 3] is a convenient choice so that a f (x) d x = 9.
The following Python loop generates a table error values E (TN ) for this integral with
N = 1, 2, 4, 8.
f o r k i n range ( 4 ) :
N = 2∗∗ k
I = t r a p N ( lambda x : x ∗ ∗ 2 , 0 , 3 , N )
p r i n t ( N , 9− I )
The results, and the ratios of successive errors, are
For each time the number of subdivisions is doubled, i.e. h is halved, the error
is reduced by a factor of 4. This provides strong evidence that k = 2. Finally, for
N = 1, it is seen that E (TN ) = −4.5 and b − a = 3, h = 3, and f (ξ) = 2. This
means that the constant c must satisfy
c (3) 32 (2) = −4.5
a y1 y2 y3 yN −2 yN −1 yN b
x0 x1 x2 x3 xN −3 xN −2 xN −1 xN
Fig. 3.11 Nodes used by the composite trapezoid and midpoint rules
complicated). Similar arguments can be given in support of the other error formulas
(3.18) and (3.20) for the midpoint and Simpson’s rules.
The composite midpoint rule uses the midpoints of each of the subdivisions. Let
xk = a + kh in the trapezoid rule and then in the midpoint rule we canuse the same
N subdivisions of [a, b] with the nodes yk = (xk−1 + xk ) /2 = a + k − 21 h for
k = 1, 2, . . . , N . This gives
N
N
b
f (x) d x ≈ M N = h f (yk ) = h f a + k − 21 h (3.25)
a k=1 k=1
Figure 3.11 shows the distribution of nodes for the midpoint and trapezoid rules. The
composite version of Simpson’s rule using h = (b − a) /2N uses all the points for
the composite midpoint rule (3.25) as and composite trapezoid rule (3.24) together.
Rewriting these two rules with this value of h, we get
N
N
M N = 2h f (yk ) = h f (a + (2k − 1) h)
k=1 k=1
⎡ ⎤ ⎡ ⎤
N −1
N −1
TN = h ⎣ f (a) + 2 f (xk ) + f (b)⎦ = h ⎣ f (a) + 2 f (a + 2kh) + f (b)⎦
k=1 k=1
Now applying Simpson’s rule to each of the intervals xk−1 , xk we get the composite
Simpson’s rule formula
h f (a) + 4 f (a + h) + 2 f (a + 2h) + · · ·
S2N =
3 +2 f (a + 2 (N − 1) h) + 4 f (a + (2N − 1) h) + f (b)
−1
h N
N
= f (a) + 4 f (yk ) + 2 f (xk ) + f (b) (3.26)
3
k=1 k=1
TN + 2M N
= (3.27)
3
The formula in (3.27) can be used to create an efficient implementation of Simp-
son’s rule. it is important to note that Simpson’s rule is only defined for an even
number of subdivisions, since each subinterval on which Simpson’s rule is applied
must be subdivided into two pieces.
3.4 Composite Formulas 63
The following code implements this version using explicit loops. As with the
trapezoid rule, the sum function can be used to shorten the code.
Program Python code for composite Simpson’s rule using a fixed number of subdi-
visions
import numpy as np
def simpsum ( f c n , a , b , N ) :
"""
Function f o r approximating the i n t e g r a l of
the f u n c t i o n ‘ fcn ‘ over the i n t e r v a l [ a , b ]
i n N s e g m e n t s u s i n g t h e Simpson r u l e .
"""
h = (b − a) / N
s = ( fcn ( a ) + fcn ( b ) )
f o r k i n range ( 1 , N , 2 ) :
s += 4 ∗ f c n ( a + k ∗ h )
f o r k i n range ( 2 , N − 1 , 2 ) :
s += 2 ∗ f c n ( a + k ∗ h )
return s ∗ h / 3
It would be easy to reinitialize the sum so that only a single loop is used, but the
structure is kept simple here to reflect (3.26) as closely as possible.
Using the trapsum and Simpsum functions, and a similar program for the midpoint
rule, we obtain the following table of results:
It may not be surprising that Simpson’s rule estimates are much more accurate for
the same numbers of subdivisions – and that their errors decrease faster.
64 3 Numerical Calculus
Next, we provide the theorem regarding errors in composite Simpson’s rule esti-
mates. This result will also be helpful in our practical algorithm in the next section.
This theorem establishes that the error formula (3.20) for the basic Simpson’s rule
remains valid for the composite rule.
b
Theorem 2 Suppose the integral I = a f (x) d x is estimated by Simpson’s rule
S N using N subdivisions of [a, b] and suppose that f (4) is continuous. Then the
error in this approximation is given by
(b − a) h 4 (4)
I − SN = − f (ξ) (3.28)
180
where h = (b − a) /N for some ξ ∈ (a, b) .
xk+1
h xk+1 − xk h 4 (4)
f (x) d x − f (xk ) + 4 f (xk + h) + f xk+1 = − f (ξk )
xk 3 180
2h 5 (4)
M−1
E (S N ) = − f (ξk )
180
k=0
Because this fourth derivative is continuous, and therefore bounded, it follows from
the intermediate value theorem that there exists a point ξ ∈ (a, b) such that
M−1
M f (4) (ξ) = f (4) (ξk )
k=0
(b − a) h 4 (4)
E (S N ) = − f (ξ)
180
as required.
3.4 Composite Formulas 65
Example 12 Obtain an error bound for the Simpson’s rule estimates found in Exam-
ple 11
Note that these error bounds are all significantly larger than the actual errors. In
practice, you don’t know the exact error, but the Theorem 2 can still be used as the
basis of practical numerical integration routines since it provides bounds.
Up until now, our discussion and analysis assumes we have access to a func-
tion f (x). However, one of the powerful aspects of numerical calculus is that the
approximations rely only on function values. To this end an analytic function is not
needed and these methods can be used with data alone (which is often the case in
the real-world). The following example shows this and we leave it as an exercise to
modify the numerical integration subroutines in this chapter to handle data instead
of functions as input.
With N = 6, h = b−a
n = 4 and the approximation is
24 4
r (t)dt ≈ [r (0) + 2r (4) + 2r (8) + 2r (12) + 2r (16) + 2r (20) + r (24)] .
0 2
However, we do not have an analytic expression for r (t) and therefore need to
estimate those values based on the curve. There is inevitably human error that cannot
be avoided but with a careful eye, suppose the follow values are taken in our sum:
8000
r [beetles/week]
6000
4000
2000
0
2 4 6 8 10 12 14 16 18 20 22 24
t [weeks]
Since we are integrating the population rate, the solution here is the total beetle
population.
Exercises
b
1. Write a program for the midpoint rule using N subdivisions for a f (x) d x.
5
Use it to find estimates of ln 5 = 1 (1/x) d x using N = 1, 2, 4, 8, 16, 32.
2. Use your answers for Exercise 1 to verify that the error in the midpoint rule is
proportional to h 2 .
3. Derive the error formula for the composite trapezoid rule. Use this formula to
obtain bounds on the trapezoid rule errors in Example 11.
4. Repeat Exercises 1 and 2 for the trapezoid rule.
5. Repeat the computation of Exercise 1 for Simpson’s rule with N =2, 4, 8, 16, 32.
Use the results to check that the error is proportional to h 4 .
6. Use the midpoint, trapezoid and Simpson’s rules to estimate
1 1 2
N0,1 (2) = +√ exp −x 2 /2 d x
2 2π 0
using N = 10 subdivisions.
7. A model for the shape of a lizard’s
√ egg can be found by rotating the graph
of y = (ax 3 + bx 2 + cx + d) 1 − x 2 around the x-axis. Consider using a =
−0.05, b = 0.04, c = 0.1, d = 0.48. Set up an integral to find the volume of
the resulting solid of revolution (you may need to consult a Calculus text to
remember the techniques to do that!). You may want to plot the model equation
on the interval [−1,1]. Use numerical integration to approximate on the resulting
integral to approximate the volume.
8. Duplicate the results in Example 13 by modifying your trapezoid code to work
with data instead of function evaluations.
3.4 Composite Formulas 67
9. Modify your Simpson’s rule code to work with data instead of function eval-
uations and apply it to the problem in Example 13 using N = 6 subintervals.
Increase the number of subintervals. How does this impact your solution?
10. A sensor was used to track the speed of a bicyclist during the first 5 s of a race.
Use numerical integration to determine the distance she covered.
t(s) 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
v (m/s) 0 7.51 8.10 8.93 9.32 9.76 10.22 10.56 11.01 11.22 11.22
11. Just to get you outside (and to gain a better understanding of numerical integra-
tion), your mission is to find a leaf and using composite numerical integration,
approximate the area of the leaf using two numerical integration techniques of
your choice. Use at least 10 subintervals.
One approach to this would be to trace the leaf aligned on a coordinate system
and measure different points on the leaf. These could be input to your program.
Which of the methods you chose is more accurate? Can you think of a reason
why it might be important to know the area of a leaf? Include any sketches of
the leaf with your computations as well as the leaf taped to a paper. Be sure your
methodology is clearly explained.
12. Find a map of an interesting place that comes with a scale and has distinct
borders, for example suppose we pick Italy. By searching on google for ’outline
map Italy’ you’ll find several. Do not pick a simple rectangular place, like the
state of Pennsylvania. Find the area of that region using two different composite
numerical integration techniques of your choice.
One approach to this would be to print the map, align Italy on a coordinate system
and measure different points that could be input to your code. Use at least 10
subintervals but if the geometry is interesting you may need to use more–explain
your choice. Using your scale, how would you scale up your computations and
compare to the real area? Be sure your methodology is clearly explained.
13. You are asked to come up with a concept for a new roller coaster at the Flux
Theme Park. Your coaster is to be made of piecewise continuous functions that
satisfy certain design constraints. Once the coaster concept is developed you
will estimate the construction cost based on the assumption that the materials
and labor required to build the coaster are proportional to the length of the track
and the amount of support required.
In addition to being a continuous track, your coaster must satisfy the following
design requirements:
(d) You may need other functions to connect the track pieces–these can be any-
thing you choose–but you should not have any obvious SHARP transitions.
Your coaster must also satisfy the following physical constraints:
(e) The first hill needs to be the tallest.
(f) The first hill should be at least 50’
Calculating the cost: The total construction cost assumes that the cost is given
by C = k1 A + k2 L, where k1 and k2 are cost coefficients, A is the area below
the entire track and L is the length of the entire track. Use k1 = $750/ft2 and
k2 = $185/ft.
Area Below Track: Think about which parts of your coaster need support–for
example, you do not need support for the interior of your loop. You will need to
use approximate integration to calculate the area below your coaster (use Simp-
son’s rule) and a sufficient number of subintervals. Comment on the accuracy of
your calculations and your choice of subintervals.
Length of Track: You should use the arc length formula described in below. Most
arc length problems result in integrals which cannot be evaluated exactly, with
the integral given by
b
L= 1 + ( f (x))2 d x
a
For this you will need to use numerical integration to approximate the resulting
integral. Use Simpson’s rule with a sufficient number of subintervals.
Submit a detailed report on the coaster design to the CEO of Flux. Clear and
concise technical writing is important. Please take the time to create a report that
you are proud of.
with an error smaller than some prescribed tolerance ε. The methods presented here
are based on the trapezoid and Simpson’s rules, but can easily be extended to other
methods.
3.5 Practical Numerical Integration 69
Recall the error in the composite Simpson’s rule estimate of I from Eq. (3.28)
(b − a) h 4 (4)
I − SN = − f (ξ)
180
where ξ is some (unknown) mean value point.
This formula is the basis for an approach to practical numerical integration using
the following steps:
1. Find a bound M4 for f (4) (x) on [a, b] .
2. Choose N such that
3. Evaluate S N .
import numpy as np
from prg_simpsum import simpsum
"""
L = abs ( b − a )
N = i n t ( np . c e i l ( L ∗ np . s q r t ( np . s q r t ( L ∗ M4 /
180 / eps ) ) ) )
# N must be e v e n
i f N % 2 == 1 :
N += 1
r e t u r n simpsum ( f c n , a , b , N )
Note the use of abs(b-a). This allows the possibility that b < a which is mathe-
matically well-defined, and which is also handled perfectly well by Simpson’s rule.
If b < a, then the steplength h will be negative and the appropriately signed result
will be obtained.
3.5 Practical Numerical Integration 71
(b − a) (2h)4 (4)
E (S N ) = − f (ξ1 ) (3.31)
180
(b − a) h 4 (4) E (S N )
E (S2N ) = − f (ξ2 ) ≈ (3.32)
180 16
if f (4) is approximately constant over [a, b]. If |S N − S2N | < ε, and 16 (I − S2N ) ≈
(I − S N ) , then it follows that
so that second of these estimates is expected to be well within the prescribed tolerance.
1 1
Example 15 Recompute the integral I = −1 x+2 d x starting with N = 2 and repeat-
edly doubling N until two Simpson’s rule estimates agree to within 10−10 .
4 1.1000000000
8 1.0987253487
16 1.0986200427
32 1.0986127864
64 1.0986123200
128 1.0986122906
256 1.0986122888
512 1.0986122887
1024 1.0986122887
16h
15I ≈ { f (a) + 4 f (a + h) + 2 f (a + 2h) + 4 f (a + 3h) + f (b)}
3
2h
− { f (a) + 4 f (a + 2h) + f (b)}
3
which simplifies to yield
2h
I ≈ [7 f (a) + 32 f (a + h) + 12 f (a + 2h) + 32 f (a + 3h) + 7 f (b)]
45
(3.35)
This formula is in fact the interpolatory quadrature rule using five equally spaced
nodes which actually could have been derived just like we did for the other inter-
polatory rules. The error in using Eq. (3.35) has the form C (b − a) h 6 f (6) (ξ) . The
above process of cutting the steplength in half, and then removing the most signifi-
cant error contribution could be repeated using Eq. (3.35) to get improved accuracy.
This process is the underlying ideas behind Romberg integration.
3.5 Practical Numerical Integration 73
In practice, Romberg integration is typically started from the trapezoid rule rather
than Simpson’s rule. In that case, Simpson’s rule is actually rediscovered as the result
of eliminating the second order error term between T1 and T2 . You will see that
4T2N − TN
S2N = (3.36)
3
(The verification of this result is left as an exercise.)
Romberg integration can be expressed more conveniently by introducing a nota-
tion for the various estimates obtained and arranging them as elements of a lower
triangular matrix. This is done as follows. The first column has entries Rn,1 which
are the trapezoid rule estimates using 2n−1 subdivisions:
Rn,1 = T2n−1
The second column has the results that would be obtained using Simpson’s rule, so
that, for n ≥ 2, we have
16Rn,2 − Rn−1,2
Rn,3 =
15
for n ≥ 3.
Continuing in this manner produces the following array;
R1,1
R2,1 R2,2
R3,1 R3,2 R3,3
R4,1 R4,2 R4,3 R4,4
R5,1 R5,2 R5,3 R5,4 R5,5
.. .. .. .. .. ..
. . . . . .
.. .. .. .. .. .. . .
. . . . . . .
4k Rn,k − Rn−1,k
Rn,k+1 = (3.37)
4k − 1
Program Python function for Romberg integration as far as the entry R M,M where
M is some maximum level to be used.
import numpy as np
from p r g _ m i d p o i n t import midsum
def romberg ( f c n , a , b , ML ) :
"""
I n t e g r a t e ‘ f c n ‘ u s i n g Romberg i n t e g r a t i o n
b ase d on t h e m i d p o i n t r u l e . R e t u r n s t h e f u l l
Romberg a r r a y t o maximum l e v e l ‘ ML ‘ .
"""
N = 1
R = np . z e r o s ( ( ML , ML ) )
R [ 0 , 0] = ( b − a ) ∗ ( fcn ( a ) + fcn ( b ) ) / 2
f o r L i n range ( 1 ,ML ) :
M = midsum ( f c n , a , b , N )
R [ L , 0 ] = ( R [ L − 1 , 0 ] + M) / 2
f o r k i n range ( L ) :
R [ L , k + 1] = ((4∗∗( k + 1) ∗ R [ L , k ]
− R[L − 1 , k ]) /
(4∗∗( k + 1) − 1) )
N ∗= 2
return R
Note the use of the appropriate midpoint rule to update the trapezoid rule R(L,1).
You can also verify that
TN + M N
T2N =
2
to allow for a more efficient computation of subsequent trapezoid rule estimates by
ultimately avoiding the need to re-evaluate the function at nodes that have already
been used. (See the exercises for a verification of this identity.)
The following Python commands can be used with the function romberg above
to yield the results shown.
3.5 Practical Numerical Integration 75
Integral a.
>>> romberg ( lambda x : 1 / ( x + 2 ) , −1 , 1 , 6 )
[[ 1 .3 3 3 3 3 0. 0. 0. 0. 0. ]
[ 1 .1 6 6 6 6 1 .1 1 1 1 1 0. 0. 0. 0. ]
[ 1 .1 1 6 6 6 1 .1 1 .0 9 9 2 5 0. 0. 0. ]
[ 1 .1 0 3 2 1 1 .0 9 8 7 2 1 .0 9 8 6 4 1 .0 9 8 6 3 0. 0. ]
[ 1 .0 9 9 7 6 1 .0 9 8 6 2 1 .0 9 8 6 1 1 .0 9 8 6 1 1 .0 9 8 6 1 0. ]
[ 1 .0 9 8 9 0 1 .0 9 8 6 1 1 .0 9 8 6 1 1 .0 9 8 6 1 1 .0 9 8 6 1 1 .0 9 8 6 1 ]]
and integral b.
>>> romberg ( lambda x : 1 / np . s q r t ( 2 ∗ np . p i )
∗ np . exp (− x ∗∗2 / 2 ) , 0 , 2 , 6 ) }
[[ 0 .4 5 2 9 3 0. 0. 0. 0. 0. ]
[ 0 .4 6 8 4 3 0 .4 7 3 6 0 0. 0. 0. 0. ]
[ 0 .4 7 5 0 1 0 .4 7 7 2 0 0 .4 7 7 4 4 0. 0. 0. ]
[ 0 .4 7 6 6 8 0 .4 7 7 2 4 0 .4 7 7 2 5 0 .4 7 7 2 4 0. 0. ]
[ 0 .4 7 7 1 0 0 .4 7 7 2 4 0 .4 7 7 2 4 0 .4 7 7 2 4 0 .4 7 7 2 4 0. ]
[ 0 .4 7 7 2 1 0 .4 7 7 2 4 0 .4 7 7 2 4 0 .4 7 7 2 4 0 .4 7 7 2 4 0 .4 7 7 2 4 ]]
The zeros represent the fact that the Python code uses a matrix to store R and
these entries are never assigned values. In the mathematical algorithm these entries
simply do not exist.
In practice, Romberg integration is used iteratively until two successive values
agree to within a specified tolerance as opposed to performed for a fixed number of
steps. Usually this agreement is sought among entries on the diagonal of the array,
or between the values Rn,n−1 and Rn,n . The subroutine above can be modified for
this (see exercises).
In practice, you may only have data values to work with, implying you have no
control over the length of the subintervals or ultimately, the accuracy. On the other
hand, if you have a function to evaluate, we have shown there are ways to choose the
number of subintervals so that a desired accuracy can be obtained. Here we present
one more approach called adaptive quadrature. Adaptive integration algorithms use
the basic principle of subdividing the range of integration but without the earlier
insistence on using uniform subdivisions throughout the range. We present these
ideas using Simpson’s rule for the integration scheme in order to relay the key points
however, any basic rule, or indeed Romberg integration, could be applied.
The two primary forms of adaptive quadrature are
1. Decide on an initial subdivision of the interval into N subintervals xk , xk+1
with
a = x0 < x1 < x2 < · · · < x N = b
and then us the composite Simpson’s rule with continual doubling of the number
of intervals in each of the N original subintervals in turn. (Typical choices of the
initial subdivision are five or 20 equal subintervals.)
76 3 Numerical Calculus
2. Start with the full interval and apply Simpson’s rule to the current interval and to
each of its two halves. If the results agree to within the required accuracy, then
the partial result is accepted, and the algorithm moves to the next subinterval. If
the estimates do not agree to the required accuracy, then continue working with
one half until accuracy is obtained, then move to the next half.
Example 17 Compute
1 1
ln(3) = dx
−1 x +2
with error less than 10−6 using adaptive Simpson’s rule with N = 5 initial subdivi-
sions.
On each of the intervals [−1, −3/5] , [−3/5, 1/5] , [−1/5, 1/5] , [1/5, 3/5] ,
and [3/5, 1] we use continual halving of the steplengths until the Simpson’s rule
estimates agree to within 10−6 /5.
On [−1, −3/5] we get the results:
N = 2 S2 = 0.336507936507936
N = 4 S4 = 0.336474636474636
N = 8 S8 = 0.336472389660795
N = 16 S16 = 0.336472246235863
On the remaining intervals, the converged values are:
[−3/5, −1/5] S16 = 0.251314430427972
[−1/5, 1/5] S8 = 0.200670706389129
[1/5, 3/5] S8 = 0.167054088990710
[3/5, 1] S8 = 0.143100845625304
and summing these, we obtain the approximation ln(3) ≈ 1.098612317668978
which has an error of about 2.9 × 10−8 .
If there were no repetitions of any function evaluations, this would have used
17 + 16 + 8 + 8 + 8 = 57 points in all.
The actual error is much smaller than the required tolerance. One reason for this
is that, as we saw in our discussion of Romberg integration, if |S N − S2N | ≈ ε,
then we expect |I − S2N | ≈ ε/15. The local tolerance requirement in Example 17
is therefore much smaller than is needed for the overall accuracy.
Efficient programming of adaptive quadrature is more complicated than the other
schemes discussed here. It has significant advantages when the integrand varies
more than in this example, but for relatively well-behaved functions such as we have
encountered here, the more easily programmed Romberg technique or the use of an
initial subdivision, as in Example 17 is probably to be preferred.
3.5 Practical Numerical Integration 77
Exercises
1. Show that
4T2 − T1
S2 =
3
2
and verify the result for 1 1/ (x + 2) d x.
2. Use the result of the previous exercise to prove Eq. (3.36):
4T2N − TN
S2N =
3
3. Show that T2 = (T1 + M1 ) /2 and therefore that
TN + M N
T2N =
2
2 4
4. Use Romberg integration with four levels to estimate d x.
0 1 + x2
5. Write a program to perform Romberg integration iteratively until two diagonal
entries Rn−1,n−1 and Rn,n agree to within a specified accuracy. Use your program
1 1
to estimate −1 d x with a tolerance of 10−8 .
1 + x2
6. Write a program which applies composite Simpson’s rule with repeated doubling
of the number of intervals in each of five initial subintervals. Verify the results of
Example 15.
7. Use adaptive Simpson’s rule quadrature with the interval being continually split
to evaluate the integral of Example 17. How do the number of function evaluations
compare?
8. Modify program for Exercise 6 to perform adaptive integration using composite
Simpson’s rule on each interval of a specified uniform partition of the original
interval into N pieces. Use it to evaluate the integral of Example 17 with N = 5,
10 and 20 initial subintervals.
9. The following question was the 2017 Society for Industrial and Applied Mathe-
matics M3 Challenge in modeling. See
https://m3challenge.siam.org/archives/2017
for more details and access to some related data.
From Sea to Shining Sea: Looking Ahead with the Unites States National
Park Service The National Park System of the United States comprises 417 offi-
cial units covering more than 84 million acres. The 100-year old U.S. National
Park Service (NPS) is the federal bureau within the Department of the Inte-
rior responsible for managing, protecting, and maintaining all units within the
National Park system, including national parks, monuments, seashores, and other
historical sites.
Global change factors such as climate are likely to affect both park resources and
visitor experience. and, as a result, the NPS’s mission to “preserve unimpaired
78 3 Numerical Calculus
the natural and cultural resources and values of the National Park System for the
enjoyment, education, and inspiration of this and future generations.” Your team
can provide insight and help strategize with the NPS as it starts its second century
of stewardship of United States’ park system.
What have you learned in this chapter? And where does it lead?
In some respects the methods of numerical differentiation are similar to those
of numerical integration, in that they are typically based on using (in this case,
differentiating) an interpolation polynomial. We study interpolation explicitly in
Chap. 6 and have avoided specific use of interpolation in this chapter.
There is, however, one major and important difference between numerical
approaches to integration and differentiation. We saw in the previous sections that
integration is numerically a highly satisfactory operation with results of high accuracy
being obtainable in economical ways. This is because integration tends to smooth
3.6 Conclusions and Connections: Numerical Calculus 79
out the errors of the polynomial approximations to the integrand. Unfortunately the
reliability and stability we find in numerical integration is certainly not reflected
for differentiation which tends to exaggerate the error in the original polynomial
approximation. Figure 3.3 illustrates this clearly.
It is worth noting that although numerical differentiation is likely to be unreliable
(especially if high accuracy is needed), its symbolic counterpart is generally fairly
easy. Computer Algebra Systems, such as Maple, need only program a small number
of basic rules in order to differentiate a wide variety of functions. This is in direct
contrast to the situation for integration. We have seen that numerical integration is
reasonably easy, but symbolic integration is very difficult to program well. This is
partly due to the fact that many elementary functions have no antiderivative that can be
written in terms of the same set of elementary functions. Even where this is possible,
there are no simple algorithms to determine which integration techniques should
be applied. This contrast illustrates at a simple level the need for mathematicians,
scientists and engineers to develop a familiarity with both numerical and symbolic
computation. The blending of the two in solving more difficult problems remains
a very active field of scientific research which is blurring the lines among science,
computer science, and pure and applied mathematics.
For the reasons cited numerical differentiation is best avoided whenever possi-
ble. We will see in subsequent sections discussing Newton’s method and the secant
method, that if the true value of the derivative is of secondary importance, then a
simple approximation to the derivative can be satisfactory. In Chap. 7, we consider
the numerical solution of differential equations. The methods there are philosophi-
cally based on approximations to derivatives but we will see that relatively simple
approaches can be powerful. This is due, at least in part, to the fact that solving a
differential equation was historically referred to as integrating said equation. Some
of the stability of numerical integration carries over to that setting.
What we have done on integration is only the beginning. Students of calculus are
well aware that multivariable integration is often much harder than for a function of
a single variable. That is certainly also true for numerical integration. Singular and
infinite integrals create difficulties of their own, as do integrals of oscillatory func-
tions. There is still active research on obtaining or improving accurate approximate
integration “rules” for all these situations and for obtaining high degree polyno-
mial precision in different circumstances. These include Gaussian quadrature, and
variations on that theme.
Python (or rather NumPy and SciPy) provide functions for some of the operations
we have discussed in this chapter.
>>> dx = np . d i f f ( x )
generates the vector of differences
4.1 Introduction
The notion of solving systems of linear equations was likely introduced to you in
secondary school. For example, you may have encountered problems like this; All
200 students in your math class went on vacation. Some students rode in vans which
hold 8 students each and some students rode in buses which hold 25 students each.
How many of each type of vehicle did they use if there were 15 vehicles total? If
x is the number of vans and y is the number of buses we have the following two
equations and two unknowns;
8x + 25y = 200
x + y = 15,
There are numerous situations that require the solution to a linear system of
equations. We already saw this in the context of numerical integration when using
quadrature schemes. We will see this later in the study of interpolation, whether by
polynomials or by splines, and again in some finite difference methods for approx-
imate derivatives. In the cases of cubic spline interpolation and finite difference
methods, the resulting matrix of coefficients is tridiagonal and we will see that algo-
rithms can be designed to exploit matrix structure and improve efficiency.
In this section we give an introduction to a deep topic and consider the solution
of a square system of equations with a single right-hand side. These systems are
expressed as
Ax = b (4.1)
where x, b are n-vectors and A is an n × n real matrix. In full, this system would be
written as
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 a13 · · · a1n x1 b1
⎢ a21 a22 a23 · · · a2n ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ a31 a32 a33 · · · a3n ⎥ ⎢ x3 ⎥ ⎢ b3 ⎥
⎢ ⎥⎢ ⎥ = ⎢ ⎥ (4.2)
⎢ .. .. .. .. .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
⎣ . . . . . ⎦⎣ . ⎦ ⎣ . ⎦
an1 an2 an3 · · · ann xn bn
or even, writing each equation fully,
Example 1 Suppose you are asked to design the first ascent and drop of a roller
coaster. By studying other rides, you decide that the slope of the ascent should be
0.8 and the slope of the drop should be −1.6. You decide to connect these two line
segments L 1 (x) and L 2 (x) with a quadratic function, f (x) = ax 2 + bx + c. Here
x is measured in feet. For the track to be smooth there cannot be any abrupt changes
so you want the linear segments to be tangent to the quadratic part at the transition
points P and Q which are 100 feet apart in the horizontal direction. To simplify
things, you place P at the origin. Write the equations that will ensure your roller
coaster has smooth transitions and express this in matrix form.
A picture often helps. Consider Fig. 4.1. We need to find a, b, c, and d so that the
continuity conditions hold, that is f (0) = L 1 (0) and f (100) = L 2 (100). Then, the
smoothness conditions require f (0) = L 1 (0) and f (100) = L 2 (100). This gives
a(02 ) + b(0) + c = 0
4.1 Introduction 83
Q = 100
L2 (x) = −1.6x + d
2a(0) + b = 0.8
2a(100) + b = −1.6
Rearranging terms we can express these four equations in matrix form as
⎡ ⎤⎡ ⎤ ⎡ ⎤
0 0 1 0 a 0
⎢ 104 −1 ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 102 1 ⎥ ⎢ b ⎥ = ⎢ −1600 ⎥
⎣0 1 0 0 ⎦ ⎣ c ⎦ ⎣ 0.8 ⎦
200 1 0 0 d −1.6
Note, early in the model development phase, it was easy to see that c = 0 (from the
first continuity condition) and b = 0.8 (from the first smoothness condition). From
there it is straightforward to substitute these values and solve for a = −0.012 and
d = 120.
Another important problem of linear algebra arising frequently in applications is
the eigenvalue problem, which requires the solution of the equation
Ax =λx (4.4)
for the eigenvalues λ and their associated eigenvectors x. There are many techniques
available for this problem and we introduce one iterative method later as well as an
extension using our foundational linear equations techniques.
The ideas presented here are foundational and many variations arise. In practice,
the same linear system may need to be solved but with multiple right-hand sides.
That situation can be handled using the same methods presented in the next few
sections with only slight modifications.
84 4 Linear Equations
The idea of Gaussian elimination is not new to you (think about the simple van-bus
example above). However, the approach taken now is to streamline the procedure as a
systematic process that can be written as an algorithm. Then, we discuss performance
issues and improvements.
The first step in the Gauss elimination process is to eliminate the unknown x1
from every equation in (4.3) except the first. The way this is achieved is to subtract
from each subsequent equation an appropriate multiple of the first equation. Since
the coefficient of x1 in the first equation is a11 and that in the jth equation is a j1 ,
a j1
it follows that subtracting m = times the first equation from the jth one will
a11
result in the coefficient of x1 giving
a j1
a j1 − ma11 = a j1 − a11 = 0
a11
end
bj := b j − mb1
end
In the general algorithm, and in all that follows, the notation will be dropped and
elements of the matrix and components of the right-hand side will just be overwritten
by their new values. With this convention the modified rows of (4.6) represent the
system
⎡ ⎤⎡ ⎤ ⎡ ⎤
a22 a23 · · · a2n x2 b2
⎢ a32 a33 · · · a3n ⎥ ⎢ x3 ⎥ ⎢ b3 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ .. .. .. .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ (4.8)
⎣ . . . . ⎦⎣ . ⎦ ⎣ . ⎦
an2 an3 · · · ann xn bn
The procedure can be repeated for (4.8) but next to eliminate x2 and so on until what
remains is a triangular system
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 a13 ··· a1n x1 b1
⎢ 0 a22 a23 ··· a2n ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ 0 0 a33 ··· a3n ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ x3 ⎥ = ⎢ b3 ⎥ (4.9)
⎢ .. .. .. .. .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
⎣ . . . . . ⎦⎣ . ⎦ ⎣ . ⎦
0 0 0 ··· ann xn bn
The first step of the elimination is to subtract multiples of the first row from each
of the other rows. The appropriate multipliers to use for the second and third rows
are 3/1 = 3 and 5/1 = 5 respectively. The resulting system is
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 3 5 x 9
⎣0 −4 −10 ⎦ ⎣ y ⎦ = ⎣ −14 ⎦
0 −10 −20 z −30
Next (−10) / (−4) = 5/2 of the second row must be subtracted from the third
one in order to eliminate y from the third equation:
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 3 5 x 9
⎣0 −4 −10 ⎦ ⎣ y ⎦ = ⎣ −14 ⎦
0 0 5 z 5
and hence, y = 1. Finally substituting both these into the first equation, we get
x + 3 (1) + 5 (1) = 9
⎡ ⎤
..
⎢7 −7 1 . 1 ⎥
⎢ .. ⎥
⎢0 0 17/7 . 17/7 ⎥ ,
⎣ ⎦
..
0 14 −8 . 6
but at the next step we must subtract 14/0 times the second row from the third!
88 4 Linear Equations
By looking at the partitioned system above, the second row implies that z = 1 and
then this could be used in the third equation to find y and so on. However, one must
consider how to move forward within an algorithm. The issue in Example 3 could
have been addressed simply by exchanging the second and third rows, to obtain the
revised system
⎡ ⎤
..
⎢ 7 −7 1 . 1 ⎥
⎢ .. ⎥
⎢ 0 14 −8 . 6 ⎥ ,
⎣ ⎦
..
0 0 17/7 . 17/7
which is now a triangular system. The back substitution algorithm can now be used
to complete the solution: [x, y, z]T = [1, 1, 1]T .
Example 3 demonstrates the need for pivoting, which is standard to avoid dividing
by zero.
At each step of the basic Gauss elimination algorithm the diagonal entry that is being
used to generate the multipliers is known as the pivot element. In Example 3, we saw
the effect of a zero pivot. The basic idea of partial pivoting is to avoid zero (or near-
zero) pivot elements by performing row interchanges. The details will be discussed
shortly, but first we revisit Example 3 to highlight the need to avoid near-zero pivots.
Example 4 Solve the system of Example 3 using Gauss elimination with four decimal
place arithmetic.
At the first step, the multiplier used for the second row is (−3) /7 = 0.4286 to
four decimals. The resulting system is then
⎡ ⎤
..
⎢7 −7 1 . 1 ⎥
⎢ .. ⎥
⎢0 −0.0002 2.4286 . 2.4286 ⎥
⎣ ⎦
..
0 14 −8 . 6
The multiplier used for the next step is then 14/(−0.0002) = −70000 and the final
entries are
⎡ ⎤
..
⎢ 7 −7 1 . 1 ⎥
⎢ . ⎥
⎢ 0 −0.0002 2.4286 .. 2.4286 ⎥
⎣ ⎦
.
0 0 169994 .. 170008
4.2 Gauss Elimination 89
z = 170008/(169994) = 1.0001
y = (2.4286 − (2.4286) (1.0001)) /(−0.0002) = 1.2143
x = (1 − ((−7) (1.2143) + (1) (1.0001)))/7 = 1.2143
Four decimal place arithmetic has left the results not even accurate to one decimal
place.
The problem in Example 4 is that the second pivot entry is dominated by its
roundoff error (in this case, it is its roundoff error). This results in an unfortunately
large multiplier being used which magnifies the roundoff errors in the other entries
to the point where the accuracy of the final result is affected.
The basic idea of partial pivoting is to search the pivot column to find its largest ele-
ment (in absolute value) on or below the diagonal. The pivot row is then interchanged
with this row so that this largest element becomes the pivot for this step.
>>> pmax
2
Here 8 is indeed the maximum entry in the appropriate part of the vector [0, 2, 8, 6, 4]
and it is the third component of this vector. To get its position in the original vector
we simply add 5 to account for the 5 elements that were omitted from our search.
def gepp ( A , b ) :
"""
S o l v e A x = b u s i n g Gauss e l i m i n a t i o n w i t h
p a r t i a l p i v o t i n g f o r t h e s q u a r e m a t i x ‘ A ‘ and
vector ‘b ‘ .
4.2 Gauss Elimination 91
"""
# A v o i d a l t e r i n g t h e p a s s e d −i n a r r a y s
A = A . copy ( )
b = b . copy ( )
return x
Note that it is not strictly necessary to perform the row interchanges explicitly.
A permutation vector could be stored which carries the information as to the order
in which the rows were used as pivot rows. For our current purpose, the advantage
of such an approach is outweighed by the additional detail that is needed. NumPy’s
vector operations make the explicit row interchanges easy and fast.
To test our algorithm, we shall use the Hilbert matrix which is defined for dimen-
sion n as
⎡ ⎤
1 1/2 1/3 ··· 1/n
⎢ 1/2 1/3 1/4 · · · 1/ (n + 1) ⎥
⎢ ⎥
Hn = ⎢ ⎢ 1/3 1/4 1/5 · · · 1/ (n + 2) ⎥⎥
⎣ · · · ··· · ⎦
1/n 1/ (n + 1) 1/ (n + 2) · · · 1/ (2n − 1)
92 4 Linear Equations
Example 6 Solve the Hilbert matrix system of dimension 6 using Gauss elimination
with partial pivoting.
H = hilbert (6)
r s H = np . sum ( H , a x i s = 1 )
x = gepp ( H , r s H )
np . s e t _ p r i n t o p t i o n s ( p r e c i s i o n = 4 )
>>> p r i n t ( x )
[ 1. 1. 1. 1. 1. 1 .]
At this level of accuracy, the results confirm that our algorithm is working. However
if the solution is displayed again using higher precision we get
np . s e t _ p r i n t o p t i o n s ( p r e c i s i o n = 1 5 )
>>> p r i n t ( x )
[ 0 .9 9 9 9 9 9 9 9 9 9 9 9 0 7 2 1 .0 0 0 0 0 0 0 0 0 0 2 6 6 9 9
0 .9 9 9 9 9 9 9 9 9 8 1 8 2 6 5 1 .0 0 0 0 0 0 0 0 0 4 7 4 8 7 3
0 .9 9 9 9 9 9 9 9 9 4 7 3 9 6 2 1 .0 0 0 0 0 0 0 0 0 2 0 7 8 7 2 ]
which suggests that there is considerable build-up of roundoff error in this solution.
This loss of accuracy is due to the fact that the Hilbert matrix is a well-known
example of an ill-conditioned matrix. An ill-conditioned system is one where the
accumulation of roundoff error can be severe and/or the solution is highly sensitive
to small changes in the data. In the case of Hilbert matrices, the severity of the
ill-conditioning increases rapidly with the dimension of the matrix.
For the corresponding 10 × 10 Hilbert matrix system to that used in Example 6,
the “solution vector” obtained using the Python code above was
4.2 Gauss Elimination 93
[1.0000, 1.0000, 1.0000, 1.0000, 0.9999, 1.0003, 0.9995, 1.0005, 0.9998, 1.0001]
showing significant errors – especially bearing in mind that Python is working inter-
nally with about 15 significant (decimal) figures.
Why do we believe these errors are due to this notion of ill-conditioning rather
than something inherent in the program?
The NumPy random module’s rand function can be used to generate random
matrices. Using the same approach as above we can choose the right-hand side
vector to force the exact solution to be a vector of ones. The computed solution can
then be compared with the exact one to get information about the performance of
the algorithm. The commands listed below were used to repeat this experiment with
one hundred different matrices of each dimension from 4 × 4 up to 10 × 10.
np . random . seed ( 4 2 )
E = np . z e r o s ( ( 1 0 0 , 7 ) )
f o r n i n range ( 4 ,1 1 ) :
f o r k i n range ( 1 0 0 ) :
A = np . random . u n i f o r m ( −5 , 5 , s i z e = ( n , n ) )
b = np . sum ( A , a x i s = 1 )
x = gepp ( A , b )
E [ k , n − 4 ] = np . max ( np . abs ( x − 1 ) )
Each trial generates a random matrix with entries in the interval [−5, 5] . The matrix
E contains the largest individual component-error for each trial. Its largest entry is
therefore the worst single component error in the complete set of 700 trials:
>>> np . max ( E )
4 .9 9 8 2 2 4 0 5 6 8 6 e−13
This output shows that none of these randomly generated test examples generated
errors that were quite large. Of course, some of these random matrices could be
somewhat ill-conditioned themselves. The command
>>> np . sum ( E > 1e −14)
43
counts the number of trials for which the worst error was greater than 10−14 . There
were just 43 such out of the 700.
Note that NumPy’s sum function sums all entries of a matrix when no axis argu-
ment is given. Therefore the first np.sum in the command above generates a count
of the number of cases where E > 10−14 across both dimensions. Gauss elimination
with partial pivoting appears to be a highly successful technique for solving linear
systems.
We will revisit the heat equation example from Chap. 3 as an example of how
tridiagonal matrices arise in practice. We will also encounter these in the context of
94 4 Linear Equations
cubic spline interpolation later. The tridiagonal algorithm presented here is just one
example of how matrix structure can lead to improved efficiency.
A tridiagonal matrix has its only nonzero entries on its main diagonal and immedi-
ately above and below this diagonal. This implies the matrix can be stored simply by
storing three vectors representing these diagonals. Let a generic tridiagonal system
be denoted as
⎡ ⎤⎡ ⎤ ⎡ ⎤
a1 b1 x1 r1
⎢ c1 a2 b2 ⎥ ⎢ x 2 ⎥ ⎢ r2 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ c a b ⎥ ⎢ x 3 ⎥ ⎢ r3 ⎥
⎢ 2 3 3 ⎥⎢ ⎥ ⎢ ⎥
⎢ .. .. .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ (4.10)
⎢ . . . ⎥⎢ . ⎥ ⎢ . ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ cn−2 an−1 bn−1 ⎦ ⎣ xn−1 ⎦ ⎣ rn−1 ⎦
cn−1 an xn rn
Example 7 Steady state heat equation in one dimension: Consider a simplified ver-
sion of the heat model on a wire (Eq. 3.1) but set u t = 0 (i.e. the temperature along
the wire is no longer changing in time). This example demonstrates how to express
−K u x x = f (x) as a linear system. To see this, we will focus on the mathematics
itself and not necessarily worry about the physical parameters. To this end, assume
4.2 Gauss Elimination 95
the temperature at the boundary is fixed at 0 ◦ C and f (x) = 1 ◦ C for all the interior
values of [0, 1] and let K = 1. Using numerical derivatives, approximate the tem-
peratures at five equally spaced points along the interior of the wire by setting up
and solving a linear system.
u5 = 0.5/1.2 = 0.4167
u4 = (0.4167 + 0.4167) /1.25 = 0.6667
u3 = (0.3333 + 0.6667) /1.3333 = 0.7500
u2 = (0.25 + 0.7500) /1.5 = 0.6667
u1 = (0.1667 + 0.6667) /2 = 0.4167
Although plotting this solution is crude for this small number of points, you should
consider doing so and then ask yourself why the shape of this temperature distribution
makes sense.
96 4 Linear Equations
The important difference between this algorithm and Algorithm 3 is that it requires
a much smaller amount of computation. For a square system, we stated earlier that
approximately n 3 /3 multiplications and n 2 /2 divisions are needed. For a tridiagonal
system, using Algorithm 6 reduces these counts to just 3n and 2n respectively.
Exercises
G(x) = kx 3 + lx 2 + mx + n, 0 ≤ x < 10
(a) Write a new system of equations that ensures the coaster feels smooth at the
transition points.
(b) Express this new system in the form Au = w.
(c) Solve the resulting system and provide a plot for the roller coaster.
10. We now revisit the image distortion example from the numerical calculus chapter
which was modeled with
u t − K u x x − K u yy = f (x, y, t).
You (may now) know from the chapter on numerical derivatives that explicit
methods can be highly unstable and care must be taken when choosing the time-
step size and spatial discretization. Implicit methods can help overcome this but
require a linear solve at each time step. The backward Euler approximation is
one such approach and this will be covered in depth soon! For now, we can
implement that method simply considering all the values on the right hand side
evaluated at the time k + 1 instead (we will derive this as a numerical approach
to differential equations later). This means we have unknowns on both sides of
the equation and with some rearranging, we can arrive at a linear system of the
form Auk+1 = b
(a) To check your understanding. Starting with Eq. (3.11), re-write the right
hand side with all values at k+1 in time. Now re-arrange terms so that your
unknowns are on the left and your knowns are on the right.
(b) How do you now express this as a linear system? You need to translate
your unknowns on the spatial grid into an ordered vector so that you can
determine the entries in the matrix A. For consistency, start at the bottom
and move left to right in x and then up to the next y value, continuing left to
right.
For example at some fixed time on a 3 × 3 grid of unknowns is shown in
Fig. 4.2.
What does the resulting linear system look like for the 2-D diffusion equation
at each time step for this 3 × 3 example above? What is b? Generalize this
to any size grid. How has the matrix structure changed in comparison to the
1-D case?
(c) Outline, in pseudo-code, how you would implement solving this problem
using the implicit method in time given an initial condition and boundary
conditions of zero along the edge of the domain? You do not need to imple-
ment the method, but you need to outline a complete method to evolve the
solution in time.
98 4 Linear Equations
Fig. 4.2 The unknowns u i,k+1j must be mapped to a vector of unknowns. Consider a fixed snapshot
in time (so ignore the k superscripts for now). Starting a the bottom right corner, we move from left
to right and assign each grid point to a location in the vector. Note this assignment (or ordering of
your unknowns) will impact the structure of the matrix
(d) How is the number of flops required different than the explicit case?
11. Consider solving the Bessel equation x 2 y + x y + x 2 − 1 y = 0 subject to
the boundary conditions y (1) = 1, y (15) = 0 using numerical derivatives with
N = 280 steps between [1, 15]. You should begin by rewriting the differential
equation in the form
1 1
y + y + 1 − 2 y = 0
x x
Next, use finite difference approximations to the derivatives and collect like
terms to obtain a tridiagonal linear system for the unknown discrete values of y.
12. Solve y + 2x y + 2y = 3e−x − 2xe−x subject to y (−1) = e + 3/e, y (1) =
4/e using the second-order finite difference method as in the above problem.
Consider N = 10, 20, 40 in [−1, 1]. Plot these solutions and the true solution
y = e−x + 3e−x . Also, on a separate set of axes, plot the three error curves.
2
Estimate the order of the truncation error for this method. Is it what you would
expect?
We now consider a way to reduce computational effort when having to solve multiple
linear systems that have the same coefficient matrix, but different right hand side
vectors. For example, consider solving Ax = b and then having to solve Az = c.
Gauss elimination would yield the same computations the second time for the entries
4.3 LU Factorization and Applications 99
in A but those operations would still need to applied for the vector c. This scenario is
fairly common in practice. We demonstrate one example of how it arises in modeling
below before giving the details of how to avoid doing the work on A more than once.
Specifically, this situation can arise when using an implicit discretization for
the time derivative in a linear partial differential equation model (which are the
underlying models in many industrial, large-scale simulation tools). A benefit is that
stability is improved but the new formulation requires a linear solve at each time step.
The backward Euler approximation is one such approach, which is implemented by
simply considering all the values at the next time step. We demonstrate this on the
heat equation.
Example 8 Consider the time dependent 1-D heat equation from Example 1 in
Chap. 3 with an implicit time scheme and express the problem as a linear system.
ρcu t − K u x x = f (x, t)
with the boundary conditions u(0, t) = u(L , t) = 0 ◦ C and initial temperature u 0 (x)
= u(x, 0). The idea is to evaluate all the terms in the discretized expressions for u x x
and f at k + 1.
k+1
u k+1 − u ik u i−1 − 2u ik+1 + u i+1
k+1
ρc i −K = f ik+1 . (4.12)
t x 2
⎡ k ⎤
⎡ ⎤ ⎡ k+1 ⎤ u1 + t
1 + 2H −H u1 ρc t f 1
⎢ ⎥ ⎢ t ⎥
⎢ −H 2H −H ⎥ ⎢ u k+1 ⎥ ⎢ u k2 + ⎥
ρc t f 2 ⎥
⎢ ⎥⎢ 2 ⎥ ⎢
⎢ ⎥ ⎢ k+1 ⎥ ⎢ t ⎥
⎢ −H 2H −H ⎥ ⎢ u 3 ⎥ = x 2 ⎢ u k3 + ρc t f 3 ⎥
⎢ ⎥ ⎢ k+1 ⎥ ⎢ ⎥
⎣ −H 2H −H ⎦ ⎣ u 4 ⎦ ⎢ uk + t ⎥
⎣ 4 ρc t f 4 ⎦
−H 2H u k+1 u k5 + t
5 ρc t f 5
K t
and H = ρcx 2
. So, if u0 is our discrete initial condition vector we could define a
right hand side vector, and solve Au1 = b to get the vector of discrete approximations
u1 to u(x, t 1 ) at the nodes. Then the process would be repeated with u1 to define a
new right hand side vector b to solve for u2 but of course using the same matrix A.
This would be repeated until we reached the final time. Performing Gauss elimination
at each time step would be roughly O(N 3 ) work each time if we have N interior
nodes.
100 4 Linear Equations
It has already been mentioned that Gauss elimination could be used to solve several
systems with the same coefficient matrix but different right-hand sides simultane-
ously. Essentially the only change is that the operations performed on the elements
of the right-hand side vector must now be applied to the rows of the right-hand side
matrix. Practically, it makes sense to avoid repeating all the work of the original
solution in order to solve the second system. This can be achieved by keeping track
of the multipliers used in the Gauss elimination process the first time. This leads us
to what is called LU factorization of the matrix A in which we obtain two matrices
L, a lower triangular matrix, and U , an upper triangular matrix, with the property
that
A = LU (4.13)
This process requires no more than careful storage of the multipliers used in Gauss
elimination. To see this, consider the first step of the Gauss elimination phase where
we generate the matrix
⎡ ⎤
a11 a12 a13 ··· a1n
⎢ 0
a22 a23 ··· a2n ⎥
⎢ ⎥
⎢ ··· ⎥
A(1) =⎢ 0 a32 a33 a3n ⎥
⎢ .. .. .. .. .. ⎥
⎣ . . . . . ⎦
0
an2
an3 ··· ann
in which
a j1
a jk = a jk − a1k = a jk − m j1 a1k
a11
See (4.6) and (4.7). Next, use the multipliers to define
⎡ ⎤
1 0 ··· ··· 0
⎢ m 21 1 0 ··· 0⎥
⎢ ⎥
⎢ ⎥
M1 = ⎢ m 31 0 1 ⎥ (4.14)
⎢ .. .. .. ⎥
⎣. . . ⎦
m n1 0 1
Multiplying any matrix by M1 results in adding m j1 times the first row to the jth
row. Therefore, multiplying A(1) by M1 would exactly reverse the elimination that
has been performed, in other words
M1 A(1) = A (4.15)
A similar equation holds for each step of the elimination phase. It follows that
denoting the final upper triangular matrix produced by Algorithm 3 by U and defining
4.3 LU Factorization and Applications 101
LU = A
as desired.
As an algorithm this corresponds to simply overwriting the subdiagonal parts of
A with the multipliers so that both the upper and lower triangles can be stored in the
same local (matrix) variable. The entire process is shown in Algorithm 3.
Input n × n matrix A
Factorization
for i=1:n-1
for j=i+1:n
a ji
a ji := ;
aii
for k=i+1:n
a jk := a jk − a ji aik
end
end
end
Output modified matrix A containing upper and lower triangular factors
L is the lower triangle of A with unit diagonal elements
U is the upper triangle of A
Note that this algorithm is really the same as Algorithm 3 except that the multipliers
are stored in the lower triangle and the operations on the right-hand side of our linear
system have been omitted. Using the LU factorization to solve Ax = b can be done
in three steps:
r = b−Ax̃ (4.17)
for k = 1, 2, . . . , n provide a measure (although not always a very good one) of the
extent to which we have failed to satisfy the equations. If we could also solve the
system
Ay = r (4.18)
then, by adding this solution y to the original computed solution x̃, we obtain
In other words, if we can compute r and solve (4.18) exactly, then x̃ + y is the exact
solution of the original system. Of course, in practice, this is not possible but we
would hope that x̃ + ỹ will be closer to the true solution. (Here ỹ represents the
computed solution of (4.18).) This suggests a possible iterative method, known as
iterative refinement, for solving a system of linear equations.
Unfortunately, however, we do not know the residual vector at the time of the
original solution. That is to say that (4.18) must be solved after the solution x̃ has
been obtained. So what is the advantage? Consider the iterative refinement
process
that we discussed above. The initial solution x̃ is computed using O n 3 floating-
point operations. Then the residual vector is computed as
r = b−Ax̃
4.3 LU Factorization and Applications 103
Lz = r
Uy = z
and setting
x = x̃ + y
The important difference is that the factorization does not need to be repeated if we
make use of the LU factorization.
First we note that this is the same system that we used in Example 4. From the
multipliers used there, we obtain the factorization
⎡ ⎤ ⎡ ⎤⎡ ⎤
7 −7 1 1 0 0 7 −7 1
⎣ −3 3 2 ⎦ = ⎣ −0.4286 1 0⎦⎣0 14 −8 ⎦
7 7 −7 1 −70000 1 0 0 169994
Solving Lz = [1, 2, 7] we obtain z = [1, 2.4286, 170008] which is of course the
right-hand side at the end of the elimination phase in Example 4. Now solving
U x = z, to get the previously computed solution
⎡ ⎤
1.2143
x̃ = ⎣ 1.2143 ⎦ .
1.0001
P LU = A
>>> p r i n t ( L )
[[ 1 .0 0 0 0 e+00 0 .0 0 0 0 e+00 0 .0 0 0 0 e +00]
[ 1 .0 0 0 0 e+00 1 .0 0 0 0 e+00 0 .0 0 0 0 e +00]
[ −4.2857 e−01 1 .1 8 9 5 e−17 1 .0 0 0 0 e + 0 0 ] ]
Exercises
1. Obtain LU factors of
⎡ ⎤
7 8 8
A = ⎣6 5 4⎦
1 2 3
using four decimal place arithmetic. Use these factors to solve the following
system of equations
⎡ ⎤ ⎡ ⎤
x 7
A⎣ y ⎦ = ⎣5⎦
z 2
with iterative refinement.
2. Write a program to perform forward substitution. Test it by solving the following
system
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 x1 0.1
⎢1 1 ⎥ ⎢ x2 ⎥ ⎢ 0.4 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢2 3 1 ⎥ ⎢ x3 ⎥ = ⎢ 1.6 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣4 5 6 1 ⎦ ⎣ x4 ⎦ ⎣ 5.6 ⎦
7 8 9 0 1 x5 8.5
3. Write a program to solve a square system of equations using the built-in LU
factorization function and your forward and back substitution programs. Use
these to solve the Hilbert systems of dimensions 6 through 10 with solution
vectors all ones.
4. For the solutions of Exercise 3, test whether iterative refinement offers any
improvement.
5. Investigate what a Vandermonde matrix is (and we will see more about this in
Chap. 6). Solve the 10 × 10 Vandermonde system V x = b where vi j = i j−1 and
bi = (i/4)10 using LU factorization.
6. Implement the model problem from Example 8, using the implicit discretization
of the heat equation. Compare the computational effort of using Gauss elimination
inside the time stepping loop compared to using the LU factorization calculated
outside the time loop. Consider using a right hand side of all ones and experiment
with different choices for t, x and finals times. As your final time gets larger,
what would you expect the shape of your solution to look like?
The tridiagonal algorithm presented in Sect. 4.2 showed that the computational cost
in solving Ax = b could be reduced if the underlying matrix had a known structure.
If the dimension of the system is large and if the system is sparse (meaning A has
4.4 Iterative Methods 107
many zero elements) then iterative methods can be attractive over direct methods.
Two of these iterative methods are presented in this section. These two form the basis
of a family of methods which are designed either to accelerate the convergence or to
suit some particular computer architecture.
The methods can be viewed in terms of splitting A into a sum of its parts. Specif-
ically, A = L + U + D where L and U are triangular and are the strictly lower and
upper parts of A (not to be confused with the LU decomposition). Here D a diagonal
matrix with the diagonal of A on its diagonal. Specifically,
ai j if i > j
li j =
0 otherwise
dii = aii , di j = 0 otherwise
ai j if i < j
ui j =
0 otherwise
This is actually the result of solving the ith equation for xi in terms of the remaining
unknowns. There is an implicit assumption here that all diagonal elements of A are
nonzero. (It is always possible to reorder the equations of a nonsingular system to
ensure this condition is satisfied.) This rearrangement of the original system lends
itself to an iterative treatment.
For the Jacobi iteration, we generate the next estimated solution vector from the
current one by substituting the current component values in the right-hand side of
(4.19) to obtain the next iterates. In matrix terms, we set
x(k+1) = D −1 b− (L + U ) x(k) (4.20)
where the superscript here represents an iteration counter so that x(k) is the kth
(vector) iterate. In component terms, we have
(k+1)
x1 = b1 − a12 x2(k) + a13 x3(k) + · · · + a1n xn(k) /a11
(k+1) (k) (k)
x2 = b2 − a21 x1 + a23 x3 + · · · + a2n xn(k) /a22
··· ··· ··· ··· ··· ··· (4.21)
(k) (k) (k)
xn(k+1) = bn − an1 x1 + an2 x2 + · · · + an,n−1 xn−1 /ann
108 4 Linear Equations
Example 10 Perform the first three Jacobi iterations for the solution of the system
26 − 3x2 − 2x3
x1 =
6
17 − 2x1 − x3
x2 =
5
9 − x1 − x2
x3 =
4
26
x11 = = 4.3333
6
17
x21 = = 3.4000
17
9
x31 = = 2.2500
4
The next two iterations yield
and
x13 = 3.6194, x23 = 2.5833, x33 = 1.4750
which are (slowly) approaching the true solution [3, 2, 1]T . In fact, eventually, x 9 =
[3.0498, 2.0420, 1.0353]T .
Clearly for this small linear system, an accurate solution with less effort could be
found through a direct solve. The purpose of the example is simply to illustrate how
the iteration works.
Stepping through the algorithm by hand may actually have brought a question to
(k+1)
mind. Once xi has been obtained in the first of equations (4.21), why not use the
(k)
most recent value in place of xi in the remainder of the updates in (4.21)? That’s a
great idea – the result is the Gauss–Seidel iteration:
4.4 Iterative Methods 109
(k+1) (k) (k)
x1 = b1 − a12 x2 + a13 x3 + · · · + a1n xn(k) /a11
(k+1) (k+1) (k)
x2 = b2 − a21 x1 + a23 x3 + · · · + a2n xn(k) /a22
··· ··· ··· ··· ··· ··· (4.22)
(k+1) (k+1) (k+1)
xn(k+1) = bn − an1 x1 + an2 x2 + · · · + an,n−1 xn−1 /ann
Note that (4.23) should be understood as an assignment of values from the right-hand
side to the left where the entries of x(k+1) are updated sequentially in their natural
order.
26
x11 = = 4.3333
6
17 − 2(4.3333)
x21 = = 1.6667
5
9 − 4.3333 − 1.6667
x31 = = 0.7500
4
The next two iterations then produce the estimates
which are much closer to the true solution than the Jacobi iterates for the same
computational effort.
Either of the iterations (4.21) or (4.22) uses roughly n 2 multiplications and n divi-
sions per iteration. Comparing this with the operation counts for Gauss elimination,
we see that these iterative methods are likely to be computationally less expensive if
they will converge in fewer than about n/3 iterations. For large sparse systems this
may be the case.
With any iterative method, the question arises, under what conditions will the
method converge? The examples above showed convergence, but unfortunately they
are not guaranteed to behave that way. Consider a slight modification to Example 11
by changing the first coefficient from a 6 to a 1 in the first equation.
110 4 Linear Equations
|aii | >
ai j
j=i
for each i. For our example, we have 6 > 3 + 2, 5 > 2 + 1, and 4 > 1 + 1 whereas
in the modified system the first row gives 1 < 3 + 2.
The simplest conditions under which both Jacobi and Gauss–Seidel iterations can
be proved to converge is when the coefficient matrix is diagonally dominant. The
details of these proofs are beyond our present scope. They are essentially multivari-
able versions of the conditions for convergence of fixed-point (function) iteration
which will be discussed for a single equation in the next chapter. The diagonal domi-
nance of the matrix ensures that the various (multivariable) derivatives of the iteration
functions (4.20) and (4.23) are all less than one.
In general, the Gauss–Seidel method will converge faster than the Jacobi method.
On the other hand, the updates of a Jacobi iteration can all be performed simulta-
(k+1) (k+1)
neously, whereas for Gauss–Seidel, x2 cannot be computed until after x1
is known. This means the Jacobi method has the potential to be easily paralleliz-
able while Gauss–Seidel does not. There are other iterative schemes which can be
employed to try to take some advantage of the ideas of the Gauss–Seidel iteration in
a parallel computing environment.
This point is readily borne out by the efficient implementation of the Jacobi
iteration in Python/NumPy where we can take advantage of the matrix operations.
import numpy as np
def j a c o b i t ( A , b , N i t s ) :
"""
P e r f o r m s F u n c t i o n f o r computing ‘ N i t s ‘
i t e r a t i o n s of the
J a c o b i method f o r Ax=b where
‘ A ‘ must be s q u a r e .
"""
4.4 Iterative Methods 111
D = np . d i a g ( A )
n = A . shape [ 0 ]
A_D = A − np . d i a g ( D ) # T h i s i s L + U
x = np . z e r o s ( n )
s = np . z e r o s ( ( n , N i t s ) )
for k i n range ( N i t s ) :
x = ( b − A_D . dot ( x ) ) / D
s [: , k] = x
return s
Note the ease of implementation that NumPy’s matrix arithmetic allows for the Jacobi
iteration. The same simplicity could not be achieved for Gauss–Seidel because of
the need to compute each component in turn. The implicit inner loop would need to
be “unrolled” into its component form.
Example 12 Perform the first six Jacobi iterations for a randomly generated diago-
nally dominant 10 × 10 system of linear equations.
>>> A = 10 ∗ np . eye ( 1 0 )
>>> A += np . random . rand ( 1 0 , 1 0 )
>>> b = np . sum ( A , a x i s = 1 )
Since the random numbers are all in [0, 1] , the construction of A guarantees its
diagonal dominance. The right-hand side is chosen to make the solution a vector of
ones. Using the program listed above we get the table of results below.
>>> S = j a c o b i t ( A , b , 6 )
Iteration 1 2 3 4 5 6
x1 1.4653 0.8232 1.0716 0.9714 1.0114 0.9954
x2 1.2719 0.8900 1.0447 0.9822 1.0071 0.9971
x3 1.3606 0.8492 1.0599 0.9761 1.0096 0.9962
x4 1.3785 0.8437 1.0615 0.9753 1.0099 0.9961
x5 1.3691 0.8528 1.0592 0.9764 1.0095 0.9962
x6 1.4424 0.8310 1.0677 0.9729 1.0108 0.9957
x7 1.4470 0.8178 1.0725 0.9710 1.0116 0.9954
x8 1.4441 0.8168 1.0727 0.9709 1.0116 0.9953
x9 1.4001 0.8358 1.0652 0.9738 1.0105 0.9958
x10 1.4348 0.8345 1.0670 0.9733 1.0107 0.9957
The Jacobi iteration is particularly easy to program, but the Gauss–Seidel is typi-
cally more efficient. The results of using five Gauss–Seidel iterations for the system
generated in Example 12 are tabulated below.
Iteration 1 2 3 4 5
x1 1.4653 0.9487 1.0021 1.0003 1.
x2 1.2692 0.9746 0.9996 1.0002 1.
x3 1.2563 0.9914 0.9982 1.0001 1.
x4 1.2699 0.9935 0.9977 1.0001 1.
x5 1.1737 1.0066 0.9986 1. 1.
x6 0.9459 1.0092 1.0000 0.9999 1.
x7 1.1264 1.0130 0.9996 0.9999 1.
x8 0.9662 1.0047 0.9999 1. 1.
x9 0.9719 1.0074 0.9999 0.9999 1.
x10 0.9256 1.0016 1.0004 1. 1.
It is apparent that the Gauss–Seidel iteration has given greater accuracy much
more quickly.
Exercises
In the simplest case the “maze” consists of a rectangular grid of passages and the
food is placed at exits on one edge of the grid. In order to decide whether the rat
learns, it is first necessary to determine the probability that the rat gets food as
a result of purely random decisions. Specifically, suppose the maze consists of
a 6 × 5 grid of passages with exits at each edge. Food is placed at each exit on
the left hand (western) edge of the grid. Let’s consider finding the probabilities
Pi j that the rat finds food from the initial position (i, j) where we are using
“matrix coordinates”. An example grid is in Fig. 4.3 and is augmented with the
“probabilities of success” for starting points at each of the exits.
Now, for random decisions, assume that the probability of success (that is getting
food) starting from position (i, j) will be simply the average of the probabilities at
the four neighboring points. This can be written as a simple system of equations:
1
Pi j = Pi−1, j + Pi+1, j + Pi, j−1 + Pi, j+1
4
where i = 1, 2, . . . , 6, j = 1, 2, . . . , 5 and the probabilities at the edges are as
shown. The array of values would therefore be augmented with P0 j = P7 j =
Pi6 = 0 and Pi0 = 1. (The apparent conflict at the corners is irrelevant since
there is no passage leading to the corner at (0, 0) for example.)
For our case this could be rewritten as a conventional system of 30 linear
equations in the unknowns Pi j . However this system would be very sparse. That
is, a high proportion of the entries in the coefficient matrix would be 0 meaning
this is a good candidate for an iterative solver. Use the Jacobi or Gauss–Seidel
methods to calculate the probabilities.
7. The model used in the simple case above (Exercise 6) is probably too simple. It is
used to test whether rats learn by statistical comparison of the rat’s performance
with that which would be expected as a result of purely random decisions. To
make a more extensive test of rat’s intelligence, the basic model needs refining.
• The rat never reverses its direction at one intersection (that is, no 180◦ turns)
• Incorporate variable probabilities for different turns with greater turning angles
having lower probabilities. In particular for a rectangular maze try a probability
of 1/2 for straight ahead, 1/4 for each 90◦ turn and 0 for a 180◦ turn.
114 4 Linear Equations
Try other values for the probabilities and extend the idea to diagonal passages,
too.
9. In this problem, we revisit the image distortion application in the exercises of
Sects. 3.2 and 4.2. That exercise highlighted that explicit methods are often prone
to stability problems. Implicit methods can help overcome this but require a linear
solve at each time step. The backward Euler approximation is one such approach
which simply considers all the values on the right hand side evaluated at the time
k + 1 instead, as seen in Exercise 10 of Sect. 3.2. Often this system is too large to
solve directly (i.e. with Gaussian Elimination) and since the matrix A is actually
sparse (contains a lot of zeros) it is suitable for an iterative method, such as Jacobi
or Gauss–Seidel.
(a) What does the resulting linear system look like for the 2-D diffusion equation
at each time step? How has the matrix structure changed in comparison to
the 1-D case?
(b) Determine (with pencil and paper) what the Jacobi iteration gives based on
the matrix splitting for updating the i, jth component of the solution.
(c) Since the matrix has mainly zeros, it would be inefficient to store the whole
matrix and use Gaussian elimination (and the problem is large in size). The
purpose of using the Jacobi method is that we do NOT need to store the
matrix or even form the matrix splitting that the method is based on. You
can manipulate the expression from Eq. (3.11) (with the terms on the right
evaluated at k + 1) to represent the Jacobi iteration, which basically forms
an iteration by diving by the diagonal entries, to solve for u i, j in terms of
the others.
4.4 Iterative Methods 115
(d) Solve the 2-d diffusion equation for image distortion in Exercise 10 (Sect. 3.2)
with the Jacobi iteration. Use your own image. Explore different time step
sizes and draw some conclusions on the impact on the final solution.
Some remarks:
• This code should be written to take in an initial value u(0) and solve the
diffusion equation forward in time.
• Do NOT confuse the Jacobi iterates with u k+1 and u k which represent solu-
tions in TIME.
• To simplify the coding (and because we are applying this to an image, which
is a non-physical problem) assume h = 1 in your coding.
• For points along the boundary, we avoid complexity by placing zeros all
around the solution (i.e. assume zero boundary conditions).
• Because Jacobi requires an initial guess to get the iteration started, it is com-
mon to use the solution at the previous time step.
We now return to one of the main motivating themes of this text; building mathe-
matical models to understand the world around us and using scientific computing
to find solutions. Models are often used to make predictions, sometimes created to
try to match up well to data. The data may be experimental, historical, or may even
be output from a complicated, computationally expensive function. In any case, one
wants to trust that the model is a good predictor for other points of interest that are
not yet measured or known. Unfortunately, since models are just that–models, they
often are created based on a wide range of assumptions and simplifications that lead
to unknown model parameters.
For example, suppose a team of safety analysts is trying to understand the distance
of a car traveling at a constant speed when a driver is alerted that they need to stop
(for example, when they see an animal about to run into the road). They come up
with the following relationships based on a variety of assumptions (some based on
physics and some based on observing and summarizing human behavior).
1. The reaction distance, that is the distance the car travels when a person realizes
they need to stop and when they actually apply the brakes, is proportional to the
speed.
2. The stopping distance, that is the distance a car travels once the brakes are applied,
is proportional to the square of the speed.
These two assumptions together imply that if x is the speed of the car, then the total
stopping distance from the time the person was alerted would be D(x) = a1 x + a2 x 2 .
116 4 Linear Equations
Next suppose they consider an experiment in which a car traveled at a fixed speed
until the driver was told to stop and then the distance the car traveled from that
moment until it completely stopped was recorded (See Table 4.1)
Their problem would then be to try to find the values of a1 and a2 so that their
model fit as closely to the data as possible. Then, they may be able to predict the
braking distance of a person driving at 48 km/h using the model. We’ll go through
the process of finding the unknown model parameter in an example later.
For now, let’s consider the general problem; finding a polynomial of a certain
degree, p(x) = a0 + a1 x + a2 x 2 + . . . a M x M that agrees with a set of data,
D = {(x1 , y1 ), (x2 , y2 ), . . . , (x N , y N )} .
Our problem can now be posed as to find the coefficients a0 , a1 , . . . a M so that the
polynomial agrees with the data as closely as possible at x1 , x2 , . . . , x N . Mathemat-
ically, this is usually posed as a least-squares approximation. That is, the problem
is
N
min (yi − p(xi ))2 (4.24)
a
i=1
where a is a vector containing the coefficients in p. At first glance it may not be clear
at all why this is called linear least squares but if the xi are distinct then evaluation
of p(x) at those points actually leads to a linear system since
N
min (yi − p(xi ))2 = (y − Aa) (y − Aa),
a
i=1
where A is the coefficient matrix above and y is the vector of data measurements.
We canexpand this further using linear algebra. Recall that for a vector, say x, that
x x = i xi2 . So our problem expanded out is
A Aa = A y,
known as the normal equations. The details of arriving at the normal equations for
this discrete case is a bit beyond the scope of this book and left as an exercise
(with some hints), but the theory is provided below for the continuous least squares
problem.
There are multiple ways to solve the normal equations depending on the size of
A. For example, if the xi are distinct and if N = M, then A is a square, nonsingular
matrix and therefore so is A and so this problem reduces to solve Aa = y. Usually,
there are more data points than there are unknowns (that is N > M) which means
that A is rectangular. Once again though, if the xi are distinct then one can show that
A A is a square, nonsingular matrix and the normal equations can be solved directly.
Example 13 Find the model parameters to fit the braking data in Table 4.1 in a least
squares approximation.
Here the discrete model inputs (i.e. xi ) are taken as the five speeds and the model
outputs (i.e. the yi ) are the stopping distances. Since the safety analysts’ model has
118 4 Linear Equations
90
80
70
Distance (m)
60
50
40
30
20
10
40 50 60 70 80 90 100
Speed (km/h)
Fig. 4.4 Model for stopping distance plotted with experimental data
a0 = 0, D(xi ) = a1 xi + a22 xi2 . The coefficient matrix has rows made up of xi and
xi2 ;
⎡ ⎤
40 1600
⎢ 55 3025 ⎥
⎢ ⎥
⎢ 70 4900 ⎥
⎢ ⎥.
⎣ 9 8100 ⎦
100 10000
Recall, A is formed by transposing the rows and columns of A so that A A is the
2 × 2 matrix and the resulting linear system, A Aa = A y, is given by
27625 2302375 a1 20155
=
2302375 201330625 a2 1742875
Solving this gives a1 = 0.1728 and a2 = 0.0067. So, the model can now be used
to predict the stopping distance for intermediate speeds. For example to predict the
stopping distance if driving 45 km/h we have D(45) = 0.1728(45) + 0.0067(452 ) =
21.34 m. Figure 4.4 shows the resulting model curve and data points plotted together.
It is important to assess the quality of the model; how well does the model agree with
the data? Calculating the absolute difference between the model at the speed values
and the data (i.e. |D(xi ) − y(i)|) gives the following values 0.6007, 1.2873, 1.1689,
4.6655, 2.9130. So the biggest error is roughly 4.5 m that occurs when trying to match
the stopping distance at 90 km/h. Whether or not the maximum error is acceptable,
is of course up to the modeler (or perhaps the customer). They may decide to revisit
their assumptions and come up with an entirely different model.
4.5 Linear Least Squares Approximation 119
c0 a0 + c1 a1 + · · · + c M a M = b0
c1 a0 + c2 a1 + · · · + c M+1 a M = b1
c2 a0 + c3 a1 + · · · + c M+2 a M = b2 (4.27)
.. .. .. .
. . . = ..
c M a0 + c M+1 a1 + · · · + c2M a M = bM
Note the special structure of the matrix with constant entries on each “reverse
diagonal”. Such matrices are often called Hankel matrices.
In the discrete case that we investigated above, a similar system was obtained, the
only difference being that the coefficients are defined by the discrete analogues of
(4.28):
N
N
ck = xik , bk = xik f (xi ) (4.30)
i=0 i=0
These are exactly the entries in A A and A y
if you consider yi = f (xi ). In
either case therefore the least-squares approximation problem is reduced to a linear
system of equations, the normal equations. The matrix of coefficients is necessarily
nonsingular (provided the data points are distinct in the discrete case, with M = N)
and so the problem has a unique solution. The reason this process is called linear
least-squares is that the approximating function is a linear combination of the basis
functions, in this case the monomials 1, x, x 2 , . . . , x M .
We must minimize
π
F (a0 , a1 , a2 , a3 ) = sin x − a0 − a1 x − a2 x 2 − a3 x 3 d x
0
1.2
0.8
0.6
0.4
0.2
-0.2
0 0.5 1 1.5 2 2.5 3 3.5
Fig. 4.5 f (x) = sin(x) and the least squares cubic polynomial approximation (dashed line)
Applying any of our linear equation solvers (LU factorization or Gauss elimination)
we obtain the solution
a0 = −0.0505, a1 = 1.3122, a2 = −0.4177, a3 = 0
so that the required least squares approximation is
The plot of the approximation (dashed line) along with the true solution is shown
in Fig. 4.5 and the error is show in Fig. 4.6
The Python (NumPy) function polyfit performs essentially these operations to
compute discrete least-squares polynomial fitting. See Sect. 4.8.2 for details.
Unfortunately, the system of resulting linear equations for the method presented
above tends to be ill-conditioned, particularly for high-degree approximating polyno-
mials. For example, consider the continuous least-squares polynomial approximation
over [0, 1] . The coefficients of the linear system will then be
1 1
h i j = c (i + j − 1) = x i+ j−2 d x =
0 i + j −1
which is to say the coefficient matrix will be the Hilbert matrix of appropriate size.
As mentioned earlier in the text, as the dimension increases, solutions of such Hilbert
systems become increasingly incorrect.
122 4 Linear Equations
0.06
0.05
0.04
0.03
0.02
0.01
-0.01
-0.02
-0.03
0 0.5 1 1.5 2 2.5 3 3.5
The same ill-conditioning holds true even for discrete least-squares coefficient
matrices so that alternative methods are needed to obtain highly accurate solutions.
These alternatives use a different basis for the representation of our approximating
polynomial. Specifically a basis is chosen for which the linear system will be well-
conditioned and possibly sparse. The ideal level of sparsity would be for the matrix
to be diagonal.This can be achieved this by using orthogonal polynomials as our
basis functions.
Let φ0 , φ1 , . . . , φ M be polynomials of degrees 0, 1, . . . , M respectively. These
polynomials form a basis for the space of polynomials of degree ≤ M. Our approx-
imating polynomial p can then be written as a linear combination of these basis
polynomials;
The same procedure applies–we set all partial derivatives of F to zero to obtain a
system of linear equations for the unknown coefficients;
⎡ ⎤⎡ a ⎤ ⎡ b ⎤
c00 c01 c02 ··· c0M 0 0
⎢ c10 ⎢ a1 ⎥ ⎢ b1 ⎥
⎢ c11 c12 ··· c1M ⎥ ⎥⎢⎢ ⎥ ⎢ ⎥
⎢ c20 c21 c22 ··· c2M ⎥ ⎢ a2 ⎥ ⎥ =
⎢ b2 ⎥
⎢ ⎥ (4.32)
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣··· ··· · · · ⎦ ⎣ ... ⎦ ⎣ ... ⎦
c M0 c M1 c M2 ··· cM M aM bM
4.5 Linear Least Squares Approximation 123
where
b
ci j = c ji = φi (x) φ j (x) d x
a
b
bi = f (x) φi (x) d x
a
Such polynomials are known as orthogonal polynomials over [a, b] . The mem-
bers of such a set of orthogonal polynomials will depend on the specific interval
[a, b] . Note that there are many classes of orthogonal polynomials and actually,
other orthogonal functions could be used for approximation depending on the con-
text.
We focus on one particular set of orthogonal polynomials called the Legendre
polynomials Pn (x) . Multiplying orthogonal functions by scalar constants does not
affect their orthogonality and so some normalization is needed. For the Legendre
polynomials one common normalization is to set Pn (1) = 1 for each n. With this
normalization the first few polynomials on the interval [−1, 1] are
P0 (x) = 1, P1 (x) = x
P2 (x) = 3x 2 − 1 /2
P3 (x) = 5x 3 − 3x /2 (4.36)
P4 (x) = 35x 4 − 30x 2 + 3 /8
Example 15 Find the first three orthogonal polynomials on the interval [0, π] and
use them to find the least-squares quadratic approximation to sin x on this interval.
124 4 Linear Equations
The first one, φ0 has degree zero and leading coefficient 1. As we define the
subsequent polynomials, we normalize them so the leading coefficient is one. It
follows that
φ0 (x) = 1
Next φ1 (x) must have the form φ1 (x) = x + c for some constant c, and must be
orthogonal to φ0 . That is
π π
φ1 (x) φ0 (x) d x = (x + c) (1) d x = 0
0 0
φ2 (x) = x 2 − πx + π 2 /6
π3 π5
c00 = π, c11 = , c22 =
12 180
π2
b0 = 2, b1 = 0, b2 = −4
3
giving the coefficients of the approximation
2
2 0 π − 12 /3
a0 = , a1 = 3 , a2 =
π π /12 π 5 /180
2 60π 2 − 720 2
sin x ≈ + x − πx + π 2 /6
π π 5
4.5 Linear Least Squares Approximation 125
The rearrangement simply allows us to see that this is exactly the same approximation
derived in Example 14.
One immediate benefit of using an orthogonal basis for least-squares approxima-
tion is that additional terms can be added with little effort because the resulting system
is diagonal. The previously computed coefficients remain valid for the higher-degree
approximation and only the integration and division are needed to get the next coef-
ficient. Moreover, the resulting series expansion that would result from continuing
the process usually converges faster than a simple power series expansion, giving
greater accuracy for less effort.
Keep in mind, there other functions that may be appropriate for least squares
approximations instead of polynomials. For periodic functions, such as waves,
including square or triangular waves, periodic basis functions are a suitable choice.
On the interval [−π, π] , the functions sin (mx) and cos (nx) form an orthogonal
family. The least-squares approximation using these functions is the Fourier series.
The coefficients of this series represent the Fourier transform of the original function.
We do not go into those types of problems here but hope the reader will investigate
the wide range of possibilities when it comes to least squares problems since they
arise frequently in practice.
Exercises
9. Use the results of Exercise 7 to obtain the fourth degree least-squares approxi-
mation to arctan x on [0, 1] . Plot the error curve for this approximation.
10. The annual profits for a given company are shown below. Use a least squares
approach to obtain a polynomial to predict the company’s profits. Should the
company close down next year? In two years? When should the company decide
it is losing too much money?
11. A biologist is modeling the trajectory of a frog using a motion sensor to collect
the position of the frog as it leaps. She has the following positions;
Find the quadratic polynomial that best fits the data. Use your quadratic model
to predict the maximum height of the frog and the horizontal distance it might
occur. Plot the data and the polynomial together.
12. Newton’s Law of Cooling (or warming) states that the difference between an
object’s current temperature and the ambient temperature decays at a rate pro-
portional to that difference. This results in an exponential decay model for the
temperature difference as a function of time that could be modeled as
Here, T (t) is the temperature of the object at time t, T0 is the initial temper-
ature, and TE is the temperature of the environment (i.e. the ambient or room
temperature).
Note that the model is not linear in its coefficients and so we need to modify
the data and model (we address the nonlinearities and give more background on
the model in the next chapter). First rearrange the equation as
T − TE = (T0 − TE )e−kt
4.5 Linear Least Squares Approximation 127
and note that T − TE and T0 − TE are both negative. If we multiply both sides
by −1, we can use logarithms to obtain a linear least squares problem to obtain
estimates of TE − T0 and k.
(a) Take a cold drink out of the refrigerator in a can or bottle–but don’t open it
yet. Gather data on the temperature of the drink container after 5 min, and at
three minute intervals for another fifteen minutes. You will have six readings
in all.
(b) Use linear least squares to estimate the temperature of your refrigerator and
the constant k for this particular container. It is reasonable to suppose that
the refrigerator temperature is the initial temperature T0 of your container
assuming it has been inside for some time.
(c) How long will it take your drink to reach 15.5 ◦ C or 60 ◦ F? Check your
answer by measuring the temperature after this time.
(d) What does your solution say the temperature of your refrigerator is? Check
it’s accuracy by testing the actual refrigerator.
(e) How could you modify this procedure if you do not know the ambient
temperature?
(f) What would happen if that temperature was itself changing?
4.6 Eigenvalues
If you have had a course in applied linear algebra or differential equations, the notion
of eigenvalues may be familiar. What may be less clear is why they are so important.
Probably everyone reading this in fact uses an enormous eigenvalue problem solver
many time a day. Google’s page rank algorithm is, at its heart, a very large eigenvalue
solver. At full size, the matrix system would have dimension in the many billions–
but of course this matrix would be very sparse and most entries are irrelevant to any
particular search.
A widely used tool in statistics and data science is Principal Component Analysis
in which we identify the factors of most critical importance to some phenomenon of
interest, such as identifying which aspects of a student’s pre-college preparation are
most important predictors of their likely degree completion, or which courses they
should be placed in at the start of their college career. This is an eigenvalue problem.
Closely related to this is identifying primary modes, for example the fundamental
frequency in a vibrating structure–again an eigenvalue problem. Identifying such
modes can be critically important in order to avoid resonant vibration in a bridge
that could lead to its failure.
Biologists and environmental scientists are often concerned with population
dynamics under different scenarios. The long term behavior of such population mod-
els typically depends on the eigenvalues of the underlying matrix and these allow
determination of the likelihood of the population gradually dying out, or growing
128 4 Linear Equations
Av =λv (4.37)
(A − λI ) v = 0
det (A − λI ) = 0. (4.38)
|λmax |
κ (A) = (4.39)
|λmin |
4.6 Eigenvalues 129
where λmax and λmin are the (absolute) largest and smallest eigenvalues of A. Strictly
speaking a matrix has many condition numbers, and “large” is not itself well-defined
in this context. To get an idea of what “large” means, the condition number of the
6 × 6 Hilbert matrix is around 15 × 106 while that for the well-conditioned matrix
⎡ ⎤
1 2 3 4 5 6
⎢2 2 3 4 5 6⎥
⎢ ⎥
⎢3 3 3 4 5 6⎥
⎢ ⎥
⎢4 4 4 4 5 6⎥
⎢ ⎥
⎣5 5 5 5 5 6⎦
6 6 6 6 6 6
is approximately 100.
We focus on the power method for finding eigenvalues and associated eigenvec-
tors and show how it can be used to estimate condition numbers. In its basic form
the power method is a technique which will provide the largest (in absolute value)
eigenvalue of a square matrix A.
The algorithm is based on the fact that if (λ, v) form an eigen pair, then
v Av v (λv) ||v||2
= = λ =λ (4.40)
v v v v ||v||2
v Av
The ratio is known as the Rayleigh quotient. We shall use the standard
v v
Euclidean norm ||v||2 = v · v = vi2 where vi is the ith component of v.
xk = Ak x0 (4.41)
which explains the origin of the name “power method” for this algorithm.
Variants of the algorithm use different ratios in place of the Rayleigh quotient.
Other possibilities are the (absolute) largest components of the vector, the (absolute)
sum of the vector components, the first elements of the vectors and so on. For vectors
of moderate dimension, the Euclidean norm is easily enough computed that there is
no great advantage in using these cheaper alternatives.
We note that the existence of a dominant eigenvalue for a real matrix necessarily
implies that this dominant eigenvalue is real. The power method and its variants are
therefore designed for finding real eigenvalues of real matrices or complex eigen-
values of complex matrices. They are not suited to the (common) situation of real
matrices with complex eigenvalues.
With the arbitrary initial guess x0 = [1, 2, 3, 4, 5, 6]T , the results of the first few
iterations are
We see that the eigenvalue estimates have already settled to three decimals and
the components of the associated eigenvector are also converging quickly.
4.6 Eigenvalues 131
The power method seems to be reasonable for finding the dominant eigenvalue
of a matrix. The good news is that it can be modified to find other eigenvalues. The
next easiest eigenvalue to obtain is the (absolute) smallest one. To see why, note that
for a nonsingular matrix A, the eigenvalues of A−1 are the reciprocals of those of A
since if (λ, v) are an eigen pair of A, then
Av =λv.
1
v =A−1 v.
λ
It follows that the smallest eigenvalue of A is the reciprocal of the largest eigenvalue
of A−1 . This largest eigenvalue of A−1 could be computed using the power method
– except we do not have A−1 . The following technique, the inverse iteration, can be
implemented by taking advantage of the LU factorization of A to avoid a repeated
linear solve with different right hand sides.
Note, at each iteration, finding v =A−1 xk , is equivalent to solving the system
Av = xk
which can be done with forward and back substitution if the LU factors of A are
known. The algorithm for inverse iteration can be expressed as;
Example 17 Use inverse iteration to compute the smallest eigenvalue of the matrix
A used in Example 16.
132 4 Linear Equations
Using the same starting vector as in Example 16 the first few estimates of the
largest eigenvalue of the inverse of A are: 0.0110, –1.000, –2.500, –3.0000 which are
settling slowly. After another 22 iterations we have the converged value –3.7285 for
the largest eigenvalue of A−1 so that the smallest eigenvalue of A is –0.2682 with the
approximate associated eigenvector [0.1487, −0.4074, 0.5587, −0.5600, 0.4076,
−0.1413] .
Testing the residual of these estimates by computing Av−λv we get a maximum
component of approximately 1.5 × 10−4 , which is consistent with the accuracy in
the approximation of eigenvalue.
There is even more good news–the power method can be further modified to find
other eigenvalues in addition to the largest and smallest in magnitude. Having some
notion of the locations of those eigenvalues can help and the following theorem
provides that insight.
|z − aii | ≤
ai j
.
j=i
Moreover, if any collection of m of these disks is disjoint from the rest, then exactly
m eigenvalues (counting multiplicities) lie in the union of these disks.
The diagonal entries and the absolute sums of their off-diagonal row-elements
give us the following set of centers and radii:
Center 1 2 3 4 5 6
Radius 20 20 21 23 26 30
These disks are plotted in Fig. 4.7. In this case Gerschgorin’s Theorem gives
us little new information since all the disks are contained in the largest one. Since
the matrix A is symmetric we do know that all of its eigenvalues are real, so the
theorem really only tells us that all eigenvalues lie in [−24, 36] . Since we already
know that λmax = 27.7230 and λmin = −0.2682 to four decimals, we can reduce
this interval to [−24, 27.7230] but this does not really assist in predicting where the
other eigenvalues lie.
4.6 Eigenvalues 133
20
10
−10
−20
−30
−20 −10 0 10 20 30
is symmetric and so has real eigenvalues. Gerschgorin’s theorem implies that these
lie in the union of the intervals [1, 5] , [5, 7] , [−2, 0] , and [8, 12] . Note that since
the last two of these are each disjoint from the others, we can conclude that one
eigenvalue lies in each of the intervals [−2, 0] , and [8, 12] while the remaining two
lie in [1, 5] ∪ [5, 7] = [1, 7] . The smallest and largest are −1.2409 and 10.3664 to
four decimal places, respectively. There are two eigenvalues in [1, 7] still to be found.
If we can locate the one closest to the mid-point of this interval, then we could use
the trace to obtain a very good approximation to the remaining one.
How then could we compute the eigenvalue closest to a particular value? Specif-
ically suppose we seek the eigenvalue of a matrix A closest to some fixed μ.
Note that if λ is an eigenvalue of A with associated eigenvector v, then λ − μ is
an eigenvalue of A − μI with the same eigenvector v since
(A − μI ) v =Av−μv =λv − μv = (λ − μ) v
gives the smallest eigenvalue as −0.907967 to six decimals. Adding 4 gives the
closest eigenvalue of B to 4 as 3.092033.
The trace of B is 3 + 6 − 1 + 10 = 18 and this must be the sum of the eigenvalues.
We therefore conclude that the remaining eigenvalue must be very close to
If this eigenvalue was sought to greater accuracy, we could use inverse iteration with
an origin shift of, say, 6.
Although we have not often reported them, all the techniques based on the power
method also provide the corresponding eigenvectors.
What about the remaining eigenvalues of our original example matrix from
Example 16? Gerschgorin’s theorem provided no new information. The matrix was
⎡ ⎤
1 2 3 4 5 6
⎢2 2 3 4 5 6⎥
⎢ ⎥
⎢3 3 3 4 5 6⎥
A=⎢
⎢4
⎥
⎢ 4 4 4 5 6⎥⎥
⎣5 5 5 5 5 6⎦
6 6 6 6 6 6
and we have largest and smallest eigenvalues λmax = 27.7230 and λmin = −0.2682
to four decimals.
The sum of the eigenvalues is the trace of the matrix, 21, and their product is
det A = −6. Since
the sum of the remaining values must be approximately −6.5. Their product must
be close to −6/ (27.7230) (−0.2682) = 0.807. Given the size of the product it is
reasonable to suppose there is one close to −5 or −6 with three smaller negative
ones.
4.6 Eigenvalues 135
Using an origin shift of −5 and inverse iteration on A − (−5) I we get one eigen-
value at −4.5729.
The final three have a sum close to −2 and a product close to 0.8/ (−4.6) =
−0.174. Assuming all are negative, then one close to −1 and two small ones looks
likely. Using an origin shift of −1, we get another eigenvalue of A at −1.0406.
Finally, we need two more eigenvalues with a sum close to −1 and a product
around 0.17. Using an origin shift of −0.5 we get the fifth eigenvalue −0.5066. The
trace now gives us the final eigenvalue as being very close to
What we have seen in this last example is that for a reasonably small matrix, which
we can examine, then the combination of the power method with inverse iteration
and origin shifts can be used to obtain the full eigen structure of a matrix all of
whose eigenvalues are real. Of course this situation is ideal but perhaps not realistic.
It may not be known in advance that the eigenvalues are all real. The possible sign
patterns of the “missing” eigenvalues might be much more difficult to guess than in
this example.
Luckily, efficient and accurate numerical techniques do exist to approximate
eigenvalues of large linear systems. Those methods are often studied in detail in
upper level Matrix Theory and Computations courses. The purpose of this section
serves as an introduction to the subject.
Exercises
Consider the matrix
⎡ ⎤
5 5 5 5 5
⎢4 4 4 4 5⎥
⎢ ⎥
A=⎢
⎢3 3 3 4 5⎥⎥
⎣2 2 3 4 5⎦
1 2 3 4 5
which has entirely real eigenvalues. Find these eigenvalues each accurate to 5 decimal
places.
7. Modify your subroutine for the inverse power method so that it uses the LU
decomposition. Compare the computation costs of the two approaches for finding
the smallest eigenvalue of A.
8. One application of eigenvalues arises in the study of earthquake induced vibra-
tions on multistory buildings. The free transverse oscillations satisfy a system
of second order differential equations of the form
mv = k Bv,
import numpy as np
from m a t p l o t l i b import p y p l o t as p l t
# E i g e n v a l u e example f o r e a r t h q u a k e i n d u c e d
# v i b r a t i o n s define the matrix
B = np . a r r a y ( [ [ − 2 , 1 , 0 , 0 , 0 ] ,
[ 1 , −2 , 1 , 0 , 0 ] ,
[ 0 , 1 , −2 , 1 , 0 ] ,
[ 0 , 0 , 1 , −2 , 1 ] ,
[ 0 , 0 , 0 , 1 , −2]])
k = 1000
m = 1250
a = 0 .0 7 5
# F i n d t h e s m a l l e s t e i g e n v a l u e and accompanying
4.6 Eigenvalues 137
# eigenvector
# . . . YOU DO T H I S
# C a l l them lam and v
# D e f i n e omega
w = −1∗lam ∗ ( k /m)
# YOU D e f i n e t h e s o l u t i o n s a t each t i m e s t e p
# ( THINK 5 x200 )
f o r i i n range ( 2 0 0 ) :
Vk [ : , i ] =
9. We will consider a model for population growth that models only the female por-
tion of a population and see how eigenvalues can help explain how the population
evolves. Let L be the maximum age that can be attained by any female of the
species. We divide the lifetime interval [0, L] into n equally spaced subintervals
to form n age classes as shown in the following table.
So for example, the age a of a female in class 3 must satisfy 2L/n ≤ a < 3L/n.
Next, suppose the vector x, represents the female population at some particular
point in time such that the ith component of x is the number of females in the
ith age class. We also require the following assumptions to develop our model:
138 4 Linear Equations
Given the above parameters, xk+1 can be calculated in terms of xk with the linear
model Axk = xk+1 , where
⎡ ⎤
a1 a2 ... an
⎢ b1 0 ... 0⎥
⎢ ⎥
A=⎢ .. ⎥.
⎣ . ⎦
0 bn−1 0
Population Stability: There are three things that can happen to a population over
time; the population (1.) increases with time, (2.) decreases with time, or (3.)
the population stabilizes. In the case of (3.) we say that the population has reach
steady state. Eigenvalues can help us predict when the above situations might
occur. In particular, if A has a positive, dominant eigenvalue, then that eigenvalue
can help determine the long-term behavior of the population.
(a) Explain how the structure of A is consistent with the definitions of the birth
and survival coefficients and the way that the population advances in time.
(b) Suppose that a given population matrix A does have a dominant positive
eigenvalue λ. Given that Axk = xk+1 , suppose that for some m, that eventu-
ally xm is the eigenvector of that corresponds to λ. How can we relate the size
of λ to the three possible scenarios (think about 0 < λ < 1, λ = 1, λ > 1).
4.6 Eigenvalues 139
(c) Consider the following sets of birth and death parameters (Tables 4.2, 4.3
and 4.4) of a population with L = 20 and n = 5. For each data set, model
the population for a three choices of initial populations x0 and explain what
you see. Code up the power method for locating the dominant eigenvalue
and corresponding eigenvector. Use it to analyze these three sets and show
that the results are consistent with the theory you described above.
10. Consider the model in Exercise 9 above. Using whatever resources necessary,
investigate some ‘vital statistics’, which give population information for various
states and counties (actuaries work with this kind of data all the time!) Pick a
‘population’ and using real data that you find, create your own population model
and analyze the long term behavior. If your eigenvalue locator doesn’t work on
the real problem, explain why not.
140 4 Linear Equations
What have you learned in this chapter? And where does it lead?
The basic problem of solving two (or more) linear equations in two (or more)
unknowns is somewhat familiar. That situation was our starting point using simple
examples to illustrate the basics of eliminating variables from one equation by using
information from others.
That elementary approach is the basis for Gauss elimination. The need to auto-
mate the algorithm for computer application removes some of the choices that we
have in the “hand solution” of small scale systems–“convenient” arithmetic for hand
calculation is no longer a priority for the computer, of course.
Convenience may not be important but other arithmetic considerations certainly
are. The effects of rounding error in a simplistic implementation of Gauss elimination
can be severe–as we saw with a 3 × 3 example. Using pivoting, which is essentially
equivalent to changing the order of the equations, can usually overcome this loss of
precision.
A further modification, LU factorization, allows for subsequent systems with the
same coefficient matrix to be solved much more efficiently. We saw this in practice
for iterative refinement of the computed solution. The basic approach is still that
of Gauss elimination but with the ability to do the manipulations on the matrix just
once. For small systems that does not seem to offer much, but for large systems it
can be significant. Frequently in large time-dependent computational models (such
as weather systems modeling to take an extreme case) the same system needs to be
solved for different data relevant to the next time-step. Changing the computational
effort for subsequent data from proportional to n 3 to just proportional to n 2 (which
is the effect of LU factorization) can result in order of magnitude savings in compu-
tational time and effort. If you want tomorrow’s weather forecast today, rather than
next week, such savings are vital.
The situation just referenced is not completely realistic because the large systems
arising in big time-dependent models are often sparse, meaning lots of zeros in the
coefficient matrix. In such situations it is common to use iterative methods rather than
the so-called direct methods discussed up to now. The two fundamental approaches
are Jacobi and Gauss–Seidel iterations. Here again, though, we have introduced a
topic which is still actively researched. Large sparse linear systems play such an
important role as a basic tool in major computational processes that seeking even
modest improvements in speed, accuracy, storage requirements, memory handling,
and inter-processor communication can pay large dividends.
The final situation in which we studied linear equations was within (linear) least
squares approximation. It is worth reminding you that linear least squares does not
mean fitting a straight line to data–though that can be an example. It refers to the
problem being linear in the coefficients of the approximating function. Much of what
was discussed related to finding least squares polynomial approximations. Yet again
this is an introduction to a much bigger active research field. Fourier series can be
obtained in this same way using linear combinations of trigonometric functions, for
example. The Fast Fourier Transform, FFT, is one important example. It is arguably
4.7 Conclusions and Connections: Linear Equations 141
the most important workhorse of signal processing which in turn is used, often with
special purpose improvements, in areas as diverse as hearing aid technology, radio
and television transmissions in compressed data, security scanners, and medical
imaging.
The final topic of the chapter was the eigenvalue problem. Eigenvalues, and their
even more powerful cousins singular values, are vital tools for many purposes–
signal processing is again an important one. They are the mathematical basis of
many statistical data analysis techniques such as principal component analysis. As
such eigenvalue problems are critically important to the whole field of data science
or data analytics which has become one of the biggest growth areas for employment
in recent years in industries from finance to health care to physical science and
engineering.
The fundamental Python (NumPy) function for linear equation solving is the
numpy.linalg.solve function.
Ax = b
a0 + a1 xk + a2 xk2 + · · · + am xkm = f k
n
f k − a0 + a1 x k + a2 x 2 + · · · + am x m
k k
k=0
The vector of coefficients returned by numpy.polyfit starts from the the highest
powers of the polynomial. As an example, consider the data
>>> x = np . arange ( −1 , 1 . 1 , 0 . 1 )
>>> y = np . c o s ( np . p i ∗ x )
A third-degree polynomial can be fitted to the data using the following command
>>> p = numpy . p o l y f i t ( x , y , 3 )
The least-squares cubic approximation is given by the following coefficients
[ −7.7208 e−15 −2.1039 e+00 5 .2 1 8 7 e−15 7 .2 3 8 2 e −01]
4.8 Python’s Linear Algebra Functions 143
if we neglect the third- and first-degree coefficients which are essentially zero due
to their small magnitude.
4.8.3 Eigenvalues
numpy.linalg.eig The NumPy function eig is the basic routine for computing
eigenvalues and eigenvectors. It returns all eigenvalues and eigenvectors of a
square matrix.
Running
>>> W, V = np . l i n a l g . e i g ( A )
computes all eigenvalues of A returned as the vector W . The matrix V contains the
corresponding eigenvectors.
scipy.linalg.eig The similar SciPy function eig can additionally be used for solv-
ing the generalized eigenvalue problem
Ax = λBx
A = U SV T
The matrices U, V are orthonormal and S is a diagonal matrix containing the singular
values of A along its diagonal. The singular values are related to eigenvalues in the
sense that they are the square-roots of the eigenvalues of the matrix AT A.
numpy.linalg.matrix_rank We often need to be able to compute the rank of
a matrix. The rank is the number of linearly independent rows (or columns) of
a matrix. The rank also corresponds to the number of nonzero singular values of
a rectangular matrix (non-zero eigenvalues in case of a square matrix). NumPy’s
function matrix_rank uses SVD to compute the rank of a matrix. It is necessary to
use a certain tolerance in determining whether individual singular values should be
considered “zero”. This is due to roundoff errors, as discussed in Chap. 2. Python
uses a default value proportional to the largest singular value, to the largest dimension
of the matrix, and to the machine precision of the data type in use for storing the
matrix.
The syntax is as follows:
>>> r = np . l i n a l g . m a t r i x _ r a n k ( A )
for default tolerance. The tolerance (here 10−8 for example) can be specified using
>>> r = np . l i n a l g . m a t r i x _ r a n k ( A , t o l = 1e −8)
144 4 Linear Equations
There are numerous examples standard linear algebra “arithmetic” throughout this
chapter. Here we also mention a small selection of other useful functions.
numpy.dot computes the product of two vectors, two matrices, or a vector and
a matrix. Depending on the dimensions of the vectors and matrices involved, the
product may correspond to an inner or outer product, or in general just matrix-
vector product. If a, b are two one-dimensional arrays (vectors), the following
>>> np . dot ( a , b )
or equivalently
>>> a . dot ( b )
corresponds to the inner product between the vectors:
a·b= ak bk
For example,
>>> a = np . arange ( 1 , 4 )
>>> b = np . arange ( 6 , 9 )
>>> np . dot ( a , b )
returns the result 44. Note here that NumPy’s arange function returns an array of
integers from the first argument up to, but excluding the second argument. This is
the standard convention in Python whenever specifying ranges of integers. They
can be thought of as half-open intervals; closed at the low end and open at the
high end.
numpy.linalg.norm computes the Euclidean norm of a vector (or a matrix)
√
a = a·a
Any p-norm for p a positive integer can also be obtained using np.linalg.norm,
as well as · −∞ , · ∞ , the Frobenius, and the nuclear norm. For example with
b as above, we get
>>> np . l i n a l g . norm ( b )
1 2 .2 0 6 5 5 5 6 1 5 7 3 3 7 0 2
>>> np . l i n a l g . norm ( b , 1 )
2 1 .0
>>> np . l i n a l g . norm ( b , np . i n f )
8 .0
The default is the Euclidean or 2-norm.
Iterative Solution of Nonlinear
Equations 5
5.1 Introduction
We have just seen that although linear functions have been familiar to us since grade
school, large linear systems can be challenging to solve numerically. Moreover, even
though linear models are used frequently in understanding relationships and making
predictions, one could argue that the world around us is truly nonlinear (and possibly
even unpredictable!)
As an example, consider pouring a hot cup of coffee (or tea, or your favorite hot
beverage) and placing it on the kitchen counter. What things affect the temperature of
the coffee? What happens to the rate at which the coffee is cooling over time? Could
you make a sketch of the temperature of the coffee over time? Does it look linear?
Newton’s Law of Cooling is one way to approach this scenario with a well-known
mathematical model.
Newton’s law of cooling states that the rate of change of the temperature of an
object is proportional to the difference between the object and the ambient environ-
ment. In ideal circumstances, the object’s temperature can be modeled over time
as
T (t) = TE + (T0 − TE )e−kt
Here TE is the surrounding temperature of the environment and T0 is the initial
temperature. The parameter k describes how quickly the object cools down and
depends on the object itself. This is a nonlinear model for the temperature. If you
wanted to find the time when then coffee was your ideal drinking temperature, say T̂ ,
then you would need to solve the nonlinear equation of T̂ − T (t) = 0. We’ll revisit
this model later.
For now consider general problems of the form
f (x) = 0. (5.1)
This problem is surprisingly more challenging than it might seem and there are few
functions in which we can directly find the solutions (or roots) of such an equation. To
this end, the methods discussed here compute a sequence of approximate solutions
which we hope converges to this solution. This idea was briefly presented in the
previous chapter in the context of the Jacobi and Gauss-Seidel methods for linear
systems. We provide more details with a brief listing of the basic facts included at the
end of this introductory section for reference purposes. However, a deeper treatment
of the convergence of sequences in general can be found in standard texts on calculus
or elementary real analysis.
Iterative methods start with an initial estimate, or guess, of the solution and accord-
ing to some simple rule, generate a sequence of estimates, or iterates, which we hope
gets closer to the solution of the original problem. In addition to an initial iterate,
iterative methods require a prescribed stopping criteria, which can impact the quality
of the final solution. We shall also discuss the convergence properties of the methods.
That it is, it is critical to know under what conditions a certain method will work
and how quickly. The primary motivation for these theoretical aspects is to be able
to make the right choices when selecting a solver and setting the algorithmic param-
eters, thereby avoiding unnecessary computations to obtain a level of precision in a
computed answer. These following ideas will help in the study and implementation
of iterative solvers.
an → L as n → ∞
or
lim an = L
n→∞
if for every (small) ε > 0 there exists a number N such that
|an − L| < ε
The condition |an − L| < ε can often be usefully rewritten in the form
L − ε < an < L + ε
1. an ± bn → a ± b
2. an bn → ab (In particular, can → ca for any constant c.)
3. an /bn → a/b provided bn , b = 0 (In particular, 1/bn → 1/b.)
Many ideas in this section that lead to numerical methods for solving f (x) = 0
are derived from ideas you saw in Calculus. One of the simplest approaches is the
method of bisection which is based on the Intermediate Value Theorem. Suppose that
f is continuous on an interval [a, b] and that f (a) f (b) < 0 (so that the function is
negative at one end-point and positive at the other) then, by the intermediate value
theorem, there is a solution of Eq. (5.1) between a and b.
The basic situation and the first two iterations of the bisection method are illus-
trated in Fig. 5.1. At each stage we set m to be the midpoint of the current interval
[a, b] . If f (a) and f (m) have opposite signs then the solution must lie between
a and m. If they have the same sign then it must lie between m and b. One of the
endpoints can then be replaced by that midpoint and the entire process is repeated
on the new smaller interval.
In Fig. 5.1, we see that for the original interval [0, 1.1] , m = 0.55 and f (a) , f (m)
have the same sign. The endpoint a is replaced by m, which is to say m becomes
−0.4
−0.6
148 5 Iterative Solution of Nonlinear Equations
the new a, and a new m is computed. It is the midpoint of [0.55, 1.1], m = 0.825 as
shown. This time f (a) and f (m) have opposite signs, f (a) f (m) < 0, and so b
would be replaced by m.
The process can be continued until an approximate solution has been found to
a desired accuracy, or tolerance. One possible stopping criteria could be when the
length of the interval b − a is smaller than a specified tolerance.
The bisection method is summarized below.
x − cos x − 1 = 0 (5.2)
Since f is continuous, the intermediate value theorem shows that there is a solu-
tion of (5.2) in [0, π/2] . We set a = 0, b = π/2. For the first iteration, m =
π/4, f (π/4) = π/4 − cos π/4 − 1 = −0.9217 < 0. So the midpoint, m now
becomes the left endpoint of the search interval, i.e. the solution must be in [π/4, π/2]
so we set a = π/4.
As the algorithm progresses, it can be useful to organize all the necessary informa-
tion in a table. For the second iteration m = 3π/8, f (3π/8) = 3π/8 − cos π/8 −
1 = −0.2046 < 0. So the solution lies in [3π/8, π/2] (i.e. so we set a = 3π/8 for
the next iteration.
At this point the process stops because b − a ≈ 0.09 < 0.1 and the solution is
1.2836 with error less than 0.05. Note that the solution is the midpoint of the final
interval and so is less than ε/2 away from either end of the interval.
5.2 The Bisection Method 149
Example 2 Solve equation (5.2) used in Example 1 using the bisection method in
Python. Start with the interval [0, n/2] and obtain the solution with an error less than
10−5 .
def eq1 ( x ) :
return x − cos ( x ) − 1
Then the following instructions in the Python command window will achieve the
desired result:
a = 0
b = π /2
t o l = 1e−5
f a = eq1 ( a )
f b = eq1 ( b )
while abs ( b − a ) > 2 ∗ t o l :
m = (a + b) / 2
fm = eq1 (m)
i f f a ∗ fm <= 0 :
b=m
else :
a=m
m = (a + b) / 2
>>> p r i n t (m)
1 .2 8 3 4 3 2 5 8 9 9 0 1 8 0 4
Note that we could use 2∗tol in the test for the while statement because of the final
step after the completion of the loop which guarantees that this final solution has
the required accuracy. Also we used abs(b-a) to avoid any difficulty if the starting
values were reversed.
150 5 Iterative Solution of Nonlinear Equations
Of course, the trailing digits in the final answer do not imply that level of precision
in the solution – we have only ensured that the true solution is within 10−5 of this
value.
It should be fairly apparent that the commands used here correspond to those of
Algorithm 12. One key difference between that algorithm and the Python commands
above is that the latter depend critically on the name of the function file whereas in
the algorithm the function is one of the inputs. This can be achieved by defining a
function for the bisection method:
def b i s e c t ( f c n , a , b , t o l ) :
"""
S o l v e t h e e q u a t i o n f c n ( x ) =0 i n t h e i n t e r v a l
[ a , b ] to accuracy ‘ t o l ’ using the b i s e c t i o n
method . ‘ f c n ’ i s a f u n c t i o n .
"""
fa = fcn ( a )
fb = fcn ( b )
while abs ( b − a ) > 2 ∗ t o l :
m = (a + b) / 2
fm = f c n (m)
i f f a ∗ fm <= 0 :
b = m
else :
a = m
return ( a + b ) / 2
To see the use of this bisection algorithm function, we resolve the same equation
to greater accuracy with the single command:
>>> s = b i s e c t ( eq1 , 0 , π / 2 , 1e −6)
>>> p r i n t ( s )
1 .2 8 3 4 2 8 8 4 4 8 3 1 5 2 1
The program above will work provided the function is continuous and provided
that the initial values fa and fb have opposite signs. It would be easy to build in a
check that this condition is satisfied by inserting the lines
5.2 The Bisection Method 151
0
0.0 0.5 1.0 1.5 2.0 2.5
−1 x
i f fa ∗ fb > 0 :
p r i n t ( ’Two e n d p o i n t s have same s i g n ’ )
return
immediately before the start of the while loop.
Example 3 Use the bisection method to find the positive solutions of the equation
f (x) = e x − 5x + 2 by the bisection method.
bn−1 − an−1
bn − an =
2
which implies that
b0 − a0
bn − an =
2n
and therefore bn − an → 0. It follows that L = M; that is the two sequences have
the same limit.
Finally, because f is a continuous function,
but the algorithm is designed to ensure that f (an ) f (bn ) ≤ 0 for every n. The only
possibility is that f (L) = 0 which is to say this limit is a solution of the equation
f (x) = 0.
Exercises
Although the bisection method provides a simple and reliable technique for solving
an equation, if the function f is more complicated (perhaps its values must be
obtained from the numerical solution of a differential equation) or if the task is to be
performed repeatedly for different values of some parameters, then a more efficient
technique is needed.
Fixed point iteration applies to problems in the special case that f (x) = 0 can be
rewritten in the form
x = g (x) . (5.3)
This new rearrangement can be used to define an iterative process as follows. Given
an initial iterate x0 , or estimate of the solution, then a sequence can be defined by
(xn ) with
xn = g (xn−1 ) n = 1, 2, . . . . (5.4)
Provided that g is continuous, we see that, if this iterative sequence converges then the
terms get closer and closer together. Eventually we obtain, to any required accuracy,
xn ≈ xn−1
which is to say
xn−1 ≈ g (xn−1 )
so that xn is, approximately, a solution of Eq. (5.3).
The iteration function is then g (x) = cos x + 1. Consider the initial guess x0 =
1.374 (which is roughly the second midpoint in the bisection iteration in Example 2)
The algorithm would give the following sequence
y=x
y=x
g(x0 )
g(x2 )
g(x1 )
g(x0 )
y
y
g(x1 )
x0 x2 x3 x1 x0 x1 x2
x x
which appears to be converging to the same solution obtained with the bisection
method (although nothing has been formally proven about convergence).
Given a problem f (x) = 0, there may be multiple ways to reformulate it as a
fixed point iteration. For example, the same equation could have been rewritten as
x = cos−1 (x − 1). Using the starting value x0 = 1.374, the first few iterations yield
the sequence
So the bad news is that after 18 iterations, the iteration is moving away from the
solution in both directions and is eventually not defined.
There is also good news; it is possible to determine in advance if a particular
re-formulation will converge and obtain a desired accuracy in the approximate solu-
tion. A graphical illustration of the process of function iteration helps to explain this.
Solving an equation of the form x = g (x) is equivalent to finding the intersection
of the graphs y = x and y = g (x). Figure 5.3 illustrates two convergent function
iterations. In each case, xn is the point at which the vertical line x = xn−1 meets the
curve y = g (x). The horizontal line at this height, y = g (xn−1 ) meets the graph of
y = x at x = xn = g (xn−1 ). The process can be continued indefinitely.
In the first case, where g is a decreasing function, the iterates converge in an
oscillatory manner to the solution – giving a web-like picture. This can be useful in
determining the accuracy of the computed solution. In the second case, the iterates
converge monotonically – either ascending or descending a staircase.
In Fig. 5.3, we see that function iteration can converge independent of whether
the iteration function is increasing or decreasing near the solution. What turns out to
be critical is how fast the function changes.
5.3 Fixed Point Iteration 155
en = xn − s
for each n.
The fact that g (s) = s along with the Taylor series expansion of g about the
solution gives
= g (s) + (xn − s) g
(s) + g (s) + · · · − g (s)
2!
en2
= en g
(s) + g (s) + · · · (5.5)
2!
If |en | is small enough that higher-order terms can be neglected, we get
en+1 ≈ en g
(s) (5.6)
which implies that the error will be reduced if g
(s) < 1. Keep in mind, we do not
know s in advance, but the following theorem establishes that a sufficient condition
for
convergence
of an iteration. The results is that convergence is guaranteed if
g (x) < 1 throughout an interval containing the solution. The condition that g is
twice differentiable was used to simplify the analysis above. We see in the next
theorem that only the first derivative is strictly needed.
x0 ∈ [a, b] ; xn = g (xn−1 ) , n = 1, 2, . . .
a − g (a) ≤ 0 ≤ b − g (b)
156 5 Iterative Solution of Nonlinear Equations
The intermediate value theorem implies that there exists s ∈ [a, b] such that s =
g (s) . The fact that s is the only such solution follows from the mean value theorem.
Suppose that t = g (t) for some t ∈ [a, b] . By the mean value theorem, g (s) −
g (t) = (s − t) g
(ξ) for some ξ between s and t. But, g(s) = s and g (t) = t so that
s − t = (s − t) g
(ξ)
or
(s − t) 1 − g
(ξ) = 0
By condition (ii) , g
(ξ) < 1 which implies s − t = 0; proving that the solution is
unique.
The convergence of the iteration is also established by appealing to the mean value
theorem. By condition (i) , we know that xn ∈ [a, b] for every n since x0 ∈ [a, b] .
Then, for some ξn ∈ [a, b] ,
|en+1 | = |g (xn ) − g (s)| = |xn − s| · g
(ξn )
= |en | · g
(ξn ) ≤ K |en |
ex + 2
x= , (i)
5
x = e x − 4x + 2, and (ii)
x = ln (5x − 2) (iii)
We begin by examining the first few iterations using each of these rearrangements
with the iteration functions defined as Python functions. In each case we used the
initial guess x1 = 2.2.
The Python commands used for the first of these were as follows. The modifica-
tions required for the other cases should be apparent.
x = [ 2 .2 ]
f o r k i n range ( 1 ,1 5 ) :
x . append ( i t e r 1 ( x [ k − 1 ] ) )
where the function iter1 is defined as
5.3 Fixed Point Iteration 157
def i t e r 1 ( x ) :
r e t u r n ( exp ( x ) + 2 ) / 5
Note the use of NumPy’s exp rather than exp from the Python standard library’s
math module is necessary as the latter cannot handle very large function arguments.
For large arguments that result in an overflow the above code will output the following
warning
RuntimeWarning : o v e r f l o w e n c o u n t e r e d i n exp
r e t u r n ( exp ( x ) + 2 ) / 5
The results for the three rearrangements are:
The results imply that the first two rearrangements are not converging, while the
third appears to be settling down, slowly. Theorem 13 helps explain why.
ex + 2
(i) For the first rearrangement, g (x) = , giving g
(x) = e x /5 > 1 for
5
every x > ln 5 ≈ 1.6. The theorem indicates this iteration should not converge for
the solution in [2, 3]. However, we see that 0 < g
(x) < 1 for all x < ln 5. Also, if
0 < x < ln 5 < 2, then g (x) ∈ (0.06, 0.943) . It follows that the conditions of the
theorem are satisfied on this interval, and therefore that this iteration will converge
to this solution. With x1 = 0.9, the first few iterations yield:
0.8919, 0.8919, 0.8880, 0.8860, 0.8851, 0.8846, 0.8844, 0.8843, 0.8843, 0.8842, 0.8842
g (x) > 1 for every x > ln 5. and the iteration should not be expected to converge
to the solution in [2, 3] . Unfortunately for this rearrangement, you can verify that
158 5 Iterative Solution of Nonlinear Equations
2.5
2.0
y=x
1.5
1.0
0.5
y = ex − 4x + 2
0.0
0.5 1.0 1.5 2.0
x
f o r i t e r a t i o n i n range ( 1 , m a x i t s ) :
x0 = x1
x1 = x2
x2 = i t e r 3 ( x1 )
5.3 Fixed Point Iteration 159
>>> p r i n t ( i t e r a t i o n )
18
>>> p r i n t ( x2 )
2 .1 9 3 7 0 6 8 2
Note that in this implementation, we do not store the full array of iterates–only the
last three. The initial values x0=1; x1=2; are essentially arbitrary, we just need to
ensure that the iterative while loop starts. Here, the values of x0, x1 and x2 are
continually overwritten as better estimates of the solution are computed. In this case,
15 iterations were enough to meet the prescribed accuracy. (It is common to test
for convergence by examining just the last two iterates, but in the absence of other
information this test is safer – even if a little too conservative.)
Although we have proved convergence, we can look closer at the accuracy as well
by looking at the Taylor expansion of the error in Eq. (5.5) which shows that
en2
en+1 = en g
(s) + g (s) + · · ·
2!
Note that in Example 5, rearrangement (i) converged to the smaller solution for
appropriate starting points. The solution is less than 0.9, and therefore 0 < g
(s) <
g
(0.9) < 0.5 so that as the iterates converge the errors satisfy
en
en+1 <
2
meaning each iterate reduces the error by at least this factor, as we saw in the numeri-
cal results. Sometimes we can do much better, however. If we obtain a rearrangement
such that g
(s) = 0 then subsequent errors will be proportional to the squares of the
prior ones. In particular if we also have g
Exercises
1. The equation
3x 3 − 5x 2 − 4x + 4 = 0
has a solution near x = 0.7. (See Exercise 1 in Sect. 5.2.) Carry out the first five
iterations for each of the rearrangements
5 4 4 3x 3 − 5x 2
(i) x= + − 2, and (ii) x =1+
3 3x 3x 4
starting with x0 = 0.7.
2. Which of the iterations in Exercise 1 will converge to the solution near 0.7? Prove
your assertion using Theorem 13. Find this solution using a tolerance of 10−6 .
3. Find a convergent rearrangement for the solution of the equation e x − 3x − 1 = 0
in [1, 3] . Use it to locate this solution using the tolerance 10−6 .
4. Intervals containing the three solutions of e x − 100x 2 = 0 were found in Exer-
cise 3 of Sect. 5.2. Each of the following rearrangements yields a convergent
iteration for one of these solutions. Verify that they are all rearrangements of the
original equation, and determine which will converge to which solution. Use the
appropriate iterations to locate the solutions with tolerance 10−6 .
exp (x/2)
(i) x=
10
(ii) x = 2 (ln x + ln 10)
− exp (x/2)
(iii) x=
10
Newton’s method is one of the most widely used nonlinear solvers and it can be
derived using basic ideas from Calculus and the notion of approximating a nonlinear
function with a linear one. Moreover, the convergence of the iterative schemes in
Example 5 implied that rapid (quadratic) convergence was achieved when g
(s) = 0.
This is the motivation behind Newton’s, or the Newton–Raphson, method. The
iteration can be derived in terms of the first-order Taylor polynomial approximation,
but essentially a local linear model using the simple tangent line to f (x) is the
underlying idea.
To see this, consider the original nonlinear problem (5.1) f (x) = 0. The first-
order Taylor expansion of f about a point x0 is
f (x) ≈ f (x0 ) + (x − x0 ) f
(x0 ) ,
which looks a lot like the equation of the tangent line to f (x) at x0 . If the
point x0 is close to the required solution s, then we expect that setting the right-
hand side to zero should give a better approximation of this solution. Solving
5.4 Newton’s Method 161
f (x0 ) + (x − x0 ) f
(x0 ) = 0 (which is equivalent to finding where the tangent
line crosses the x-axis) provides an expression for the next approximation
f (x0 )
x1 = x0 −
f
(x0 )
This process would then be repeated to generate a sequence of tangent lines, and
roots of those tangent lines to get a general form of the Newton iteration formula
f (xn )
xn+1 = xn − (n = 0, 1, 2, . . .) (5.7)
f
(xn )
f
(x) f (x) f
(x)
g
(x) = 1 − +
f
(x) [ f
(x)]2
f (x) f (x)
=
[ f
(x)]2
Example 6 Find the positive square root of a real number c by Newton’s method.
√
To find c, the nonlinear problem can be posed as
x 2 − c = 0.
xn2 − c
xn+1 = xn −
2xn
xn + c/xn
=
2
162 5 Iterative Solution of Nonlinear Equations
θ0
x2 x1 x0
To demonstrate how the iteration would progress, let c = 3 and choose x0 = 1/2.
Then we get
1/2 + 3/(1/2)
x1 = = 3.25
2
3.25 + 3/3.25
x2 = = 2.0865385,
2
x3 = 1.7621632
x4 = 1.7320508
x5 = 1.7320508
so with an initial guess perhaps not very close to the solution we have agreement in
seven decimals after a fourth iteration. It’s also worth noting that quadratic conver-
gence can be observed by looking at the sequence of errors: 1.2321 × 100 , 1.5179 ×
100 , 3.5449 × 10−1 , 3.0112 × 10−2 , 2.5729 × 10−4 , 1.9106 × 10−8 . Quadratic
convergence can be observed by noting the exponent field doubles at each
iteration.
Next we consider the implementation of Newton’s method for solving f (x) = 0
in Python. The necessary inputs are the function f, its derivative, an initial guess and
the required accuracy in the solution. In the implementation below, no limit is placed
on the number of iterations. This would be needed for a robust implementation of
Newton’s (or any iterative) method. Our objective is not to create such software, but
rather to get an idea of the basic ideas. Python and SciPy in particular have some
robust equation-solvers built in. We shall discuss those briefly later.
5.4 Newton’s Method 163
def newton ( f c n , df , g , t o l ) :
"""
S o l v e t h e e q u a t i o n f c n ( x ) =0 t o a c c u r a c y
‘ t o l ’ <1 u s i n g Newton ’ s method . Use i n i t i a l
g u e s s ‘ g ’ . ‘ f c n ’ and ‘ d f ’ must be f u n c t i o n s
and t h e l a t t e r i s t h e d e r i v a t i v e o f t h e
former .
"""
old = g + 1 # Ensure i t e r a t i o n s t a r t s .
while abs ( g − o l d ) > t o l :
old = g
g = old − fcn ( old ) / df ( old )
sol = g
We know that one solution lies in [2, 3] and so take the initial guess 2.5. The
function e x − 5x + 2 and its derivative were defined as Python functions called eq1
and deq1 respectively after which the following Python command gave the result
shown.
>>> s = newton ( eq1 , deq1 , 2 . 5 , 1e −10)
>>> p r i n t ( s )
2 .1 9 3 7 4 1 6 6 7 8 4 5 6 1 7 6
To get an idea of the speed with which Newton’s method found this solution, g was
printed in each iteration in the Python command window. The successive iterates
were:
2 .2 6 5 7 5 0 7 3 0 8 8 6 9 7 9 5
2 .1 9 9 0 0 2 0 9 6 0 8 4 1 1 2 3
2 .1 9 3 7 7 2 6 7 5 6 1 3 1 5 3 5
2 .1 9 3 7 4 1 6 6 8 9 3 1 9 6 9
2 .1 9 3 7 4 1 6 6 7 8 4 5 6 1 7
2 .1 9 3 7 4 1 6 6 7 8 4 5 6 1 7 6
Just six iterations were needed to obtain the result to high precision.
164 5 Iterative Solution of Nonlinear Equations
Upon looking at the iteration in general, it is clear that there are several things that
could go wrong with the Newton iteration. Equation (5.7) means that if f
(xn ) is
small, then the correction made to the iterate xn will be large, i.e. if the derivative of
f is zero (or very small) near the solution, then Newton’s method may not converge.
Such a situation can arise if there are two solutions very close together, or, as in
Fig. 5.6, when the gradient of f is small everywhere except very close to the solution.
Example 8 Consider the equation tan−1 (x − 1) = 0.5 which has its unique solution
x = 1.5463 to four decimals.
The function f (x) = tan−1 (x − 1) − 0.5 is graphed in Fig. 5.6 along with the
first two iterations of Newton’s method with the rather poor starting point x0 = 4.2
which yields x1 ≈ −4.43 and x2 ≈ 53. The oscillations get steadily wilder reaching
inf and then nan in just ten more iterations using Python:
Iterate Value
x0 4.2
x1 −4.43132
x2 53.1732
x3 −2810.47
x4 1.63627e+07
x5 −2.86693e+14
x6 1.70205e+29
x7 −3.10205e+58
x8 1.99267e+117
x9 −4.25185e+234
x10 inf
x11 nan
Of course, this particular equation can be solved very easily: rewrite it in the
form x = 1 + tan(1/2) = 1. 546 302 5 to obtain the “solution”. This rearrangement
5.4 Newton’s Method 165
transforms the original problem from one of equation solving to one of function
evaluation which is still a real computational problem as we shall see in the next
chapter.
The following theorem explains the conditions under which we can expect New-
ton’s method to converge. First, note that in Fig. 5.6, the function f is convex (concave
up) to the left of the solution and concave (concave down) to the right– i.e it has an
inflection point close to the solution causing issues. The global convergence theo-
rem, Theorem 14, has hypotheses which eliminate the possibility of any points of
inflection near the solution.
Theorem 14 Let f be twice differentiable on an interval [a, b] and satisfy the con-
ditions
(i) f (a) f (b) < 0
(ii) f
has no zeros on [a, b]
(iii) f
Proof The first condition establishes the existence of a solution since f must change
sign in the interval. The second and third conditions force both f and f
to be strictly
monotone. These guarantee the uniqueness of the solution and the absence of any
inflection points. The final condition ensures that a Newton iteration from either
endpoint a or b will generate a point in the interval (a, b) . In combination these last
two conditions ensure that all the iterates remain in the interval. The details of the
proof are omitted.
Our motivation for Newton’s method was the desire to achieve quadratic conver-
gence. In the next example provides the theoretical analysis of Newton’s method for
square roots, as in Example 6.
xn + c/xn
Example 9 Show that the iteration xn+1 = converges quadratically to
√ 2
c for any x0 > 0.
√
√ 2
√ x 2 − 2xn c + c xn − c
xn+1 − c= n =
2xn 2xn
166 5 Iterative Solution of Nonlinear Equations
Similarly
√ 2
√ xn + c
xn+1 + c =
2xn
which implies that
√
√ 2 √
4
xn+1 − c xn − c xn−1 − c
√ =
√ 2 = x √
xn+1 + c xn + c n−1 + c
√
2n+1
x0 − c
= ··· = √
x0 + c
√ √
√c
Since x0 > 0, it follows that xx0 − < 1 and hence xn+1 − c
√ → 0 as n → ∞. There-
√ 0+ c xn+1 + c
fore xn → c for every choice x0 > 0.
Since even powers
√ of real numbers are positive, it also follows from the analysis
above that xn > c for every n ≥ 1 so that
en2 e2
en+1 = ≤ √n
2xn 2 c
1. The equation
3x 3 − 5x 2 − 4x + 4 = 0
has a solution near x = 0.7. (See Exercise 1 in Sect. 5.2.) Carry out the first four
iterations of Newton’s method to obtain this solution.
2. Use Newton’s method to obtain the solution of the equation e x − 3x − 1 = 0 in
[1, 3] using the tolerance 10−12 .
3. Intervals containing the three solutions of e x − 100x 2 = 0 were found in Exer-
cise 3 of Sect. 5.2. Use Newton’s method to locate the solutions with tolerance
10−10 .
4. For the equation in Exercise 3, two solutions are close to x = 0. Try to find
the critical value c such that if x0 > c then the solution near 0.1 is obtained,
while if x0 < c the negative solution is located. Now try to justify your answers
theoretically. (We are trying to find the regions of attraction for each of these
solutions.)
5. Show that Newton’s method for finding reciprocals by solving 1/x − c = 0
results in the iteration
xn+1 = xn (2 − cxn )
Show that this iteration function satisfies g
(x) < 1 for x ∈ (1/2c, 3/2c) .
5.4 Newton’s Method 167
Therefore xn+1 < 1/c. Show also that if xn < 1/c then xn+1 > xn . It follows
that, for n ≥ 1, (xn ) is an increasing sequence which converges quadratically to
1/c.
7. For Newton’s iteration for finding 1/c with c ∈ [1, 2) and x0 = 3/4, show that
six iterations will yield an error smaller than 2−65 .
8. Consider the coffee cooling scenario above. Use TE = 22o C, T0 = 90 ◦ C and
suppose your ideal temperature to drink the coffee is T̂ = 80 ◦ C. Find the time
that the coffee reaches that temperature by solving T̂ − T (t) = 0 if k = 0.04
using Newton’s method and compare the convergence properties to your results
with the bisection method. Experiment with different initial iterates.
In the previous section, we saw that Newton’s method is a powerful tool for solving
equations, obtaining accurate solutions at a fast rate given the appropriate conditions.
We also saw that some issues can arise. One challenge in the implementation of
Newton’s method is that derivatives are required. In many real-world applications,
the derivative may not be available if, for example, the function itself is the result
of some other computation. One important example of this considered later is in
shooting methods for the solution of differential equations. The secant method is
an approach to recover some of the power of Newton’s method without using any
derivative information.
Newton’s method was described by the idea that the next iterate is the point at
where the tangent line (at the current estimate of the solution) cuts the x-axis. The
secant method uses the point at which the secant, or chord, line joining the two
previous two iterates cross the x-axis. Figure 5.7 shows this.
To define the iteration, consider the equation of the secant line joining two points
f (x1 ) − f (x0 )
on the curve y = f (x) at x = x0 , x1 . The slope is and so equation
x1 − x0
of the secant line is
f (x1 ) − f (x0 )
y − f (x1 ) = (x − x1 ) . (5.8)
x1 − x0
Setting y = 0 in (5.8) and calling the solution x2 gives the next iterate,
x1 − x0
x2 = x1 − f (x1 ) .
f (x1 ) − f (x0 )
168 5 Iterative Solution of Nonlinear Equations
y = f (x)
x1 x2 x4
x3 x0
xn − xn−1
xn+1 = xn − f (xn ) (5.9)
f (xn ) − f (xn−1 )
Note this is exactly the Newton iteration with the true derivative replaced by the
approximation
f (xn ) − f (xn−1 )
f
(xn ) ≈ ,
xn − xn−1
which we know from the previous chapter on numerical differentiation may not be
a good approximation! In this context, however, this simple approximation does the
trick.
The secant method appears to have fast convergence but we can quantify the rate
more precisely. There are convergence theorems for this method similar to those for
Newton’s method but they are beyond the scope of this book. The main conclusion is
that when the secant method converges, it does so at a superlinear rate. Specifically,
if en is the error xn − s, then the sequence of errors satisfies
en+1 ≈ cenα
√
where α = 1 + 5 /2 ≈ 1.6. This can be interpreted as saying that the number
of correct decimal places increases by about 60% with each iteration, compared to
Newton’s method in which the number of decimal places nearly doubles.
5.5 The Secant Method 169
xn − xn−1 xn2 − c
xn+1 = xn − xn2 − c 2 = x n −
xn − xn−1
2 xn + xn−1
xn2 + xn xn−1 − xn2 + c xn xn−1 + c
= =
xn + xn−1 xn + xn−1
xn2 + c xn xn + c
xn+1 = =
2xn xn + xn
which is similar, especially when the iterates are close to the solution.
For c = 3, with the initial guesses x0 = 1/2, x1 = 3/4, the next few secant iter-
ations are:
2.700000000000000
1.456521739130435
1.667887029288703
1.737709153632850
1.731944200430867
1.732050633712870
1.732050807574228
1.732050807568877.
Six iterations are accurate to 10 decimal places. So we still see fast convergence
without having or using any derivative information.
Example 11 Consider the coffee model from Newton’s law of cooling given by
Suppose you do not know the cooling rate k for your favorite mug, but you have
temperature data for your coffee over time. Formulate a mathematical approach to
estimating the model parameter k as a nonlinear least squares problem and use the
secant method to approximate k. Use TE = 24.5 and T0 = 93.5 Use the temperature
data given in the table below.
The idea is to try to approximate a value of k so that the coffee cooling model is
“close” to the experimental data. This can be posed as
1
20
min f (k) = (Ti (k) − T̄i )2 , (5.10)
k 2
i=1
170 5 Iterative Solution of Nonlinear Equations
Times (min) 0 1 2 3 4 5 6 7
T oC 93.50 76.9 65.9 54.6 45.3 40.3 37.9 32.95
Times (min) 8 9 10 11 12 13 14 15
T oC 30.99 29.14 28.14 27.95 26.54 24.90 25.38 25.22
Times (min) 16 17 18 19 20
T oC 25.07 24.89 24.81 24.76 24.63
where here T̄i are the temperature measurements from above and T is a vector of
model temperature values at the same times the experimental data was taken as a
function of k. If we want to find the k value that minimizes f (k) then we need to set
f
(k) = 0, where this is the derivative with respect to k. So in the case of optimization,
the nonlinear equation to be solved is actually in terms of the derivative of f, which
is typically referred to as the objective function. Differentiating with respect to k
gives the following nonlinear problem:
20
f
(k) = (Ti (k) − T̄i − ti (T0 − TE )e−kti ), (5.11)
i=1
where here we have applied the chain rule in expression Eq. (5.10) and the derivative
of the coffee model expression with respect to k. Since this is a 1D problem, insight
can easily be gained by looking at plots over a range of k values. This can help
identify an initial iterate for example. Figure 5.8 shows the graph of the objective
function and Fig. 5.9 shows the plot of Eq. (5.11). The problem appears to have a
solution somewhere between 0.25 and 0.3.
×103
4.0
3.5
3.0
2.5
f (k)
2.0
1.5
1.0
0.5
0.0
0.10 0.15 0.20 0.25 0.30 0.35 0.40
k
Fig. 5.8 Plot of the objective function over a range of k values to help see where the minimum
occurs
5.5 The Secant Method 171
×104
8
5
df (k)/dk
0
0.10 0.15 0.20 0.25 0.30 0.35 0.40
−1 k
Fig. 5.9 Plot of the derivative for a range of k values. Where this graph is zero is the corresponding
k value for the minimum of the objective function
100
80
Temperature (◦ C)
60
40
20
0
0 5 10 15 20
Time (mins)
Fig. 5.10 Plot of the model output using the optimal value of k along with the experimental
temperature values
The secant method requires two initial iterates. We will use 0.25 and 0.255 and a
stopping criteria of 0.001. The secant algorithm gives the following iterates 0.2780,
0.2835, 0.2850, 0.2851 with a final objective function value of 199.1363 which has
a derivative value of 0.0529. Figure 5.10 shows how the model using the optimal
value of k agrees with the experimental temperature values.
Writing computer code for the secant method is left as an exercise, but should be
straightforward using the Newton’s method script to start from.
172 5 Iterative Solution of Nonlinear Equations
Exercises
1. The equation
3x 3 − 5x 2 − 4x + 4 = 0
has a solution near x = 0.7. (See Exercise 1 in Sect. 5.2.) Carry out the first four
iterations of the secant method to obtain this solution. Compare the results with
those of Newton’s method.
2. Write a program to solve an equation using the secant method. Test it by checking
your answers to the previous exercises.
3. Use the secant method to obtain the solution of the equation e x − 5x + 2 = 0
using the tolerance 10−12 and compare the results and convergence behavior to
the previous examples.
4. Use the secant method to obtain the solution of the equation e x − 3x − 1 = 0 in
[1, 3] using the tolerance 10−12 .
5. Intervals containing the three solutions of e x − 100x 2 = 0 were found in Exer-
cise 3 of Sect. 5.2. Use the secant method to locate the solutions with tolerance
10−10 . Compare the numbers of iterations used with those needed by Newton’s
method.
6. Show that the secant method for finding reciprocals by solving 1/x − c = 0
results in a division-free iteration
Carry out the first three iterations for finding 1/7 using x0 = 0.1, x1 = 0.2.
7. For this exercise, we will use some ideas from calculus to model the curve of a
hanging cable such as telephone wire and find its length. A cable hanging under
its own weight assumes the shape of a catenary. In the special case that the ground
is level and two poles are the same height, the lowest point on the curve will be
at the midpoint of the distance between the two poles. Denote that point as x = 0
on the ground. The curve representing the shape of the cable for some parameters
h 0 and λ is given by
x
y = h 0 + λ cosh . (5.12)
λ
The length of the cable will depend on the shape of the catenary and the amount of
sag in the middle. The amount of sag desired would be determined by engineers
to allow for weather conditions in the area. The greater the sag, the less likely the
cable will be blown down in a storm, for example.
Suppose the following:
• The two poles are 200 ft apart
• The cable sags 15 ft in the middle
• The height at which the cable is attached to the pole is 50 ft.
5.5 The Secant Method 173
Find h 0 , λ using the iterative methods presented in this chapter. To do this you
need to determine an equation in one unknown (it will be nonlinear) and find
the solution via the bisection method, Newton’s method, and the secant method.
Report on the iteration histories for each method. Finally, determine the length
of the cable.
HINTS:
• The height at the midpoint
is y(0) = h 0 + λ and the height at an endpoint
is y(100) = h 0 + λ cosh 100 x . How can you use this information and the sag
condition to find one nonlinear equation in one unknown, f (λ)?
• Once you know λ how can you find h 0 ?
• The arc length of the resulting curve can be found using standard calculus
techniques:
100 x
Length = 2 1 + sinh2 dx
0 λ
Use elementary identities and properties of hyperbolic functions. (Ie. Look in
your calc text.)
• You should get λ = 335.7 and Length = 202.97 ft.
The focus of this section is solving systems of nonlinear equations of the form
f (x) = 0 (5.13)
where f is a vector function of the vector variable x. Later, we provide details on the
specific case of two equations in two unknowns but we begin with the more general
case of n equations in n unknowns.
Looking more closely at Eq. (5.13), we have
⎡ ⎤ ⎡ ⎤
f 1 (x1 , x2 , . . . , xn ) 0
⎢ f 2 (x1 , x2 , . . . , xn ) ⎥ ⎢ ⎥
⎢ ⎥ ⎢0⎥
f (x) = ⎢ .. ⎥ = ⎢ .. ⎥ = 0
⎣ . ⎦ ⎣.⎦
f n (x1 , x2 , . . . , xn ) 0
So the problem is now to find the n-dimensional vector x = [x1 , x2 , . . . , xn ]T that
simultaneously satisfies the equations f i (x) = 0 for i = 1, 2, . . . n. Deep insight into
techniques requires some background in multivariable Calculus and Linear Algebra,
but some of the key ideas are presented here. The thing to keep in mind is that when
174 5 Iterative Solution of Nonlinear Equations
it comes to vector valued functions, some of the tools we used in solving f (x) = 0
for a scalar equation will no longer apply. For example, consider the Newton iterate
defined by
xc
x+ = xc −
,
f (xc )
where here the subscripts + and c refer to the next and current Newton iterate so
as to not cause confusion with the subscripts in xi (i = 1, 2, . . . , n) which are the
components of x.
What goes wrong when f is a function from R n into R n ? We need to be careful
about what we mean by a derivative first. In higher dimensions like this, the Taylor
expansion of f (x) actually leads to using a Jacobian Matrix, J defined in terms of
the partial derivatives of the f i ,
∂ fi
Ji j = .
∂x j
So if the derivative is now a matrix, then dividing by f
(xc ) isn’t actually defined
in that sense. Even more food for thought–what would go wrong with the secant
approach?
However, Newton’s method for such a system is still based on a first order Taylor
expansion:
f (x + h) ≈ f (x) + J h (5.14)
where J is the Jacobian matrix of f evaluated at x. As in the one-dimensional case,
the Newton iteration is derived from setting the right hand side of (5.14) to 0. This
leads to the iterative formula
x+ = xc − J −1 f (xc ) . (5.15)
In practice, the inverse of the Jacobian is not usually formed, rather Eq. (5.15) is
re-arranged so that one computes the Newton step by solving the linear system
Jh = − f (xc )
x+ = xc + h
Thus, the implementation of the iteration (5.15) requires the ability to solve general
linear systems of equations, which you are already now familiar with. Also, fast
quadratic convergence can still be maintained under analogous conditions.
Solving systems of nonlinear equations arises in the context of applied optimiza-
tion problems. Across science and engineering disciplines, optimization problems
5.6 Newton’s Method in Higher Dimensions 175
abound. Decision makers want to minimize cost, minimize risk, maximize profits,
or maximize customer satisfaction. Other times, within a design process–optimal
parameters may be sought to fit a model to experimental data or to make a prod-
uct give a desired output. In these instances, a mathematical representation of the
stakeholder’s ‘goal’ is called an objective function and such a problem can be posed
as
min f (x) (5.16)
x∈
but here f maps a vector of n decision variables to a real number. In Eq. (5.16),
defines a set of constraints on x that may be simple requirements like xi > 0 for
each i or they may be more complicated expressions that are themselves linear or
nonlinear equations or inequalities. Some of those ideas may be familiar to you if
you recall using LaGrange multipliers in a multivariable Calculus course. We will
not go into much detail about that here, but to see how this is related to nonlinear
equations, recall that a critical point in multivariable calculus means that all the
partial derivatives of f are zero, i.e.
∇ f x = 0,
∂f
∇ fi = .
∂xi
More care must be paid to whether or not the method is converging to a maximum
or a minimum and most optimization algorithms have built-in mechanisms to ensure
they are searching in an appropriate direction (i.e. the function values are decreasing
as the optimization progresses).
Newton’s method for optimization problems then requires solving the system of
nonlinear equations given by Eq. (5.16) leading to
x+ = xc − H −1 ∇ f (xc ) . (5.17)
where
∂2 f
Hi j = .
∂xi ∂x j
is the Hessian matrix of second partial derivatives. Also note that this matrix is
naturally symmetric since mixed partial derivatives are equal.
Revisiting the coffee example from the previous section, there are certainly more
accurate approaches than Newton’s law of cooling. In reality, coffee cools quickly
at the beginning (by a steaming mechanism) and after a while it cools mainly by
176 5 Iterative Solution of Nonlinear Equations
conduction. To study the conduction mechanism, we could discard the hottest data
when the steaming is most active, leaving say m data points. This means though,
that we would need to estimate an initial temperature to use in the conduction-only
approach. Now, the reformulated objective function also has the initial temperature
as decision variable. So, we simultaneously seek k and T0 for the following problem;
1
m
min f (k, T0 ) = Ti (k, T0 ) − T̂i )2 . (5.18)
2
i=1
Now the system of nonlinear equations determined by;
⎡ ⎤
∂f m dTi
i=1 (Ti (k, T0 ) − T̂i ) dk
∇ f (k, T0 ) = ∂k
∂f = ⎣ ⎦
m dTi
∂T0 i=1 (Ti (k, T0 ) − T̂i ) dT0
and since T (t) = TE + (T0 − TE )e−kt , it follows that dTdk = −ti (T0 − TE )e
i −kt and
dTi −kti .
dT0 = e
In the next section we provide the specific details about how to implement New-
ton’s method for problems with two unknowns. Since the Jacobian matrix is only
2 × 2, it is straightforward to define an inverse. We should stress that the method is
generalizable to any size problem of course, using many of the ideas presented in
Chap. 4 on linear equations.
We demonstrate the above ideas for n = 2, that is two equations in two unknowns.
In this more manageable scenario, it is straightforward to get a formula to directly
update the Newton iterate without having to solve a linear system at each iteration.
For two unknowns, the problem can be expressed as
f 1 (x, y) = 0 (5.19)
f 2 (x, y) = 0
Setting the two right-hand sides (simultaneously) to zero leads to the solutions
− f 1 f 2y + f 2 f 1y
h= (5.20)
f 1x f 2y − f 1y f 2x
f 1 f 2x − f 2 f 1x
k=
f 1x f 2y − f 1y f 2x
5.6 Newton’s Method in Higher Dimensions 177
Example 12 Find the coordinates of the intersections in the first quadrant of the
ellipse 4x 2 + y 2 = 4 and the curve x 2 y 3 = 1 illustrated below.
0
0.0 0.2 0.4 0.6 0.8 1.0
−1
One possible technique for solving this pair of equations would be to eliminate
x 2 between the two equations to obtain a fifth degree polynomial in y. This could
in turn be solved using Newton’s method for a single equation. Our purpose here
is to illustrate the use of Newton’s method for a pair of equations. The use of this
substitution is left as an exercise to serve as a check on this process.
One of the solutions is close to (0.4, 1.8) and we shall use this as a starting point for
the iteration. The partial derivatives of the two functions f 1 (x, y) = 4x 2 + y 2 − 4
and f 2 (x, y) = x 2 y 3 − 1 are
f 1x = 8x, f 1y = 2y
f 2x = 2x y ,3
f 2y = 3x 2 y 2
and
f 1x = 3.2, f 1y = 3.6
f 2x = 4. 665 6, f 2y = 1. 555 2
Hence f 1x f 2y − f 1y f 2x = (3.2) (1. 555 2) − (3.6) (4. 665 6) = −11. 819 52. Now
applying (5.20) we get
so that
A very small number of iterations has provided both solutions to very high accu-
racy.
Newton’s method is fairly easy to implement for the case of two equations in
two unknowns. We first need to define functions for the equations and the partial
derivatives. For the equations in Example 2, these can be given by:
import numpy as np
def eq2 ( v ) :
"""
The f u n c t i o n ’ s i n p u t ‘ v ‘ and t h e o u t p u t a r e
b o t h 2− e n t r y v e c t o r s .
"""
x, y = v
f = np . empty ( 2 )
f [ 0 ] = 4 ∗ x ∗∗2 + y ∗∗2 − 4
f [ 1 ] = x ∗∗2 ∗ y ∗∗3 − 1
return f
and
def Deq2 ( v ) :
"""
Compute J a c o b i a n f o r t h e f u n c t i o n ‘ eq2 ‘ .
"""
x, y = v
J = np . empty ( ( 2 , 2 ) )
J [ 0 ,0 ] = 8 ∗ x
J [ 0 ,1 ] = 1 ∗ y ;
J [ 1 , 0 ] = 1 ∗ x ∗ y ∗∗3
J [ 1 , 1 ] = 3 ∗ x ∗∗2 ∗ y ∗∗2
return J
The newton2 function will need both the function and its partial derivatives as
well as a starting vector and a tolerance. The following code can be used.
180 5 Iterative Solution of Nonlinear Equations
def newton2 ( f c n , J a c , g , t o l ) :
"""
S o l v e two e q u a t i o n s g i v e n by f c n ( x ) =0 t o
a c c u r a c y ‘ t o l ‘ < 1 u s i n g Newton ’ s method . ‘ g ‘
i s an i n i t i a l g u e s s . ‘ f c n ‘ and ‘ Jac ‘ must be
f u n c t i o n s where t h e l a t t e r i s t h e p a r t i a l
d e r i v a t i v e s of the former .
"""
o l d = np . z e r o s _ l i k e ( g )
old [ 0 ] = g [ 0 ] + 1
while max ( abs ( g − o l d ) ) > t o l :
old = g
f = fcn ( old )
f1 = f [ 0 ]
f2 = f [ 1 ]
J = Jac ( old )
f1x = J [ 0 ,0 ]
f1y = J [ 0 ,1 ]
f2x = J [ 1 ,0 ]
f2y = J [ 1 ,1 ]
D = f1x ∗ f2y − f1y ∗ f2x
h = ( f2 ∗ f1y − f1 ∗ f2y ) / D
k = ( f1 ∗ f2x − f2 ∗ f1x ) / D
g = o l d + np . a r r a y ( ( h , k ) )
return g
Then the following command can be used to generate the second of the solutions
in Example 12:
>>> s = newton2 ( eq2 , Deq2 , [ . 8 , 1 . 2 ] , 1e −8)
>>> p r i n t ( s )
[ 0 .8 2 1 6 8 1 6 2 1 .1 3 9 8 9 3 5 2 ]
Exercises
1.0
0.5
x
0.0
−2.0 −1.5 −1.0 −0.5 0.5 1.0 1.5 2.0
−0.5
−1.0
−1.5
Distance from A 0 220 440 660 880 110 1320 1540 1760
Height above A 0 20 49 83 115 140 150 135 112
Distance from A 1980 2200 2420 2640 2860 3080 3300 3520 3740
Height Above A 88 63 39 16 −3 −18 −27 −30 −28
Distance from A 3960 4180 4400 4620 4840 5060 5280
Height above A −25 −21 −15 −6 −7 26 50
50
-50
0 1000 2000 3000 4000 5000 6000
distance
HINTS:
• First, consider one arc of the cable with endpoints of different heights. Suppose
the two endpoints are x = 0 and x = L . The cable will still hang under the shape
of a catenary and the equation for the shape of the curve is given by
x +c
y = h + λ cosh (5.22)
λ
where we need to determine the appropriate h, λ, and c. We can eliminate h as
an unknown to obtain our system of two equations in two unknowns, λ and c.
• The heights at the endpoints are y(0) = h + cosh λc and y(L) = h + cosh L+cλ .
The difference in these heights is the same as the difference between the ground
levels at the poles (which is a known value)! So one equation is
L +c c
λ cosh − λ cosh = y(L) − y(0). (5.23)
λ λ
The second equation can be determined from the sag condition. We can now use
Newton’e method for two equations and two unknowns to find λ and c.
• To get your code running, try this for the simple case when the ground is level for
L = 220. In this case y(0) − y(L) = 0 and symmetry yields c = −110. Using
Newton’s method for scalar equations would give λ = 606.66 so a good initial
guess for your 2-D code (to see that it is working) would be c = −100, λ = 600.
Try as an initial guess for the uneven terrain c = 2(y(L) − y(0)) − 110 and
λ = 600.
5.6 Newton’s Method in Higher Dimensions 183
•You need use Newton’s method on each interval to find the corresponding c, λ
values of each catenary. You can then compute the arc length on each interval and
sum them for the total length of the cable.
• You need to determine h. This must be done relative to some h 0 so that the
cable is continuous and h 0 must be determined so that the cable is always kept
25 ft above the ground.
6. When on a trip to the carnival with your friends, your best friend gets on the Ferris
wheel, accidentally forgetting their candy apple. Being the great friend that you
are, you decide to throw it to them. You’ve only got one apple and one chance
to make the perfect shot to deliver your friend their delicious treat. Luckily, you
have taken Calculus, and know that you can calculate the perfect moment to make
your throw by using the parametric representation of the apple trajectory and the
equations of motion of the Ferris wheel. For the apple the trajectory is given by
x = f (t) = 22 − 8.03(t − t0 ),
y = g(t) = 1 + 11.7(t − t0 ) − 4.9(t − t0 )2 ,
Here t0 is the time you launch the apple. As you start calculating, you realize that
you have the worst throwing arm of anyone you know. It may take you several
iterations to get this right! Here is your methodology!
(a) Given the apple trajectory, find the speed and angle in relation to t at which
the apple is thrown when t = t0 .
(b) Now that you know the angle and speed of the candy apple, you need to
find out where your friend is. So, using the Ferris wheel equation, find the
location of your friend at t = 0 (Hint: They should be at the bottom of the
Ferris wheel).
(c) Find the number of rotations per minute of the ferris wheel.
(d) Set up a mathematical formulation for this scenario that represents what you
are trying to do (throw the apple and have your friend catch it). You will
have two unknowns: the initial time to throw the candy apple, and the time
when your friend will receive the apple.
(e) This is a system of two nonlinear equations with two unknowns. Use New-
ton’s method to solve this problem. It may help to play around and try a few
different values of the initial launch time and apple-catch time to understand
how close and far you are.
184 5 Iterative Solution of Nonlinear Equations
7. For this problem, you will consider the optimal shape and construction of a trash
dumpster.
(a) First, locate a dumpster. Carefully study the dimensions and describe all
aspects of the dumpster. Second, determine the volume of the dumpster.
Include a sketch of the dumpster. Digital pictures can accompany, but a
sketch with labeled components is required. Your job is to find the dimen-
sions of a container with the same volume that minimizes construction costs.
For example, something like in Fig. 5.13
(b) While maintaining the same general shape and method of construction,
determine the dimensions such a container with the same volume should
have in order to minimize the cost of construction. Review from Calculus
how you find the minimum of a function of several variables.
Use the following assumptions in your analysis:
• Again, the volume of the dumpster must be the same.
• The angle of the top of the dumpster must be preserved. Relate the height
of the back panel to the front panel, then use the volume constraint to sim-
plify this further
• The sides, back, and front are to made of 12-gauge (0.1046 inch thick)
steel sheets, which cost $0.87 per square foot (including cuts or bends)
• The base is made from a 10-gauge (0.1345 inch thick) steel sheet which
costs $1.79 per square foot
• Lids cost approximately $85.00 each regardless of dimensions
• Welding costs approximately $0.37 per foot of material with labor included
(c) Describe how any of your assumptions or simplifications may effect your
design. For example, are there any fixed costs you did not include in the
design? Would they affect the validity of your results. If you were hired as a
consultant on this investigation, what would your conclusion be? Would you
recommend altering the design of the dumpster? If so, describe the savings
that would result. It is important be as specific as possible when explain-
ing your procedures, mathematical model, experiments, measurements, and
computations.
8. Consider solving the refined coffee model problem in which the period of “steam-
ing” is ignored. To do this, disregard all the data that is hotter than 65 ◦ C from
data given in Example 11.
also important to incorporate stopping criteria that represent failure since we cannot
necessarily know in advance whether a chosen iteration and starting point will lead
to convergence at all. Leaving your computer in a futile infinite loop of meaningless
calculations is something to be avoided!
Knowing information about your problem can also help guide choices in terms
of which algorithm to choose in general. In the absence of derivatives, the secant
method still maintains fast convergence. We should note that using a finite differ-
ence approximate derivative (for example, something from Chap. 3) within Newton’s
method is also a possibility. Knowing the rate of convergence of a method is a way
of testing your code. By solving a problem you know the answer to, if you look at
successive errors, you will see the exponent field changing accordingly.
Finally, we presented the setting for systems of nonlinear equations. Both the
model for a hanging cable exercise and the coffee cooling model parameter estimation
example have been extended from one unknown to two. The details for implementing
Newton’s method for two equations and two unknowns are provided (in addition to
the scalar algorithms) to so that you have the solvers all ready to go and tackle the
applied problems in this chapter.
Several issues pertaining to solution of single equations with one unknown have
been discussed. The introduction to Newton’s method in two variables gives just
a glimpse of a major field. Systems of several (or even many) equations in many
unknowns present a major challenge–and one that arises frequently.
Some numerical methods themselves are built from a basic structure but with
weights or coefficients that are not known, so-called undetermined coefficients. Fre-
quently these must be found by solving systems of equations–sometimes linear,
often nonlinear. Mathematical models for physical or other situations often depend
on model parameters which need to be estimated from data. Again this typically
results in systems of equations. We saw a little of that in the previous chapter with
linear least squares approximation leading to systems of linear equations. Often the
appropriate model does not depend linearly on the parameters and so the problem
becomes one of systems of nonlinear equations.
Many real world problems are posed in a way that seeks a “best” answer by some
definition of best. Typically this results in a mathematical formulation that requires
optimization. Optimization problems can be of (essentially) infinite dimension where
the objective is to find the function that (say) minimizes energy consumption. Many
times the resulting optimization problem becomes one of minimizing a function of
many variables, perhaps subject to some constraints on the variables. In Calculus
you saw that such problems can be recast as nonlinear equations by setting all partial
derivatives to zero. Optimization is much bigger than this, but what we see is that
solution of nonlinear equations is a far-reaching topic with many aspects that have
not be explored here.
There are still many opportunities for you to contribute in the future!
5.8 Python Functions for Equation Solving 187
For many of the tasks we shall discuss including the solution of equations, Python has
functions using efficient algorithms in the third-party packages SciPy and NumPy
The optimize module in the scipy package provides several functions for equation
solving.
fsolve In the case of solving nonlinear equations of a single variable, the basic
Python (SciPy) function is fsolve.
This function takes two required arguments, the function f and the starting point
x0 . It then finds the solution of f (x) = 0 nearest to x0 . Optional arguments are
possible extra arguments to the function f , the partial derivatives of f , the tolerance,
as well as several additional parameters to configure fsolve if desired.
The fsolve function uses the Powell hybrid method (not discussed in this book).
(Newton’s method cannot be used in this general context because of its need for the
derivative.)
The command
>>> s = f s o l v e ( eq1 , 1 , x t o l =1e −8)
>>> p r i n t ( s )
gives the output
[ 1 .2 5 6 4 3 1 2 1 ]
fsolve can also solve multi-variate equations, i.e. where the function takes a vector
as input. Another alternative to fsolve in scipy.optimize is the function root which
lets you choose among a number of different solution methods.
For more details on these and other SciPy functions, consult the SciPy doc-
umentation or type help(scipy.optimize) at the Python prompt after importing
scipy.optimize.
The numpy package also provides useful functions for equation solving.
roots For the special case of finding roots of polynomial equations, we can use
roots.
This function will find the roots of the polynomial equation p (x) = 0 where p
is represented in Python by a NumPy array of its coefficients, beginning with the
highest degree term. The polynomial x 2 + 2x − 3 is therefore represented by the
array [1, 2, −3]. Any zero coefficients must be included.
188 5 Iterative Solution of Nonlinear Equations
n
F (x) = | f k (x)|2
k=1
The details of the algorithms used are topics for more advanced courses.
Interpolation
6
6.1 Introduction
What is interpolation and why do we need it? The short answer to “What is inter-
polation?” is that we seek to find a function of a particular form that goes through
some specified data points. Our focus will be entirely on single variable interpola-
tion. Much of what we will be doing is based on polynomial interpolation in which
we seek a polynomial of specified degree that agrees with given data.
So, why do we need interpolation? Many of the standard elementary and special
functions of mathematics cannot be evaluated exactly–even if we assume a computer
with infinite arithmetic precision. Even where we have good algorithms to approxi-
mate these functions they may be very expensive to compute. Thus we seek simpler
functions that agree with the data and which can be used as surrogates for the under-
lying function. Of course in most practical situations we don’t even know what that
underlying function is!
Interpolation is a powerful tool used in industry, often used to replace complex,
computationally expensive functions with friendlier, cheaper ones. In engineering
design, interpolation is often used to create functions (usually low degree polyno-
mials) so that experimental data can be incorporated into a continuous model. This
idea is similar to least-squares approximations (as in Chap. 4 or with regards to the
coffee model) but instead of looking for a polynomial that might “fit” data closely,
the interpolating polynomial is required to equal, or agree with, the data at certain
locations. When data is subject to error (experimental error, for example) it is com-
mon to use approximation methods which force the approximating function to pass
close to all the data points without necessarily going through them.
One special interpolation technique, called spline interpolation, is the basis of
much computer aided design. In particular, Bezier splines were developed originally
for the French automobile industry in the early 1960s for this specific purpose. Bezier
splines are a generalization of the cubic splines to general planar curves. They are
not necessarily interpolation splines but make use of what are called “control points”
which are used to “pull” a curve in a particular direction. Many computer graphics
programs use similar techniques for generating smooth curves through particular
points.
Spline interpolation was an important tool in creating the early digitally animated
movies such as Toy Story. The ability to have visually smooth curves that do not
deteriorate under scaling for example is critical to creating good images. Another
image processing application of the basic idea is the removal of unwanted pieces in
a photograph. We remove the telephone wire, for example, and need to fill the void
with appropriate shape and color information. One approach to this is interpolation
to ensure a smooth transition across such an artificial boundary. Digital photogra-
phy relies on interpolation to complete the image from the basic red, green or blue
receptors.
First we provide some background and justification for choosing polynomials as
the basic tools for approximating functions. Note that any continuous function can
be approximated to any required accuracy by a polynomial (we provide the theorem
below) and surprisingly polynomials are the only functions, which can, theoretically
at least, be evaluated exactly.
The first of these reasons is based on a famous theorem of Weierstrass.
for all x ∈ [a, b] . Therefore there exists a sequence of polynomials such that
|| f − pn ||∞ → 0 as n → ∞.
The second reason for choosing polynomials is less obvious. To gain some insight
we briefly describe Horner’s rule for efficient evaluation of a polynomial.
Suppose we’d like to evaluate
"""
6.1 Introduction 191
n = len ( a )
p = a [ −1]
f o r k i n range ( 2 , n + 1 ) :
p = p ∗ x + a[−k ]
return p
Note that indexing arrays with a negative index as above (a[–k]) in Python indexes
the array from the back end. That is, the last entry in an array can be accessed as
a[–1], the next-to-last as a[–2] etc.
For example the polynomial −x 2 + 2x + 4 could be evaluated at x = 3 by
>>> h o r n e r ( [ 4 , 2 , −1] , 3 )
which returns the expected value 1. NumPy’s function polyval is essentially similar
– except that it has the coefficient vector in the reverse order.
Considering this as a piece of hand calculation, we see a significant saving of
effort. In the code above each of the n − 1 steps entails a multiplication and an
addition, so that for a degree n − 1 polynomial, a total of n − 1 additions and n − 1
multiplications are needed while direct computation requires more. Evaluation of
ak x k requires k multiplications, and therefore the complete operation would need
n (n − 1)
(n − 1) + (n − 2) + · · · + 2 + 1 =
2
multiplications and n − 1 additions. Horner’s rule is also numerically more stable,
though this is not obvious without resorting to highly pathological examples.
p (x) = x 5 + 2x 4 + 3x 3 + 4x 2 + 5x + 6
for x = 0.4321.
We get
p (x) = ({[(x + 2) x + 3] x + 4} x + 5) x + 6
= (((((0.4321) + 2) (0.4321) + 3) (0.4321) + 4) (0.4321) + 5) (0.4321) + 6
= 9.234159
Exercises
1. Use Horner’s rule to evaluate 10 k=0 kx
10−k for x = 0.7.
2. Repeat the previous exercise using the function horner and using NumPy’s poly-
val command.
3. Modify the horner for NumPy array inputs. Use it to evaluate the polynomial
in Exercise 1 for x = −1 : 0.1 : 2. Use the resulting data to plot a graph of this
polynomial.
Polynomial interpolation is the basic notion that we build a polynomial which agrees
with the data from the function f of interest (or simply data). The Lagrange inter-
polation polynomial has the property that it takes the same values as f at a finite set
of distinct points.
Actually, this idea isn’t entirely new to you! The first N + 1 terms of the Tay-
lor expansion of f about a point x0 form a polynomial, the Taylor polynomial, of
degree N ,
(x − x0 ) N (N )
p (x) = f (x0 ) + (x − x0 ) f (x0 ) + · · · + f (x0 )
N!
which satisfies the interpolation conditions
Taylor’s theorem also provides the error in using p (x) to approximate f (x) is given
by
(x − x0 ) N +1 (N +1)
f (x) − p (x) = f (ξ )
(N + 1)!
for some ξ lying between x and x0 .
Note that in many data-fitting applications, we are only given function values and
the corresponding derivative values are not available; secondly, this approach will
usually provide good approximations only near the base point x0 . So using Taylor
polynomials exclusively just isn’t enough.
The Taylor polynomial does however illustrate the general polynomial interpola-
tion approach: first find a formula for the polynomial (of minimum degree) which
satisfies the required interpolation conditions, and then find an expression for the
error of the resulting approximation. In reality of course, that error can’t be com-
puted explicitly since, if it were, then we could evaluate the function exactly in the
first place.
6.2 Lagrange Interpolation 193
In general, if m < N , then the linear system will have no solution. It will have
infinitely many solutions if m > N but, provided the matrix A with elements ai j =
j−1
xi−1 is nonsingular (which it is), a unique solution exists if m = N . We thus expect
our interpolation polynomial to have degree N (or less if it turns out that a N = 0).
Although Eq. (6.3) provides us with a theoretical way to find coefficients, a more
practical method to finding the polynomial p satisfying (6.1) is needed.
To see how this is achieved, we consider a more specific goal: find polynomials
l j ( j = 0, 1, . . . , N ) of degree at most N such that
1 if j = k
l j (xk ) = δ jk = (6.4)
0 if j = k
N
p (x) = f x j l j (x) (6.5)
j=0
will have degree at most N and will satisfy the interpolation conditions (6.1).
Before obtaining these polynomials explicitly, you can see that p defined by (6.5)
satisfies the desired interpolation conditions;
N
N
p (xk ) = f x j l j (xk ) = f x j δ jk
j=0 j=0
= f (xk )
by (6.4).
194 6 Interpolation
Next, the requirement that l j (xk ) = 0 whenever j = k means that l j (x) must
have factors (x − xk ) for each such k. There are N such factors so that
l j (x) = c (x − xk ) = c (x − x0 ) · · · x − x j−1 · x − x j+1 · · · (x − x N )
k= j
has degree
Nand satisfies l j (xk ) = 0 for k = j. It remains to choose the constant c
so that l j x j = 1. This requirement implies that
1
c=
x j − xk
k= j
so that
(x − xk ) (x − x0 ) · · · x − x j−1 · x − x j+1 · · · (x − x N )
l j (x) =
=
k= j
x j − xk x j − x0 · · · x j − x j−1 · x j − x j+1 · · · x j − x N
(6.6)
These polynomials are called the Lagrange basis polynomials and the polynomial p
given by (6.5) is the Lagrange interpolation polynomial.
To this end, the existence of this polynomial has been established by finding it,
although the linear equations (6.3) indicated this as well. These same considerations
also show the uniqueness of the interpolation polynomial of degree at most N satis-
fying (6.1), which relies on the fact that the matrix A mentioned above is nonsingular.
The proof of this, by showing that the Vandermonde determinant
1 x0 x02 · · · x0N
1 x1 x12 · · · x1N
V =
··· ···
1 x N x N2 · · · xN
N
Example 2 Find the Lagrange interpolation polynomial for the following data and
use it to estimate f (1.4) .
x 1 2 3
f (x) 1.0000 1.4142 1.7321
6.2 Lagrange Interpolation 195
(x − x1 ) (x − x2 ) (x − x0 ) (x − x2 )
p (x) = f (x0 ) + f (x1 )
(x0 − x1 ) (x0 − x2 ) (x1 − x0 ) (x1 − x2 )
(x − x0 ) (x − x1 )
+ f (x2 )
(x2 − x0 ) (x2 − x1 )
(x − 2) (x − 3) (x − 1) (x − 3)
= 1.0000 + 1.4142
(1 − 2) (1 − 3) (2 − 1) (2 − 3)
(x − 1) (x − 2)
+ 1.7321
(3 − 1) (3 − 2)
√ gives f (1.4) ≈ 1.1772 to√4 decimal places. Note that for this data,
which gives
f (x) = x and so the true value is 1.4 = 1.1832 giving an absolute error of
6.0160 × 10−3 .
With any approximation, an expression for the error guides practical implementa-
tion. The proof of the error formula is accomplished with the repeated use of Rolle’s
theorem which states that between any two zeros of a differentiable function there
must be a point at which the derivative is zero.
(x − x0 ) (x − x1 ) · · · (x − x N ) (N +1)
f (x) − p (x) = f (ξ ) (6.7)
(N + 1)!
N
L (x) = (x − xk ) = (x − x0 ) (x − x1 ) · · · (x − x N )
k=0
Now, for x ∈ [a, b] , but not one of the nodes, consider the function E defined by
where the constant c is chosen so that E (x) = 0. It now follows that E (t) vanishes
at the N + 2 distinct points x0 , x1 , . . . , x N , and x. By Rolle’s theorem, between
each successive pair of these there is a point at which E vanishes. Repeating this
argument for E , E , . . . , E (N ) implies that there is a point ξ ∈ [a, b] such that
E (N +1) (ξ ) = 0
Note that Theorem 16 remains valid for x outside [a, b] with a minor adjustment to
the range of possible values of ξ. In such a case, this theorem would cease being about
interpolation. The process of extrapolation in which we attempt to obtain values of
a function outside the spread of the data is numerically much less satisfactory since
the error increases rapidly as x moves away from the interval [a, b] .
To see this, it is sufficient to study a graph of the function L (x) . In Fig. 6.1 we
have a graph of this function for the nodes 1, 1.5, 2, 2.5, 3 so that
We see that the magnitude of this function remains small for x ∈ [1, 3] , but that
it grows rapidly outside this interval. If f (N +1) varies only slowly then this is an
accurate reflection of the behavior of the error term for Lagrange interpolation using
these same nodes.
Intuitively, one would expect the accuracy to improve as the number of nodes
increases. Theorem 16 bears this out to some extent since the (N + 1)! is likely
to dominate the numerator in (6.7) provided that the higher-order derivatives of f
remain bounded. Unfortunately, in practice, this may not be the case. Moreover,
the inclusion of additional data points in the Lagrange interpolation formula is not
completely straightforward. See the following example.
Example 3 Repeat the computation of Example 2 with the additional data point
f (1.2) = 1.0954.
Since we now have a new data point, the previous calculations don’t help and p(x)
must be reconstructed. Taking the nodes in numerical order, the new polynomial is
(x − 1.2) (x − 2) (x − 3) (x − 1) (x − 2) (x − 3)
p (x) = 1 + 1.0954
(1 − 1.2) (−1 − 2) (1 − 3) (1.2 − 1) (1.2 − 2) (1.2 − 3)
(x − 1) (x − 1.2) (x − 3) (x − 1) (x − 1.2) (x − 2)
+ 1.4142 + 1.7321 .
(2 − 1) (2 − 1.2) (2 − 3) (3 − 1) (3 − 1.2) (3 − 2)
for some ξ between 1.2 and 2. In this case, f (4) (x) = (−15/16)x −7/2 and for
(4)
x ∈ [1., 2] , it follows that that f (x) ≤ 1/2. It follows that the error is bounded
−3
24 (0.5) = 1.6000 × 10 .
by 0.0768
The true error is actually smaller than the predicted
error since we computed the
error bound using a maximum bound on f (4) (x) .
There are many different ways of representing this interpolation polynomial, some
of which we propose in the next section. The principal motivation is to find conve-
nient forms which can be implemented efficiently and are easily programmed. The
Lagrange form does not fulfill either of these criteria. However, The most important
aspects of the Lagrange interpolation polynomial are that it provides a relatively
straightforward way to prove the existence and uniqueness of the polynomial. More-
over, it allows for a simpler derivation of the error formula. Since the polynomial
itself is unique, all the alternatives are just different ways of writing down the same
thing–so there is no need to repeat the error analysis for the various forms.
As well as being able to choose the representation of the interpolation polynomial,
in some circumstances, one may be in a position to choose the nodes themselves.
Consideration of the remainder term (6.7) shows that it could be advantageous to
198 6 Interpolation
choose the interpolation points so that |L (x)| is kept as small as possible over the
whole interval of interest. This motivation leads to the choice of the Chebyshev
interpolation points. For details of Chebyshev interpolation, the reader is referred to
a more advanced numerical analysis text. Several such are listed among the references
and further reading.
Exercises
x 134
f (x) 3 1 1
Write down the Lagrange interpolation polynomial using the nodes 0.0, 0.1, and
0.2, and then using 0.3 as well. Estimate the value of sin 0.26 using each of these
polynomials.
5. Obtain error bounds for the approximations in Exercise 4.
6. Show that for any x ∈ (0.0, 0.8) , the Lagrange interpolation quadratic using the
three nearest nodes from the table in Exercise 4 results in an error less than
6.5 × 10−5 .
7. Show that if x ∈ (0.1, 0.7) then the Lagrange interpolation polynomial using the
four closest points in the table in Exercise 4 results in an error less than 2.5 × 10−6 .
8. Repeat Exercise 5 for the function ln (1 − x) for the same nodes. How many
points will be needed to ensure that the Lagrange interpolation polynomial will
introduce no new errors to four decimal places?
6.3 Difference Representations 199
f [xk ] = f (xk )
Note that (6.8) implies a connection between first divided differences and first
derivatives. Recall that the secant iteration in Chap. 5, used the approximation
f (xn ) − f (xn−1 )
f (xn ) ≈
xn − xn−1
which is to say
f (xn ) ≈ f xn−1 , xn
Further, by the mean value theorem, it follows that
f (xn ) − f (xn−1 )
f xn−1 , xn = = f (ξ )
xn − xn−1
for some point ξ between xn−1 and xn . Soon we will show that there is a more general
connection between divided differences and derivatives of the same order.
200 6 Interpolation
k 0 1 2
xk 2 3 5
f (xk ) 4 2 1
Applying (6.8),
f [x1 ] − f [x0 ] 2−4
f [x0 , x1 ] = = = −2
x1 − x0 3−2
f [x2 ] − f [x0 ] 1−4
f [x0 , x2 ] = = = −1
x2 − x0 5−2
f [x2 ] − f [x1 ] 1−2 1
f [x1 , x2 ] = = =−
x2 − x1 5−3 2
Substituting (6.9) into (6.10) with k = 0 and using increasing values of m, we then
get
If we now let p be the polynomial consisting of all but the last term of this
expression:
p (xk ) = f (xk ) (k = 0, 1, . . . , N )
which, using (6.13), is p (xm+1 ) . This completes the induction step and, therefore,
the proof.
Recall that by the uniqueness of the Lagrange interpolation polynomial, this for-
mula (6.12) is just a rearrangement of the Lagrange polynomial. This special form is
known as Newton’s divided difference interpolation polynomial. It is a particularly
useful form of the polynomial as it allows the data points to be introduced one at a
time without any of the waste of effort this entails for the Lagrange formula.
Another implication of (6.12) is that the divided differences f [x0 , x1 , . . . , xk ] is
the leading (degree k) coefficient of the interpolation polynomial which agrees with
f at the nodes x0 , x1 , . . . , xk . By the uniqueness of this polynomial, this coefficient
must be independent of the order of the nodes. This observation means that divided
differences depend only on the set of nodes, not on their order.
The uniqueness of the interpolation polynomial also provides an important con-
nection between divided differences and derivatives. We’ve seen that first divided
202 6 Interpolation
differences are also approximations to first derivatives, or, using the mean value
theorem, are first derivatives at some “mean value” point.
Subtracting Eq. (6.12) from (6.11), we obtain
but from Theorem 16, and, specifically, Eq. (6.7), we already know that
(x − x0 ) (x − x1 ) · · · (x − x N ) (N +1)
f (x) − p (x) = f (ξ )
(N + 1)!
It follows that
f (N +1) (ξ )
f [x0 , x1 , . . . , x N , x] =
(N + 1)!
where the “mean value point” ξ lies somewhere in the interval spanned by x0 , x1 , . . . ,
x N , and x.
√ √
Example 5 Use Newton’s divided difference formula to estimate 0.3 and 1.1
from the following data
Taking the data in the order given results in the following table of divided differ-
ences.
k xk f [xk ] f xk , xk+1 f xk , xk+1 , xk+2 f xk , xk+1 , xk+2 , xk+3
0 0.2 0.4472 0.8185 −0.2837 0.1296
1 0.6 0.7746 0.5347 −0.1153
2 1.2 1.0954 0.4310
3 1.5 1.2247
Here, for example, the entry f [x1 , x2 ] = 0.5347 results from (1.0954 − 0.7746)/
(1.2 − 0.6). So for x = 0.3, from the table of differences we obtain the successive
approximations, here shown rounded to four decimal places
and then,
0.5291 + (x − x0 ) (x − x1 ) f [x0 , x1 , x2 ]
= 0.5291 + (0.3 − 0.2)(0.3 − 0.6)(−0.2837)
= 0.5376
and, finally,
k xk f [xk ] f xk , xk+1 f xk , xk+1 , xk+2 f xk , xk+1 , xk+2 , xk+3
0 1.2 1.0954 0.4310 −0.1153 0.1296
1 1.5 1.2247 0.5002 −0.2448
2 0.6 0.7746 0.8185
3 0.2 0.4472
Program Python function for divided difference interpolation using nearest nodes
first.
import numpy as np
"""
N = x . size
M = xdat . s i z e
D = np . z e r o s ( ( M, M) )
y = np . z e r o s ( N )
for k i n range ( N ) :
# S o r t c o n t e n t s of i n p u t data a r r a y s
xtst = x[k]
i n d = np . a r g s o r t ( np . abs ( x t s t −x d a t ) )
x s o r t = xdat [ ind ]
D [ : , 0] = ydat [ ind ]
# Begin divided d i f f e r e n c e s
f o r j i n range (M) :
f o r i i n range (M−j −1) :
D[ i , j +1] = ( ( D[ i +1 , j ] − D[ i , j ] )
/ ( x s o r t [ i + j +1] − x s o r t [ i ] ) )
# End d i v i d e d d i f f e r e n c e s
# Compute i n t e r p o l a t i o n
xdiff = xtst − xsort
prod = 1 # H o l d s t h e p r o d u c t o f x d i f f
f o r i i n range (M) :
y [ k ] = y [ k ] + prod ∗ D [ 0 , i ]
prod = prod ∗ x d i f f [ i ]
return y , D
For the data of Example 5, the following Python commands yield the results shown.
>>> X = np . a r r a y ( [ 0 . 2 , 0 . 6 , 1 . 2 , 1 . 5 ] )
>>> Y = np . s q r t ( X )
>>> x = np . a r r a y ( [ 0 . 3 , 1 . 1 ] )
>>> y = d d i f f 1 ( X , Y , x )
>>> p r i n t ( y )
[ 0 .5 4 1 0 6 8 9 3 1 .0 5 0 3 2 5 4 6 ]
6.3 Difference Representations 205
as we obtained earlier.
If your problem has many data points it may be desirable to limit the degree of
the interpolation polynomial. This can be incorporated into the code by restricting
the variable M to be bounded by some maximum degree.
To evaluate Newton’s formula at many points (in order to graph the function, for
example) it may be economical to avoid re-sorting the data for each argument. If the
order of the data is not to be changed, then it is typically the case that all data points
are used throughout. However if the number of data points is also large, the resulting
graph may exhibit the natural tendency of a polynomial to “wiggle”.
With data vectors X,Y and x=1:0.01:2, we can use the Python commands
>>> y = d d i f f 1 ( X , Y , x )
>>> p l t . p l o t ( X , Y , ’ k ∗ ’ , x , y , ’k ’ )
>>> p l t . show ( )
These generate the graph shown in Fig. 6.2.
206 6 Interpolation
1.5
1.0
0.5
0.0
−4 −2 0 2 4
Example 7 Graph the interpolation polynomial for data from the function
1
y=
1 + x2
using nodes at the integers in [−5, 5].
0.6
0.4
0.2
0.0
−4 −2 0 2 4
If the degree of the interpolating polynomial is restricted so that the nearest four
nodes (to the current point) are used to generate a local cubic interpolating polyno-
mial, then the resulting curve fits the original function much better. The commands
>>> p l t . p l o t ( X , Y , ’ k ∗ ’ , x , runge ( x ) )
>>> y1 = i n t e r p 1 d ( X , Y , k i n d = ’ c u b i c ’ )
>>> p l t . p l o t ( x , y1 ( x ) , ’ k ’ )
>>> p l t . show ( )
produce the graph shown in Fig. 6.4. At a glance, the two appear to agree well.
The function interp1d is an interpolation function imported from scipy.
interpolate, which will be revisited later.
In the last example, the data points were equally-spaced and so the divided dif-
ference formula can be rewritten in special forms. There are efficient ways of imple-
menting divided difference interpolation to reduce the computational complexity.
Among these, perhaps the most widely used is the algorithm of Aitken, which com-
putes the values of the interpolation polynomials directly without the need for explicit
computation of the divided differences. The details of Aitken’s algorithm are left to
subsequent, more advanced courses. We turn our attention to the special case of finite
difference interpolation for data at equally-spaced nodes.
In introducing this approach to finite difference interpolation, we assumed that
x0 was chosen so that x ∈ (x0 , x1 ) . If x lies much closer to x1 than to x0 , then
the order of the nodes is not exactly as we desired. In such a situation, it would
be preferable to choose x0 so that x ∈ (x−1 , x0 ) and then use the data points in the
order x0 , x−1 , x1 , x−2 , . . . The examples above have demonstrated some practical
difficulties with polynomial interpolation. Locally restricting the degree of the poly-
nomial used can result in better approximation to a smooth function – but such an
approximation does not reflect the smoothness of the function being approximated.
The next section on spline interpolation presents a way to addresses this difficulty.
208 6 Interpolation
Exercises
1. Compute the divided differences f [x0 , x1 ] , f [x0 , x2 ] , and f [x0 , x1 , x2 ] for the
data
k 0 1 2
xk 0 2 5
f (xk ) 1 3 7
2. Write down the quadratic polynomial p which has the values p (0) = 1, p (2) =
3, p (5) = 7. (Hint: use Exercise 1.)
3. Use the divided difference interpolation formula to estimate f (0.97) from the
data
4. Use divided difference interpolation to obtain values of cos 0.17, cos 0.45 and
cos 0.63 from the following data
5. Write a script for divided difference interpolation which uses all the data, in
the same order, at all points. Use this to plot the divided difference interpolation
polynomial for the data in Exercise 4. Also plot the error function for this interpo-
lation polynomial over the interval [0, 1] . (Note the behavior in the extrapolation
region.)
6. Modify your code from Exercise 5 so that you use data closest to the point of
interest first, and can restrict the degree of the local polynomial being used. Test
your code by computing the interpolating function and the error function over the
interval [0, 1.2].
6.4 Splines
The previous section demonstrated ways to build polynomials to agree with a function
or data at specified locations. It was also shown that using a single polynomial is
not always a good choice in approximating values between those points. Using
divided difference or finite difference formulas alleviated some of the pitfalls by
6.4 Splines 209
using only those data points that are close to the current point of interest. However,
it is sometimes advantageous to approximate the function (or interpolate data) using
piecewise polynomials on different intervals.
The basic idea is probably not new to you, especially at the simplest level of
piecewise linear functions. For example, suppose you look at the temperature forecast
for the day and see it is going to be 15 ◦ C at 10:00AM and 19 ◦ C at 12:00 PM, then you
may consider predicting the temperature at 11:00AM by thinking about connecting
those ‘data points’ with a line. You can determine the line between (10, 15) and
(12, 19) to get y = 2x − 5 (where y is the temperature and x is the time) so at x = 11
it would predict 17 ◦ C. Here, there was an assumption made that the temperature
would behave linearly between times. Also, you are not using any information about
what the temperature was at say 6:00AM. Instead, you are just trusting the line
between those two data points.
For data with more complicated behavior (or curvature), higher degree polynomial
on each interval may be a better choice.The problem with the piecewise polynomials
generated by, say, divided difference interpolation using cubic pieces (which would
mean using the four nearest nodes to each point) is that the transition from one piece
to the next is typically not smooth. This situation is illustrated in Fig. 6.5. The nodes
used were 0, 1, 1.25, 1.5, 1.75, 2, 3. The data are plotted along with the divided
difference piecewise cubic using the nearest four nodes. There is an abrupt change
(indeed a discontinuity) at the point at which the node 0 is dropped from the set
being used. For these nodes, this happens as soon as |x − 1.75| < |x| , which occurs
at 7/8.
The so-called spline interpolation can eliminate such loss of smoothness at the
transitions.
0.8
0.6
0.4
0.2
0.0
−0.2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
210 6 Interpolation
1. s is a piecewise polynomial such that, on each subinterval xk , xk+1 , s has
degree at most m, and
2. s is m − 1 times differentiable everywhere.
Example 8 Show that the following function is a spline of degree 1 (i.e. a linear
spline)
⎧
⎨ x +1 x ∈ [−1, 0]
s (x) = 2x + 1 x ∈ [0, 1]
⎩
x +2 x ∈ [1, 3]
All that is needed is to show that s is continuous at the internal knots 0, 1. Now
As x → 0− s (x) → 1
As x → 0+ s (x) → 1
As x → 1− s (x) → 3
As x → 1+ s (x) → 3
Since the highest degree is 3, we are checking to see if this is a cubic spline, i.e. we
must check continuity of s, s and s at x = 1, 2.
At both points, s is continuous–check that s(1) = 1, and s(2) = 5 using the piece-
wise definitions. However, you can check that s(x) fails to have continuous second
derivatives at x = 1, so by definition, this cannot be a spline function.
Interpolation using low-degree splines can provide accurate and efficient methods
for approximating a function or in some cases, creating a model out of data. Let’s
investigate linear splines in more depth and then proceed to the special case of cubic
spline interpolation. Specifically, a linear spline, or spline of degree 1, is a continuous
6.4 Splines 211
function whose graph consists of pieces which are all straight lines. Suppose then
we are given the values of a function f at the knots
f k+1 − f k
y = fk + (x − xk ) (6.15)
xk+1 − xk
sk (x) = ak + bk (x − xk )
Example 10 Find the linear spline that interpolates the following data.
x 1 3 4 5.5
f (x) 1 27 64 166.375
Here, x0 = 1, x1 = 3, x2 = 4, x3 = 5.5,
a0 = f 0 = 1
a1 = f 1 = 27
a2 = f 2 = 64
212 6 Interpolation
and
27 − 1
b0 = f [x0 , x1 ] = = 13
3−1
64 − 27
b1 = f [x1 , x2 ] = = 37
4−3
166.375 − 64
b2 = f [x2 , x3 ] = = 68.25
5.5 − 4
Therefore
⎧
⎨ 1 + 13 (x − 1) = 13x − 12 1≤x ≤3
s (x) = 27 + 37 (x − 3) = 37x − 84 3≤x ≤4
⎩
64 + 68.25 (x − 4) = 68.25x − 209 4 ≤ x ≤ 5.5
The spline is continuous at the interior knots and agrees with the data values at those
locations. Note that the linear spline is not defined outside the range of the data.
Extrapolation using a linear spline would be especially suspect.
This can be achieved by collecting data (i.e. measurements) that describe the shape
of your hand. A rudimentary approach would be simply to place your hand on paper
and trace it and then use a ruler to measure points along your hand relative to some
location you designate as the origin. Suppose you got n points describing your hand.
This gives two data sets: (i, xi )i=1
n
of and (i, yi )i=1
n
. Evaluate your linear spline
function at a denser set of intermediate points between 1 and n (for example, at
increments of 0.05) to obtain the intermediate x and y approximate locations on
your hand (let’s call those vectors u and v for example). Finally, plot your data along
with the intermediate values (Fig. 6.6).
>>> n = x . size
>>> s = np . a r a n g e ( 1 , n + 1)
>>> t = np . a r a n g e ( 1 , n , 0.05)
>>> u_func = interp1d ( s , x )
>>> v_func = interp1d ( s , y )
>>> plt . figure ( f i g s i z e = (3 ,4) )
>>> plt . plot ( x , y , ’k∗ ’ )
>>> plt . plot ( u_func ( t ) , v_func ( t ) , ’k
− ’ )
>>> p l t . show ( )
The above example may remind you of connecting the dots when you were a
child. When it comes to digital arts applications, a much smoother image would be
better. A higher degree polynomial may help. Probably the most commonly used
spline functions are cubic splines which we will derive in detail below. Following
6.4 Splines 213
0.6
0.4
0.2
the notation used with linear splines above, the components of a cubic spline can be
written as
sk (x) = ak + bk (x − xk ) + ck (x − xk )2 + dk (x − xk )3 (6.16)
which guarantees the continuity of s. The first two derivatives must be continuous
giving the following conditions;
ak = f k (k = 0, 1, . . . , n − 1) (6.20)
214 6 Interpolation
h k = xk+1 − xk .
and
2ck + 6dk h k = 2ck+1 (k = 0, 1, . . . , n − 2)
This last equation gives
ck+1 − ck
dk = (6.24)
3h k
and substituting (6.24) into (6.22) yields
(ck+1 − ck ) h k
bk = δk − ck h k −
3
hk
= δk − (ck+1 + 2ck ) (6.25)
3
To this end, if the coefficients ck can be determined, then Eqs. (6.24) and (6.25) can
be used to complete the definition of the spline components. Substituting for bk , bk+1,
and dk in (6.23), we arrive at a linear system of equations for the coefficients ck , as
follows:
hk h k+1
δk − (ck+1 + 2ck ) + 2ck h k + (ck+1 − ck ) h k = δk+1 − (ck+2 + 2ck+1 )
3 3
Collecting terms, we obtain
Definition 20 The natural cubic spline is defined by imposing the additional condi-
tions
s (a) = s (b) = 0
This requirement implies that the spline function continues with straight lines
outside the interval [a, b] while maintaining its smoothness. This mimics the behavior
of the physical spline beyond the extreme knots.
For the natural cubic spline, this implies that
while
s (b) = sn−1
(xn ) = 2cn−1 + 6dn−1 h n−1 = 0
Introducing the spurious coefficient cn = 0, this last equation becomes
cn − cn−1
dn−1 =
3h n−1
which is to say that (6.24) remains valid for k = n − 1. It also extends the validity
of (6.26) to k = n − 2.
The full tridiagonal system is
⎡ ⎤ ⎡ ⎤
c1 δ1 − δ0
⎢ c2 ⎥ ⎢ δ2 − δ1 ⎥
⎢ ⎥ ⎢ ⎥
H ⎢ . ⎥ = 3⎢ .. ⎥
⎣ .. ⎦ ⎣ . ⎦
cn−1 δn−1 − δn−2
where H is the tridiagonal matrix
⎡ ⎤
2 (h 0 + h 1 ) h1
⎢ h1 2 (h 1 + h 2 ) h2 ⎥
⎢ ⎥
⎢ h2 2 (h 2 + h 3 ) h 3 ⎥
⎢ ⎥
⎢ . . . ⎥
⎢ .. .. .. ⎥
⎢ ⎥
⎣ h n−3 2 (h n−3 + h n−2 ) h n−2 ⎦
h n−2 2 (h n−2 + h n−1 )
Example 12 Find the natural cubic spline which fits the data
x 1 4 9 16 25
f (x) 1 2 3 4 5
Since there are five knots, n = 4 and we should obtain a 3 × 3 tridiagonal system
of linear equations for the unknown coefficients c1 , c2 , c3 .
Recall that ak = f k for k = 0, 1, 2, 3 so
a0 = 1, a1 = 2, a2 = 3, a3 = 4
In matrix terms:
⎡ ⎤⎡ ⎤ ⎡ ⎤
16 5 0 c1 −0.4000
⎣ 5 24 7 ⎦ ⎣ c2 ⎦ = ⎣ −0.1714 ⎦
0 7 32 c3 −0.0952
The solution to this system is
c1 − c0
d0 = = −2.7353 × 10−3
3h 0
d1 = 1.5595 × 10−3
d2 = −7.0671 × 10−7
d3 = 1.0031 × 10−4
6.4 Splines 217
h0
b0 = δ0 − (c1 + 2c0 ) = 3.5795 × 10−1
3
b1 = 2.8410 × 10−1
b2 = 1.5489 × 10−1
b3 = 1.2736 × 10−1
Putting it all together, components of the natural cubic interpolation spline are then
defined by using the appropriate values from above in
sk (x) = ak + bk (x − xk ) + ck (x − xk )2 + dk (x − xk )3
The absolute errors in the example values above are approximately 3.8051 × 10−2
and 2.7049 × 10−2 which are of a similar order of magnitude to those that would
be expected from using local cubic divided difference interpolation. (The data in
Example 12 is from the square-root function.) This is to be expected since the error in
natural cubic spline interpolation is again of order h 4 as it is for the cubic polynomial
interpolation. The proof of this result for splines is much more difficult however. It
is left to subsequent courses and more advanced texts.
Typically, the natural cubic spline will have better smoothness properties because,
although the approximating polynomial pieces are local, they are affected by the
distant data. In order to get a good match to a particular curve, we must be careful
about the placement of the knots. We shall consider this further, though briefly, after
discussing the implementation of natural cubic spline interpolation.
The function cspline first computes the coefficients using the equations derived
earlier. The tridiagonal system is solved here using NumPy’s linear equation solver
numpy.solve. We shall discuss techniques for solving such systems in detail in
Chap. 7. For now we shall content ourselves with the “black box” provided with
Python/NumPy.
218 6 Interpolation
Note the use of array arithmetic operations in generating the remaining coefficients
once the c’s have been computed. Note, too, the use of np.diag to initialize the matrix
elements.
6.4 Splines 219
This function assumes that the arguments x are all in the interval spanned by the
knots. It would evaluate the spline outside this interval by using the first component
for any x-value to the left, and the last component for values to the right of the span
of the knots. It is not difficult to modify the code to extend the spline with straight
line components beyond this interval but we do not trouble with this here.
√
Example 13 Plot the error function s (x) − x for the data of Example 12.
We can use the previously created Python function cspline with the following
commands to generate the graph in Fig. 6.7.
>>> import numpy as np
>>> from m a t p l o t l i b import p y p l o t as p l t
>>> from p r g _ c u b i c s p l i n e import c s p l i n e
>>> X = np . a r r a y ( [ 1 , 4 , 9 , 1 6 , 2 5 ] )
>>> Y = np . a r r a y ( [ 1 , 2 , 3 , 4 , 5 ] )
>>> x = np . arange ( 1 , 2 5 , . 4 )
>>> y = cspline (X , Y , x )
>>> p l t . p l o t ( x , y − np . s q r t ( x ) )
>>> p l t . show ( )
Here the vectors X, Y are the knots and the corresponding data values, x is a vector
of points at which to evaluate the natural cubic spline interpolating this data and y is
the corresponding vector of spline values.
We see that this error function varies quite smoothly, as we hoped. Also we note
that the most severe error occurs very near x = 3 which was the first point we used in
the earlier example. The evidence presented there was representative of the overall
performance (Fig. 6.7).
In motivating the study of spline interpolation, we used an example of local cubic
divided difference interpolation where the resulting piecewise polynomial had a
220 6 Interpolation
0.8
0.6
0.4
0.2
0.0
discontinuity. See Fig. 6.5. The corresponding cubic spline interpolant is plotted in
Fig. 6.8. This was generated using the Python commands
>>> X = np . c o n c a t e n a t e ( ( ( 0 , ) ,
np . arange ( 1 , 2 .2 5 , . 2 5 ) , ( 3 , ) ) )
>>> Y = np . c o n c a t e n a t e ( ( ( 1 , ) , np . l o g ( X [ 1 : ] ) ) )
>>> x = np . l i n s p a c e ( X [ 0 ] , X [ − 1 ] , 2 0 0 )
>>> y = c s p l i n e ( X , Y , x )
>>> p l t . p l o t ( x , y , ’ k− ’ )
>>> p l t . p l o t ( X , Y , ’ k ∗ ’ )
>>> p l t . show ( )
It is immediately apparent that the cubic spline has handled the changes in this
curve much more readily. The main cause of difficulty for the local polynomial
interpolation used earlier is the fact that the function has a minimum and an inflection
point very close together – and the inflection point appears to be very close to the
discontinuity of the local divided difference interpolation function.
In general, we want to place more nodes in regions where the function is changing
rapidly (especially in its first derivative). We shall consider briefly the question of
knot placement for cubic spline interpolation in the next example.
Example 14 Use cubic spline interpolation for data from the function 1/ 1 + x 2
with eleven knots in [−5, 5] .
First, for equally spaced knots, we can use the Python commands
>>> X = np . arange ( −5 , 6 )
>>> Y = runge ( X )
>>> x = np . arange ( −5 , 5 .0 5 , . 0 5 )
>>> y = cspline (X , Y , x )
>>> p l t . p l o t ( X , Y , ’ k∗ ’ , x ,
6.4 Splines 221
runge ( x ) , ’ k− ’ , x , y , ’ k− ’ )
>>> plt . show ( )
>>> plt . figure ()
>>> plt . p l o t ( x , runge ( x ) − y , ’ k− ’ )
>>> plt . show ( )
to obtain the following two plots in Figs. 6.9 and 6.10. In Fig. 6.9, we see that the
graphs of Runge and its cubic spline interpolant using integer knots in [−5, 5] are
essentially indistinguishable. The error function in Fig. 6.10 shows that the fit is
especially good near the ends of the range (where polynomial interpolation failed).
The errors are largest either side of ±1 which is where the inflection points are, and
where the gradient is changing rapidly. To get a more uniform fit we would probably
want to place more of the knots in this region.
In Fig. 6.11 we see the effect on the error function of using knots at
0.6
0.4
0.2
0.0
−4 −2 0 2 4
0.000
−0.005
−0.010
−0.015
−0.020
−4 −2 0 2 4
222 6 Interpolation
Fig. 6.11 Interpolation error
with new knots 0.002
0.001
0.000
−0.001
−0.002
−4 −2 0 2 4
There are still just eleven knots but the magnitude of the maximum error has been
reduced by an approximate factor of 10.
We have seen that natural cubic splines can be used to obtain smooth interpolating
functions even for data where polynomial interpolation cannot reproduce the correct
behavior.
Exercises
x 01346
f (x) 5 4 3 2 1
6.4 Splines 223
3. Write a program to compute the linear spline interpolating a given set of data. Use
it to plot the linear spline interpolant for the
data in Exercise 2. Also plot the linear
interpolating spline for the function 1/ 1 + x 2 using knots −5, −4, . . . , 5. (If
all we want is the plot, there is a simpler way, of course. Just plot(X,Y) where X
and Y are the vectors of knots and their corresponding data values.)
4. Find the natural cubic interpolation spline for the data
x 1 2 3 4 5
ln x 0.0000 0.6931 1.0986 1.3863 1.6094
• Hints: As usual, the a-coefficients are just the data values:
5. √
Compute the natural cubic spline interpolant for data from the function
x exp (−x) using√knots at the integers in [0, 8] . Compare the graph of this
spline with that of x exp (−x) .
6. We can continue a natural cubic spline outside the span of the knots with straight
lines while preserving the continuity of the first two derivatives. Show that to
do this we can use s−1 (x) = a0 + b0 (x − x0 ) and sn (x) = an + bn (x − xn )
where
an = f (xn ) , and
cn−1 h n−1
bn = + δn−1
3
7. Modify the program for natural cubic spline interpolation to include straight
line extensions outside the span of the
knots. Use
this modified code to plot the
natural cubic spline interpolant to 1/ 1 + x 2 over the interval [−6, 6] using
knots −5, −4, . . . , 5.
224 6 Interpolation
12. Repeat the above exercise but this time use a picture of yourself to create a
line-drawing self portrait.
6.5 Conclusions and Connections: Interpolation 225
What have you learned in this chapter? And where does it lead?
The chapter began with polynomial interpolation in which we seek a polynomial
of specified degree that agrees with given data. The first approach was the “obvi-
ous” one of using our knowledge of solving linear systems of equations to find the
Lagrange interpolation polynomial by solving the Vandermonde system for the coef-
ficients. However that is both inefficient and because of ill-conditioning subject to
computational error.
The use of the Lagrange basis polynomials is equivalent to transforming that
system to a diagonal form, and is a more practical approach. Even so the addition of
new data points is cumbersome–and this is a very real issue when for example the data
comes from a time series and new observations become available–such as adding the
temperature at 11:00 to a sequence of temperature readings at 10:00, 10:15, 10:30
and 10:45 in forecasting the temperature at 11:15. (Think of the Weather Channel
app preparing its future radar to forecast the progress on a winter storm. They would
certainly want to use the most recent true data to update those estimates.)
Difference schemes allow a more efficient use of the data, including adding new
data points. These divided difference approaches have a very long history–which is
why they bear Newton’s name, of course–but they retain their importance today. Part
of the significance of difference representations lies in the ability to recenter the data
so that local data assumes greater importance relative to more distant data points. It
also allows us to halt computation (effectively reducing the degree of the polynomial)
either when no further improvement results from using more data, or when all data
within some reasonable radius of the point of interest has been exhausted. That is
the interpolation can be local.
Traditional polynomial, or even spline, interpolation is not always appropriate
although it provides initial insight into a deep field of applied mathematics. In some
scenarios, other methods will give better results, and it is helpful to draw a distinction
between the approximation of a smooth function whose tabulated values are known
very accurately, and the situation where the data itself is subject to error. (Most practi-
cal problems have some aspects of both of these.) In the former situation, polynomial
approximation may be applied with advantage over a wide interval, sometimes using
polynomials of moderate or high degree. In the latter case, it is invariably desirable
to use lower degree polynomials over restricted ranges. One important technique
is approximation by cubic splines in which the function is represented by different
cubic polynomials in different intervals.
Spline interpolation, can be viewed as a very special form of local polynomial
interpolation. Here the basic idea is to use low degree polynomials which connect as
smoothly as possible as we move through the data. The example we focused on was
cubic spline interpolation where the resulting function retains continuity in its slope
and curvature at each data point. (We enforce those through continuity of the first and
second derivatives.) Spline interpolation was an important tool in creating the early
digitally animated movies such as Toy Story. The ability to have visually smooth
curves that do not deteriorate under scaling for example is critical to creating good
226 6 Interpolation
images. In the context of animation, splines have been largely superseded in recent
years by subdivision surfaces. These are similarly smooth curves (or surfaces in
higher dimension) where certain points act as control points which you can envision
as elastic connections to the curve rather than fixed interpolation points. This is one
of many potential research topics that has its origins in the methods of this chapter.
Python has several functions for interpolation available in SciPy. These include
polynomial as well as spline interpolation.
’linear’ default value – performs linear interpolation between the data values.
’nearest’ interpolates intermediate values to the nearest data value, i.e. piece-wise
constant interpolation.
’zero’ zero-degree spline interpolation.
’slinear’ first-degree (linear) spline interpolation.
’quadratic’ second-degree spline interpolation.
’cubic’ third-degree spline interpolation.
The interp1d function returns a function itself. The returned function can be used
to evaluate the interpolation at an array of points supplied to it.
Example 15 We can use the following data to interpolate a set of points using SciPy’s
interp1d function
X = np . c o n c a t e n a t e ( ( ( 0 , ) , np . arange ( 1 , 2 .2 5 , . 2 5 ) ,
( 3 ,) ) )
Y = np . c o n c a t e n a t e ( ( ( 1 , ) , np . l o g ( X [ 1 : ] ) ) )
The interpolating functions are obtained using
6.6 Python Interpolation Functions 227
y1 = interp1d (X , Y)
y2 = interp1d (X , Y, ’ nearest ’ )
y3 = interp1d (X , Y, ’ zero ’ )
y4 = interp1d (X , Y, ’ slinear ’)
y5 = interp1d (X , Y, ’ quadratic ’ )
y6 = interp1d (X , Y, ’ cubic ’ )
The interpolation y1 uses the default argument ’linear’. So far we have the functions
y1 to y6 available for evaluating the interpolation at arbitrary points. We can do this
directly in plotting the interpolations
x = np . arange ( 0 , 3 , 0 .0 2 )
plt . plot ( x , y1 ( x ) , ’k ’)
plt . plot ( x , y2 ( x ) , ’k ’)
plt . plot ( x , y3 ( x ) , ’k ’)
plt . plot ( x , y4 ( x ) , ’k ’)
plt . plot ( x , y5 ( x ) , ’k ’)
plt . plot ( x , y6 ( x ) , ’k ’)
plt . show ( )
which are shown in Fig. 6.12.
We see that the interpolations obtained using ’linear’ (default) and ’slinear’ coin-
cide.
SciPy similarly has the functions interp2d and interpn for interpolation in 2
dimensions and in n dimensions, respectively. In addition, the scipy.interpolate
module contains several additional functions and classes for performing various
types of interpolation. We advise you to explore the SciPy documentation for further
details.
numpy.polyfit was introduced in Sect. 4.8.2. It is intended for least squares fitting
of polynomials to data. As such, it can also be used to compute an interpolating
0.6
0.4
0.2
0.0
−0.2
0.0 0.5 1.0 1.5 2.0 2.5 3.0
228 6 Interpolation
polynomial for a set of nodes and corresponding function values. See Sect. 4.8.2
for an example.
We can use numpy.polyfit configured for degree one less than the number of
nodes to obtain full polynomial interpolation in Python. The resulting polynomial
can be evaluated using the function numpy.polyval.
Differential Equations
7
Later in the chapter both higher-order initial value problems (or systems of them)
and the solution of boundary value problems (where the conditions are specified
© Springer International Publishing AG, part of Springer Nature 2018 229
P. R. Turner et al., Applied Scientific Computing, Texts in Computer Science,
https://doi.org/10.1007/978-3-319-89575-8_7
230 7 Differential Equations
at two distinct points) are also considered briefly. A typical second-order two-point
boundary value problem has the form
y = f x, y, y ; y (a) = ya , y (b) = yb (7.2)
Taylor series are the basis for many of the methods to follow. That is, some
methods can be derived by considering
h 20
y (x1 ) = y (x0 + h 0 ) = y0 + h 0 y (x0 ) + y (x0 ) + · · · (7.3)
2
for some steplength h 0 . Truncating the series after a certain number of terms or
approximating terms using other function values are some ways to arrive at numerical
schemes to approximate the solution at x1 . The process can then be repeated for
subsequent steps. An approximate value y1 ≈ y (x1 ) is used with a steplength h 1 to
obtain an approximation y2 ≈ y (x2 ) , and so on throughout the domain of interest.
Although Taylor series provides one way to derive many of the methods dis-
cussed here, there are other approaches. Many of the methods can also be viewed as
applications of simple numerical integration rules to the integration of the function
f (x, y (x)) . The simplest method – from either of these viewpoints – is Euler’s
method for which there is also an insightful graphical explanation. We should note,
you already got a flavor for this in the section on numerical calculus, but we present
here from a beginning viewpoint. To get an initial understanding, we’ll consider
Euler’s method as a graphical technique using an example. Figure 6.1 shows a slope
field for the differential equation
which, with the initial condition y (0) = 1, we shall use as a basic example for much
of this chapter. This particular initial value problem
1.5
0.5
0
-1 -0.5 0 0.5 1 1.5 2
Following a tangent line for a short distance gives an approximation to the solution
at a nearby point. The tangent line at that point can then be used for the next step in
the solution process. That is the conceptual idea behind Euler’s method.
As mentioned above, Euler’s method can be derived algebraically from a Taylor
approximation. Consider for these derivations, a fixed steplength h and let x0 = 0
and y0 = 1 as in (7.5). We denote x0 + kh by xk for k = 0, 1, . . . A first-order Taylor
approximation to y (x1 ) is
y1 = y0 + h f (x0 , y0 ) (7.6)
In terms of notation, keep in mind that apart from y0 , we use yk to represent our
approximation to the solution value y (xk ) . This is in contrast to our notation for
interpolation and integration where we often used f k = f (xk ) .
Also note that (7.7) is not quite the desired generalization of (7.6). In the first step
(7.6), the exact values are used in the right-hand side. In subsequent steps, previously
computed approximate values are used.
The notion that many methods for differential equations can be derived in terms
of the approximate integration of the right-hand side of the differential equation
232 7 Differential Equations
(7.1) is worth revisiting. To see this, consider the first step of the solution. By the
fundamental theorem of calculus, we have
x1
y (x1 ) − y (x0 ) = f (x, y (x)) d x. (7.8)
x0
which is equivalent to the approximation (7.6) derived above for Euler’s method.
This suggests that alternative techniques could be based on better numerical inte-
gration rules such as the mid-point rule or Simpson’s rule. This is not completely
straightforward since such formulas would require knowledge of the solution at
points where it is not yet known. This is discussed more later.
The Python implementation of Euler’s method is below. The inputs required are
the function f, the interval [a, b] over which the solution is sought, the initial value
y0 = y (a), and the number of steps to be used. The following code can be used.
import numpy as np
def e u l e r ( f c n , a , b , y0 , N ) :
"""
F u n c t i o n f o r g e n e r a t i n g an E u l e r s o l u t i o n t o
y ’ = f ( x , y ) in ‘N‘ steps with i n i t i a l
c o n d i t i o n y [ a ] = y0 .
"""
h = (b − a) / N
x = a + np . ar a n g e ( N+ 1 ) ∗ h
y = np . z e r o s ( x . s i z e )
y [ 0 ] = y0
f o r k i n range ( N ) :
y [ k +1] = y [ k ] + h ∗ fcn ( x [ k ] , y [ k ] )
return ( x , y )
The output here consists of a table of values xk , yk .
Example 1 Apply Euler’s method to the solution of the initial value problem (7.5)
y = (6x 2 − 1)y; y (0) = 1. Begin with N = 4 steps and repeatedly double this
number of steps up to 256.
7.1 Introduction and Euler’s Method 233
x y
0 1.0000
0.25 0.7500
0.5 0.6328
0.75 0.7119
1.0 1.1346
It is often useful to check computer code with some hand calculations. We step
through Euler’s method here by hand to get a better idea of how it works. With
x0 = 0, y0 = 1, N = 4, and h = 1/4, we obtain
1 3
y1 = y0 + (6x02 − 1)y0 =
4 4
Similar tables, with more entries are generated for the other values of N . The
values for y (1) obtained, and
their errors
are tabulated below. (Note that the true
solution of (7.5) is y = exp 2x 3 − x so that y (1) = e.)
N 4 8 16 32 64 128 256
yN 1.1346 1.6357 2.0527 2.3420 2.5169 2.6139 2.6651
|Err or | 1.5837 1.0826 0.6656 0.3763 0.2014 0.1044 0.0532
Although it is slow, the errors are decreasing. The solutions are plotted, along
with the true solution in Fig. 7.2.
As seen in the numerical calculus section, the ratio of successive errors can be used
to verify that code is working properly by comparing values to what the truncation
error theory predicts. Here the ratios of the successive final errors in Example 1 are
approximately 0.6836, 0.6148, 0.5352, 0.5183, and 0.5095 which seem to be settling
234 7 Differential Equations
True solution
1.5
1
N=4
0.5
0 0.2 0.4 0.6 0.8 1
down close to 1/2 which is the same factor by which the steplength h is reduced.
This suggests that the overall error in Euler’s is first order, i.e. the error is O (h) .
Let’s take a closer look at the truncation error analysis. The successive approxi-
mate solution points are generated by following the tangents to the different solution
curves through these points. At each step the approximate solution follows the tangent
line. At the next step, it follows the tangent to the solution to the original differential
equation that passes through this erroneous point. Although all the solution curves
have similar behavior, those below the desired solution are all less steep at corre-
sponding x-values and the convex nature of the curves implies that the tangent will
lie below the curve resulting in further growth in the error.
So, there are two components to this error; one is a contribution due to the straight
line approximation used at each step. Second, there is a significant contribution
resulting from the accumulation of errors from previous steps. Their effect is that the
linear approximation is not tangent to the required solution but to another solution of
the differential equation passing through the current point. These two types of error
contributions are called the local and global truncation errors.
The situation is illustrated in Fig. 7.3 for the second step of Euler’s method. In
this case, the local contribution to the error at step 2 is relatively small, with a much
larger contribution coming from the effect of the (local and global) error in the first
step.
For the first step, we have
y1 = y0 + h f (x0 , y0 )
h 2 h 2
y (x1 ) = y0 + hy0 + y (ξ) = y0 + h f (x0 , y0 ) + y (ξ)
2 2
7.1 Introduction and Euler’s Method 235
h 2
|y1 − y (x1 )| = y (ξ) ≤ K h 2
2
where K is a bound on the second derivative of the solution. A similar local truncation
error will occur at each step of the solution process.
We see that the local truncation error for Euler’s method is O h 2 yet we observed
that the global truncation error appears to be just O (h) . By the time we have com-
puted our estimate y N of y (b) , we have committed N of these truncation errors
from which we may conclude that the global truncation error has the form
E = N K h 2 = K (b − a) h
since N h = b − a. Thus the global truncation error is indeed of order h. The stated
error is the leading contribution to the rigorously defined global truncation error
which is of order h. (The rigorous derivation is more complicated than is appropriate
here.)
In addition to the truncation error theory presented here, in practice there will also
be round-off error, which will tend to accumulate. Those effects can be analyzed in
a manner similar to that for the global truncation error. The result is that the global
roundoff error bound is inversely proportional to the steplength h. As with numerical
differentiation earlier in this text, this places a restriction on the accuracy that can
be achieved. However, for systems working with IEEE double precision arithmetic,
the effect of roundoff error is typically much smaller than the global truncation error
unless we are seeking very high accuracy and therefore using very small steplengths.
Euler’s method is straightforward to implement. Unfortunately, it is not always
going to be sufficient to generate accurate solutions to differential equations. More
accurate methods are developed in the next sections that build on similar ideas pre-
sented above.
236 7 Differential Equations
Exercises
Exercises 1–6, and 8 concern the differential equation
y = 2yx − y
10. A rumor is spread when “information” passes from person to person. One
assumption to understand how fast a rumor spreads is; people acquire the rumor
by means of public sources, such as the Internet and the rate of acquisition is
proportional to the number of people that don’t yet know the rumor.
7.1 Introduction and Euler’s Method 237
Plot your solutions over time and explain the behavior of the graphs.
11. The above model is not particularly realistic. Develop your own model, using a
differential equation, that describes the rate of spreading of a rumor. You may
want to revisit the modeling ideas from Chap. 1. Try to solve your problem
numerically. What challenges arise during this process. If you are able to get a
solution, what are the strengths and weaknesses of your modeling approach?
The Runge–Kutta methods, which we explore next, use additional points between xk
and xk+1 which allow the resulting approximation to agree with more terms of the
Taylor series. This approach can also be derived by using higher-order numerical
integration methods to approximate
xk+1
y(xk+1 ) − y(xk ) = f (x, y) d x
xk
basic ideas behind these are similar to those presented. We shall discuss the most
commonly used fourth-order Runge–Kutta method in some detail without deriving
it in detail.
Consider the second-order Taylor expansion about the point (x0 , y0 ) using a
steplength h:
h 2
y(x1 ) ≈ y0 + hy0 + y
2 0
which give an approximate value of
h 2
y1 = y0 + h f (x0 , y0 ) + y (7.9)
2 0
This formula has error of order O h 3 . However, an approximation for y0 is required
for this expression to be useful. Since this term is multiplied by h 2 , it follows that an
approximation to y0 which
is itself accurate to order O (h) will maintain the overall
error in (7.9) at O h 3 .
For the sake of notation moving forward, denote the slope y0 = f (x0 , y0 ) by k1 .
Also let α ∈ [0, 1] and consider an “Euler step” of length αh. This gives
and denote by k2 the slope at the point (x0 + αh, y0 + αhk1 ) so that k2 = f (x0 + αh,
y0 + αhk1 ) . The simplest divided difference estimate of y0 is then given by
h 2 k2 − k1
y1 = y0 + hk1 +
2 αh
1 k2
= y0 + h k1 1 − + (7.11)
2α 2α
which is the general form of the Runge–Kutta second-order formulas. The corre-
sponding formula for the nth step in the solution process is
1 k2
yn+1 = yn + h k1 1− + (7.12)
2α 2α
7.2 Runge–Kutta Methods 239
where now
k1 = f (xn , yn )
k2 = f (xn + αh, yn + αhk1 ) (7.13)
Because the approximation (7.10) error of order O (h) , it follows that the
has
local truncation error in (7.11) is O h 3 as desired.
For implementation, a user has choices for the values of α. There are three common
choices, α = 1/2, 1, or 2/3, which are used in (7.12) to give different numerical
schemes.
α = 1/2: Corrected Euler, or Midpoint, method
where k2 = f (xn + h/2, yn + hk1 /2) . This method is simply the application of the
midpoint rule for integration of f (x, y) over the current step with the (approximate)
value at the midpoint being obtained by a preliminary (half-) Euler step.
α = 1: Modified Euler method
h
yn+1 = yn + (k1 + k2 ) (7.15)
2
with k2 = f (xn + h, yn + hk1 ) . This method results from using the trapezoid rule
to estimate the integral where a preliminary (full) Euler step is taken to obtain the
(approximate) value at xn+1 .
α = 2/3: Heun’s method
h
yn+1 = yn + (k1 + 3k2 ) (7.16)
4
with k2 = f (xn + 2h/3, yn + 2hk1 /3) . Heun’s method is different in that it cannot
be derived from a numerical integration approach.
The first two of these are illustrated in Figs. 7.4 and 7.5. In each of those figures,
the second curve represents the solution to the differential equation passing through
the appropriate point. For the corrected Euler this “midpoint” slope k2 is then applied
for the full step from x0 to x1 . For the modified Euler method, the slope used is the
average of the slope k1 at x0 and the (estimated) slope at x1 , k2 . One can see that
both these methods result in significantly improved estimates of the average slope
of the true solution over [x0 , x1 ] than Euler’s method which just uses k1 .
As mentioned above, the
local truncation error for all these “second-order” Runge–
Kutta methods is O h 3 but they are called “second-order” methods. The reason
for this is that by the time we have computed our approximate value of y (b) =
y (a + N h) , we have committed N of these local truncation errors (just like we
saw with Euler’s method).
The effect is that the global truncation error is of order
O N h 3 which is O h 2 since N h = b − a is constant.
240 7 Differential Equations
y0
Slope k2
y1
x0 x0+h/2 x1
y0 Slope k1
Slope k2
x0 x1
It follows that reducing the steplength by a factor of 1/2 should now result in
approximately a 75% improvement in the final answer. That is the error should be
reduced by a factor of (about) 1/4. We shall verify this with examples.
For the Corrected Euler (midpoint) (7.14) method we obtain the results below.
The true solution (y = e2x −x ) is tabulated for comparison.
3
7.2 Runge–Kutta Methods 241
The other two methods yield the results tabulated here in somewhat less detail.
All of these methods are easily programmed. (See the Exercises.) The values with
only 2 steps may look somewhat pessimistic at first, but compare to the forward
Euler methods, which took roughly 8 steps to get to nearly the same accuracy.
Example 3 Apply the three Runge–Kutta methods (7.14)–(7.16) to the same initial
value problem with N = 10and N = 100. Compare their errors and verify that the
global truncation error is O h 2 .
For this particular example, the modified Euler method appears to be performing
somewhat better than the other two. The errors at x = 1 are
242 7 Differential Equations
For comparison the error in Euler’s method for N = 10 is 0.9358 meaning second
order Runge–Kutta gives a significant improvement.
For N = 100, we of course do not tabulate all the results. The errors at x = 1 are
now
k1 = f (xn , yn )
k2 = f (xn + h/2, yn + hk1 /2)
k3 = f (xn + h/2, yn + hk2 /2)
k4 = f (xn + h, yn + hk3 )
The derivation of this result, and of the fact that its global truncation error is O h 4
are not included here. The basic idea is again that of obtaining approximations which
agree with more terms of the Taylor expansion.
This approach can also be viewed (7.17) as an approximate
application of Simp-
son’s rule to the integration of the slope over xn , xn+1 . Here k1 is the slope at xn ,
k2 and k3 are both estimated slopes at the midpoint xn + h/2, and k4 is then an esti-
mate of this slope at xn + h = xn+1 . With this interpretation, (7.17) is equivalent to
Simpson’s rule where the value at the midpoint is replaced by the average of the two
estimates. This gives insight to the truncation error, recall that we already saw that
7.2 Runge–Kutta Methods 243
Simpson’s rule has an error of order O h 4 . Moreover, the second-order Runge–
Kutta methods (7.14) and (7.15) have errors of the same order as their corresponding
integration rules.
Program Python code for the classical Runge–Kutta fourth-order method RK4
(7.17).
import numpy as np
def RK4 ( f c n , a , b , y0 , N ) :
"""
Solve y ’ = f ( x , y ) in ‘N‘ steps using
f o u r t h −o r d e r Runge−K u t t a w i t h i n i t i a l
c o n d i t i o n y [ a ] = y0 .
"""
h = ( b−a ) / N
x = a + np . ar a n g e ( N+ 1 ) ∗ h
y = np . z e r o s ( x . s i z e )
y [ 0 ] = y0
f o r k i n range ( N ) :
k1 = f c n ( x [ k ] , y [ k ] )
k2 = f c n ( x [ k ] + h / 2 , y [ k ] + h ∗ k1 / 2 )
k3 = f c n ( x [ k ] + h / 2 , y [ k ] + h ∗ k2 / 2 )
k4 = f c n ( x [ k ] + h , y [ k ] + h ∗ k3 )
y [ k + 1 ] = y [ k ] + h ∗ ( k1 + 2 ∗ ( k2 + k3 )
+ k4 ) / 6
return x , y
Example 4 Apply RK4 to our usual example with both N = 10 and N = 100.
The results show much better accuracy than the previous approaches and the error
in y (1) is 4.2884 × 10−4
With N = 100, this error is reduced to only 5.1888 × 10−8 which reflects the
reduction by a factor of 104 which we should anticipate for a fourth-order method
when the number of steps is increased by a factor of 10.
Higher-order Runge–Kutta methods can give high-accuracy answers with small
numbers of steps for a wide range of differential equations. There are however sit-
uations when even these methods are insufficient. The so-called stiff differential
equations are one such case.
Exercises
Exercises 1–5, and 8 concern the differential equation
y = x/y
1. For the initial condition y (0) = 3, use the corrected Euler’s method with steps
h = 1, 1/2 and 1/4 to approximate y (1) .
2. Repeat Exercise 1 for the modified Euler and Heun’s methods.
3. Write a script to implement the three second-order Runge–Kutta methods we
have discussed. Use them to solve the initial value problem of Exercises 1 and
2 over [0, 4] with N = 10, 20, 50, 100 steps.
4. Tabulate the errors in the approximate
values of y (4) in Exercise 3. Verify that
their errors appear to be O h 2 . Which of the methods appears best for this
problem?
5. Repeat Exercise 3 for different initial conditions: y (0) = 1, 2, 3, 4, 5 and plot
the solutions for N = 100 on the same axes.
6. Show that if f (x, y) is a function of x alone, then the modified Euler method is
just the trapezoid rule with step h.
7. Use the classical Runge–Kutta RK4 method with steplengths h = 10−k for k =
1, 2, 3 to solve the initial value problem y = x + y 2 with y (0) = 0 on [0, 1] .
Tabulate the results for x = 0, 0.1, 0.2, . . . , 1 and graph the solutions.
7.2 Runge–Kutta Methods 245
8. Repeat Exercise 7 for the initial value problem of Exercise 1. Verify that the
errors are of order O h 4 .
9. Show that if f (x, y) is a function of x alone, then RK4 is just Simpson’s rule
with step h/2.
10. Solve the differential equation y = −x tan y with y(0) = π/6 for x ∈ [0, 1] .
Compare your solution at the points between 0 and 1 using increments of 0.1
with the values obtained using RK4 with h = 0.1 and h = 0.05.
11. Repeat Exercise 9 from Sect. 7.1 using the methods from this section. How do
your errors compare?
12. Star Trek fans may remember when the cute, fuzzy Tribbles quickly became
trouble on the starship Enterprise. One Tribble was trapped in the ship and a
single Tribble actually produces in a litter of 10 Tribbles every 12 h. Mr. Spock
claimed that after 3 days, there were already 1,771,561 Tribbles.
The methods considered next are like the Runge–Kutta methods in that they use
more than one point to compute the estimated value of yk+1 . They differ in that all
the information is taken from tabulated points.
The simplest example is based on a second-order Taylor expansion
h 2 h 2
yn+1 = yn + hyn + yn = yn + h f (xn , yn ) + y . (7.18)
2 2 n
Using a backward difference approximation to the second derivative gives
N
yn+1 = yn + h βk f n+1−k (7.20)
k=0
The coefficients βk are chosen to give the maximum order of agreement with the
Taylor expansion. Like the Runge–Kutta methods, these Adams methods can be
viewed as the application of numerical integration rules but differ in that nodes are
incorporated from outside the immediate interval. The cases where βk = 0 are the
general N -step Adams–Bashforth methods which provide explicit formulas for yn+1 .
An important observation in (7.20) is that since the coefficient β0 multiplies the
slope f n+1 = f (xn+1 , yn+1 ) , then if β0 = 0, the right-hand side of (7.20) depends
on the very quantity yn+1 which we are trying to approximate. This gives an implicit
formula for yn+1 which could be solved iteratively (perhaps with a nonlinear solver
that we covered earlier). Adams methods with β0 = 0 are known as Adams–Moulton
methods.
Let’s derive the Adams–Moulton method which agrees with the Taylor expansion
up to third order. We start with
h 2 h 3
y(xn+1 ) ≈ yn + hyn + y + y (7.21)
2 n 6 n
and can then use difference approximations to the higher derivatives. Since the
Adams–Moulton formulas are implicit, we can use the (better) symmetric approxi-
mations to the derivatives:
− yn−1
yn+1 f n+1 − f n−1
yn ≈ =
2h 2h
and
− 2yn + yn−1
yn+1 f n+1 − 2 f n + f n−1
yn ≈ =
h2 h2
7.3 Multistep Methods 247
(Recall from the Numerical Calculus section) Substituting these into (7.21) gives
h
yn+1 = yn + ( f n+1 + f n ) (7.23)
2
When it comes to implementation, a user has choices about how to get multi-step
methods started. To use the two-step Adams–Bashforth formula (7.19) to obtain y1
appears to require knowledge of y−1 which would not be available. For higher-order
methods using more steps, even more points are needed. The usual solution to this
problem is that a small number of steps of a Runge–Kutta method of the desired
order could generate enough values to allow the Adams methods to proceed.
As an example, the modified Euler method (a second-order Runge–Kutta method)
could generate y1 after which the second-order (two-step) Adams–Bashforth method
would be used to generate y2 , y3 , . . .. For a fourth order Adams method, one could
use RK4 to generate y1 , y2 , y3 which is enough points to continue with the Adams
method.
In terms of computational efficiency, once the Adams methods are started, subse-
quent points are generated without the need for any of the intermediate points required
by the Runge–Kutta methods. For example, the second-order Adams–Bashforth
method would use only about half the computational effort of a second-order Runge–
Kutta method with the same basic steplength. this is a general advantage of Adams
methods.
Example 5 Use the two-step Adams–Bashforth method to solve the initial value
problem y = (6x 2 − 1)y; y (0) = 1 on [0, 1] with h = 1/4.
x y f
0 1 −1
1/4 0.8164 −0.5103
1/2 0.8164 + (1/8) (3 (−0.5103) − (−1)) = 0.7501 0.3750
3/4 0.7501 + (1/8) (3 (0.3750) − (−0.5103)) = 0.9545 2.2669
1 0.9545 + (1/8) (3 (2.2669) − 0.3750) = 1.7577 8.7884
Example 6 Use the Adams–Bashforth two-step method to solve the same initial
value problem as in Example 5 using N = 10 steps.
def ab2 ( f c n , a , b , y0 , N ) :
"""
S o l v e y ’ = f ( x , y ) on [ a , b ] i n ‘ N ‘ s t e p s u s i n g
two−s t e p Adams−B a s h f o r d , w i t h i n i t i a l
c o n d i t i o n y [ a ] = y0 .
"""
h = (b − a) / N
x = a + np . ar a n g e ( N+ 1 ) ∗ h
y = np . z e r o s ( x . s i z e )
f = np . z e r o s _ l i k e ( y )
y [ 0 ] = y0
f [ 0 ] = fcn ( x [ 0 ] , y [ 0 ] )
# Use m o d i f i e d E u l e r f o r f i r s t s t e p .
k1 = f [ 0 ]
k2 = f c n ( x [ 0 ] + h , y [ 0 ] + h ∗ k1 )
y [ 1 ] = y [ 0 ] + h ∗ ( k1 + k2 ) / 2
f [ 1 ] = fcn ( x [ 1 ] , y [ 1 ] )
# Use two−s t e p f o r m u l a f o r r e m a i n i n g p o i n t s
f o r k i n range ( 1 ,N ) :
y [ k + 1] = y [ k ] + h ∗ (3 ∗ f [ k ] − f [ k −1]) /2
f [ k + 1] = fcn ( x [ k + 1 ] , y [ k + 1 ] )
return x , y
7.3 Multistep Methods 249
Table 7.1 Results for Adams methods for the initial value problem of Example 6
x AB2, N = 10 ModEuler, AB2, N = 20 ABM23, N = 10
N = 10
0 1.0000 1.0000 1.0000 1.0000
0.1000 0.9077 0.9077 0.9063 0.9077
0.2000 0.8338 0.8297 0.8310 0.8329
0.3000 0.7844 0.7778 0.7805 0.7828
0.4000 0.7648 0.7557 0.7599 0.7628
0.5000 0.7824 0.7690 0.7757 0.7800
0.6000 0.8496 0.8282 0.8400 0.8468
0.7000 0.9908 0.9531 0.9756 0.9878
0.8000 1.2549 1.1824 1.2288 1.2525
0.9000 1.7441 1.5936 1.6956 1.7455
1.0000 2.6850 2.3484 2.5886 2.7022
The choice of the modified Euler method to generate y1 (or y[1] in the Python code)
is somewhat arbitrary, any of the second-order Runge–Kutta methods could have
been used.
The results are included in the second column of Table 7.1. For comparison,
the third column of Table 7.1 shows the results obtained using the modified Euler
method (Example 3) with the same number of steps. These two methods have global
truncation errors of the same order O h 2 . The explicit Adams–Bashforth method
appears much inferior. But we must note that each step of AB2 requires only one
new evaluation of the right-hand side, whereas the modified Euler method (or any
second-order Runge–Kutta method) requires two such evaluations per step.
It would therefore be fairer to compare the two-step Adams–Bashforth method
using N = 20 with the modified Euler method using N = 10. These results form
the third set of output. As expected the error is reduced by (approximately) a factor
of 1/4 compared to N = 10. Although the error is still greater than that for the
modified Euler method, it is now comparable with those for the other second-order
Runge–Kutta methods in Example 3.
Until now we haven’t discussed the role of Adams–Moulton methods. Since they
are implicit, they require more computational effort that the Adams–Bashforth meth-
ods. A practical approach is to use them in conjunction with Adams–Bashforth meth-
ods as predictor-corrector pairs. The basic idea is that an explicit method is used
to estimate yn+1 , after which this estimate can be used in the right-hand side of
the Adams–Moulton (implicit) formula to obtain an improved estimate. Thus the
Adams–Bashforth method is used to predict yn+1 and the Adams–Moulton method
is then use to correct this estimate.
250 7 Differential Equations
The code for the two-step Adams–Bashforth method is readily modified to add
the corrector step. The basic loop becomes
f o r k i n range ( 1 ,N ) :
y [ k + 1] = y [ k ] + h ∗ (3 ∗ f [ k ] − f [ k −1]) / 2
f [ k + 1] = fcn ( x [ k + 1 ] , y[k + 1])
y [ k + 1] = y [ k ] + h ∗ (5 ∗ f [ k + 1] + 8 ∗ f [ k ]
− f [ k − 1 ] ) / 12
f [ k + 1] = fcn ( x [ k + 1 ] , y[k + 1])
The results are shown as the final column of Table 7.1. The predictor-corrector
method provides smaller error than the modified Euler method using the same number
of points for this example. This is a fair comparison since both methods entail two
new evaluations of the slope per step.
To re-emphasize the connection between differential equations and numerical
integration formulas, let’s revisit the derivation of the coefficients for the Adams
methods. If f (x, y) has no explicit dependence on y then it follows that
xn+1 1
yn+1 − yn = f (x) d x = h f (xn + ht) dt.
xn 0
is exact for all quadratic functions F. Imposing this condition for F (t) = 1, t, t 2 in
turn we get the equations
F (t) = 1 : β1 + β2 + β3 = 1
F (t) = t : −β1 − 2β2 = 1/2
F (t) = t 2 : β2 + 4β3 = 1/3
Adding the last two of these yields β3 = 5/12, and thence we obtain
β2 = −4/3, β1 = 23/12. Therefore the three-step Adams–Bashforth formula is
(Note that the coefficients must sum to one, providing an easy check.)
The three-step formula (7.24) has fourth-order
local truncation error so that its
global truncation error is of order O h 3 .
The corresponding three-step (fourth-order) Adams–Moulton formula is based on
the quadrature formula
1
F (t) dt ≈ β0 F (1) + β1 F (0) + β2 F (−1) + β3 F (−2)
0
9. Derive the fourth-order Adams–Bashforth formula. (This will use nodes 0, −1,
−2, −3 and be exact for cubic polynomials.)
10. Apply the third and fourth order Adams–Bashforth methods AB3 and AB4 to
the standard problem using N = 20 steps on [0, 4]. Compare the results with
those of the fourth-order Runge–Kutta method RK4 with N = 10.
11. Repeat Exercise 10, doubling the numbers of steps used twice (N = 40 and
N = 80 for the Adams–Bashforth methods). What are your estimates of the
orders of their truncation errors?
12. Implement the Adams predictor-corrector method. Compare the performance to
using the fourth-order Runge–Kutta method the usual example problem y =
3x 2 y; y (0) = 1 on [0, 1]. Use 10 and then 20 steps and explain what you see.
Calculate the ratio of successive errors and explain what you see.
13. Use the predictor-corrector methods with AB3 and AB4 as predictors and AM4
as corrector. Use N = 20, 40, 80. Compare the results with those for RK4 using
N = 10, 20, 40.
We now focus on systems of differential equations, which are used to model interde-
pendent scenarios. Models that couple quantities that are interacting and changing
over time are often approached mathematically in this way. The first model we dis-
cussed in this text, that of the spread of a disease, is an example where three popula-
tions interact and change; susceptible individuals, infected individuals, and recovered
individuals (often called the SIR model). Given an initial population containing a
certain number of sick individuals, one can predict how those three populations are
changing in time under certain assumptions about the rates that the disease spreads
and people recover. Another common model that uses systems of differential equa-
tions is the predator-prey model from population dynamics. We provide an example
of this shortly.
Consider the problem of solving a system of first-order initial value problems
such as
y = f (x, y) , y (x0 ) = y0 , (7.26)
which is written in vector-form. Here y is an n-dimensional vector function of x, so
that y : → n and f is an n-dimensional vector-valued function of n + 1 variables,
f : n+1 → n . Note that we are making no distinction here between the function
and its values. The meaning should hopefully be clear from the context. It may help
7.4 Systems of Differential Equations 253
The techniques available for initial value problems of this sort are essentially the
same as those for single differential equations that we have already discussed. The
primary difference is that, at each step, a vector step must be taken.
Since the classical fourth-order Runge–Kutta method RK4 is both straightforward
to program and efficient, let’s concentrate on its use for solving systems. The python
code below is almost identical to that in Sect. 7.2. The only difference is the use of
vector quantities for the initial conditions and in the main loop.
def R K 4 s y s ( f c n , a , b , y0 , N ) :
"""
S o l v e y ’ = f ( x , y ) on t h e i n t e r v a l [ a , b ] i n
‘N‘ s t e p s f o r a system of d i f f e r e n t i a l
e q u a t i o n s u s i n g f o u r t h −o r d e r Runge−K u t t a w i t h
i n i t i a l c o n d i t i o n y [ a ] = y0 where ‘ y0 ‘ i s a
vector .
"""
h = (b − a) / N
x = a + np . ar a n g e ( N+ 1 ) ∗ h
y = np . z e r o s ( ( x . s i z e , y0 . s i z e ) )
y [ 0 , : ] = y0
f o r k i n range ( N ) :
k1 = f c n ( x [ k ] , y [ k , : ] )
k2 = f c n ( x [ k ] + h / 2 , y [ k , : ] + h ∗ k1 / 2 )
k3 = f c n ( x [ k ] + h / 2 , y [ k , : ] + h ∗ k2 / 2 )
k4 = f c n ( x [ k ] + h , y [ k , : ] + h ∗ k3 )
y [ k + 1 , : ] = y [ k , : ] + h ∗ ( k1 + 2
∗ ( k2 + k3 ) + k4 ) / 6
return x , y
Example 8 Predator-Prey model. Let R (t) , F (t) represent the populations of a rab-
bits and foxes at time t in a closed ecosystem. We assume that they behave like a
254 7 Differential Equations
simple food chain in that the rabbits feed on grass and the foxes feed on the rabbits.
A simple predator-prey model for this system can be used to predict the sizes of the
two populations. It has the form
Here α represents the rate at which the rabbit population would grow in the absence
of foxes, γ is the rate at which the fox population would decline in the absence of
the rabbits. The two nonlinear terms represent the interaction of these two popula-
tions. All the coefficients are assumed positive, with the attached signs implying the
“direction” of their influence. With t measured in months, find the populations of
rabbits and foxes after 4 months given initial populations
with α = 0.7, γ = 0.2, β = 0.005, δ = 0.001. (The first two of these imply that
the unchecked rabbit population would more than double in one month, and the
unchallenged fox population declines about 20% per month.)
return r
where y[0] represents the rabbit population R, and y[1] represents the foxes F.
The solution is then computed by the commands
>>> RF0 = np . a r r a y ( [ 5 0 0 , 2 0 0 ] )
>>> s o l _ x , s o l _ y = R K 4 s y s ( predprey , 0 , 4 , RF0 , 1 0 0 )
A meaningful visualization of the solution is obtained by plotting the rabbit pop-
ulation against the fox population at the corresponding time:
>>> p l t . p l o t ( s o l _ y [ : , 1 ] , s o l _ y [ : , 0 ] , ’ k− ’ )
The resulting plot is shown in Fig. 7.6. It is evident that the rabbit population
declines steadily while the foxes multiply until their population peaks at about 260.
At this time the foxes appear to have “won” and both populations decrease.
7.4 Systems of Differential Equations 255
400
rabbits
300
200
100
u 1 = y, u 2 = y
u 1 = u 2
u 2 = f (x, u 1 , u 2 )
u 1 = u 2 u 1 (x0 ) = y0
u 2 = f (x, u 1 , u 2 ) u 2 (x0 ) = y0
which could now be solved in just the same way as was used for the predator-prey
model in Example 8.
256 7 Differential Equations
u 1 = u 2 u 1 (0) = 1
u 2 = −u 1 u 2 (0) = 1
This system can now be solved using the same approach as above. The true solution
to this equation is y = sin(x) + cos(x). Simply plotting the approximation and the
true solution together would show no apparent differences between the two because
the error is significantly small. Indeed, the error in the solution at the first step is
roughly 10−10 and after 100 steps, although it has grown due the accumulation of
error at each step, it is only about 10−8 .
Note that the examples used so far have been initial value problems. For higher-
order differential equations it is common to be given boundary conditions at two
distinct points. A technique for solving boundary value problems is the subject of
the next section.
Exercises
1. Solve the initial value system y1 = x + y1 − 2y2 , y2 = 2y1 + y2 with y (0) =
(0, 0) over the interval [0, 4] using 200 steps of the fourth-order Runge–Kutta
method. Plot the functions y1 , y2 on the same axes.
2. Solve the same predator-prey equations as in Example 8 over longer time spans
[0, 10] and [0, 20] using h = 0.1 in both cases.
3. Vary the starting conditions for the predator-prey problem of Exercise 2. Fix
B (0) = 500 and use A (0) = 100, 200, 300 over the intervals [0, 20] , [0, 20]
and [0, 30] respectively. Plot the solutions on the same axes.
4. As mentioned in the beginning of this section, the spread of a disease (often
called an SIR model) can be modeled with a system of differential equations.
Investigate these ideas and choose a disease to study. Use the ideas in this section
to model how the populations are changing over some time span. Consider a
range of different initial conditions and model parameters and explain what you
see.
5. Solve the Bessel equation x 2 y + x y + x 2 − 1 y = 0 with initial conditions
y (1) = 1, y (1) = 0 over the interval [1, 15] using N = 280 steps. Plot the
computed solution.
6. Solve the second-order nonlinear equation y = 4x y + 2(1 − 2x 2 )y over the
interval [0, 2] using N = 100 steps with the intitial conditions y(0) = 0, y (0) =
2
1. Show that the true solution is y = xe x and plot the error. Explain what you
see.
7.4 Systems of Differential Equations 257
7. Solve the second-order nonlinear equation y = 2yy over the interval [0, 1.5]
N = 150 steps
using for the following
initial conditions:
(a) y (0) , y (0) = (0, 1)
(b) y (0) , y (0) = (0, 2) (c) y (0) , y (0) =
(1, 1) (d) y (0) , y (0) = (1, −1) .
Plot all the solutions on the same axes.
8. A projectile is fired with initial speed 1000 m/s at angle α0 from the horizontal.
A good model for the path uses an air-resistance force proportional to the square
of the speed. This leads to the equations
x + cx x 2 + y 2 = 0, y + cy x 2 + y 2 = −g
s = − βsz − δs
z = βsz − γr − αsz
r = δs + αsz − γr
Here, susceptible people can die via natural causes at a death rate δ. Here they
are taken out of the human population and added to the ‘removed’ population.
Humans in the removed class may become zombies if they are resurrected via
the parameter γ. Humans can become zombies if they encounter a zombie via the
interaction parameter β. Zombies leaving the model and entering the removed
class because a human won a battle with a zombie (by chopping the head off)
258 7 Differential Equations
is accounted for with the parameter α. is a birth rate of humans, but we will
assume that this attack is happening quickly enough that this can be set to zero.
(a) What are some assumptions that were made to arrive at this model?
(b) Solve this system using RK4 for systems to predict these populations after 5
days. Assume an initial human population of 500, 1 zombie, and 0 removed
people. For model parameters, use = 0, α = 0.005, β = 0.0095, γ =
0.0001, δ = 0.0001.
(c) Provide a plot of the zombies and the humans over time.
(d) Play with the model parameters and provide some illustrative plots. Are we
doomed?
To motivate the need for the numerical technique in this section, we start with an
example and then generalize the solution approach. We consider using a model for
projectile motion to determine a launch angle and hit a specific target. This example
also demonstrates how several of the ideas in this text can be combined to propose
a solution to a real-world problem.
Example 10 Find the angle α at which a projectile must be launched from the origin
with an initial speed of 500 m/s to hit a target with coordinates (1000, 150) in meters.
Represent this scenario mathematically and outline an approach to solve this problem
numerically.
One approach to modeling this while including air resistance has the drag force
proportional to the square of the speed. The vector differential equation is then of
the form
r + cvr = −gj
where v = |r | is the speed of the projectile. Here c is the constant of proportionality
which will depend on many physical parameters including the density and viscosity
of the medium (in this example, air), the size, mass, cross-section, and aerodynamic
properties of the projectile. This is subject to initial conditions,
The resulting nonlinear equation cannot be solved exactly and we require numer-
ical methods. Using ideas from the previous section, this model can convert this
higher-order system to a system of linear differential equations. Writing the vector
differential equation in component form we have
x + c (x )2 + (y )2 x = 0
7.5 Boundary Value Problems: Shooting Methods 259
y + c (x )2 + (y )2 x = −g
Each of these is a second order equation and can be converted to a system of first
order equations. Let’s define
u 1 = x, u 2 = x , u 3 = y, u 4 = y
u 1 = u 2
u 2 = −cu 2 u 22 + u 24
u 3 = u 4
u 4 = −cu 4 u 22 + u 24 − g.
For a given launch angle α, the initial conditions are known and so the system may
be solved, for example using RK4.
However, we have two remaining difficulties; we do not know α and we do not
know the “flight time” for which the solution must be computed.
Note that for a given α we could compute the solution until we have x = u 1 >
1000 using a small time step. Next, local interpolation (linear would suffice) could
be used to estimate the height u 3 when u 1 = 1000. Let’s denote this value of u 3
as φ(α). So, ultimately, our objective is to then solve the nonlinear equation of one
variable
φ(α) = 150,
which could be done for example, with the secant method.
Embedded within the proposed solution, this example illustrates the basic idea
of shooting methods for the solution of a two-point boundary-value problem. The
idea is to embed the solution to a related initial value problem in an equation-solver.
The resulting equation is then solved so that the final solution to the initial-value
problem also satisfies the boundary conditions. The specific details will of course
vary according to the particular type of equation and the nature of the boundary
conditions.
In what follows, the focus is on the solution of a second-order differential equation
with boundary conditions which specify the values of the solution at two distinct
points. That is, the problem is defined as
y = f x, y, y (7.29)
to have a solution for each value of the unknown parameter z. Our objective is to
find the appropriate value of z so that the solution of (7.31) hits the “target” value
y (b) = yb .
We need to express this as an equation for the unknown z. Let the solution of
(7.31) for any value of z be denoted by y (x; z) and define the function F (z) by
F (z) = 0 (7.33)
for then
y (b; z) = yb
and so the function y (b; z) is a solution the differential equation (7.29) which satisfies
the boundary conditions (7.30).
To this end, we have a nonlinear equation for z (7.33) that depends on output
from differential equation solver. One approach is to use the secant method since
it does not require any derivative information about F. The secant method requires
two initial guesses (or estimates), let’s call them z 0 , z 1 for z. Then we would need
to solve the initial value problem (7.31) for each of them. These yield F (z 0 ) and
F (z 1 ) after which we can apply the secant method (Sect. 5.5) to generate our next
estimate z 2 . The secant iteration can then proceed as usual – the only difference
being the need to solve a second-order initial-value problem on each iteration.
This approach exemplifies the need for efficient techniques for solving some
of the fundamental problems arising in the real world. The need for an efficient
equation solver is much easier to appreciate when evaluating the “function” entails
the accurate solution of a system of differential equations (or in many real-world
application, large systems of partial differential equations that comprise an off-the-
shelf industrial simulation tool). Keep in mind, the underlying differential equation
likely needs to be solved repeatedly for different initial conditions, meaning the need
for efficient methods of solving differential equations becomes apparent as well.
The process is illustrated using examples below (the projectile motion application
will be revisited in the exercises). The first iteration is described in considerable
detail. Later we show how to set up the function F as a Python script so that the
secant method can be used just as if it were a conventional function defined by some
algebraic expression (but it is actually treated like a “black-box”).
7.5 Boundary Value Problems: Shooting Methods 261
Example 11 Solve the Bessel equation x 2 y + x y + x 2 − 1 y = 0 with boundary
conditions y (1) = 1, y (15) = 0 by the shooting method.
z1 − z0
z2 = z1 − F (z 1 )
F (z 1 ) − F (z 0 )
1−0
= 1− 0.5356
0.5356 − 0.2694
= −1.012021
The next iteration begins with the solution of the differential equation with ini-
tial conditions y(1) = 1, y (1) = −1.012021 which will give f (z 2 ) = −9.5985 ×
10−5
Note that this is the solution to reasonable accuracy using just one iteration of
the secant method. The two initial guesses and the final solution are plotted together
in Fig. 7.7 along with the target point. The final curve agrees with the target and is
marked with a ‘*’.
This example is misleadingly very successful, which is not always the case. For
that example, Bessel’s equation is a linear differential equation even though its coef-
ficients are nonlinear functions of the independent variable. The general solution of
such a differential equation is therefore a linear combination of two linearly inde-
pendent solutions. (In this particular case, these are usually written as the Bessel
functions of the first and second kinds J1 and Y1 .) It follows that y (15) is just a lin-
ear combination of J1 (15) and Y1 (15) . Details of Bessel functions are not important
to this argument, what is important is that the equation F (z) = 0 is then a linear
equation. Since the secant method is based on linear interpolation, it will find the
solution in only one iteration in this case.
262 7 Differential Equations
0.5
-0.5
-1
0 5 10 15
In general we can’t expect to see such fast convergence. However, the procedure
is basically the same and we demonstrate the algorithm further on the next example;
a nonlinear differential equation.
We take as initial guesses y (0) = 0 which yields the solution y (x) = 1, and
y (0) = −1.
Note that for this problem,
To get started, set the boundary condition to [1, z 0 ] with z 0 = 0 and use RK4 for
systems to solve the problem. This gives F(Z 0 ) = 2. Next, repeat the process with
z 1 = −1 and get the approximated value for y(200) ≈ −0.6895. This gives F(z 1 ) =
0.3105.
To complete the first secant iteration, we set
z1 − z0
z2 = z1 − F(z 1 ) = −1.1838
F(z 1 ) − F(z 0 )
The solution of the initial value problem corresponding to this initial slope is then
computed.
7.5 Boundary Value Problems: Shooting Methods 263
This procedure can be automated by having a script that calculates F (z) for any
initial slope y (0) = z. The steps used above show how to achieve that. Note that
a function evaluation requires a call to the RK4sys solver. With that subroutine in
hand, the regular secant method can be applied to the equation F (z) = 0, giving
z = −1.3821. The actual error at the right-hand end of the interval is about 5 × 10−9 .
Note that the boundary conditions could be specified in a number of ways. For
example if we had specified the slope of the solution at a different value, we would
simply have compared the computed slope with that value.
Unfortunately, the method may be sensitive to the initial conditions. Recall the
motivating example, finding the launch angle for a projectile to hit a specific target.
The second-order air-resistance model the differential equations are a pair of nonlin-
ear second-order equations which we can rewrite as a system of four equations. The
major difference is that the boundary conditions specify a position whereas the inde-
pendent variable is time. That is, the boundary conditions are given at an unspecified
value of the independent variable. This problem can still be solved with the shooting
method, but one needs to check the y-coordinate at the point where the x-coordinate
is closest to the desired target and use that to measure how far we are off-target. See
the exercises for more details.
Exercises
1. Solve the Bessel equation x 2 y + x y + x 2 − 4 y = 0 with the boundary con-
ditions y (1) = 1, y (10) = 1.
2. Solve the Bessel equation x 2 y + x y + x 2 − 4 y = 0 with the boundary con-
ditions y (1) = 10, y (10) = −1. (This is similar to Exercise 1 except that the
right-hand boundary condition is given in terms of the slope. Note that the other
component of the solution to the system gives the values of the slope.)
3. Solve the equation y = 2yy with boundary conditions (a) y (0) = 0, y (1) = 1,
(b) y (0) = −1, y (1) = 1, (c) y (0) = 1, y (1) = 0
4. A projectile is launched from the origin with initial speed 200 m/s. Find the
appropriate launch angle so that the projectile hits its target which is at coordi-
nates (100, 30). Assume the air resistance is proportional to the square of the
speed with the constant 0.01 so that the differential equations are
x + 0.01x x 2 + y 2 = 0, y + 0.01y x 2 + y 2 = −g
For a given launch angle, compute the trajectory over a period of 10 s using a
time-step of 0.1 s. Find the x-value in your solution that is closest to 100. For
h (α) take the corresponding y-value and solve h (α) − 30 = 0 by the secant
method.
264 7 Differential Equations
(typically the slope) as an unknown and then solve for that unknown such that the
final point hits the target. The methods of Chap. 5 can be combined with our initial
value problem methods to solve the resulting “equation”–one which we can certainly
not expect to be able to write down explicitly.
In looking back at numerical integration and linear equations, you were actually
already introduced to another approach to boundary value problems–explicitly the
heat equation problems–using finite difference approaches to solve for all the inter-
mediate points. This observation serves at least two purposes: one is that there are
other approaches that we have not covered, and another is that the solution of differ-
ential equations is an appropriate final topic for this book because it relies on almost
every topic we have studied.
Final topic, yes–but certainly not the end of the story! As well as more advanced
study on the solution of ordinary differential equations in the manner we have dis-
cussed, the enormous issue of solving partial differential equations is the focus of
whole journals on current research. Many of the approaches also utilize optimization
techniques as well as all the topics we have discussed to address large complicated
models. We do not even try to catalog major fields within that realm as they are
simply too numerous! If you have been fascinated by some of what you’ve done in
this course, you should certainly consider further study, possibly including graduate
study as there is much important work still to be done.
Python has several functions available for solving ordinary differential equations
through the scipy.integrate module.
2.25
2.00
y(x)
1.75
1.50
1.25
1.00
0.75
0.0 0.2 0.4 0.6 0.8 1.0
x
The output sol is an object containing various information about the solution.
Most importantly, the fields t and y contain the time points used in the solution
and the function values at these points, respectively.
There are optional arguments to this function to specify the time points at which
to evaluate the solution and various other details.
We can plot the solution found in this example as follows
>>> p l t . p l o t ( s o l . t , s o l . y . r e s h a p e ( −1) , ’ k− ’ )
>>> p l t . show ( )
generates the plot of y in Fig. 7.8. Note that the methods described here are from
SciPy version 1.0.0. Previous versions of SciPy use an older interface through the
function odeint.
Other functions SciPy’s integrate module has several other differential equation
solvers. Particularly, the two above solution methods are recommended for “non-
stiff” problems while two other methods Radau and BDF are intended for “stiff”
problems. The method LSODA can handle automatic detection of stiffness and
switching to an appropriate solver method. We have not defined what is meant by
stiffness here and we do not cover the details. We encourage you to explore the
SciPy documentation for further details.
Further Reading and Bibliography
1. Barlow JL, Bareiss EH (1985) On roundoff error distributions in floating-point and logarithmic
arithmetic. Computing 34:325–347
2. Buchanan JL, Turner PR (1992) Numerical methods and analysis. McGraw-Hill, New York
3. Burden RL, Faires JD (1993) Numerical analysis, 5th edn. PWS-Kent, Boston
4. Cheney EW, Kincaid D (1994) Numerical mathematics and computing, 3rd edn. Brooks/Cole,
Pacific Grove
5. Clenshaw CW, Curtis AR (1960) A method for numerical integration on an automatic computer.
Numer Math 2:197–205
6. Clenshaw CW, Olver FWJ (1984) Beyond floating-point. J ACM 31:319–328
7. Davis PJ, Rabinowitz P (1984) Methods of numerical integration, 2nd edn. Academic Press,
New York
8. Feldstein A, Goodman R (1982) Loss of significance in floating-point subtraction and addition.
IEEE Trans Comput 31:328–335
9. Feldstein A, Turner PR (1986) Overflow, underflow and severe loss of precision in floating-point
addition and subtraction. IMA J Num Anal 6:241–251
10. Hamming RW (1970) On the distribution of numbers. Bell Syst Tech J 49:1609–1625
11. Hecht E (2000) Physics: calculus, 2nd edn. Cengage, Boston
12. IEEE (2008) 754–2008 - IEEE standard for floating-point arithmetic. IEEE, New York
13. Kincaid D, Cheney EW (1991) Numerical analysis. Brooks/Cole, Pacific Grove
14. Knuth DE (1969) The art of computer programming, seminumerical algorithms, vol 2. Addison-
Wesley, Reading
15. Langtangen HP (2016) A primer on scientific programming with python. Springer, Berlin
16. Munz P et al (2009) When zombies attack: mathematical modeling of an outbreak of zombie
infection. In: J.M. infectious disease modelling research progress, Tchuenche and C. Chiyaka,
pp 133–150
17. Olver FWJ (1978) A new approach to error arithmetic. SIAM J Num Anal 15:369–393
18. Schelin CW (1983) Calculator function approximation. Am Math Monthly 90:317–325
19. Skeel R (1992) Roundoff error and the patriot missile. SIAM News 25:11
20. Stewart J (2016) Calculus: early transdcendentals, 8th edn. Cengage
21. Turner PR (1982) The distribution of leading significant digits. IMA J. Num. Anal. 2:407–412
22. Turner PR (1984) Further revelations on l.s.d. IMA J Num Anal 4:225–231
23. Volder J (1959) The CORDIC computing technique. IRE Trans Comput 8:330–334
24. Walther J (1971) A unified algorithm for elementary functions. AFIPS Conf Proc 38:379–385
25. Wilkinson JH (1963) Rounding errors in algebraic processes. Notes on Applied Science. HMSO,
London
Index
Exponential series, 7–9, 12, 23 Intermediate Value Theorem, 41, 64, 147,
Extrapolation, 196, 208, 212 148, 156
Interp1d, 226
F Interpolation
Fixed point iteration, 153 divided difference formula, 199, 200, 207
Floating-point function, 203
arithmetic, 27 Lagrange, 55, 192–195, 197, 198, 201, 225
double precision, 17, 31 nodes, 41, 50, 54, 72, 193, 195, 197–199,
errors, absolute and relative, 27 204, 207, 209, 228
exponent, 17 polynomial, 41, 48, 50, 54, 78, 189, 192–
fraction, 17 194, 197–201, 206–208, 217, 220, 225,
IEEE, 17 228
machine unit, epsilon, 20 spline, 82, 94, 189, 190, 207, 209, 211,
mantissa, 17 215, 217–220, 223, 225, 226
normalized, 16, 20, 27, 32 Inverse iteration, 131, 133–135
representation, 17 Iteration function, 110, 153, 154, 158, 166
single precision, 17 Iterative methods, 107, 146
Forward elimination, 86, 90, 102 bisection, 147, 153
Forward substitution, 101, 102, 106 fixed point iteration, 153
Fourier transform, 125 function iteration, 154, 158
Fsolve, 186, 187 Gauss–Seidel iteration, 108–110, 112, 140
Function iteration, 154, 158 Jacobi iteration, 107, 110–112, 114, 115
Function metrics, 26 Newton’s method, 79, 160–163, 165–168,
L 1 ,, L 2 , L ∞ ,, 26 172, 173, 175, 177, 179–182, 185–188
least squares, 26 secant method, 79, 167, 168, 171–173,
uniform, 26 260, 263
Iterative refinement, 102, 106, 140
G
Gauss elimination, 84 J
back substitution, 86 Jacobi method, 107, 110, 114
forward elimination, 85, 90
pivoting, 88, 96 L
tridiagonal systems, 94, 96 Lagrange interpolation, 192, 194
Gaussian quadrature, 55, 57, 58, 79 basis polynomials, 122, 194, 225
Gauss–Seidel iteration, 108, 110, 112, 140 error formula, 55, 60, 61, 64, 71, 195, 197
Generalized eigenvalue problem, 143 Least squares approximation, 116
Geometric series, 7, 22, 23 continuous, 119–121, 189
Gerschgorin’s theorem, 132–134 normal equations, 120
Least-squares metric, 26, 119, 125
H Legendre polynomials, 123
Hankel matrices, 120 Linear regression, 5
Heat equation, 93, 94, 99, 106, 229, 265 Linear spline, 210–212, 223
Hilbert matrix, 91, 92, 129 Linear system of equations
in least-squares approximation, 121 Gauss–Seidel iteration, 108, 140
Horner’s rule, 190–192 iterative methods, 5, 107, 146
residuals, 102, 103, 142
I sparse, 106, 109, 114, 140
Ill-conditioned, 24, 92, 93, 102, 121, 128 triangular, 88, 101
Initial value problems, 229, 230, 232, 236, tridiagonal, 82, 93, 96, 98, 214, 217
240, 244, 245, 248, 251, 253, 255, Linear systems
259, 260, 262, 264, 265 Gauss–Seidel iteration, 108
Index 271