MA261Notes Part1
MA261Notes Part1
1 Introduction 1
2 Getting Started 10
2.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Convergence rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Conditioning (this subsection is not examinable) . . . . . . . . . . . . . . . 13
2.1.3 Floating point numbers (this subsection is not examinable) . . . . . . . . . 16
2.2 Some ODE Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Solvability of the Initial Value Problem (this subsection is not examinable) 21
2.3 Discrete Gronwall Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 The Forward/Backward Euler method . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Stability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
i
Chapter 1
Introduction
Numerical analysis is the mathematical study of algorithms for solving problems arising in many
different areas, e.g., physics, engineering, biology, statistics, economics, social sciences. In general
starting from some real world problem, the following steps have to be performed:
1. Modelling: the problem is formulated in mathematical terms, e.g., as differential equation.
In general the resulting problem can not be solved analytically (without approximation).
2. Analysis: the mathematical model is analysed for example with respect to well-posedness,
e.g., existence and uniqueness of a solution, sensitivity to errors in the data. Also terms with
different importance in the model can be identified to possibly reduce the complexity of the
model.
3. Discretization: the problem is approximated by (a sequence of) finite dimensional prob-
lems. The discretization is chosen to maintain important properties of the analytical prob-
lem, e.g., that the density is positive.
4. Numerical analysis: the discretization is studied again with respect to well-posedness, but
most importantly the error between the solution of the finite dimensional problem and the
mathematical model is estimated and convergence of the numerical solution is established.
5. Implementation: the finite dimensional problems are solved using a computer program.
This can be a cyclic procedure where for example the Analysis in step two can influence the
modelling step, i.e., step one is refined. The numerical simulation can show that additional effects
have to be taken into account and so the modelling has to be refined and so on.
This module will focus on all these points for some simple settings to make you familiar with
central underlying ideas. The modelling techniques and the numerical schemes presented are an
important building block used for solving more complex problems. We will be focusing on problems
described by ordinary differential equations.
The following example demonstrates how the above steps are applied - you don’t need to under-
stand the mathematical details of each step!
Example 1. Consider the problem of a steel rope of length L > 1m clamped between two pols
which are 1m apart so that the rope is almost taut. Now the position of the rope is to be modelled
in the case where an acrobat is standing in the middle. A sketch of the problem is shown in
Figure 1.2.
1. Modelling: first we make the assumption that the rope can be represented as a function
y : [0, 1] → R. The shape of the rope is such that its bending energy E is minimal. For E
one finds (neglecting for example gravity) the formula
c 1
Z 1
y 0 (x)2
Z
E(y) := p dx − f (x)y(x) dx .
2 0 1 + y 0 (x)2 0
1
CHAPTER 1. INTRODUCTION 2
y(x)
f
0 x
1
Figure 1.2: Sketch of problem and a computed solution with f (x) = B ( 21 ), = 0.05.
Here c depends on the material of the rope and f is the load (the acrobat) on the rope. So
we seek y ∈ V := {v ∈ C 2 ((0, 1)) : v(0) = v(1) = 0} so that
Both the function f and the constant c have to be determined by measurements and contain
data error.
This is a very complex problem. So we make a simplification and assume that the displace-
ment of the rope is small, e.g., y 0 is small. Then we can replace E by
c 1 0 2
Z Z 1
Ē(y) := y (x) dx − f (x)y(x) dx .
2 0 0
This problem has a unique solution. One can also show for example that y < 0 if f < 0, i.e.,
when the force is pointing downward, the displacement is also downwards along the whole
length of the rope. This matches our intuition.
3. Discretization: Instead of approximating the function y at all points in [0, 1] we compute
y at N points xi = ih for i = 1, . . . , N − 1 with h = N1 . We can replace the second derivative
CHAPTER 1. INTRODUCTION 3
for n ≥ 1. We compute this sequence up to Y P which is then our final approximation. It can
be seen that Y n → Y (n → ∞) but since we can not compute an infinite number of iterates,
we have a further termination error caused by stopping the computation after P steps.
4. Numerical analysis: the matrix A is regular and thus the discrete problem has a unique
solution. For v ∈ C 4 ((0, 1)) there exists a constant M > 0 so that for all x ∈ (0, 1) the
following error estimate holds v 00 (x) − h12 (v(x + h) − 2v(x) + v(x − h)) ≤ M h2 . The same
estimate holds for the discrete values Y :
max ȳ(xi ) − yi ≤ Ch2 ,
i=1,...,N −1
The following example is again a taster of things to come. It demonstrates how some simple
manipulation can be used to simplify a model reducing the number of parameters it depends on
considerably making analysing the model much easier:
Example 2. A mass spring system with friction proportional to the velocity is modelled by the
second order ODE
µx00 (t) + βx0 (t) + γx(t) = 0 .
Here x(t) is the position of the (point) mass at time t, thus x0 is the velocity and x00 the acceleration
at time t. The three constants in the model are the mass µ > 0, β > 0 the amount of friction, and
γ > 0 the restoring force of the spring. To make the problem well posed the initial position of the
mass x(0) = x0 and the initial velocity x0 (t) = v0 have to be prescribed.
There are different ways to derive this model, one of them is to start with Newton’s second law
µa = F (mass × acceleration = applied forces). We choose our coordinate system in such a way
that x = 0 corresponds to the rest position of the spring - so x > 0 means the spring is stretched.
The restoring forces are assumed to be proportional to the amount of stretching s, so Fr = −γs.
This is a modelling assumption we could also have a nonlinear spring where the restoring force
depends nonlinearly on the stretching, e.g., F = −ks3 . The force of friction is assumed to also be
directly proportional to the velocity of the mass x0 , so Ff = −βx0 . As said before a = x00 and due
to our choice of coordinate system s = x so:
This model is linear and can be easily solved using the approach of characteristic polynomials
discussed in MA133:
β
x(t) = e− 2µ t A cos(wt) + B sin(wt) ,
p
with w = 4γµ − β 2 /2µ
where we made the assumption that b is small so that β 2 < 4γµ (the system is under damped).
The constants A and B are determined from the initial conditions.
The problem seems to depend on three parameters - although we know from studying the exact
solution that the type of behaviour of the system depends on the ratio between β 2 and 4γµ, e.g., if
β2 β2
4γµ < 1 the system oscillates while for 4γµ > 1 the system is over damped and will not oscillate
at all. One modelling technique is to non dimensionalize the model and in that step try to reduce
the number of parameters and isolating the parameters that mainly determine the behaviour of the
problem. To this end one first needs to fix the physical units of each part of the model, e.g., x could
be measured in meters (m), the velocity x0 could then be meters per second (m/s), acceleration
is (m/s2 ). Mass µ could be in kilogram (kg) and (to make things fit) we assume that β has
units kg/s and γ kg/s2 (we will discuss this in detail later on). Now let us fix a typical time
scale T , length scale L and introduce scaled time τ = t/T and rescale the position x(t) in the form
χ(τ ) = x(T τ )/L. Using chain rule we can easily see that χ0 (τ ) = x0 (T τ )T /L, χ00 (τ ) = x00 (T τ )T 2 /L
and thus
Note that χ, τ do not have any units (e.g. t, T have both some time units like seconds and their
fraction is unitless). We can now divide through by µL and multiply with T 2 to arrive at
and note not that the two remaining constants T β/µ, T 2 γ/µ are also unitless. We now have many
different ways to choose T (note that the equation doesn’t depend qon our choice for L). We could
choose T to make T β/µ = 1 or alternatively T γ/µ = 1, e.g., T = µγ , which leads to a coefficient
2
q q
2 β2
in front of the friction term of T β/µ = µβγµ 2 = 2
γµ =: ω . Our model thus reduces to
ξ 00 + ω 2 ξ 0 + ξ = 0 .
CHAPTER 1. INTRODUCTION 5
2
β
We are only left with a single factor ω 2 = γµ and we can discuss the behaviour of the solution
to this model (or simulate it) depending on the size of this one parameter. The damping regime
now depends on ω 2 being less than or greater than one. After understanding the behaviour of this
non dimensionalized problem one can then look at the values of the parameters in the problem e.g.
µ, β, γ to figure out which regime a given spring mass system belongs to. Of course in this simple
case we have not really learned anything new but the concept is more widely applicable.
Using Newton’s second law is one way of deriving the equations of motion for the mass. Another
approach is based on Hamiltonian dynamics which we will also briefly cover in this module. Let
us for now consider the frictionless case. Define the Hamiltonian
µ 2 γ 2
H(p, q) := q + p
2 2
and consider a particle moving such that H(x(t), x0 (t)) is constant in other words d 0
dt H(x(t), x (t)) =
0. Using chain rule it is easy to see that
d
H(x(t), x0 (t)) = µx00 x0 + γx0 x = x0 (µx00 + γx)
dt
so that H(x(t), x0 (t)) is constant if and only if either x is stationary (i.e. x0 = 0) or x solves the
second order problem
µx00 + γx = 0 .
We looked a bit at the modelling aspects of this problem and did some analysis, we can now turn
to discretizing the problem. In this case we have an exact solution so looking at discretization
methods for this problem seems a bit pointless but the circumstance that we know what the solution
should look like allows us to study the behaviour of a given method much more easily and we can
deduce something for more complicated cases where we do not know the exact solution. Of course
this only makes sense if we assume that the method we are studying is applicable to more general
problems. In this module we will focus on methods for solving first order nonlinear systems, i.e.,
ODEs of the form
y 0 (t) = f (t, y(t))
where y : [0, T ] → Rm for m ≥ 1. We can easily rewrite our mass spring system in that form
by introducing the vector y(t) = (y1 (t), y2 (t)) = (x(t), x0 (t)) so that y 0 (t) = (x0 (t), x00 (t)) =
(x0 (t), −γ/µx(t)) = (y2 (t), −γ/µy1 (t)) which is of the right form if we define f (y) = (y2 , −γ/µy1 ).
A simple approach to discretize y(t) is to look for approximations yn ≈ y(tn ) where t0 = 0 < t1 <
T
t2 < · · · < tN = T are some fixed points in time, for example tn = nh with h = N . The derivative
0
y (tn ) can be approximated using a finite difference quotient for example
y(tn+1 ) − y(tn ) yn+1 − yn
y 0 (tn ) ≈ ≈
h h
(we will have to make these ≈ much more precise if we want to understand what is going on).
Since y 0 (tn ) = f (y(tn )) ≈ f (yn ) we arrive at the so called forward Euler method:
yn+1 = yn + hf (yn )
which is a very easy to implement method, since given the initial condition y0 we can directly
compute y1 = y0 + hf (y0 ) and then y2 = y1 + hf (y1 ) and so on up to yN = yN −1 + hf (yN −1 ).
Applying this to our mass spring problem we get
γ
yn+1,1 = yn,1 + hyn,2 , yn+1,2 = yn,2 − h yn,1 .
µ
γ
In the following we set µ = 1 and use as initial data y(0) = (x0 , v0 ) = (1, 1) so that the exact
solution is simply
y(t) = (cos(t) + sin(t), − sin(t) + cos(t))
CHAPTER 1. INTRODUCTION 6
2π
at T = 2π the solution should be back at (1, 1) so let us check what value yN has for N = h for
2π
different values of h, e.g., hi = 100 2−i , i.e., we use Ni = 1002̇i points for i = 0, 1, . . . , 4:
i N yN |y(T ) − yN |
0 101 [1.20766198 1.2277517 ] 3.08212e-01
1 201 [1.10139465 1.10595473] 1.46654e-01
2 401 [1.05003655 1.05112221] 7.15342e-02
3 801 [1.02484773 1.02511256] 3.53278e-02
4 1601 [1.01238062 1.01244602] 1.75552e-02
5 3201 [1.00617943 1.00619568] 8.75053e-03
We have used a very large number of points tn for the final simulation and the solution is still not
all that accurate - the error has just dropped below 1%. Depending on the application this might or
might not be an acceptable level of error and may or may not be an acceptable computation effort
to reach this error. But it does seem worth while to investigate methods that achieve a smaller
error with the same computational cost or the same error with less computational cost and we will
study some such approaches in this module. The results seem to indicate that the error is going
to zero with increasing N - in fact it looks like the error is halving each time N is doubled, i.e.,
the error is proportional to 1/N ∼ h. We will see later in this module that this is in fact the
case. Computing only one period of the oscillation is often not of interest but instead the long
time behaviour is to be simulated, so let us redo the above computation with T = 200π (which is
actually not that long):
i N yN |y(T ) − yN |
0 10001 [-2.00707e+07,5.08076e+08] 5.08472e+08
1 20001 [1.48842e+04,2.27772e+04] 2.72078e+04
2 40001 [1.31599e+02,1.45952e+02] 1.95108e+02
3 80001 [1.16376e+01,1.19422e+01] 1.52607e+01
4 160001 [3.42277e+00,3.44495e+00] 3.44204e+00
5 320001 [1.85158e+00,1.85458e+00] 1.20644e+00
Not so good - the best that can be said, is that it does seem to be converging but the errors
are huge! As the next simulation shows, instead of staying on a constant level curve of H (i.e.
H(y(t)) = H(y(0)) the value of H seem to be increasing and to verify this we add H(yN ) to our
output (the expected value is H(1, 1) = 1). We also increase i a bit more:
So decreasing h (or increasing N ) to compute the error at a fixed time does seem to work - although
the required work can be very high if the error is to be small or the time period somewhat longer.
Instead of changing h we will now fix h and increase T just to show that effect a bit more:
To show the time evolution of the discrete solution for different values of h see the left figure in
the following plot (only every 15th approximate value is plotted). On the right you can see the
same a simulation with a larger value of T using the same value of h used for the curve with the
same colour on the left. The plots show the evolution of the system in phase space, i.e., the x-axis
represents the position of the mass and the y-axis its velocity. Another way of thinking of these
plots is in terms of the Hamiltonian H - H should be constant, i.e., the mass should remain on a
single level curve of H which are circles around the origin.
10
2
0 0
−2 −10
−2 0 2 −15 −10 −5 0 5 10 15
We will see later on that the forward or explicit Euler method suffers from stability issues in the
case that h is too large (this is not the problem here...). Nevertheless, we can try a method that
we will later prove to be more stable: the backward or implicit Euler method. The approach to
derive the approximation is the same as for the forward Euler method, except that we use the
approximation at t = tn+1 instead of at t = tn , i.e., y 0 (tn+1 ) ≈ y(tn+1h)−y(tn =≈ yn+1h−yn . and
y 0 (tn+1 ) = f (y(tn+1 )) ≈ f (yn+1 ) we arrive at the so called forward Euler method:
yn+1 = yn + hf (yn+1 ) .
The method is in general not quite as easy to code up but since f is linear in our case it is still
fairly easy to do:
1 1
0 0
−1 −1
Note that now the mass is slowing down like it would if friction was added (recall that the x-axis
in these plots represent the position and the y-axis the velocity).
In summary: were we to use the forward Euler method to compute the orbit of a satellite around
earth, the satellite would always be spinning off into space - preferable perhaps to the trajectory
predicted by the backward Euler method but still not correct... But also note that both methods
converge, i.e., if we fix a point in time and reduce h enough the error can (in theory) be made as
small as we want it to be. We have some experimental indication of this for the forward Euler
method, the following table indicates that the same is true for the backward Euler method:
Can we derive a method that converges and maintains the energy of the system, i.e., guarantees
that H(yn ) = H(y0 ) for all n? The answer is yes and the method is just as simple to implement
as the forward Euler method. The method is often referred to as the symplectic Euler method and
you will need to look closely to see the difference to the forward Euler method described above:
γ
yn+1,1 = yn,1 + hyn,2 , yn+1,2 = yn,2 − h yn+1,1 .
µ
CHAPTER 1. INTRODUCTION 9
1 1
0 0
−1 −1
Getting Started
In this chapter we will introduce a few concepts without being too formal. The ideas will then be
expanded on in the following chapters.
We use the abbreviations: C m (a, b) for C m ((a, b)) and C ∞ (I) := C m (I). It follows that
T
m∈N
C ∞ (I) ⊂ . . . ⊂ C m (I) ⊂ . . . ⊂ C 0 (I).
Theorem 1. (Taylor Theorem) Let f ∈ C m (a, b) and x0 ∈ (a, b) be given. Then there exist a
function ωm : R → R with lim ωm (x) = 0, so that
x→x0
where
m
X 1 (k)
Pm (x) = f (x0 )(x − x0 )k ,
k!
k=0
where there are different important expressions for the remainder term:
1. Lagrange representation: For fixed x ∈ (a, b) there is a ξ between x0 and x so that
1
Rm (x) := f (m+1) (ξ)(x − x0 )m+1 .
(m + 1)!
10
CHAPTER 2. GETTING STARTED 11
2. Integral representation:
Zx
1
Rm (x) := f (m+1) (t)(x − t)m dt .
m!
x0
(i) g(t) = O(h(t)) for t → 0 iif there is a constant C > 0 and a δ > 0, so that
(ii) g(t) = o(h(t)) for t → 0 iif there is a δ > 0 and a function c : (0, δ) → R with
|xn+1 − x|
lim = λ,
n→∞ |xn − x|p
then we say {xn } converges to x with order p. The largest p with this property is said to be the
convergence rate or the rate of convergence of the sequence (xn )n .
Suppose z : [0, h0 ] → Rq and z(h) → z0 for h → 0. If there exists a p > 0 with
|z(h) − z0 | = O(hp )
then we say that the convergence is of order p. The largest p with this property is said to be the
convergence rate or rate of convergence of z(h) to z0 .
If p = 1, this is called linear convergence. If p = 2 this is quadratic convergence.
This can of course not be used to prove convergence but we can get a good indication of the
convergence rate through experiments and we can use a theoretical proven rate of convergence to
verify that an implementation is correct by comparing an experimental order of convergence with
the theoretical convergence rate. A major issue with this approach is that to compute the error one
requires knowledge of the exact solution z0 . In the example we discussed at the beginning of this
chapter, where we applied the forward Euler method to the linear mass spring problem we know
the exact solution so could compute the error. In general we want to use approximation in complex
cases where no exact solution is available. In this case some other approach to determining the
order of convergence has to be used. In summary the EOC is a good tool to check the correctness
of an implementation but to use it requires simplifying/modifying the problem to the point that
the exact solution is available.
A typical approach is the so called method of manufactured solution. For an ODE this would for
example entail to pick an exact solution Y and then compute the right hand side f = f (t, y) so
that Y is the exact solution to the ODE y 0 = f (t, y). The trick is to choose Y so that f is not too
trivial, e.g., depends nonlinearly on y. For example Y (t) = sin(t) and f (t, y) = cos(t) would allow
us to computep errors but not really challenge the ODE solver. Instead we could take Y (t) = sin(t)
and f (t, y) = 1 − y 2 which would work at least for a restricted time interval.
Applying this to ODEs we will have z0 = Y (T ) and z(h) is the approximation at the final time
using a scheme with step size h.
Looking back at the errors computed for the mass spring problem we see that the error seems
to behave proportional to h = N1 - we already mentioned this previously. User the concept of a
rate of convergence this means that the scheme converges linearly, i.e., with order p = 1. Let us
compute the EOC to confirm this:
i N yN |y(T ) − yN | EOC
0 101 [1.20766e+00,1.22775e+00] 3.08212e-01
1 201 [1.10139e+00,1.10595e+00] 1.46654e-01 1.07151e+00
2 401 [1.05004e+00,1.05112e+00] 7.15342e-02 1.03571e+00
3 801 [1.02485e+00,1.02511e+00] 3.53278e-02 1.01783e+00
4 1601 [1.01238e+00,1.01245e+00] 1.75552e-02 1.00891e+00
5 3201 [1.00618e+00,1.00620e+00] 8.75053e-03 1.00445e+00
c u(t)
u(t)
1 1
c
0
0 t t0
The next example also shows how even very small errors in some part of the algorithm can strongly
influences the error in the solution:
Example 6. Consider the linear system of equations
1.2969 0.8648 x1 0.86419999
= =: b.
0.2161 0.1441 x2 0.14400001
x1 0.9911
The exact solution is = .
x2 −0.4870
Due to some error, we obtain
0.8642
b̄ =
0.1440
instead of the exact right hand side. The relative error in the first and second component is merely
|0.86419999−0.8642|
0.86419999 = 1.15 10−8 and |0.14400001−0.1440|
0.14400001
−8
= 6.9410 . So the error is quite small.
x̄1 2
But the solution to the new problem is = , which means that the error in the
x̄2 −2
solution to the linear system of equations is more than 100%.
The amplification of errors, as shown in the previous example, is characterized by the conditioning
of the problem.
Definition 6. A problem is said to be well conditioned if small errors in the data lead to small
errors in the solution and badly conditioned otherwise. We will provide two different notion of
conditioning for the problem of computing the value f (x0 ) for a given function f : U → Rn and
given data x0 ∈ U where U is an open subset of Rm . We call this the problem (f, x0 ).
CHAPTER 2. GETTING STARTED 15
Example 7. The solution to the linear system Ay = b is the problem (f, x0 ) where f (x) = A−1 x
and x0 = b. The problem given in Example 6 was apparently badly conditioned.
Theorem 2. Let x0 = (x1 , . . . , xm ) ∈ U and let x0 + ∆x ∈ U be some perturbation of the data
with |∆x| 1. If f : U → Rn is once continuously differentiable then the error ∆fi (x0 ) =
fi (x0 + ∆x) − fi (x0 ) (i = 0, . . . , n) in the evaluation of fi is up to leading order equal to
m
X ∂fi
(x0 )∆xj .
j=1
∂xj
∂fi xj
kij := (x0 )
∂xj f (x0 )
3. f (x1 , x2 ) = x1 + x2 :
xj
k1j = .
x1 + x2
Thus k1j is arbitrarily large, if x1 x2 < 0 and the absolute values of x1 , x2 are similar - in
this case addition is badly conditioned. Otherwise it is well-conditioned.
4. Subtraction is badly conditioned, if x1 x2 > 0 and the absolute values of x1 , x2 are similar.
We see that subtracting two positive numbers of the same size is not well conditioned. This can
be a problem on a computed since non exact arithmetic has to be used.
Example 9. Consider the case where number are rounded after the third decimal, i.e., instead
of x = 0.9995 and y = 0.9984, the approximations x̄ = 1 and ȳ = 0.998 have to be used. Then
¯ = 0.002 is performed. The relative error
instead of x − y = 0.0011 the computation x̄ − ȳ = 0.002
in the data is 0.0005 while the relative error after evaluation is 0.82 so we have an amplification
of more than 1000. For the condition number we compute k1j ≈ 910. This problem is known as
cancellation or loss of significants.
A−1 y
Kabs = A−1 := sup .
y6=0 ||y||
Since there is a x ∈ Rm with ||Ax|| = ||A|| ||x|| the number A−1 · ||A|| is a good estimate for
the condition number of the problem (f, b).
Consider the matrix A from Example 6. We can show that A−1 ||A|| ≈ 109 , which shows that
the problem is badly conditioned.
The following section describes in detail how numbers are represented on a computer and how
arithmetic operations are performed. That section also includes some more examples showing the
issue of cancellation.
Usually one uses the notation ±a = 0. m1 . . . mr b±E with E = es−1 bs−1 + . . . + e0 b0 (Expo-
| {z }
Mantissa M
nent) and mi , ei ∈ {0, . . . , b − 1} , E ∈ N. For normalization purposes one assumes that m1 6= 0 if
a 6= 0.
CHAPTER 2. GETTING STARTED 17
For given (b, r, s) let the set A = A(b, r, s) denote all real numbers a ∈ R with the representation
(∗).
To store a number a ∈ D = [−amax , −amin ] ∪ {0} ∪ [amin , amax ] a mapping from D to A(b, r, s) is
defined: f l : D → A with f l(a) = min |â − a|.
â∈A
Remark. The floating point representation allows to store real number of very different magnitude,
−30
e.g., the speed of light c ≈ 0.29998 · 109 m
s or the electron mass m0 ≈ 0.911 · 10 kg.
We usually use the decimal system with b = 10 while on computers a binary representation is used
with b = 2. The constants r, s ∈ N depend on the architecture of the computer and the desired
accuracy.
Lemma 1. The set A(b, r, s) is finite. Its largest and smallest positive element is amax = (1 −
s s
b−r ) · bb −1 , amin = b−b , respectively.
Proof. Left as an exercise.
Remark. Usually for a ∈ (−amin , amin ) one defines f l(a) = 0 (“underflow”). If |a| > amax (“over-
flow”) many programs set a = N aN (not a number) and a computation has to be terminated.
Theorem 3. (Rounding errors) The absolute error is given by
1 −r E
|a − f l(a)| ≤ b ·b ,
2
where E is the exponent of a. For the relative error caused by f l(a) for a 6= 0 the estimate
|f l(a) − a| 1
≤ b−r+1
|a| 2
holds.
Definition 11. The machine epsilon M := 12 b−r+1 is the difference between 1 and the next larger
number representable.
Defining := f l(a)−a
a , one has f l(a) = a + a = a(1 + ) and || ≤ M .
Proof. (Theorem 3) In the worst case f l(a) will differ from a by half a unit in the last position of
the mantissa of a: |a − f l(a)| ≤ 21 b−r bE .
Since we are assuming a normalized representation (m1 6= 0) if follows that |a| ≥ b−1 bE and
therefore
1 −r E
|f l(a) − a| b b 1
≤ 2 −1 E = b−r+1 .
|a| b b 2
Example 11 ((IEEE format)). A usual formal is the IEEE format. It provide standards for
single and double precision floating point numbers. A double precision number is stored using 64
bits (8 bytes):
x = ±m2c−1022 .
One bit is used to store the sign. 52 bits are used for the mantissa m = 2−1 +m2 2−2 +· · ·+m53 2−53
(the first position is one due to normalization). The characteristic c = c0 20 +· · ·+c10 210 ∈ [1, 2046]
can be stored in the remaining 11 bits. Here mi , ci ∈ {0, 1}. By storing the exponent in the form
c − 1022, i.e., without a sign, the range of numbers is doubled. The two excluded cases c = 0
and c = 2047 are used to store x = 0 and NaN, respectively. We have amax = 21024 ≈ 1.8 10308 ,
amin = 2−1022 ≈ 2.2 10−308 , and M = 21 2−52 ≈ 10−16 .
Definition 12. (Machine operations) The basic operations ? ∈ {+, −, ×, /} is replace by ~. In
general
a ~ b = f l(a ? b) = (a ? b)(1 + )
with || ≤ eps.
CHAPTER 2. GETTING STARTED 18
Example 12. (Loss of significants) In this example we use b = 10 and r = 6. We study the
problem (f, x0 ) with √ √
f (x) = ( x + 1 − x), x0 = 100 .
√ √
As x gets large, x + 1 and x are of very similar magnitude and subtracting the two values
is ill conditioned√as we already saw. Assume that √ we can compute
√ the square roots also up to
six √
decimals:
√ f l( 101) = 0.100499 · 10 2
and f l( 101) f l( 100)) = 0.499000 · 10−1 instead of
−1
f l( 101 − 100) = 0.498756 · 10 . So we have lost 3 significant figures from the available 6.
Rewriting f in the form
√ √ √ √
x+1− x x+1+ x 1
f (x) = √ √ =√ √
1 x+1+ x x+1+ x
√ √
removes
√ the problem
√ of loss of significance because adding x + 1 and x is well conditioned:
f l( x + 1) ⊕ f l( x) = 0.200499 · 102 and 1 (0.200499 · 102 ) = 0.498755 · 10−1 .
Observe that:
z fl(z) z(1 + δ)
fl = ≤ = z223 + δz223
M fl(M ) M
in single precision floating point. So the error term is magnified. It is possible to compensate for
this:
Example 13. Let:
log(x + 1)
f (x) =
x
and consider x ≈ 0.
lim f (x) = 1
x→0
x2 x4
cos x = 1 − + − · · · h.o.t.
2! 4!
So we have:
1 − cos x 1 x2 x4 x6
= − + − cos ξ
x2 2! 4! 6! 8!
The following definition is closely related to the notion of well-posedness given previously.
Definition 13. An algorithm is called stable if small changes in the initial data produce only
small changes in the final results. Otherwise it is called unstable.
Let E0 > 0 be an initial error, and En be the error after n steps. Typically we can have
(i) Linear growth: En ≈ CnE0 , a constant C ∈ R.
(ii) Exponential growth: En ≈ C n E0 , C > 1. This occurs for example if En = CEn−1 . Expo-
nential growth is not acceptable.
CHAPTER 2. GETTING STARTED 19
n 1 2 3 4 5 6 7 8
(II) 0.33333 0.11111 0.03704 0.01235 0.00412 0.00137 0.00046 0.00015
(I) 0.33333 0.1111 0.03699 0.01216 0.00337 −0.00161 −0.01147 −0.04755
The exact value is 0.00015 in the obtainable precision. Even with r = 8 we obtain with (II)
0.00015242 and with (I) 0.00010407; the exact value is 0.00015242. The corresponding relative
errors are approximately 0.27 · 10−6 and 0.31.
Overall stability means that errors in previous steps are not amplified.
R1 xk
Example 16. (Error amplification) We want to compute the integral Ik := x+5 dx.
0
(A) Observe
I0 = ln(6) − ln(5)
and
1
Ik + 5Ik−1 = (k ≥ 1), since
k
Z1 Z1
xk xk−1 − 1 1
+5 = xk−1 dx = .
x+5 x+5 k
0 0
Here we use I¯k to denote the computed value taking rounding errors into account. Obviously
Ik is monotone decreasing, and Ik & 0 (k → ∞), but this is not observed for the computed
values, we even have I¯4 < 0. On a standard PC we found: I¯21 = −0.158 · 10−1 and
I¯39 = 8.960 · 1010 .
This is a typical example of error accumulation. In the scheme the error in Ik−1 in amplified
by the factor 5 to compute Ik .
(B) If one computes the values for Ik exactly, one observes that I9 = I10 up to the three first
decimals. Using the backwards iteration Ik−1 = 15 k1 − Ik we obtain
Example 17. (Computing the solution to a quadratic equation) Consider the quadratic equation
y 2 − py + q = 0
2
q
2
for p, q ∈ R and 6= q < p4 . The two solution are y1,2 = y1,2 (p, q) = p2 ± p4 − q. Also p = y1 + y2
and q = y1 y2 . From this we can conclude that ∂p y1 +∂p y2 = 1 and y2 ∂p y1 +y1 ∂p y2 = 0. Therefore,
y1 y2
∂p y1 = , ∂p y2 = .
y2 − y1 y2 − y1
From this we can conclude that ∂q y1 + ∂q y2 = 0 and y2 ∂q y1 + y1 ∂q y2 = 1. Therefore,
1 1
∂q y1 = , ∂q y2 = .
y1 − y2 y1 − y2
The condition numbers are for y1 (p, q)
p 1 + y2 /y1 q 1
k1,p = ∂p y1 = , k1,q = ∂q y1 = .
y1 1 − y2 /y1 y1 1 − y2 /y1
Similar results can be obtained for the condition numbers k2,p and k2,q for y2 (p, q). This shows
that the computation of the roots is badly conditioned if the two roots are close together, i.e., yy12
is close to one.
For | yy12 1 the problem is well conditioned. We could employ the following algorithm to compute
the results:
p2 √
u= ,v = u − q ,w = v .
4
For p < 0 we should first compute y2 = p2 − w to avoid cancellation effects. For the second root
we can use different approaches:
(I) (II)
p q .
y1 = 2 +w y1 = y2
2
For q p4 we have w ≈ p
2 and (I) is prone to cancellation effects. Errors made in p and w are
carried over to y1 :
∆y1 • 1 ∆p 1 ∆w
≤ + .
y1 1 + 2w/p p 1 + p/2w w
p2
Both factors are much greater than one since q 4 . The method (II) is on the other hand stable:
∆y1 • ∆q ∆y2
≤ + .
y1 q y2
reusing the symbol F for the right hand side. By introducing an extended solution vector
and suitable right hand side it is possible to reduce this problem to a homogeneous, first order
ODE of the form
y 0 (t) = f (y(t))
and this is the type of problem we are going to study throughout this lecture - or its non-
homogeneous counterpart y 0 (t) = f (y(t), t) - although this is equivalent to introducing another
dependent variable satisfying the ODE τ 0 = 1.
We will consequently assume that y(t) = (yi (t))m i=1 for some given m ≥ 1. To make the problem
well posed we also need to provide an initial value, so we will assume that some y0 ∈ Rm is given
and will look for solutions to the initial value problem
y 0 (t) = f (t, y(t)) , y(0) = y0 .
R = {(t, y)|a1 ≤ t ≤ a2 , b1 ≤ y ≤ b2 }
and if (t0 , y0 ) is an interior point of R, then (∗) has a unique solution y = g(t) which passes
through (t0 , y0 ).
Sketch of Proof - requires complete metric spaces and Banach fixed point theorem. By assumption,
|f (t, y)| ≤ K and ∂f
∂y ≤ L. It follows
Choose a > 0 such that La < 1 and |t − t0 | ≤ a and |y − y0 | ≤ Ka. Let X be the set of all
continuous functions y = g(t) on |t − t0 | ≤ a with |g(t) − y0 | ≤ Ka, and so X is a complete
metric space. Define a mapping T of X into itself by
Z t
T g = h, h(t) = y0 + f (s, g(s))ds (|h(t) − y0 | ≤ Ka)
t0
Furthermore,
Z t
|h1 (t) − h2 (t)| = [f (s, g1 (s)) − f (s, g2 (s))]ds ≤ La sup |g1 (t) − g2 (t)|.
t0
1. study certain properties of the solution, e.g., long time behaviour by looking at the stability
of fixed points. This was discussed in MA133.
2. approximate the solution, e.g., simplify the model or use numerical methods. We will look
at both these aspects in this module.
zn+1 ≤ Czn + D, ∀n ≥ 0
z0 ≤ z0
which is obviously satisfied. We now assume that (2.1) holds for a fixed n and prove that it is true
for n + 1. We have
zn+1 ≤ Czn + D
Cn − 1
n
≤ C D + z0 C + D
C −1
C n+1 − C
= D + z0 C n+1 + D
C −1
n+1
C −C C −1
= D + + z0 C n+1
C −1 C −1
C n+1 − 1
= D + z0 C n+1
C −1
and the induction is complete.
Remark. As mentioned this is a discrete version of the Gronwall Lemma which states that if a
smooth enough function u satisfies the differential inequality u0 (t) ≤ β(t)u(t) for t ≥ a then
Rt
β(s) ds
u(t) ≤ u(a)e a .
In other words, u is bounded by the solution to the differential equation v 0 (t) = β(t)v(t) , v(a) =
u(a). In the same way the right hand side in the bound of the discrete Gronwall lemma is the
solution to the difference equation ξn+1 = Cξn + D.
CHAPTER 2. GETTING STARTED 23
Y (t + h) − Y (t)
Y 0 (t + τ h) ≈ ,
h
for some τ ∈ [0, 1]. That this is a reasonable approximation can be seen using Taylor expansion as-
suming Y ∈ C 3 (see first assignment). In the case that τ = 12 , i.e., we are aiming at approximating
the time derivative exactly in the middle of the interval [t, t + h], we arrive at
1 Y (t + h) − Y (t)
Y 0 (t + h) = + O(h2 ) .
2 h
1
If we are not approximating in the middle of the interval, i.e., τ 6= 2 and we end up with
Y (t + h) − Y (t)
Y 0 (t + τ h) = + O(h) .
h
This type of superconvergence in some points is a typical property of many finite difference ap-
proximations to derivatives, i.e., that they have a higher convergence rate in some isolated points
then in the rest of the domain.
A finite difference approximation can be used to compute an approximation yn to the exact
solution Y at a point in time tn+1 = tn + h given approximations to Y at some earlier points in
time. For example taking τ = 0 and t = tn
Y (tn+1 ) − Y (tn )
f (tn , Y (tn )) = Y 0 (tn ) = + O(h) ,
h
Replacing Y (tn ) by yn and Y (tn+1 ) by yn+1 and ignoring the O(h) term yields
yn+1 − yn
= f (tn , yn )
h
which provides an explicit formula to compute yn+1 given yn :
yn+1 = yn + hf (tn , yn ) .
y1 = y0 + hf (t0 , y0 ),
y2 = y1 + hf (t1 , y1 ),
y3 = y2 + hf (t2 , y2 ),
..
.
This is known as the forward or explicit Euler method :
CHAPTER 2. GETTING STARTED 24
yn+1 = yn + hf (tn , yn ) .
T
The following algorithm provides an approximation to Y (T ) given h = N for some N > 0::
t = t 0 , y = y0
While t < T
y = y + hf (t, y)
t=t+h
If we start by evaluating the finite difference approximation at t = tn , τ = 1 we arrive at
yn+1 − yn
= f (tn+1 , yn+1 ) .
h
This does not lead to an explicit formula for yn+1 . The approximation at tn+1 has to be computed
by finding the root δn+1 of
Fn (δ; yn , tn , h) = δ − f (t + h, y + hδ) .
δ k = δ k − DF −1 (δ k )F (δ k ) .
The following algorithm provides an approximation to δ with F (δ) = 0 for a given initial guess δ0 :
δ = δ0 and 0 < ε 1:
While F (δ) > ε
δ = δ − DF −1 (δ)F (δ)
Remark. We will see later that both the forward and backward Euler method converge with order
one. We will be discussing approaches to improve the accuracy and also why using a more complex
implicit method can sometimes be a good idea.
Using the finite difference quotient in the middle of the interval to take advantage of the higher
convergence rate is not so straightforward:
yn+1 − yn
= f (tn+ 21 , yn+ 21 )
h
since yn+ 12 is not part of the sequence we are computing. We can either use the approximation
on an interval 2h:
yn+1 − yn−1
= f (tn , yn )
2h
CHAPTER 2. GETTING STARTED 25
which we can use to compute yn+1 assuming we know both yn and yn−1 . This type of method
is called multistep method and is discussed later in the lecture. A second approach is to replace
f (tn+ 12 , yn+ 21 ) by an approximation - either
1
f (tn+ 21 , yn+ 12 ) ≈ (f (tn , yn ) + f (tn+1 , yn+1 ))
2
or
1
f (tn+ 12 , yn+ 21 ) ≈ f (tn+ 12 , (yn + yn+1 )) .
2
The first approach is often called the Crank-Nicholson while the second is called the implicit
midpoint method. Both are implicit: This method looks very similar to the backward Euler
method but note that it does have a higher complexity since more evaluations of f are required in
each step and evaluation of f can be very expensive. On the other hand the higher convergence
rate of the finite difference approximation at the midpoint could improve the overall convergence
rate of the method - assuming we haven’t messed things up by approximation f (tn+ 12 , yn+ 21 ).
Remark (Alternative derivation). Starting from y 0 (t) = f (y(t)) we can use integration over the
time interval [tn , tn+1 ] to derive an approximation:
Z tn+1 Z tn+1
y(tn+1 ) − y(tn ) = y 0 (t) dt = f (y(t)) dt .
tn tn
Now to get a numerical scheme we need to approximate the integral on the right. We will study
more sophisticated methods later in the course but for now we can use an approximation based
on a single point in the interval:
Z tn+1
y(tn+1 ) − y(tn ) = f (y(t)) dt ≈ (tn+1 − tn )f (y(tn + τ (tn+1 − tn ))) = hf (y(tn + hτ )) ,
tn
for some τ ∈ [0, 1]. As we will have noticed (hopefully) we have rediscovered the forward Euler
(τ = 0) and the backward Euler (τ = 1) methods.
Remark. The definition of the approximation error is not unique or more to the point, the way to
measure the error is not unique. For example instead of looking at the maximum error over the
time interval we could study an average error or the error eN at the final time only as we will do
in the assignment.
During the derivation of the schemes we replaced the time derivative with a finite difference
quotient by dropping higher derivative terms in the Taylor expansion. Recall for example the
formula used for the forward Euler method
Y (tn+1 ) = Y (tn ) + hf (tn , Y (tn )) + O(h2 )
which indicates that we are introducing an error of the magnitude of h2 when going from Y (tn )
to Y (tn+1 ). So in each step we are doing an error of magnitude h2 but we are doing N ∼ h1 steps
to get to the final time. But the question is: how do these errors in each step add up over all time
steps?
CHAPTER 2. GETTING STARTED 26
Forward Euler
For h > 0 let yn+1 = yn + hf (tn , yn ) be the sequence produced by the forward Euler method,
tn = nh the sequence of points in time. We also introduce the evaluation of the exact solution at
these points in time, i.e., Yn := Y (tn ) and finally we denote the error at each of these points in
time by en := |yn − Yn |. Keep in mind that tn , yn , Yn , en all depend on h. Next we introduce the
local truncation error which is conceptually the error introduced by inserting the exact solution
into the numerical scheme:
Definition 15 (Local truncation error for forward Euler method). For t ∈ [0, T − h] we define
the truncation error for the forward Euler method to be
or for n the local truncation error for the forward Euler method is
τn := Yn+1 − Yn − hf (tn , Yn )
Remark. From our definition we have Yn+1 = Yn + hf (Yn ) + τn and our derivation based on Taylor
expansion shows that the truncation error converges quadratically to 0, i.e., τn = O(h2 ).
We will drop the f ’s dependency on t now to simplify the notation and furthermore, that f is
Lipschitz continuous (an assumption commonly used in the existence theory of ODEs), i.e.,
en+1 = yn +hf (yn )−(Yn +hf (Yn )+τn ) ≤ en +h f (yn )−f (Yn ) +τn ≤ en +Lhen +τn ≤ (1+Lh)en +τn .
We thus conclude that going from step n to n + 1 leads to an amplification of the error en by
1 + Lh plus an additional O(h2 ) error coming from the truncation error in that step.
We can now apply the same estimate to en to get
n
X
en+1 ≤ (1+Lh)en +τn ≤ (1+Lh)2 en−1 +(1+Lh)τn−1 +τn ≤ · · · ≤ (1+Lh)n+1 e0 + (1+Lh)i τn−i
i=0
Since e0 = |y0 − Y (0)| = |y0 − y0 | = 0 and using τn ≤ Ch2 where C depends neither on h nor on
n (in depends on the second derivative of Y over whole time interval) we get
n
X
en+1 ≤ Ch2 (1 + Lh)i
i=0
T T
for all 0 ≤ n < N where N = or h = h N.
We still need to estimate the size of the final geometric
sum:
n
X (1 + Lh)n+1 − 1 1
(1 + Lh)i = = ((1 + Lh)n+1 − 1) .
i=0
(1 + Lh) − 1 Lh
T T LT
Using n + 1 ≤ N = h and therefore h < n+1 so that Lh ≤ n+1 and therefore
n
X 1 LT n+1
(1 + Lh)i ≤ 1+ −1 .
i=0
Lh n+1
a i a
Using that (1 + i) ≤ e we conclude
n
X 1
(1 + Lh)i ≤ (exp LT − 1) = O(h−1 ) .
i=0
Lh
Theorem 5 (Convergence of the forward Euler method). If the exact solution is in C 2 and the
right hand side f is Lipschitz continuous with Lipschitz constant L then the approximation error
converges linearly to 0 for h → 0:
exp LT − 1
E(h) ≤ max τi = O(h) .
0≤i≤N Lh
So we have proven (what we already guessed from our numerical experiments) that the forward
Euler method converges linearly to the exact solution. The constant in the error bound depends
on (i) the end time (ii) the second derivative of the exact solution, i.e., max0≤t≤T |Y 00 (t)| and (iii)
the Lipschitz constant of the right hand side f .
Remark. The result given above for the forward Euler method is very typical for many numerical
schemes used to solve ODEs where the error at a time step can often be expressed as the error at
the previous time step plus the truncation error at that step. Since these methods are often derived
by truncating a Taylor expansion, the convergence rate for the truncation error is straightforward
to obtain. Modifying the above argument (i) using that e0 = 0 and (ii) the truncation errors add
up in the form of a geometric sum, one can derive the overall convergence rate from the rate of
the truncation error. Of course one needs to keep in mind that the solution to the ODE (or the
right hand side function f ) has to be smooth enough to carry out the truncated Taylor expansion
in the first place.
Backwards Euler
The analysis for the backward Euler method is almost identical to the above. In this case yn+1 =
yn + hf (yn+1 ) and τn := Yn+1 − hf (Yn+1 ) − Yn .
Since we are interested in h → 0 we can assume that hL ≤ 1 − ε ∈ (0, 1) then rearranging terms
leads to
n
en τn e0 X 1
en+1 ≤ + ≤ ··· ≤ n+1
+ τn−i .
1 − hL 1 − hL (1 − hl) i=0
(1 − hL)i
1 hL
where β = 1−hL . We have β − 1 = 1−hL and also
hL hL
β =1+ ≤1+
1 − hL ε
hL
since we assumed that hL < 1 − ε. Using that 1 + x ≤ ex we finally have β ≤ e ε and thus using
again h = Tn and n ≤ N :
nhL TL
βn ≤ e ε ≤ e ε .
Putting it all together we have shown that
Theorem 6 (Convergence of the backward Euler method). If the exact solution Y ∈ C 2 and the
right hand side f is Lipschitz continuous with Lipschitz constant L then the approximation error
converges linearly to 0 for h → 0. For h ≤ 1−ε
L for some ε ∈ (0, 1) the error is bounded by
1 − hL N TL
E(h) ≤ max τi exp − 1 = O(h) .
hL i=0 ε
CHAPTER 2. GETTING STARTED 28
yn+1 = (I + hA)yn
(I − hA)yn+1 = yn
The decoupling argument shows that it is enough to study stability for linear problems in the case
of complex valued scalar problems (those correspond to either real valued problems or to 2 × 2
systems). So in the following we again denote with Y the now possibly complected valued solution
which satisfies
Y 0 (t) = λY (t)
with a complex constant λ. Since we are interested in the ODE setting, where the origin is a
stable fixed point, we assume the real part of λ is less than zero: λ ∈ C, Realλ < 0. We then know
that the origin is a stable fixed point and thus |Y (t)| → 0 for t → ∞ independent of the initial
condition Y0 , in fact:
and so
|Y (t)| = eRealλt |Y0 |
1. |Y (t)| is monotnly decreasing and converges to 0,
2. if λ ∈ R then Y (t) is strictly monotonly decreasing if Y0 > 0 and strictly monotonly increasing
if Y0 < 0. Consequently, Y (t) has the same sign as Y0 for all t.
Now one can ask the question: under which conditions does the sequence (yn )n behave in the
same way?
Forward Euler
As we saw above
yn+1 = (1 + hλ)n+1 Y0
and thus |yn | = |1 + hλ|n |Y0 |. To get this to converge to zero requires |1 + hλ| < 1, since Realλ < 1
this will hold for h sufficiently small but not for too large values of h. In fact
p
1 > |1 + hλ| = (1 + hRealλ)2 + h2 (Imagλ)2
and thus
|Realλ|
2 >h
|λ|2
recalling that our assumption was that Realλ < 0. There are two interesting cases here:
2
1. Imagλ = 0: the condition reduces to h < |Realλ| or perhaps easier to remember h|λ| < 2.
2. Realλ → 0: in this case the condition for convergence of forward Euler scheme is h < 0 so
not achievable. On the other hand the exact solution satisfies |Y (t)| = |Y0 | (the origin is a
centre) so aiming for |yn | → 0 does not make sense. But since |yn | = |1 + hλ|n |Y0 | the issue
is not only that the discrete approximation does not converge to zero but in fact it grows
monotonically without bounds. This is in fact the setting of our mass spring system from
the introduction where our experiments showed that the right long time behaviour is not
achievable with the forward Euler method.
CHAPTER 2. GETTING STARTED 30
Backwards Euler
Doing the same for the backward Euler scheme we arrive at (1 − hλ)yn+1 = yn , or
1
yn+1 = Y0 .
(1 − hλ)n+1
We again focus on the stable case, i.e., Realλ < 0. Now |yn | → 0 if and only if |1 + hλ| > 1 or
which holds for any value of h again using that Realλ < 0. Note that the condition is satisfied
for any h in the case that Realλ ≤ 0 (in fact even for quite a lot of the right half of the complex
plane in the purely imaginary case the condition is still satisfied for any h > 0 which means that
|yn | → 0 although the exact solution is a centre with |Y (t)| = |Y0 | - but we already saw this in
the introduction where are simulations showed that for the mass spring system the implicit Euler
method always leads to approximation that converge to zero.
A way to visualize the stability property of the two schemes is to plot the stability region in the
complex plane, i.e., for the forward Euler method all complex values z = hλ such that |1 + z| < 1,
for the implicit Euler method we shade the left half plane (although as pointed most of the right
half should also be shaded:
2 2
1 1
0 0
−1 −1
−2 −2
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Definition 16 (Absolute stability). The stability concept described so far is referred to as absolute
stability.
Methods like the backward Euler method which are stable for all step sizes h are called uncondi-
tionally (absolute) stable or simply A-stable while method like the forward Euler method which
require h to be small enough for stability are called conditionally stable.
A more formal discussion will be carried out later in the lecture.
We have so far focused on the limiting behaviour of yn for n → ∞ in the case of linear problems
with a stable fixed point at the origin. As pointed out above, if λ ∈ R and λ < 0 then the
exact solution is monotone decreasing or increasing depending on the sign of the initial condition
Y0 ∈ R. We will now look at conditions for h which guarantee that this behaviour carries over to
the sequence (yn )n when using the forward or backward Euler methods.
Starting with the forward Euler method we have yn = (1 − h|λ|)n |Y0 | where now λ is real and
negative. We can assume for simplicity that Y0 > 0 then we want to find a step size condition
so that 0 < yn < yn+1 , which is equivalent to 0 < 1 − h|λ| < 1 as can be easily seen. Since
h|λ| > 0 the second condition is always satisfied while the first condition requires h|λ| < 1. It is
worth now comparing this to the condition for absolute stability derived previously, which in the
CHAPTER 2. GETTING STARTED 31
current setting is |1 − h|λ| | < 1 which is equivalent to −1 < 1 − hλ| < 1. So as to be expected
achieving monotonicity requires a harder restriction on the time step then absolute stability did.
2 1
For |yn | → 0 we require h < |λ| while monotonicity requires h < |λ| .
1
Turning our attention to the backward Euler method with yn = (1+h|λ) n Y0 we see that mono-
1
tonicity always holds since 0 < 1+h|λ| < 1 for any step size h > 0.