0% found this document useful (0 votes)
8 views

Maths All Notes

Uploaded by

Vinay Dogra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Maths All Notes

Uploaded by

Vinay Dogra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Lecture Notes

Numerical Analysis

First Version 2014

Instructor: Paramjeet Singh


2 Preface

These notes were originally prepared during Fall 2014 for Math Numerical Analysis. In writing these
notes, it was not my intention to add to the glut of Numerical Analysis texts; they were designed to
complement the course text, Numerical Analysis, Ninth edition, by Burden and Faires. As such, these
notes follow the conventions of that text fairly closely. If you are at all serious about pursuing study of
Numerical Analysis, you should consider acquiring that text, or any one of a number of other fine texts
by e.g., Atkinson, Cheney & Kincaid etc.
Special thanks go to the students of batch 2013, who suffered through early versions of these notes,
which were riddled with (more) errors. Now I guess these notes have less errors.
More homework questions and example problems will help you learn the material.
CHAPTER 1 (4 LECTURES)
FLOATING POINT ARITHMETIC AND ERRORS

1. Numerical analysis
Numerical analysis, area of mathematics and computer science that creates, analyzes, and imple-
ments algorithms for obtaining numerical solutions to problems involving continuous variables. Such
problems arise throughout the natural sciences, social sciences, engineering, medicine, and business.
Since the mid 20th century, the growth in power and availability of digital computers has led to an
increasing use of realistic mathematical models in science and engineering, and numerical analysis
of increasing sophistication is needed to solve these more detailed models of the world. The formal
academic area of numerical analysis ranges from quite theoretical mathematical studies to computer
science issues. A major advantage for numerical technique is that a numerical answer can be obtained
even when a problem has no analytical solution. However, result from numerical analysis is an approx-
imation, in√ general, which can be made as accurate as desired. For example to find the approximate
values of 2, π etc.
With the increasing availability of computers, the new discipline of scientific computing, or com-
putational science, emerged during the 1980s and 1990s. The discipline combines numerical analysis,
symbolic mathematical computations, computer graphics, and other areas of computer science to make
it easier to set up, solve, and interpret complicated mathematical models of the real world.

1.1. Common perspectives in numerical analysis. Numerical analysis is concerned with all as-
pects of the numerical solution of a problem, from the theoretical development and understanding of
numerical methods to their practical implementation as reliable and efficient computer programs. Most
numerical analysts specialize in small subfields, but they share some common concerns, perspectives,
and mathematical methods of analysis. These include the following:
• When presented with a problem that cannot be solved directly, they try to replace it with a
“nearby problem” that can be solved more easily. Examples are the use of interpolation in
developing numerical integration methods and root-finding methods.
• There is widespread use of the language and results of linear algebra, real analysis, and func-
tional analysis (with its simplifying notation of norms, vector spaces, and operators).
• There is a fundamental concern with error, its size, and its analytic form. When approximating
a problem, it is prudent to understand the nature of the error in the computed solution.
Moreover, understanding the form of the error allows creation of extrapolation processes to
improve the convergence behaviour of the numerical method.
• Numerical analysts are concerned with stability, a concept referring to the sensitivity of the
solution of a problem to small changes in the data or the parameters of the problem. Numerical
methods for solving problems should be no more sensitive to changes in the data than the
original problem to be solved. Moreover, the formulation of the original problem should be
stable or well-conditioned.
In this chapter, we introduce and discuss some basic concepts of scientific computing. We begin
with discussion of floating-point representation and then we discuss the most fundamental source of
imperfection in numerical computing namely roundoff errors. We also discuss source of errors and then
stability of numerical algorithms.

2. Floating-point representation of numbers


Any real number is represented by an infinite sequence of digits. For example
 
8 2 6 6
= 2.66666 · · · = + + + . . . × 101 .
3 101 102 103
1
2 FLOATING POINT ARITHMETIC AND ERRORS

Figure 1. Numerical Approximations

This is an infinite series, but computer use an finite amount of memory to represent numbers. Thus
only a finite number of digits may be used to represent any number, no matter by what representation
method.
8
For example, we can chop the infinite decimal representation of after 4 digits,
3
 
8 2 6 6 6
= 1
+ 2 + 3 + 4 × 101 = 0.2666 × 101 .
3 10 10 10 10
Generalizing this, we say that number has n decimal digits and call this n as precision.
For each real number x, we associate a floating point representation denoted by f l(x), given by
f l(x) = ±(0.a1 a2 . . . an )β × β e ,
here β based fraction is called mantissa with all ai integers and e is known as exponent. This repre-
sentation is called β−based floating point representation of x and we take base β = 10 in this course.
For example,
42.965 = 4 × 101 + 2 × 100 + 9 × 10−1 + 6 × 10−2 + 5 × 10−3
= 0.42965 × 102 .
−0.00234 = −0.234 × 10−2 .
Number 0 is written as 0.00 . . . 0 × 10e . Likewise, we can use for binary number system and any real
x can be written
x = ±q × 2m
with 21 ≤ q ≤ 1 and some integer m. Both q and m will be expressed in terms of binary numbers. For
example,
1001.1101 = 1 × 23 + 2 × 20 + 1 × 2−1 + 1 × 2−2 + 1 × 2−4
= (9.8125)10 .
Remark 2.1. The above representation is not unique.
For example, 0.2666 × 101 = 0.02666 × 102 etc.
Definition 2.1 (Normal form). A non-zero floating-point number is in normal form if the values of
mantissa lies in (−1, −0.1] or [0.1, 1).
Therefore, we normalize the representation by a1 6= 0. Not only the precision is limited to a finite
number of digits, but also the range of exponent is also restricted. Thus there are integers m and M
such that −m ≤ e ≤ M .
FLOATING POINT ARITHMETIC AND ERRORS 3

Definition 2.2 (Overflow and underflow). An overflow is obtained when a number is too large to fit
into the floating point system in use, i.e e > M . An underflow is obtained when a number is too small,
i.e e < −m . When overflow occurs in the course of a calculation, this is generally fatal. But underflow
is non-fatal: the system usually sets the number to 0 and continues. (Matlab does this, quietly.)
2.1. Rounding and chopping. Let x be any real number and f l(x) be its machine approximation.
There are two ways to do the “cutting” to store a real number
x = ±(0.a1 a2 . . . an an+1 . . . ) × 10e , a1 6= 0.
(1) Chopping: We ignore digits after an and write the number as following in chopping
f l(x) = (.a1 a2 . . . an ) × 10e .
(2) Rounding: Rounding is defined as following
±(0.a1 a2 . . . an ) × 10e , 0 ≤ an+1 < 5

(rounding down)
f l(x) =
±[(0.a1 a2 . . . an ) + (0.00 . . . 01)] × 10e , 5 ≤ an+1 < 10 (rounding up).
Example 1.
0.86 × 100 (rounding)
  
6
fl =
7 0.85 × 100 (chopping).

Example 2. Find the largest interval in which f l(x) must lie to approximate 2 with relative error
at most 10−5 for each value of x.
Sol. We have √
2 − f l(x) √
√ ≤ 2 · 10−5 .
2
Therefore
√ √
| 2 − f l(x)| ≤ 2 · 10−5 ,
√ √ √
− 2 · 10−5 ≤ 2 − f l(x) ≤ 2 · 10−5
√ √ √ √
− 2 − 2 · 10−5 ≤ −f l(x) ≤ − 2 + 2 · 10−5
√ √ √ √
2 + 2 · 10−5 ≥ f l(x) ≥ 2 − 2 · 10−5 .
Hence interval (in decimals) is [1.4141994 · · · , 1.4142277 · · · ].

3. Errors in numerical approximations


Definition 3.1 (Absolute and relative error). If f l(x) is the approximation to the exact value x, then
|x − f l(x)|
the absolute error is |x − f l(x)|, and relative error is .
|x|
Remark: As a measure of accuracy, the absolute error may be misleading and the relative error is more
meaningful.
3.1. Chopping and Rounding Errors. Let x be any real number we want to represent in a com-
puter. Let f l(x) be the representation of x in the computer then what is largest possible values of
|x − f l(x)|
? In the worst case, how much data we are losing due to round-off errors or chopping errors?
|x|
Chopping errors: Let
x = (0.a1 a2 . . . an an+1 . . . ) × 10e
a a2 an an+1 
1
= + 2 + · · · + n + n+1 + · · ·
10 10! 10 10

X ai
= × 10e , a1 6= 0,
10i
i=1
n
!
e
X ai
f l(x) = (0.a1 a2 . . . an ) × 10 = × 10e .
10i
i=1
4 FLOATING POINT ARITHMETIC AND ERRORS

Therefore

!
X ai
|x − f l(x)| = × 10e
10i
i=n+1

Now since each ai ≤ 9 = 10 − 1, therefore,


X 10 − 1
|x − f l(x)| ≤ × 10e
10i
i=n+1
 
1 1
= (10 − 1) + + . . . × 10e
10n+1 10n+2
" #
1
10n+1
= (10 − 1)) 1 × 10e
1 − 10
= 10e−n .

Therefore absolute error bound is

Ea = |x − f l(x)| ≤ 10e−n .

Now
1
|x| = (0.a1 a2 . . . an )10 × 10e ≥ 0.1 × 10e = × 10e .
10
Therefore relative error bound is

|x − f l(x)| 10−n × 10e


Er = ≤ −1 = 101−n .
|x| 10 × 10e

Rounding errors: For rounding

n
 !
X ai
(0.a1 a2 . . . an )10 × 10e = × 10e , 0 ≤ an+1 < 5



10i


f l(x) = i=1
n
!
 e 1 X ai
 (0.a1 a2 . . . an−1 [an + 1])10 10 = + × 10e , 5 ≤ an+1 < 10.


10n 10i

i=1

For 0 < an+1 < 5 = 10/2,


X ai
|x − f l(x)| = × 10e
10i
i=n+1

" #
an+1 X ai
= + × 10e
10n+1 10i
i=n+2

" #
10/2 − 1 X (10 − 1)
≤ + × 10e
10n+1 10i
i=n+2
 
10/2 − 1 1
= + n+1 × 10e
10n+1 10
1 e−n
= 10 .
2
FLOATING POINT ARITHMETIC AND ERRORS 5

For 5 ≤ an+1 < 10,



X ai 1
|x − f l(x)| = i
− n × 10e
10 10
i=n+1

1 X ai
= − × 10e
10n 10i
i=n+1

1 an+1 X ai
= n
− n+1
− × 10e
10 10 10i
i=n+2
1 an+1
≤ n
− n+1 × 10e
10 10
Since −an+1 ≤ −10/2, therefore
1 10/2
|x − f l(x)| ≤ n
− n+1 × 10e
10 10
1 e−n
= 10 .
2
Therefore, for both cases absolute error bound is
1 e−n
Ea = |x − f l(x)| ≤ 10 .
2
Also relative error bound is
|x − f l(x)| 1 10−n × 10e 1
Er = ≤ = 101−n = 5 × 10−n .
|x| 2 10−1 × 10e 2
4. Significant Figures
The term significant digits is often used to loosely describe the number of decimal digits that appear
to be accurate. The definition is more precise, and provides a continuous concept.
Looking at an approximation 2.75303 to an actual value of 2.75194, we note that the three most
significant digits are equal, and therefore one may state that the approximation has three significant
digits of accuracy. One problem with simply looking at the digits is given by the following two examples:
(1) 1.9 as an approximation to 1.1 may appear to have one significant digit, but with a relative
error of 0.73, this seems unreasonable.
(2) 1.9999 as an approximation to 2.0001 may appear to have no significant digits, but the relative
error is 0.00010 which is almost the same relative error as the approximation 1.9239 is to 1.9237.
Thus, we need a more mathematical definition of the number of significant digits. Let the number x
and approximation x∗ be written in decimal form. The number of significant digits tells us to about
how many positions x and x∗ agree. More precisely, we say that x∗ has m significant digits of x if
the absolute error |x − x∗ | has zeros in the first m decimal places, counting from the leftmost nonzero
(leading) position of x, followed by a digit from 0 to 5.
Examples:
5.1 has 1 significant digit of 5: |5 − 5.1| = 0.1.
0.51 has 1 significant digits of 0.5: |0.5 − 0.51| = 0.01.
4.995 has 3 significant digits of 5: 5 − 4.995 = 0.005.
4.994 has 2 significant digits of 5: 5 − 4.994 = 0.006.
0.57 has all significant digits of 0.57.
1.4 has 0 significant digits of 2: 2 − 1.4 = 0.6.
In the terms of relative errors, the number x∗ is said to approximate x to m significant digits (or
figures) if m is the largest nonnegative integer for which
|x − x∗ |
≤ 0.5 × 10−m .
|x|
If the relative error is greater than 0.5, then we will simply state that the approximation has zero
significant digits.
6 FLOATING POINT ARITHMETIC AND ERRORS

For example, if we approximate π with 3.14 then relative errors is


|π − 3.14|
Er = ≈ 0.00051 ≤ 0.005 = 0.5 × 10−2 ,
π
and therefore it is correct to two significant digits.
Also 4.994 has 2 significant digits of 5 as relative errors is (5 − 4.994)/5 = 0.0012 = 0.12 × 10−2 ≤
0.5 × 10−2 .
Some numbers are exact because they are known with complete certainty. Most exact numbers are
integers: exactly 12 inches are in a foot, there might be exactly 23 students in a class. Exact numbers
can be considered to have an infinite number of significant figures.

5. Rules for mathematical operations


In carrying out calculations, the general rule is that the accuracy of a calculated result is limited
by the least accurate measurement involved in the calculation. In addition and subtraction, the result
is rounded off so that it has the same number of digits as the measurement having the fewest decimal
places (counting from left to right). For example,
100 (assume 3 significant figures) +23.643 (5 significant figures) = 123.643,
which should be rounded to 124 (3 significant figures).
In addition to inaccurate representation of numbers, the arithmetic performed in a computer is not
exact. The arithmetic involves manipulating binary digits by various shifting, or logical, operations.
Let the floating-point representations f l(x) and f l(y) are given for the real numbers x and y and
that the symbols ⊕, , ⊗ and represent machine addition, subtraction, multiplication, and division
operations, respectively. We will assume a finite-digit arithmetic given by

x ⊕ y = f l(f l(x) + f l(y)), x y = f l(f l(x) − f l(y)),

x ⊗ y = f l(f l(x) × f l(y)), x y = f l(f l(x) ÷ f l(y)).


This arithmetic corresponds to performing exact arithmetic on the floating-point representations of x
and y and then converting the exact result to its finite-digit floating-point representation.
5
Example 3. Suppose that x = 7 and y = 31 . Use five-digit chopping for calculating x + y, x − y, x × y,
and x ÷ y.

Sol. Here x = 57 = 0.714285 · · · and y = 13 = 0.33333 · · · .


Using the five-digit chopping values of x and y are

f l(x) = 0.71428 × 100 and f l(y) = 0.33333 × 100 .

Thus,

x ⊕ y = f l(f l(x) + f l(y)) = f l(0.71428 × 100 + 0.33333 × 100 ) = f l(1.04761 × 100 ) = 0.10476 × 101 .

The true value is x + y = 57 + 31 = 22


21 , so we have
22
Absolute Error Ea = | 21 − 0.10476 × 101 | = 0.190 × 10−4 .
0.190 × 10−4
Relative Error Er = 22 = 0.182 × 10−4 .
21
Similarly we can perform other calculations.
Now we find some rules for absolute and relative errors while we do addition/subtraction or multipli-
cation/division.

Error in addition/subtraction of numbers:


Let x1 and x2 denote two real numbers.
Let
X = x1 + x2 .
FLOATING POINT ARITHMETIC AND ERRORS 7

Let errors in two components are δx1 and δx2 , respectively and error in sum is δX.
∴ X + δX = (x1 + δx1 ) + (x2 + δx2 )
= (x1 + x2 ) + (δx1 + δx2 ).
=⇒ |δX| = |δx1 | + |δx2 |
|δX| ≤ |δx1 | + |δx2 |.
Dividing by X we get,
δX δx1 δx2
≤ +
X X X
which is a maximum relative error. Therefore, if two numbers are added then the magnitude of abso-
lute error in the result is the sum of the magnitudes of the absolute errors of the components. Same
result holds if we replace addition with subtraction.

Error in product/division of two numbers:


Let
X = x1 x2
∂X ∂X
δX = δx1 + δx2
∂x1 ∂x2
= δx1 x2 + x1 δx1
δX δx1 δx2
=⇒ = +
x1 x2 x1 x2
δX δx1 δx2
≤ +
X x1 x2
Therefore maximum relative in product of two numbers is the sum of relative errors of its components
and given by
δX δx1 δx2
Er = ≤ + .
X x1 x2
Absolute error is given by
δX
Ea = × |X|.
X
A similar result can be obtained for divisions.
Further we show some examples of arithmetic with different exponents.
Example 4. Add the following floating-point numbers 0.4546e3 and 0.5433e7.
Sol. This problem contains unequal exponent. To add these floating-point numbers, take operands
with the largest exponent as,
0.5433e7 + 0.0000e7 = 0.5433e7.
(Because 0.4546e3 changes in the same operand as 0.0000e7).
Example 5. Subtract the following floating-point numbers:
1. 0.5424e − 99 from 0.5452e − 99
2. 0.3862e − 7 from 0.9682e − 7
Sol. On subtracting we get 0.0028e−99. Again this is a floating-point number but not in the normalized
form. To convert it in normalized form, shift the mantissa to the left. Therefore we get 0.28e − 101.
This condition is called an underflow condition.
Similarly, after subtraction we get 0.5820e − 7.
Example 6. Multiply the following floating point numbers: 0.1111e74 and 0.2000e80.
Sol. On multiplying we obtain 0.1111e74 × 0.2000e80 = 0.2222e153. This shows overflow condition of
normalized floating-point numbers.
7.342
Example 7. Find the relative error in calculation of , where numbers 7.342 and 0.241 are correct
0.241
to three decimal places. Determine the smallest interval in which true result lies.
8 FLOATING POINT ARITHMETIC AND ERRORS

x1 7.342
Sol. Let = = 30.467.
x2 0.241
Here errors |δx1 | = |δx2 | ≤ 21 × 10−3 = 0.0005, for rounding.
Therefore relative error
0.0005 0.0005
Er ≤ + = 0.0021.
7.342 0.241
Absolute error
x1
Ea ≤ 0.0021 × = 0.0639
x2
7.342
Hence true value of lies between 30.4647 − 0.0639 = 30.4008 and 30.4647 + 0.0639 = 30.5286.
0.241
Example 8. The error in the measurement of area of a circle is not allowed to exceed 0.5%. How
accurately the radius should be measured.
Sol. Area of the circle is A = πr2 (say).
∂A
∴ = 2πr.
∂r
δA
Percentage error in A = × 100 = 0.5
A
0.5
Therefore δA = × A = 1/200πr2
100
δr 100 δA
Percentage error in r = × 100 = = 0.25.
r r ∂A∂r

6. Loss of significance
Roundoff errors are inevitable and difficult to control. Other types of errors which occur in com-
putation may be under our control. The subject of numerical analysis is largely preoccupied with
understanding and controlling errors of various kinds.
One of the most common error-producing calculations involves the cancellation of significant digits due
to the subtractions nearly equal numbers (or the addition of one very large number and one very small
number or multiplication of a small number with a quite large number).
The phenomenon can be illustrated with the following examples.
Example 9. If x = 0.3721478693 and y = 0.3720230572. What is the relative error in the computation
of x − y using five decimal digits of accuracy?
Sol. We can compute with ten decimal digits of accuracy and can take it as ‘exact’.
x − y = 0.0001248121.
Both x and y will be rounded to five digits before subtraction. Thus
f l(x) = 0.37215

f l(y) = 0.37202.

f l(x) − f l(y) = 0.13000 × 10−3 .


Relative error, therefore is
(x − y) − (f l(x) − f l(y))
Er = ≈ .04% = 4%.
x−y
Example 10. Use four-digit rounding arithmetic and the formula for the roots of a quadratic equation,
to find the most accurate approximations to the roots of the following quadratic equation. Compute the
absolute and relative errors.
1.002x2 + 11.01x + 0.01265 = 0.
FLOATING POINT ARITHMETIC AND ERRORS 9

Sol. The quadratic formula states that the roots of ax2 + bx + c = 0 are

−b ± b2 − 4ac
x1,2 = .
2a
Using the above formula, the roots of given eq. 1.002x2 + 11.01x + 0.01265 = 0 are approximately
(using long format)
x1 = −0.00114907565991, x2 = −10.98687487643590.
We use four-digit rounding arithmetic to find approximations to the roots. We write the approxima-
tions of root as x∗1 and x∗2 . These approximations are given by
p
−11.01 ± (−11.01)2 − 4 · 1.002 · 0.01265
x∗1,2 =
√ 2 · 1.002
−11.01 ± 121.2 − 0.05070
=
2.004
−11.01 ± 11.00
=
2.004
Therefore we find the first root:
x∗1 = −0.004990,
which has the absolute error |x1 − x∗1 | = 0.00384095 and relative error |x1 − x∗1 |/|x1 | = 3.34265968
(very high).
We find the second root
−11.01 − 11.00
x∗2 = = −10.98,
2.004
which has the following absolute error
|x2 − x∗2 | = 0.006874876,
and relative error
|x2 − x∗2 |
= 0.000626127.
|x2 |
This quadratic formula for the calculation of first root, encounter the subtraction of nearly equal
numbers and cause loss of significance. Therefore, we use the alternate quadratic formula by rationalize
the expression to calculate x1 and approximation is given by
−2c
x∗1 = √ = −0.001149,
b + b2 − 4ac
which has the following relative error
|x1 − x∗1 |
= 6.584 × 10−5 .
|x1 |
Example 11. The quadratic formula is used for computing the roots of equation ax2 +bx+c = 0, a 6= 0
and roots are given by √
−b ± b2 − 4ac
x= .
2a
Consider the equation x2 + 62.10x + 1 = 0 and discuss the numerical results.
Sol. Using quadratic formula and 8-digit rounding arithmetic, we obtain two roots
x1 = −.01610723
x2 = −62.08390.
We use these
√ values as√“exact values”. Now √ we perform calculations with 4-digit rounding arithmetic.
We have b2 − 4ac = 62.102 − 4.000 = 3856 − 4.000 = 62.06 and
−62.10 + 62.06
f l(x1 ) = = −0.02000.
2.000
10 FLOATING POINT ARITHMETIC AND ERRORS

The relative error in computing x1 is


|f l(x1 ) − x1 | | − 0.02000 + .01610723|
= = 0.2417.
|x1 | | − 0.01610723|
In calculating x2 ,
−62.10 − 62.06
f l(x2 ) = = −62.10.
2.000
The relative error in computing x2 is
|f l(x2 ) − x2 | | − 62.10 + 62.08390|
= = 0.259 × 10−3 .
|x2 | | − 62.08390|

In this equation since b2 = 62.102 is much larger than 4ac = 4. Hence b and b2 − 4ac become two
equal numbers. Calculation of x1 involves the subtraction of nearly two equal numbers but x2 involves
the addition of the nearly equal numbers which will not cause serious loss of significant figures.
To obtain a more accurate 4-digit rounding approximation for x1 , we change the formulation by
rationalizing the numerator, that is,
−2c
x1 = √ .
b + b2 − 4ac
Then
−2.000
f l(x1 ) = = −2.000/124.2 = −0.01610.
62.10 + 62.06
The relative error in computing x1 is now reduced to 0.62 × 10−3 .

Note: However, if rationalize the numerator in x2 to get


−2c
x2 = √ .
b − b2 − 4ac
The use of this formula results not only involve the subtraction of two nearly equal numbers but also
division by the small number. This would cause degrade in accuracy.
−2.000
f l(x2 ) = = −2.000/.04000 = −50.00
62.10 − 62.06
The relative error in x2 becomes 0.19.
Nested Arithmetic: Accuracy loss due to round-off error can also be reduced by rearranging calcu-
lations, as shown in the next example. Polynomials should always be expressed in nested form before
performing an evaluation, because this form minimizes the number of arithmetic calculations. One
way to reduce round-off error is to reduce the number of computations.
Example 12. Evaluate f (x) = 1.5 + 3.2x − 6.1x2 + x3 at x = 4.71 using three-digit arithmetic directly
and with nesting.
Sol. The exact result of the evaluation is (by taking more digits):
Exact: f (4.71) = 1.5 + 3.2 × 4.71 − 6.1 × 4.712 + 4.713 = −14.263899.
Now using three-digit rounding arithmetic, we obtain
f (4.71) = 1.5 + 3.2 × 4.71 − 6.1 × 4.712 + 4.713
= 1.5 + 15.1 − 6.1 × 22.2 + 22.2 × 4.71
= 1.5 + 15.1 − 135 + 105 = −13.4.

Similarly if we use three-digit chopping then


f (4.71) = 1.5 + 3.2 × 4.71 − 6.1 × 4.712 + 4.713
= 1.5 + 15.0 − 6.1 × 22.1 + 22.1 × 4.71
= 1.5 + 15.0 − 134 + 104 = −13.5.
FLOATING POINT ARITHMETIC AND ERRORS 11

The relative error in case of three-digit (rounding) is


−14.263899 + 13.4
≈ 0.06,
−14.263899
and for three-digit (chopping) is
−14.263899 + 13.5
≈ 0.05.
−14.263899
As an alternative approach, we write the polynomial f (x) in a nested manner as
f (x) = 1.5 + x(3.2 + x(−6.1 + x)).
Using three-digit chopping arithmetic now produces
f (4.71) = 1.5 + 4.71(3.2 + 4.71(−6.1 + 4.71))
= 1.5 + 4.71(3.2 + 4.71(−1.39)) = 1.5 + 4.71(3.2 − 6.54)
= 1.5 + 4.71(−3.34) = 1.5 − 15.7 = −14.2.
In a similar manner, we can obtain a three-digit rounding and answer is −14.3.
The relative error in case of three-digit (chopping) is
−14.263899 + 14.2
≈ 0.0045,
−14.263899
and for three-digit (rounding) is
−14.263899 + 14.3
≈ 0.0025.
−14.263899
Nesting has reduced the relative errors for both the approximations.
Example 13. How to evaluate y ≈ x − sin x, when x is small.
Sol. Since x ≈ sin x, x is small. This will cause loss of significant figures. Alternatively, if we use
Taylor series for sin x, we obtain
x3 x5 x7
y = x − (x − + − + ...)
3! 5! 7!
x3 x5 x7
= − + − ...
6 6 × 20 6 × 20 × 42
x3 x2 x2 x2
 
= 1 − (1 − (1 − )(...)) .
6 20 42 72

7. Algorithms and Stability


An algorithm is a procedure that describes, in an unambiguous manner, a finite sequence of steps
to be performed in a specified order. The object of the algorithm is to implement a procedure to solve
a problem or approximate a solution to the problem. One criterion we will impose on an algorithm
whenever possible is that small changes in the initial data produce correspondingly small changes in
the final results. An algorithm that satisfies this property is called stable; otherwise it is unstable.
Some algorithms are stable only for certain choices of initial data, and are called conditionally stable.
The words condition and conditioning are used to indicate how sensitive the solution of a problem may
be to small changes in the input data. A problem is well-conditioned if small changes in the input data
can produce only small changes in the results. On the other hand, a problem is ill-conditioned if small
changes in the input data can produce large changes in the output.
For a certain types of problems, a condition number can be defined. If that number is large, it indicates
an ill-conditioned problem. In contrast, if the number is modest, the problem is recognized as a well-
conditioned problem.
The condition number can be calculated in the following manner:
relative change in output
κ=
relative change in input
12 FLOATING POINT ARITHMETIC AND ERRORS

f (x) − f (x∗ )
f (x)
=
x − x∗
x
xf 0 (x)
≈ .
f (x)
10
For example, if f (x) = , then the condition number can be calculated as
1 − x2
xf 0 (x) 2x2
κ= = .
f (x) |1 − x2 |
Condition number can be quite large for |x| ≈ 1. Therefore, the function is ill-conditioned.
Example 14. Compute and interpret the condition number for
(a) f (x) = sin x for x = 0.51π.
(b) f (x) = tan x for x = 1.7.
Sol. (a) The condition number is given by
xf 0 (x)
κ= .
f (x)
For x = 0.51π, f 0 (x) = cos(0.51π) = −0.03141, f (x) = sin(0.51π) = 0.99951.

∴ κ = 0.05035.
Since, the condition number is < 1, we conclude that the relative error is attenuated.
(b) f (x) = tan x, f (1.7) = −7.6966, f 0 (x) = 1/ cos2 x, f 0 (a) = 1/ cos2 (1.7) = 60.2377.
κ = −13.305.
Thus, the function is ill-conditioned.
In the following we study an example to create a stable algorithm.

7.1. Creating Algorithms. Another theme that occurs repeatedly in numerical analysis is the dis-
tinction between numerical algorithms are stable and those that are not. Informally speaking, a
numerical process is unstable if small errors made at one stage of the process are magnified and prop-
agated in subsequent stages and seriously degrade the accuracy of the overall calculation.
An algorithm can be thought of as a sequence of problems, i.e. a sequence of function evaluations.
In this case we consider the algorithm for evaluating f (x) to consist of the evaluation of the sequence
x1 , x2 , · · · , xn . We are concerned with the condition of each of the functions f1 (x1 ), f2 (x2 ), · · · , fn−1 (xn−1 )
where f (x) = fi (xi ) for all i. An algorithm is unstable if any fi is ill-conditioned, i.e. if any fi (xi ) has
condition much worse than f (x).
√ √
Example 15. Write an algorithm to calculate the expression f (x) = x + 1 − x, when x is quite
large. By considering the condition number κ of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Suggest a modification which makes it stable.
Sol. Consider √ √
f (x) = x+1− x
so that there is potential loss of significance when x is large. Taking x = 12345 as an example, one
possible algorithm is
x0 : = x = 12345
x1 : = x0 + 1

x2 : = x
√ 1
x3 : = x0
f (x) := x4 : = x2 − x3 .
FLOATING POINT ARITHMETIC AND ERRORS 13

The loss of significance occurs with the final subtraction. We can rewrite the last step in the form
f3 (x3 ) = x2 − x3 to show how the final answer depends on x3 . As f30 (x3 ) = −1, we have the condition
x3 f30 (x3 ) x3
κ(x3 ) = =
f3 (x3 ) x2 − x3
from which we find κ(x3 ) ≈ 2.2 × 104 when x = 12345. Note that this is the condition of a subproblem
arrived at during the algorithm. To find an alternative algorithm we write
√ √
√ √ x+1+ x 1
f (x) = ( x + 1 − x) √ √ =√ √ .
x+1+ x x+1+ x
This suggests the algorithm
x0 : = x = 12345
x1 : = x0 + 1

x2 : = x
√ 1
x3 : = x0
x4 : = x2 + x3
f (x) := x5 : = 1/x4 .
In this case f3 (x3 ) = 1/(x2 + x3 ) giving a condition for the subproblem of
x3 f30 (x3 ) x3
κ(x3 ) = = ,
f3 (x3 ) x2 + x3
which is approximately 0.5 when x = 12345, and indeed in any case where x is much larger than 1.
Thus first algorithm is unstable and second is stable for large values of x. In general such analyses are
not usually so straightforward but, in principle, stability can be analysed by examining the condition
of a sequence of subproblems.
Example 16. Write an algorithm to calculate the expression f (x) = sin(a + x) − sin a, when x =
0.0001. By considering the condition number κ of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Suggest a modification which makes it stable.
Sol. Let x = 0.0001
x0 = 0.0001
x1 = a + x0
x2 = sin x1
x3 = sin a
x4 = x2 − x3 .
Now to check the effect of x3 on x2 , we consider the function f3 (x3 ) = x2 − x3
x3 f 0 (x3 ) x3
κ(x3 ) = =
f (x3 ) x2 − x3
We obtain a very larger condition number, which shows that the last step is not stable.
Now we modify the above algorithm. We write the equivalent form
f (x) = sin(a + x) − sin a = 2 sin(x/2) cos(a + x/2).
The modified algorithm is the following
x0 = 0.0001
x1 = x0 /2
x2 = sin x1
x3 = cos(a + x1 )
x4 = 2x2 x3 .
14 FLOATING POINT ARITHMETIC AND ERRORS

Now we consider the function f3 (x3 ) = 2x2 x3 ,


x3
κ(x3 ) = = 1.
x2 − x3
Thus the condition number is quite good, so this form is acceptable.
Remarks
(1) Accuracy tells us the closeness of computed solution to true solution of problem. Accuracy
depends on conditioning of problem as well as stability of algorithm.
(2) Stability alone does not guarantee accurate results. Applying stable algorithm to well-conditioned
problem yields accurate solution. Inaccuracy can result from applying stable algorithm to ill-
conditioned problem or unstable algorithm to well-conditioned problem.

Exercises
(1) Compute the absolute error and√relative error in approximations of x by x∗ .
a. x = π, x∗ = 22/7 b. x = 2, x∗ = 1.414 c. x = 8!, x∗ = 39900.
(2) Find the largest interval in which x must lie to approximate x with relative error at most 10−4

for each value of x. √ √


a. π b. e c. 3 d. 3 7.
(3) A rectangular parallelepiped has sides of length 3 cm, 4 cm, and 5 cm, measured to the nearest
centimeter. What are the best upper and lower bounds for the volume of this parallelepiped?
What are the best upper and lower bounds for the surface area?
(4) Use three-digit rounding arithmetic to perform the following calculations. Compute the abso-
lute error and relative error with the exact value determined to at least five digits.
√ √ √ 3 π − 22/7
a. 3 + ( 5 + 7) b. (121 − 0.327) − 119 c. −10π + 6e − d. .
√ 62 √1/17
(5) Find the relative error in taking the difference of numbers 5.5 = 2.345 and 6.1 = 2.470.
Numbers should be correct to four significant figures.
(6) Associative and distributive laws are not always valid in case of normalized floating-point
representation.
i. Let a = 0.5555e1, b = 0.4545e1, c = 0.4535e1. Show that
a(b − c) 6= ab − ac.
ii. Further let a = 0.5665e1, b = 0.5556e − 1, c = 0.5644e1. Show that
(a + b) − c 6= (a − c) + b.
(7) Calculate the value of x2 + 2x − 2 and (2x − 2) + x2 where x = 0.7320e0, using normalized point
arithmetic and proves that they are not the same. Compare with the value of (x2 − 2) + 2x.
(8) Use four-digit rounding arithmetic and the formula to find the most accurate approximations
to the roots of the following quadratic equations. Compute the absolute errors and relative
errors.
1 2 123 1
x + x − = 0.
3 4 6
(9) Find the root of smallest magnitude of the equation x2 − 1000x + 25 = 0 using quadratic
formula. Work in floating-point arithmetic using a four-decimal place mantissa.
(10) Suppose two points (x0 , y0 ) and (x1 , y1 ) are on a straight line with y1 6= y0 . Two formulas are
available to find the x-intercept of the line:
x0 y1 − x1 y0 (x1 − x0 )y0
x= , and x = x0 − .
y1 − y0 y1 − y0
Use the data (x0 , y0 ) = (1.31, 3.24) and (x1 , y1 ) = (1.93, 4.76) and three-digit rounding arith-
metic to compute the x-intercept both ways. Which method is better and why?
(11) Consider the identity
Zx
1 − cos(x2 )
sin(xt)dt = .
x
0
FLOATING POINT ARITHMETIC AND ERRORS 15

Explain the difficulty in using the right-hand fraction to evaluate this expression when x is
close to zero. Give a way to avoid this problem and be as precise as√possible.
(12) a. Consider the stability (by calculating the condition number) of 1 + x − 1 when x is near
0. Rewrite the expression to rid it of subtractive cancellation.
b. Rewrite ex − cos x to be stable when x is near 0.
(13) Suppose that a function f (x) = ln(x + 1) − ln(x), is computed by the following algorithm for
large values of x using six digit rounding arithmetic
x0 : = x = 12345
x1 : = x0 + 1
x2 : = ln x1
x3 : = ln x0
f (x) := x4 : = x2 − x3 .
By considering the condition κ(x3 ) of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Also propose the modification of function evaluation
so that algorithm will become stable.
(14) Assume 3-digit mantissa with rounding
a. Evaluate y = x3 − 3x2 + 4x + 0.21 for x = 2.73.
b. Evaluate y = [(x − 3)x + 4]x + 0.21 for x = 2.73.
Compare and discuss the errors obtained in part (a) and (b).
(15) a. How many multiplications and additions are required to determine a sum of the form
n X
X i
ai bj ?
i=1 j=1

b. Modify the sum in part (a) to an equivalent form that reduces the number of computations.
(16) Let P (x) = an xn + an−1 xn−1 + · · · + a1 x + a0 be a polynomial, and let x0 be given. Construct
an algorithm to evaluate P (x0 ) using nested multiplication.
(17) Construct an algorithm that has as input an integer n ≥ 1, numbers x0 , x1 , · · · , xn , and a
number x and that produces as output the product (x − x0 )(x − x1 ) · · · (x0 − xn ).

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 2 (8 LECTURES)
ROOTS OF NON-LINEAR EQUATIONS IN ONE VARIABLE

1. Introduction
Finding one or more root (or zero) of the equation
f (x) = 0
is one of the more commonly occurring problems of applied mathematics. In most cases explicit
solutions are not available and we must be satisfied with being able to find a root to any specified
degree of accuracy. The numerical procedures for finding the roots are called iterative methods. These
problems arise in variety of applications.
The growth of a population can often be modeled over short periods of time by assuming that the
population grows continuously with time at a rate proportional to the number present at that time.
Suppose that N (t) denotes the number in the population at time t and λ denotes the constant birth
rate of the population. Then the population satisfies the differential equation
dN (t)
= λN (t),
dt
whose solution is N (t) = N0 eλt , where N0 denotes the initial population.
This exponential model is valid only when the population is isolated, with no immigration. If
immigration is permitted at a constant rate I, then the differential equation becomes
dN (t)
= λN (t) + I,
dt
whose solution is
I λt
N (t) = N0 eλt + (e − 1).
λ
Suppose a certain population contains N (0) = 1000000 individuals initially, that 424000 individuals
immigrate into the community in the first year, and that N (1) = 1564000 individuals are present at
the end of one year. To determine the birth rate of this population, we need to find λ in the equation
424000 λ
1564000 = 1000000eλ + (e − 1).
λ
It is not possible to solve explicitly for λ in this equation, but numerical methods discussed in this
chapter can be used to approximate solutions of equations of this type to an arbitrarily high accuracy.
Definition 1.1 (Simple and multiple root). A zero (root) has a “multiplicity”, which refers to the
number of times that its associated factor appears in the equation. A root having multiplicity one is
called a simple root. For example, f (x) = (x − 1)(x − 2) has a simple root at x = 1 and x = 2, but
g(x) = (x − 1)2 has a root of multiplicity 2 at x = 1, which is therefore not a simple root.
A multiple root is a root with multiplicity m ≥ 2 is called a multiple point or repeated root. For example,
in the equation (x − 1)2 = 0, x = 1 is multiple (double) root.
If a polynomial has a multiple root, its derivative also shares that root.
Let α be a root of the equation f (x) = 0, and imagine writing it in the factored form
f (x) = (x − α)m φ(x)
with some integer m ≥ 1 and some continuous function φ(x) for which φ(α) 6= 0. Then we say that α
is a root of f (x) of multiplicity m.
Now we study some iterative methods to solve the non-linear equations.
1
2 ROOTS OF NON-LINEAR EQUATIONS

2. The Bisection Method


2.1. Method. Let f (x) be a continuous function on some given interval [a, b] and it satisfies the
condition f (a) f (b) < 0, then by Intermediate Value Theorem the function f (x) must have at least one
root in [a, b]. The bisection method repeatedly bisects the interval [a, b] and then selects a subinterval
in which a root must lie for further processing. It is a very simple and robust method, but it is also
relatively slow. Usually [a, b] is chosen to contain only root α.

Figure 1. Bisection method

Example 1. The sum of two numbers is 20. If each number is added to its square root, then the
product of the resulting sums is 155.55. Perform five iterations of bisection method to determine the
two numbers. 1
Sol. Let x and y be the two numbers. Then,
x + y = 20.
√ √
Now x is added to x and y is added to y. The product of these sums is
√ √
(x + x)(y + y) = 155.55.
√ √
∴ (x + x)(20 − x + 20 − x) = 155.55.
Write the above equation in to root finding problem
√ √
f (x) = (x + x)(20 − x + 20 − x) − 155.55 = 0.
As f (6)f (7) < 0, so there is a root in interval (6.7).
Below are the iterations of bisection method for finding root. Therefore root is 6.53125.

n a b c signf (a)f (c)


1 6.000000 7.000000 6.500000 >0
2 6.500000 7.000000 6.750000 <0
3 6.500000 6.750000 6.625000 <0
4 6.500000 6.625000 6.562500 >0
5 6.500000 6.562500 6.531250 <0

If x = 6.53125, then y = 20 − 6.53125 = 13.4688.


Further we discuss the convergence of approximate solution to exact solution. In this step firstly we
define the usual meaning of convergence and order of convergence.

1Choice of initial approximations: Initial approximations to the root are often known from the physical significance of
the problem. Graphical methods are used to find the zero of f (x) = 0 and any value in the neighborhood of root can be
taken as initial approximation.
If the given equation f (x) = 0 can be written as f1 (x) = f2 (x) = 0, then the point of the intersection of the graphs
y = f1 (x) and y = f2 (x) gives the root of the equation. Any value in the neighborhood of this point can be taken as
initial approximation.
ROOTS OF NON-LINEAR EQUATIONS 3

Definition 2.1 (Convergence). A sequence {xn } is said to be converge to a point α with order p if
there is exist a constant c such that
|xn+1 − α|
lim p = c, n ≥ 0.
n→∞ |xn − α|

The constant c is known as asymptotic error constant. If we write en = |xn − α| where en denote the
absolute error in n-th iteration then we can write in limiting case
en+1 = c epn .
Two cases are given special attention.
(i) If p = 1 (and c < 1), the sequence is linearly convergent.
(ii) If p = 2, the sequence is quadratically convergent.
Definition 2.2. Let {βn } is a sequence which converges to zero and {xn } is any sequence. If there
exists a constant c > 0 and an integer N > 0 such that
|xn − α| ≤ c|βn |, ∀n ≥ N,
then we say that {xn } converges to α with rate O(βn ). We write
xn = α + O(βn ).
Example: Define two sequences for n ≥ 1,
n+1 n+2
xn = 2
, and yn = .
n n3
Both the sequences has limit 0 but the sequence {yn } converges to this limit much faster than the
sequence {xn }.
Now
n+1 n+n 1
|xn − 0| = 2
< 2
= 2 = 2βn
n n n
and
n+2 n + 2n 1
|yn − 0| = 3
< 3
= 3 2 = 3β̃n .
n n n
Hence the rate of convergence of {xn } to zero is similar to the convergence of {1/n} to zero, whereas
{yn } converges to zero at a rate similar to the more rapidly convergent sequence {1/n2 }. We express
this by writing
xn = 0 + O(βn ) and yn = 0 + O(β̃n ).
2.2. Convergence analysis. Now we analyze the convergence of the iterations generated by the
bisection method.
Theorem 2.3. Suppose that f ∈ C[a, b] and f (a) · f (b) < 0. Then the bisection method generates a
sequence {cn } approximating a zero α of f with linear convergence.
Proof. Let [a1 , b1 ], [a2 , b2 ], · · · , [an , bn ], · · · , denote the successive intervals produced by the bisection
algorithm. Thus
a = a1 ≤ a2 ≤ · · · ≤ b1 = b
b = b1 ≥ b2 ≥ · · · ≥ a1 = a.
This implies {an } and {bn } are monotonic and bounded and hence convergent.
Since
b1 − a1 = (b − a)
1 1
b2 − a2 = (b1 − a1 ) = (b − a)
2 2
........................
1
bn − an = n−1
(b − a). (2.1)
2
Hence
lim (bn − an ) = 0.
n→∞
4 ROOTS OF NON-LINEAR EQUATIONS

Here b − a denotes the length of the original interval with which we started. Take limit
lim an = lim bn = α (say).
n→∞ n→∞
Since f is continuous function, therefore
lim f (an ) = f ( lim an ) = f (α).
n→∞ n→∞
The bisection method ensures that
f (an )f (bn ) ≤ 0
which implies
lim f (an )f (bn ) = f 2 (α) ≤ 0
n→∞
=⇒ f (α) = 0.
Thus limit of {an } and {bn } is a zero of [a, b].
Since the root α is in either the interval [an , cn ] or [cn , bn ]. Therefore
1
|α − cn | < cn − an = bn − cn = (bn − an )
2
Combining with (2.1), we obtain the further bound
1
en = |α − cn | < n (b − a).
2
Therefore
1
en+1 < n+1 (b − a).
2
1
∴ en+1 < en .
2
This shows that the iterates cn converge to α as n → ∞. By definition of convergence, we can say that
the bisection method converges linearly with rate 21 .

Illustrations: 1. Since the method brackets the root, the method is guaranteed to converge, however,
can be very slow.
an + bn
2. Computing cn : It might happen that at a certain iteration n, computation of cn = will
2
give overflow. It is better to compute cn as:
bn − an
cn = an + .
2
3. Stopping Criteria: Since this is an iterative method, we must determine some stopping criteria
that will allow the iteration to stop. We can use the following criteria to stop in term of absolute error
and relative error
|cn+1 − cn | ≤ ,
|cn+1 − cn |
≤ ,
|cn+1 |
provided |cn+1 | =6 0.
Criterion |f (cn )| ≤  can be misleading since it is possible to have |f (cn )| very small, even if cn is not
close to the root.
Let’s now find out what is the minimum number of iterations N needed with the bisection method to
b−a
achieve a certain desired accuracy. The interval length after N iterations is N . So, to obtain an
2
b−a
accuracy of , we must have N ≤ . That is,
2
2−N (b − a) ≤ ε,
or
log(b − a) − log ε
N≥ .
log 2
Note the number N depends only on the initial interval [a, b] bracketing the root.

4. If a function is such that it just touches the x-axis, for example f (x) = x2 , then we don’t have a
ROOTS OF NON-LINEAR EQUATIONS 5

and b such that f (a)f (b) < 0 but x = 0 is the root of f (x) = 0.
5. For functions where there is a singularity and it reverses sign at the singularity, bisection method
1
may converge on the singularity. An example include f (x) = . We can chose a and b such that
x
f (a)f (b) < 0. However, the function is not continuous and the theorem that a root exists is not
applicable.
Example 2. Use the bisection method to find solutions accurate to within 10−2 for x3 −7x2 +14x−6 = 0
on [0, 1].
Sol. Number of iterations
log(1 − 0) − log(10−2 )
N≥ = 6.6439.
log 2
Thus, a minimum of 7 iterations will be needed to obtain the desired accuracy using the bisection
method. This yields the following results for mid-points cn and f (cn ):

n an bn cn signf (a)f (c)


1 0 1 0.5 >0
2 0.5 1 0.75 <0
3 0.5 0.75 0.625 <0
4 0.5 0.625 0.5625 >0
5 0.5625 0.625 0.59375 <0
6 0.5625 0.59375 0.578125 <0
7 0.578125 0.59375 0.5859375

3. Fixed-point iteration method


A fixed point for a function is a number at which the value of the function does not change when
the function is applied. The terminology was first used by the Dutch mathematician L. E. J. Brouwer
(1882-1962) in the early 1900s.
The number α is a fixed point for a given function g if g(α) = α.
In this section we consider the problem of finding solutions to fixed-point problems and the connec-
tion between the fixed-point problems and the root-finding problems we wish to solve. Root-finding
problems and fixed-point problems are equivalent classes in the following sense:
Given a root-finding problem f (x) = 0, we can define functions g with a fixed point at x in a number of
ways. Conversely, if the function g has a fixed point at α, then the function defined by f (x) = x − g(x)
has a zero at α.
Although the problems we wish to solve are in the root-finding form, the fixed-point form is easier to
analyze, and certain fixed-point choices lead to very powerful root-finding techniques.
Example 3. Determine any fixed points of the function g(x) = x2 − 2.
Sol. A fixed point x for g has the property that
x = g(x) = x2 − 2
which implies that
0 = x2 − x − 2 = (x + 1)(x − 2).
A fixed point for g occurs precisely when the graph of y = g(x) intersects the graph of y = x, so g has
two fixed points, one at x = −1 and the other at x = 2.

Fixed-point iterations: We now consider solving an equation x = g(x) for a root α by the iteration
xn+1 = g(xn ), n ≥ 0,
with x0 as an initial guess to α.
Each solution of x = g(x) is called a fixed point of g.

For example, consider the solving x2 − 3 = 0.


We can write 1. x = x2 + x − 3 or x = x2 + c(x − 3), c 6= 0.
6 ROOTS OF NON-LINEAR EQUATIONS

2. x = 3/x.
3. x = 21 (x + 3/x).
Let x0 = 2.

Table 1. Table for iterations in three cases


n 1 2 3
0 2.0 2.0 2.0
1 3.0 1.5 1.75
2 9.0 2.0 1.732147
3 87.0 1.5 1.73205


Now 3 = 1.73205 and it is clear that third choice is correct but why other two are not working?
Therefore which of the approximation is correct or not, we will answer after the convergence result
(which require |g 0 (α) < 1| and a ≤ g(x) ≤ b, ∀x ∈ [a, b] in the neighborhood of root α).
Lemma 3.1. Let g(x) be a continuous function on [a, b] and assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b]
then x = g(x) has at least one solution in [a, b].
Proof. Let g be a continuous function on [a, b].
Let assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b].
Now consider φ(x) = g(x) − x.
If g(a) = a or g(b) = b then proof is trivial. Hence we assume that a 6= g(a) and b 6= g(b).
Now since a ≤ g(x) ≤ b
=⇒ g(a) > a and g(b) < b.
Now
φ(a) = g(a) − a > 0
and
φ(b) = g(b) − b < 0.
Now φ is continuous and φ(a)φ(b) < 0, therefore by Intermediate Value Theorem φ has at least one
zero in [a, b], i.e. there exists some α s.t.
g(α) = α, α ∈ [a, b].
Graphically, the roots are the intersection points of y = x & y = g(x) as shown in the Figure.

Figure 2. An example of Lemma


ROOTS OF NON-LINEAR EQUATIONS 7

Theorem 3.2 (Contraction Mapping Theorem). Let g & g 0 are continuous functions on [a, b] and
assume that g satisfy a ≤ g(x) ≤ b, ∀x ∈ [a, b]. Furthermore, assume that there is a positive constant
λ < 1 exists with
|g 0 (x)| ≤ λ, ∀x ∈ (a, b).
Then
1. x = g(x) has a unique solution α of x = g(x) in the interval [a, b].
2. The iterates xn+1 = g(xn ), n ≥ 1 will converge to α for any choice of x0 ∈ [a, b].
3.
λn
|α − xn | ≤ |x1 − x0 |, n ≥ 0.
1−λ
4. Convergence is linear.
Proof. Let g and g 0 are continuous functions on [a, b] and assume that a ≤ g(x) ≤ b, ∀x ∈ [a, b]. By
previous Lemma, there exists at least one solution to x = g(x).
By Mean-Value Theorem, ∃ a point c s.t.
g(x) − g(y) = g 0 (c)(x − y).
|g(x) − g(y)| ≤ λ|x − y|, 0 < λ < 1, ∀x ∈ [a, b].
1. Let x = g(x) has two solutions, say α and β in [a, b] then α = g(α), and β = g(β). Now
|α − β| = |g(α) − g(β)| ≤ λ|α − β|
=⇒ (1 − λ)|α − β| ≤ 0
=⇒ α = β, Since 0 < λ < 1.
Therefore x = g(x) has a unique solution in [a, b] which is α (say).
2. To check the convergence of iterates {xn }, we observe that they all remain in [a, b] as xn ∈
[a, b], xn+1 = g(xn ) ∈ [a, b].
Now
|α − xn+1 | = |g(α) − g(xn )| = |g 0 (cn )||α − xn | (3.1)
for some cn between α and xn .
=⇒ |α − xn+1 | ≤ λ|α − xn | ≤ λ2 |α − xn−1 |
................
≤ λn+1 |α − x0 |
As n → ∞, λn → 0 which implies xn → α. Also
|α − xn | ≤ λn |α − x0 |. (3.2)
3. To find the bound:
Since
|α − x0 | = |α − x1 + x1 − x0 |
≤ |α − x1 | + |x1 − x0 |
≤ λ|α − x0 | + |x1 − x0 |
=⇒ (1 − λ)|α − x0 | ≤ |x1 − x0 |
1
=⇒ |α − x0 | ≤ |x1 − x0 |
1−λ
λn
=⇒ λn |α − x0 | ≤ |x1 − x0 |
1−λ
Therefore using (3.2)
λn
|α − xn | ≤ λn |α − x0 | ≤ |x1 − x0 |
1−λ
λn
=⇒ |α − xn | ≤ |x1 − x0 |.
1−λ
8 ROOTS OF NON-LINEAR EQUATIONS

4. Now by equation (3.2)


|α − xn+1 |
= |g 0 (cn )|,
|α − xn |
for some cn between α and xn . Taking limit n → ∞, we have
|α − xn+1 |
lim = lim |g 0 (cn )|
n→∞ |α − xn | n→∞

Now xn → α =⇒ cn → α.
Hence
|α − xn+1 |
lim = |g 0 (α)|.
n→∞ |α − xn |

If |g 0 (α)| < 1, the above formula shows that iterates are linearly convergent with rate (asymptotic error
constant) |g 0 (α)|. If in addition g 0 (α) 6= 0, then formula proves that convergence is exactly linear, with
no higher order of convergence being possible.
Illustrations: 1. In practice, it is difficult to find an interval [a, b] for which a ≤ g(x) ≤ b condition
is satisfied. On the contrary if |g 0 (α)| > 1, then the iteration method xn+1 = g(xn ) will not converge
to α.
When |g 0 (α)| = 1, no conclusion can be drawn and even if convergence occur, the method would be
far too slow for the iteration method to be practical.
2. If
λn
|α − xn | ≤ |x1 − x0 | < ε
1−λ
where ε is desired accuracy. This bound can be used to find the number of iterations to achieve the
accuracy ε.
Also from part 2, |α − xn | ≤ λn |α − x0 | ≤ λn max{x0 − a, b − x0 } < ε, can be used to find the number
of iterations.
3. The possible behavior of fixed-point iterates {xn } is shown in Figure 3 for various values of g 0 (α).
To see the convergence, consider the case case of x1 = g(x0 ), the height of y = g(x) at x0 . We bring
the number x1 back to the x-axis by using the line y = x and the height y = x1 . We continue this with
each iterate, obtaining a stair-step behavior when g 0 (α) > 0. When g 0 (α) < 0, the iterates oscillates
around the fixed point α, as can be seen in the Figure. In first figure (on top) iterations are monotonic
convergence, in second oscillatory convergent, in third figure iterations are divergent and in the last
figure iterations are oscillatory divergent.

Theorem 3.3. Let α is a root of x = g(x), and g(x) is p times continuously differentiable function
for all x ∈ [α − δ, α + δ], g(x) ∈ [α − δ, α + δ], for some p ≥ 2. Furthermore assume

g 0 (α) = · · · = g (p−1) (α) = 0. (3.3)

Then if the initial guess x0 is sufficiently close to α, then iteration

xn+1 = g(xn ), n≥0

will have order of convergence p.

Proof. Let g(x) is p times continuously differentiable function for all x ∈ [a−δ, a+δ], g(x) ∈ [a−δ, a+δ]
and satisfying the conditions in equation (3.3) stated above.
Now expand g(xn ) in a Taylor polynomial about α.

xn+1 = g(xn )
= g(α + xn − α)
(xn − α)p−1 (p−1) (xn − α)p (p)
= g(α) + (xn − α)g 0 (α) + · · · + g (α) + g (ξn ),
(p − 1)! p!
ROOTS OF NON-LINEAR EQUATIONS 9

Figure 3. Convergent and non-convergent sequences xn+1 = g(xn )

for some ξn between xn and α.


Using equation (3.3) and g(α) = α, we obtain
(xn − α)p (p)
xn+1 − α = g (ξn )
p!
xn+1 − α g (p) (ξn )
=⇒ =
(xn − α)p p!
α − xn+1 g (p) (ξn )
=⇒ = (−1)p−1 .
(α − xn )p p!
Take limits n → ∞ on both sides,
α − xn+1 (p) g (p) (α)
p−1 g (ξn )
lim p = lim (−1) = (−1)p−1 .
n→∞ (α − xn ) n→∞ p! p!
By definition of convergence, the iterations will have order of convergence p.
Example 4. Consider the equation x3 − 7x + 2 = 0 in [0, 1]. Write a fixed-point iteration which will
converge to the solution.
1
Sol. We rewrite the equation in the form x = (x3 + 2) and define the fixed-point iteration xn+1 =
7
1 3
(x + 2).
7 n
1
Now g(x) = (x3 + 2)
7
then g : [0, 1] → [0, 1] and |g 0 (x)| < 3/7 < 1, ∀x ∈ [0, 1].
10 ROOTS OF NON-LINEAR EQUATIONS

Hence by the Contraction Mapping Theorem the sequence {xn } defined above will converge to the
unique solution of given equation. Starting with x0 = 0.5, we can compute the solution as following.
x1 = 0.303571429
x2 = 0.28971083
x3 = 0.289188016.
Therefore root correct to three decimals is 0.289.
Example 5. An equation ex = 4x2 has a root in [4, 5]. Show that we cannot find that root using
x = g(x) = 12 ex/2 for the fixed-point iteration method. Can you find another iterative formula which
will locate that root ? If yes, then find third iterations with x0 = 4.5. Also find the error bound.
Sol. Here g(x) = 12 ex/2 , g 0 (x) = 41 ex/2 > 1 for all x ∈ (4, 5), therefore, the fixed-point iteration fails to
converge to the root in [4, 5].
2
Now consider x = g(x) = ln(4x2 ) and |g 0 (x)| = < 1 for all x ∈ (4, 5).
x
Also 4 ≤ g(x) ≤ 5, the fixed-point iteration converges to the root in [4, 5].
Using the fixed-point iteration method with x0 = 4.5 gives the iterations as
x1 = g(x0 ) = ln(4 × 4.52 ) = 4.3944
x2 = 4.3469
x3 = 4.3253.
Now λ = max |g 0 (x)| = g 0 (4) = 0.5
4≤x≤5
We have the error bound
0.53
|α − x3 | ≤ |4.3944 − 4.5| = 0.0264.
1 − 0.5
Example 6. The equation x3 + 4x2 − 10 = 0 has a unique root in [1, 2]. Write the fixed-point
representations which converge to unique solution.
Sol. We discus the possibilities of writing the g(x).
(1)
x = g1 (x) = x − x3 − 4x2 + 10
For g1 (x) = x − x3 − 4x2 + 10, we have g1 (1) = 6 and g1 (2) = −12, so g1 does not map [1, 2] into
itself. Moreover, g10 (x) = 1 − 3x2 − 8x, so |g10 (x)|> 1 for all x in [1, 2]. Although Convergence
Theorem does not guarantee that the method must fail for this choice of g, there is no reason
to expect convergence.
(2)
x3 = 10 − 4x2
x2 = 10/x − 4x
x = [(10/x) − 4x]1/2 = g2 (x)
With g2 (x) = [10/x − 4x]1/2 , we can see that g2 does not map [1, 2] into [1, 2], and the sequence
{xn }∞
n=0 is not defined when p0 = 1.5. Moreover, there is no interval containing α ≈ 1.365
such that |g20 (x)|< 1, because |g20 (α)|≈ 3.4. There is no reason to expect that this method will
converge.
(3)
4x2 = 10 − x3
1
x = (10 − x3 )1/2 = g3 (x)
2
For the function g3 (x) = 21 (10 − x3 )1/2 , we have
3
g30 (x) = − x2 (10 − x3 )1/2 < 0 on[1, 2],
4
so g3 is strictly decreasing on [1, 2]. However, |g30 (2)|≈ 2.12, so the condition |g30 (x)|≤ λ < 1
fails on [1, 2]. A closer examination of the sequence {xn }∞ n=0 starting with x0 = 1.5 shows that
ROOTS OF NON-LINEAR EQUATIONS 11

it suffices to consider the interval [1, 1.5] instead of [1, 2]. On this interval it is still true that
g30 (x) < 0 and g3 is strictly decreasing, but, additionally,
1 < 1.28 ≈ g3 (1.5) ≤ g3 (x) ≤ g3 (1) = 1.5,
for all x ∈ 1, 1.5]. This shows that g3 maps the interval [1, 1.5] into itself. It is also true that
|g30 (x)|≤ |g30 (1.5)|≈ 0.66 on this interval, so Convergence Theorem confirms the convergence.
(4)
x3 + 4x2 = 10
x2 (x + 4) = 10
10 1/2
 
x = = g4 (x).
x+4
For g4 (x) we have
5 5
|g40 (x)|= √ ≤√ < 0.15, for all x ∈ [1, 2].
10(4 + x)3/2 10(5)3/2
The bound on the magnitude of g40 (x) is much smaller than the bound (found in (3)) on the
magnitude of g30 (x), which explains the more rapid convergence using g4 .
(5) The sequence defined by
x3 + 4x2 − 10
g5 (x) = x −
3x2 + 8x
converges much more rapidly than our other choices. In the next sections (Newton’s Method)
we will see where this choice came from and why it is so effective.
Starting with x0 = 1.5, the following table shows some of the iterates.
n (1) (2) (3) (4) (5)
0 1.5 1.5 1.5 1.5 1.5
1 -0.875 0.8165 1.286953768 1.348399725 1.373333333
2 6.732 2.9969 1.402540804 1.367376372 1.365262015
3 -4.697 (−8.65) 1/2 1.345458374 1.364957015 1.365230014
4 1.375170253 1.365264748 1.365230013
5 1.360094193 1.365225594
6 1.367846968 1.365230576
Example 7. Use a fixed-point method to determine a solution to within 10−4 for x = tan x, for x in
[4, 5].
Sol. Using g(x) = tan x and x0 = 4 gives x1 = g(x0 ) = tan 4 = 1.158, which is not in the interval [4, 5].
So we need a different fixed-point function.
If we note that x = tan x implies that
1 1
=
x tan x
1 1
=⇒ x = x − + .
x tan x
1 1
Starting with x0 and taking g(x) = x − + ,
x tan x
we obtain x1 = 4.61369, x2 = 4.49596, x3 = 4.49341, x4 = 4.49341.
As x3 and x4 agree to five decimals, it is reasonable to assume that these values are sufficiently accurate.
Example 8. The iterates xn+1 = 2 − (1 + c)xn + cx3n will converge to α = 1 for some values of constant
c (provided that x0 is sufficiently close to α). Find the values of c for which convergence occurs? For
what values of c, if any, convergence is quadratic?
Sol. Fixed-point iteration
xn+1 = g(xn )
with
g(x) = 2 − (1 + c)x + cx3 .
12 ROOTS OF NON-LINEAR EQUATIONS

If α = 1 is a fixed point then for convergence |g 0 (α)| < 1


=⇒ | − (1 + c) + 3cα2 | < 1
=⇒ 0 < c < 1.
For this value of c, g 00 (α)
6= 0.
For quadratic convergence
g 0 (α) = 0 & g 00 (α) 6= 0.
This gives c = 1/2.
Example 9. Which of the following iterations
 
1 2 6
a. xn+1 = xn +
4 xn
 
6
b. xn+1 = 4− 2
xn
is suitable to find a root of the equation x3 = 4x2 − 6 in the interval [3, 4]? Estimate the number of
iterations required to achieve 10−3 accuracy, starting from x0 = 3.
Sol. a. Let g(x) = 41 x2 + x6 which is continuous in [3, 4], but g 0 (x) > 1 for all x ∈ (3, 4). So this


choice of g(x) is not


 suitable.
6
b. g(x) = 4 − x2 which is continuous in [3, 4] and g(x) ∈ [3, 4] for all x ∈ [3, 4].
Also |g 0 (x)| = |12/x3 | < 1 for all x ∈ (3, 4). Then from the Contraction Mapping Theorem implies
that a unique fixed-point exists in [3, 4]. To find an approximation of that is accurate to within 10−3 ,
we need to determine the number of iterations n so that
λn
|α − xn | ≤ |x1 − x0 | < 10−3
1−λ
Here λ = max |g 0 (x)| = 4/9 and using the fixed-point method by taking x0 = 3, we have x1 = g(x0 ) =
3≤x≤4
10/3, we have
(4/9)n
|α − xn | ≤ |10/3 − 3| < 10−3 .
1 − 4/9
Solving for n, we get, n = 8.

4. Iteration method based on first degree equation


4.1. The Secant Method. Let f (x) = 0 be the given non-linear equation.
Let (x0 , f (x0 )) and (x1 , f (x1 )) are two pints on the curve y = f (x). Then the equation of the secant
line joining two points on the curve y = f (x) is given by
f (x1 ) − f (x0 )
y − f (x1 ) = (x − x1 ).
x1 − x0
Let intersection point of the secant line with the x-axis is (x2 , 0) then at x = x2 , y = 0. Therefore
f (x1 ) − f (x0 )
0 − f (x1 ) = (x2 − x1 )
x1 − x0
x1 − x0
x2 = x1 − f (x1 ).
f (x1 ) − f (x0 )
Here x0 and x1 are two approximations of the root. The point (x2 , 0) can be taken as next approx-
imation of the root. This method is called the secant or chord method and successive iterations are
given by
xn − xn−1
xn+1 = xn − f (xn ), n = 1, 2, . . .
f (xn ) − f (xn−1 )
Geometrically, in this method we replace the unknown function by a straight line or chord passing
through (x0 , f (x0 )) and (x1 , f (x1 )) and we take the point of intersection of the straight line with the
x-axis as the next approximation to the root and continue the process.
ROOTS OF NON-LINEAR EQUATIONS 13

Figure 4. Secant method

Illustrations:
1. Stopping Criterion: We can use the following stopping criteria
|xn − xn−1 | < ε,
xn − xn−1
Or < ε,
xn
where ε is given accuracy.
2. We can combine the secant method with the bisection method and bracket the root, i.e., we choose
initial approximations x0 and x1 in such a manner that f (x0 )f (x1 ) < 0. At each stage we bracket the
root. The method is known as ‘Method of False Position’ or ‘Regula Falsi Method’.
Example 10. Apply secant method to find the root of the equation ex = cos x with Relative error
< 0.5%.
Sol. Let f (x) = ex − cos x = 0.
The successive iterations of the secant method are given by
xn − xn−1
xn+1 = xn − f (xn ), n = 1, 2, . . .
f (xn ) − f (xn−1 )
We take initial guesses x0 = −1.1 and x1 = −1, and let en denotes error at n-th step and we obtain
x2 − x1
x2 = 0.2709, e1 = × 100% = 469.09%.
x2
x3 − x2
x3 = 0.4917, e2 = × 100% = 44.9%.
x3
x4 − x3
x4 = 0.5961, e3 = × 100% = 17.51%.
x4
x5 − x4
x5 = 0.6170, e4 = × 100% = 3.4%.
x5
x6 − x5
x6 = 0.6190, e5 = × 100% = 0.32%.
x6
We obtain error less than 0.5% and accept x6 = 0.6190 as root with prescribed accuracy.
Example 11. Let f ∈ C 0 [a, b]. If α is a simple root of f (x) = 0, then show that the sequence {xn }
generated by the secant method has order of convergence 1.618.
Sol. We assume that α is a simple root of f (x) = 0 then f (α) = 0.
Let xn = α + en , where en is the error at n-th step.
An iterative method is said to has order of convergence p if
|xn+1 − α| = C |xn − α|p .
14 ROOTS OF NON-LINEAR EQUATIONS

Or equivalently
|en+1 | = C|en |p .
Successive iteration in secant method are given by
xn − xn−1
xn+1 = xn − f (xn ) n = 1, 2, . . .
f (xn ) − f (xn−1 )
Error equation is written as
en − en−1
en+1 = en − f (α + en ).
f (α + en ) − f (α + en−1 )
By expanding f (α + en ) and f (α + en−1 ) in Taylor series, we obtain the error equation
 
0 1 2 00
(en − en−1 ) en f (α) + en f (α) + . . .
2
en+1 = en −  
0
1 00
(en − en−1 ) f (α) + (en + en−1 ) f (α) + . . .
2
00
−1
f 00 (α)
 
1 2 f (α) 1
= en − en + en 0 + . . . 1 + (en−1 + en ) 0 + ...
2 f (α) 2 f (α)
1 2 f 00 (α) f 00 (α)
  
1
= en − en + en 0 + . . . 1 − (en−1 + en ) 0 + ...
2 f (α) 2 f (α)
1 f 00 (α)
= × en en−1 + O(e2n en−1 + en e2n−1 )
2 f 0 (α)
Therefore
en+1 ≈ Aen en−1
1 f 00 (α)
where constant A = .
2 f 0 (α)
This relation is called the error equation. Now by the definition of the order of convergence, we expect
a relation of the following type
en+1 = Cepn .
1/p
Making one index down, we obtain en = Cepn−1 or en−1 = C 1/p en .
Hence
C epn = Aen C 1/p e1/p
n
=⇒ epn = AC (−1+1/p) e1+1/p
n .
Comparing the powers of en on both sides, we get
p = 1 + 1/p,
which gives two values of p, one is p = 1.618 and another one is negative (and we neglect negative
value of p as order of convergence is non-negative).
Therefore, order of convergence of secant method is less than 2.
4.2. Newton’s Method. Let f (x) = 0 be the given non-linear equation.
Let the tangent line at point (x0 , f (x0 )) on the curve y = f (x) intersect with the x-axis at (x1 , 0). The
equation of tangent is given by
y − f (x0 ) = f 0 (x0 )(x − x0 ).
Here the number f 0 (x0 ) gives the slope of tangent at x0 . At x = x1 ,
0 − f (x0 ) = f 0 (x0 )(x1 − x0 )
f 0 (x0 )
x1 = x0 − .
f (x0 )
Here x0 is the approximations of the root.
This is called the Newton’s method and successive iterations are given by
f (xn )
xn+1 = xn − , n = 0, 1, . . . .
f 0 (xn )
ROOTS OF NON-LINEAR EQUATIONS 15

The method can be obtained directly from the secant method by taking limit xn−1 → xn . In the
limiting case the chord joining the points (xn−1 , f (xn−1 )) and (xn , f (xn )) becomes the tangent at
(xn , f (xn )).
In this case problem of finding the root of the equation is equivalent to finding the point of intersection
of the tangent to the curve y = f (x) at point (xn , f (xn )) with the x-axis.

Figure 5. Newton’s method


Example 12. Use Newton’s Method in computing of 2.
Sol. This number satisfies the equation f (x) = 0 where f (x) = x2 − 2 = 0.
Since f 0 (x) = 2x, it follows that in Newton’s Method, we can obtain the next iterate from the previous
iterate xn by
x2 − 2 xn 1
xn+1 = xn − n = + .
2xn 2 xn
Starting with x0 = 1, we obtain

1 1
x1 = + = 1.5
2 1
1.5 1
x2 = + = 1.41666667
2 1.5
x3 = 1.41421569
x4 = 1.41421356
x5 = 1.41421356.
Since the fourth and fifth iterates agree in to eight decimal places, we assume that 1.41421356 is a
correct solution to f (x) = 0, to at least eight decimal places.

4.2.1. The Newton’s Method can go bad.


• Once the Newton’s Method catches scent of the root, it usually hunts it down with amazing
speed. But since the method is based on local information, namely f (xn ) and f 0 (xn ), the
Newton’s Method’s sense of smell is deficient.
• If the initial estimate is not close enough to the root, the Newton’s Method may not converge,
or may converge to the wrong root.
16 ROOTS OF NON-LINEAR EQUATIONS

• If f (x) be twice continuously differentiable on the closed finite interval [a, b] and the following
conditions are satisfied:
(i) f (a) f (b) < 0.
(ii) f 0 (x) 6= 0, ∀x ∈ [a, b].
(iii) Either f 00 (x) ≥ 0 or f 00 (x) ≤ 0, ∀x ∈ [a, b].
(iv) The tangent to the curve at either endpoint intersects the x−axis within the interval [a, b].
In other words, at the end points a, b,
|f (a)| |f (b)|
|x − a| = 0 < b − a, |x − b| = 0 < b − a.
|f (a)| |f (b)|
Then the Newton’s method converges to the unique solution α of f (x) = 0 in [a, b] for any
choice of x0 ∈ [a, b].
Conditions (i) and (ii) guarantee that there is one and only one solution in [a, b]. Condition
(iii) states that the graph of f (x) is either concave from above or concave from below, and
furthermore together with condition (ii) implies that f 0 (x) is monotone on [a, b].

Figure 6. An example where Newton’s method will not work.

The following example shows that choice of initial guess is very important for convergence.
Example 13. Using Newton’s Method to find a non-zero solution of x = 2 sin x.
Sol. Let f (x) = x − 2 sin x.
Then f 0 (x) = 1 − 2 cos x, f (1)f (2) < 0, root lies in (1, 2).
The Newton’s iterations are given by
f (xn ) xn − 2 sin xn 2(sin xn − xn cos xn )
xn+1 = xn − 0 = xn − = ; n ≥ 0.
f (xn ) 1 − 2 cos xn 1 − 2 cos xn
Let x0 = 1.1. The next six estimates, to 3 decimal places, are:
x1 = 8.453, x2 = 5.256, x3 = 203.384, x4 = 118.019, x5 = −87.471, x6 = −203.637.
Therefore iterations diverges.
Note that choosing x0 = π/3 ≈ 1.0472 leads to immediate disaster, since then 1 − 2 cos x0 = 0 and
therefore x1 does not exist. The trouble was caused by the choice of x0 as f 0 (x0 ) ≈ 0.
Let’s see whether we can do better. Draw the curves y = x and y = 2 sin x. A quick sketch shows that
they meet a bit past π/2. If we take x0 = 1.5. Here are the next five estimates
x1 = 2.076558, x2 = 1.910507, x3 = 1.895622, x4 = 1.895494, x5 = 1.895494.
ROOTS OF NON-LINEAR EQUATIONS 17

Figure 7. One more example of where Newton’s method will not work.

Example 14. Find, correct to 5 decimal places, the x-coordinate of the point on the curve y = ln x
which is closest to the origin. Use the Newton’s Method.
Sol. Let (x, ln x) be a general point on the curve, and let S(x) be the square of the distance from
(x, ln x) to the origin. Then
S(x) = x2 + ln2 x.
We want to minimize the distance. This is equivalent to minimizing the square of the distance. Now
the minimization process takes the usual route. Note that S(x) is only defined when x > 0. We have
ln x 2
S 0 (x) = 2x + 2 = (x2 + ln x).
x x
Our problem thus comes down to solving the equation S 0 (x) = 0. We can use the Newton’s Method
directly on S 0 (x), but calculations are more pleasant if we observe that S 0 (x) = 0 is equivalent to
x2 + ln x = 0.
Let f (x) = x2 + ln x. Then f 0 (x) = 2x + 1/x and we get the recurrence relation
x2k + ln xk
xk+1 = xk − , k = 0, 1, · · ·
2xk + 1/xk
We need to find a suitable starting point x0 . Experimentation with a calculator suggests that we take
x0 = 0.65.
Then x1 = 0.6529181, and x2 = 0.65291864.
Since x1 agrees with x2 to 5 decimal places, we can perhaps decide that, to 5 places, the minimum
distance occurs at x = 0.65292.
4.3. Convergence Analysis.
Theorem 4.1. Let f ∈ C 2 [a, b]. If α is a simple root of f (x) = 0 and f 0 (α) 6= 0, then Newton’s method
generates a sequence {xn } converging quadratically to root α for any initial approximation x0 near to
α.
Proof. The proof is based on analyzing Newton’s method as the fixed point iteration scheme
xn+1 = g(xn )
f (xn )
= xn − , n≥0
f 0 (xn )
18 ROOTS OF NON-LINEAR EQUATIONS

with
f (x)
g(x) = x − .
f 0 (x)
We first find an interval [α − δ, α + δ] such that g(x) ∈ [α − δ, α + δ] and for which |g 0 (x)| ≤ λ, λ ∈ (0, 1),
for all x ∈ (α − δ, α + δ).
Since f 0 is continuous and f 0 (α) 6= 0, i.e., a continuous function is non-zero at a point which implies it
will remain non-zero in a neighborhood of α.
Thus g is defined and continuous in a neighborhood of α. Also in that neighborhood
f 0 (x)f 0 (x) − f (x)f 00 (x) f (x)f 00 (x)
g 0 (x) = 1 − = . (4.1)
f 0 (x)2 [f 0 (x)]2
Now since f (α) = 0, therefore
f (α)f 00 (α)
g 0 (α) = = 0.
[f 0 (α)]2
Since g is continuous and 0 < λ < 1, then there exists a number δ such that
|g 0 (x)| ≤ λ, ∀x ∈ [α − δ, α + δ].
Now we will show that g maps [α − δ, α + δ] into [α − δ, α + δ].
If x ∈ [α − δ, α + δ], the Mean Value Theorem implies that for some number c between x and α,
|g(x) − α| = |g(x) − g(α)| = |g 0 (c)| |x − α| ≤ λ|x − α| < |x − α|.
It follows that if |x − α| < δ =⇒ |g(x) − α| < δ.
Hence, g maps [α − δ, α + δ] into [α − δ, α + δ].
All the hypotheses of the Fixed-Point Convergence Theorem (Contraction Mapping) are now satisfied,
so the sequence xn converges to root α. Further from Eqs. (4.1)
f 00 (α)
g 00 (α) = 6= 0,
f 0 (α)
which proves that convergence is of second-order provided f 00 (α) 6= 0.
Remark 4.1. Newton’s method converges at least quadratically. If g 00 (α) = 0, then higher order
convergence is expected.
Example 15. The function f (x) = sin x has a zero on the interval (3, 4), namely, x = π. Perform
three iterations of Newton’s method to approximate this zero, using x0 = 4. Determine the absolute
error in each of the computed approximations. What is the apparent order of convergence?
Sol. Consider f (x) = sin x. In the interval (3, 4), f has a zero α = π.
Also, f 0 (x) = cos x.
Newton’s iterations are given by
f (xn )
xn+1 = xn − 0 , n ≥ 0.
f (xn )
With x0 = 4, we have
f (x0 ) sin 4
x1 = x0 − 0 =4− = 2.8422,
f (x0 ) cos 4
f (x1 ) sin 2.8422
x2 = x1 − 0 = 2.8422 − = 3.1509,
f (x1 ) cos 2.8422
f (x2 ) sin 3.1509
x3 = x2 − 0 = 3.1509 − = 3.1416.
f (x2 ) cos 3.1509
The absolute errors are:
e0 = |x0 − α| = 0.8584,
e1 = |x1 − α| = 0.2994,
e2 = |x2 − α| = 0.0093,
e3 = |x3 − α| = 2.6876 × 10−7 .
ROOTS OF NON-LINEAR EQUATIONS 19

If p is the order of convergence then  p


e2 e1
= .
e1 e0
The corresponding order(s) of convergence are
ln(e2 /e1 ) ln(0.0093/0.2994)
p = = = 3.296,
ln(e1 /e0 ) ln(0.2994/0.8584)
ln(e3 /e2 ) ln(2.6876 × 10−7 /0.0093)
p = = = 3.010.
ln(e2 /e1 ) ln(0.0093/0.2994)
We obtain a better than a third order of convergence, which is a better order than the theoretical
bound gives us.

4.4. Newton’s method for multiple roots. Let α be a root of f (x) = 0 with multiplicity m. In
this case we can write
f (x) = (x − α)m φ(x).
In this case
f (α) = f 0 (α) = · · · = f (m−1) (α) = 0, f (m) (α) 6= 0.
Recall that we can regard Newton’s method as a fixed point method:
f (x)
xn+1 = g(xn ), g(x) = x − .
f 0 (x)
Then we substitute
f (x) = (x − α)m φ(x)
to obtain
(x − α)m φ(x)
g(x) = x −
m(x − + (x − α)m φ0 (x)
α)m−1 φ(x)
(x − α) φ(x)
= x− .
mφ(x) + (x − α)φ0 (x)
Therefore we obtain
1
g 0 (α) = 1 − 6= 0.
m
For m > 1, this is nonzero, and therefore Newton’s method is only linearly convergent.
There are ways of improving the speed of convergence of Newton’s method, creating a modified method
that is again quadratically convergent. In particular, consider the fixed point iteration formula
f (x)
xn+1 = g(xn ), g(x) = x − m
f 0 (x)
in which we assume to know the multiplicity m of the root α being sought. Then modifying the above
argument on the convergence of Newton’s method, we obtain
1
g 0 (α) = 1 − m = 0
m
and the iteration method will be quadratically convergent. But most of the time we don’t know the
multiplicity.
One method of handling the problem of multiple roots of a function f is to define
f (x)
µ(x) = .
f 0 (x)
If α is a zero of f of multiplicity m with f (x) = (x − α)m φ(x), then
(x − α)m φ(x)
µ(x) =
m(x − α)m−1 φ(x) + (x − α)m φ0 (x)
φ(x)
= (x − α)
mφ(x) + (x − α)φ0 (x)
20 ROOTS OF NON-LINEAR EQUATIONS

also has a zero at α. However, φ(α) 6= 0, so


φ(α) 1
0
= 6= 0,
mφ(α) + (α − α)φ (α) m
and α is a simple zero of µ(x). Newton’s method can then be applied to µ(x) to give
µ(x) f (x)/f 0 (x)
g(x) = x − = x −
µ0 (x) {[f 0 (x)]2 − [f (x)][f 00 (x)]}/[f 0 (x)]2
which simplifies to
f (x)f 0 (x)
g(x) = x − .
[f 0 (x)]2 − f (x)f 00 (x)
If g has the required continuity conditions, functional iteration applied to g will be quadratically con-
vergent regardless of the multiplicity of the zero of f. Theoretically, the only drawback to this method
is the additional calculation of f 00 (x) and the more laborious procedure of calculating the iterates. In
practice, however, multiple roots can cause serious round-off problems because the denominator of the
above expression consists of the difference of two numbers that are both close to 0.
Example 16. Let f (x) = ex − x − 1. Show that f has a zero of multiplicity 2 at x = 0. Show that
Newton’s method with x0 = 1 converges to this zero but not quadratically.
Sol. We have f (x) = ex − x − 1, f 0 (x) = ex − 1 and f 00 (x) = ex .
Now f (0) = 1 − 0 − 1 = 0, f 0 (0) = 1 − 1 = 0 and f 00 (0) = 1. Therefore f has a zero of multiplicity 2
at x = 0.
Starting with x0 = 1, iterations are given by
f (xn )
xn+1 = xn − 0 .
f (xn )
x1 = 0.58198, x2 = 0.31906, x3 = 0.16800 x4 = 0.08635, x5 = 0.04380, x6 = 0.02206.
By using the modified Newton’s Method
f (xn )f 0 (xn )
xn+1 = xn − 0 .
[f (xn )]2 − f (xn )f 00 (xn )
Starting with x0 = 1.0, we obtain
x1 = −0.023421, x2 = −0.0084527, x3 = −0.000011889.
We observe that modified Newton’s is converging faster to root 0.
Example 17. The equation f (x) = x3 − 7x2 + 16x − 12 = 0 has a double root at x = 2.0. Starting
with x0 = 1, find the root correct to three decimals with Newton’s and its modified version.
Sol. Firstly we apply simple Newton’s method and successive iterations are given by
x3n − 7x2n + 16xn − 12
xn+1 = xn − , n = 0, 1, 2, . . .
3x2n − 14xn + 16
Start with x0 = 1.0, we obtain
x1 = 1.4, x2 = 1.652632, x3 = 1.806484, x4 = 1.89586
x5 = 1.945653, x6 = 1.972144, x7 = 1.985886, x8 = 1.992894
x9 = 1.996435, x10 = 1.998214, x11 = 1.999106, x12 = 1.999553.
The root correct to 3 decimal places is x12 = 2.000.
If we apply modified Newton’s method then
x3n − 7x2n + 16xn − 12
xkn+1 = xn − 2 , n = 0, 1, 2, . . .
3x2n − 14xn + 16
Start with x0 = 1.0, we obtain
x1 = 1.8, x2 = 1.984615, x3 = 1.999884.
The root correct to 3 decimal places is 2.000 and in this case we need less iterations to get desired
accuracy.
We end this chapter by solving an example with all three methods studied previously.
ROOTS OF NON-LINEAR EQUATIONS 21

arctan 6
Example 18. The function f (x) = tan πx − 6 has a zero at ≈ 0.447431543. Use eight
π
iterations of each of the following methods to approximate this root. Which method is most successful
and why?
a. Bisection method in interval [0,1].
b. Secant method with x0 = 0 and x1 = 0.48.
c. Newton’s method with x0 = 0.4.
Sol. It is important to note that f has several roots on the interval [0, 5] (to see make a plot).
a. Since f has several roots in [0, 5], the bisection method converges to a different root in this interval.
Therefore, it would be a better idea to choose the interval to be [0, 1]. For such case, we have the
following results: After 8 iterations answer is 0.447265625.

n a b c
0 0 1 0.5
1 0 0.5 0.25
2 0.25 0.5 0.375
3 0.375 0.5 0.4375
4 0.4375 0.5 0.46875
5 0.4375 0.46875 0.453125
6 0.4375 0.46875 0.4453125
7 0.4375 0.4453125 0.44921875
8 0.4453125 0.44921875 0.447265625

b. The Secant method diverges for x0 = 0 and x1 = 0.48.


The Secant method converges for some other choices of initial guesses, for example, x0 = 0.4 and
x1 = 0.48. Few iterations are given:
x2 = 4.1824045, x3 = 4.29444232, x4 = 4.57230361, x5 = 0.444112051,
x6 = 0.446817663, x7 = 0.447469928, x8 = 0.447431099, x9 = 0.447431543.
c. We have
π
f (x) = tan(πx) − 6, and f 0 (x) = .
cos2 (πx)
Since the function f has several roots, some initial guesses may lead to convergence to a different root.
Indeed, for x0 = 0, Newton’s method converges to a different root. For Newton’s method, therefore, it
is suggested that we use x0 = 0.4 in order to converge to given root.
Starting with x0 = 0.4, we obtain
x1 = 0.488826408, x2 = 0.480014377, x3 = 0.467600335, x4 = 0.455142852,
x5 = 0.448555216, x6 = 0.447455353, x7 = 0.447431554, x8 = 0.447431543.
We see that for these particular examples and initial guesses, the Newton’s method and the Se-
cant method give very similar convergence behaviors. The Newton’s method converges slightly faster
though. The bisection method converges much slower than the two other methods, as expected.

Exercises
(1) Use the bisection method to find solutions accurate to within 10−3 for the following problems.
a. x − 2−x = 0 for 0 ≤ x ≤ 1 b. ex − x2 + 3x − 2 = 0 for 0 ≤ x ≤ 1
c. x + 1 − 2 sin(πx) = 0 for √0 ≤ x ≤ 0.5 and 0.5 ≤ x ≤ 1.
(2) Find an approximation to 3 25 correct to within 10−3 using the bisection algorithm.
(3) Find a bound for the number of iterations needed to achieve an approximation by bisection
method with accuracy 10−2 to the solution of x3 − x − 1 = 0 lying in the interval [1, 2]. Find
an approximation to the root with this degree of accuracy.
(4) Sketch the graphs of y = x and y = 2 sin x. Use the bisection method to find an approximation
to within 10−3 to the first positive value of x with x = 2 sin x.
(5) Let f (x) = (x + 2)(x + 1)2 x(x − 1)3 (x − 2). To which zero of f does the bisection method
converge when applied on the following intervals?
a. [-1.5, 2.5] b. [-0.5, 2.4] c. [-0.5, 3] d. [-3, -0.5].
22 ROOTS OF NON-LINEAR EQUATIONS

(6) The function defined by f (x) = sin(πx) has zeros at every integer. Show that when −1 < a < 0
and 2 < b < 3, the bisection method converges to
a. 0, if a + b < 2 b. 2, if a + b > 2 c. 1, if a + b = 2.
(7) For each of the following equations, use the given interval or determine an interval [a, b] on
which fixed-point iteration will converge. Estimate the number of iterations necessary to obtain
approximations accurate to within 10−2 , and perform the calculations.
a. 2 + sin x − x = 0 use [2, 3] b. x3 − 2x − 5 = 0 use [2, 3] c. 3x2 − ex = 0 d. x − cos x = 0.
(8) Use the fixed-point iteration method to find smallest and second smallest positive roots of the
equation tan x = 4x, correct to 4 decimal places.
(9) Show that g(x) = π + 0.5 sin(x/2) has a unique fixed point on [0, 2π]. Use fixed-point iteration
to find an approximation to the fixed point that is accurate to within 10−2 . Also estimate the
number of iterations required to achieve 10−2 accuracy, and compare this theoretical estimate
to the number actually needed.
(10) Find all the zeros of f (x) = x2 + 10 cos x by using the fixed-point iteration method for an
appropriate iteration function g. Find the zeros accurate to within 10−2 .
(11) What is the order of convergence of the iteration
xn (x2n + 3a)
xn+1 =
3x2n + a

as it converges to the fixed point α = a?
(12) Let A be a given positive constant and g(x) = 2x − Ax2 .
a. Show that if fixed-point iteration converges to a nonzero limit, then the limit is α = 1/A,
so the inverse of a number can be found using only multiplications and subtractions.
b. Find an interval about 1/A for which fixed-point iteration converges, provided x0 is in that
interval.
(13) Consider the root-finding problem f (x) = 0 with root α, with f 0 (x) 6= 0. Convert it to the
fixed-point problem
x = x + cf (x) = g(x)
with c a nonzero constant. How should c be chosen to ensure rapid convergence of
xn+1 = xn + cf (xn )
to α (provided that x0 is chosen sufficiently close to α)? Apply your way of choosing c to the
root-finding problem x3 − 5 = 0.
(14) Show that if A is any positive number, then the sequence defined by
1 A
xn = xn−1 + , for n ≥ 1,
2 2xn−1

converges to A whenever x0 > 0. What happens if x0 < 0?
(15) Use secant method to find solutions accurate to within 10−3 for the following problems.
a. −x3 − cos x = with x0 = −1 and x1 = 0.
b. x − cos x = 0, x ∈ [0, π/2].
c. ex + 2−x + 2 cos x − 6 = 0, x ∈ [1, 2].
(16) Use Newton’s method to find solutions accurate to within 10−3 to the following problems.
a. x − e−x = 0 for 0 ≤ x ≤ 1.
b. 2x cos 2x − (x − 2)2 = 0 for 2 ≤ x ≤ 3 and 3 ≤ x ≤ 4.
(17) Use Newton’s method to approximate the positive root of 2 cos x = x4 correct to six decimal
places.
(18) A calculator is defective: it can only add, subtract, and multiply. Use the equation 1/x = 1.732,
the Newton’s Method, and the defective calculator to find 1/1.732 correct to 4 decimal places.
(19) The fourth degree polynomial f (x) = 230x4 + 18x3 + 9x2 − 221x − 9 = 0 has two real zeros, one
in [−1, 0] and other in [0, 1]. Attempt to approximate these zeros to within 10−6 using Secant
and Newton’s methods.
(20) Use Newton’s method to solve the equation
1 1 2 1
+ x − x sin x − cos 2x = 0
2 4 2
ROOTS OF NON-LINEAR EQUATIONS 23

with x0 = π2 . Iterate using Newton’s method until an accuracy of 10−5 is obtained. Explain
why the result seems unusual for Newton’s method. Also, solve the equation with x0 = 5π and
x0 = 10π.
(21) Find all positive roots of the equation
Z x
2
10 e−x dt = 1
0

with six correct decimals with Newton’s method.


(22) a. Apply Newton’s method to the function
 √
√ x, x≥0
f (x) =
− −x, x < 0
with the root α = 0. What is the behavior of the iterates? Do they converge, and if so, at what
rate?
b. Do the same but with
 3√
2 x≥0
f (x) = √x ,
3 2
− x , x<0

(23) Apply the Newton’s method with x0 = 0.8 to the equation f (x) = x3 − x2 − x + 1 = 0, and
verify that the convergence is only of first-order. Further show that root α = 1 has multiplicity
2 and then apply the modified Newton’s method with m = 2 and verify that the convergence
is of second-order.
(24) Use Newton’s method to approximate, to within 10−4 , the value of x that produces the point
on the graph of y = x2 that is closest to (1, 0).
(25) Use Newton’s method and the modified Newton’s method to find a solution of
√ √
cos(x + 2) + x(x/2 + 2) = 0, for − 2 ≤ x ≤ −1
accurate to within 10−3 .
(26) The circle below has radius 1, and the longer circular arc joining A and B is twice as long as
the chord AB. Find the length of the chord AB, correct to four decimal places. Use Newton’s
method.

(27) A particle starts at rest on a smooth inclined plane whose angle θ is changing at a constant
rate

= ω < 0.
dt
At the end of t seconds, the position of the object is given by
e − e−ωt
 ωt 
g
x(t) = − 2 − sin ωt .
2ω 2
Suppose the particle has moved 1.7 ft in 1 s. Find, to within 10−5 , the rate ω at which θ
changes. Assume that g = 32.17 ft/s2 .
24 ROOTS OF NON-LINEAR EQUATIONS

(28) An object falling vertically through the air is subjected to viscous resistance as well as to the
force of gravity. Assume that an object with mass m is dropped from a height s0 and that the
height of the object after t seconds is
mg m2 g
s(t) = s0 − t + 2 (1 − e−kt/m ),
k k
where g = 32.17 ft/s2 and k represents the coefficient of air resistance in lb-s/ft. Suppose
s0 = 300 ft, m = 0.25 lb, and k = 0.1 lb-s/ft. Find, to within 0.01 s, the time it takes this
quarter-pounder to hit the ground.
(29) It costs a firm C(q) dollars to produce q grams per day of a certain chemical, where
C(q) = 1000 + 2q + 3q 2/3 .
The firm can sell any amount of the chemical at $4 a gram. Find the break-even point of the
firm, that is, how much it should produce per day in order to have neither a profit nor a loss.
Use the Newton’s method and give the answer to the nearest gram.

Appendix A. Algorithms
Algorithm (Bisection):
To determine a root of f (x) = 0 that is accurate within a specified tolerance value ε, given values a
and b such that f (a) f (b) < 0.
a+b
Define c = .
2
If f (a) f (c) < 0, then set b = c, otherwise a = c.
End if.
Until |a − b| ≤ ε (tolerance value).
Print root as c.

Algorithm (Fixed-point):
To find the fixed point of g in an interval [a, b], given the equation x = g(x) with an initial guess
x0 ∈ [a, b]
1. n = 1.
2. xn = g(xn−1 ).
3. If
|xn − xn−1 |
|xn − xn−1 | < ε or <ε
|xn |
then step 5.
4. n → n + 1; go to 2.
5. End of Procedure.

Algorithm (Secant):
1. Give inputs and take two initial guesses x0 and x1 .
2. Start iterations
x1 − x0
x2 = x1 − f0 .
f1 − f0
ROOTS OF NON-LINEAR EQUATIONS 25

3. If
|xn − xn−1 |
|xn − xn−1 | < ε or <ε
|xn |
then stop and print the root.
4. Repeat the iterations (step 2). Also check if the number of iterations has exceeded the maximum
number of iterations.

Algorithm (Newton’s method):


Let f : R → R be a differentiable function. The following algorithm computes an approximate solution
x of the equation f (x) = 0.
1. Choose an initial guess x0
2. for n = 0, 1, 2, . . . do
if f (x) is sufficiently small
then x∗ = x
return x
end
f (xn )
3. xn+1 = xn − 0
f (xn )
If |xn+1 − xn | is sufficiently small
then x∗ = xn+1
return x
end
4. end (for main loop)

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 3 (4 LECTURES)
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

1. Introduction
System of simultaneous linear equations are associated with many problems in engineering and
science, as well as with applications to the social sciences and quantitative study of business and
economic problems. These problems occur in wide variety of disciplines, directly in real world problems
as well as in the solution process for other problems.
The principal objective of this Chapter is to discuss the numerical aspects of solving linear system of
equations having the form 
a x + a12 x2 + .........a1n xn = b1
 11 1


a21 x1 + a22 x2 + .........a2n xn = b2
(1.1)

 ................................................
an1 x1 + an2 x2 + .........ann xn = bn .

This is a linear system of n equation in n unknowns x1 , x2 ......xn . This system can simply be written
in the matrix equation form
Ax=b

     
a11 a12 ··· a1n x1 b1
 a21 a22 ··· a2n   x2   b2 
..  ×  ..  =  ..  (1.2)
     
 .. .. ..
 . . . .   .  .
an1 an2 · · · ann xn bn
This equations has a unique solution x = A−1 b, when the coefficient matrix A is non-singular. Unless
otherwise stated, we shall assume that this is the case under discussion. If A−1 is already available,
then x = A−1 b provides a good method of computing the solution x.
If A−1 is not available, then in general A−1 should not be computed solely for the purpose of obtaining
x. More efficient numerical procedures will be developed in this chapter. We study broadly two
categories Direct and Iterative methods. We start with direct method to solve the linear system in
this chapter.

2. Gaussian Elimination
Direct methods, which are technique that give a solution in a fixed number of steps, subject only to
round-off errors, are considered in this chapter. Gaussian elimination is the principal tool in the direct
solution of system (1.2). The method is named after Carl Friedrich Gauss (1777-1855).
To solve larger system of linear equation, we consider a following n × n system
a11 x1 + a12 x2 + a13 x3 + · · · + a1n xn = b1 (E1 )
a21 x1 + a22 x2 + a23 x3 + · · · + a2n xn = b2 (E2 )
............................................................
ai1 x1 + ai2 x2 + ai3 x3 + · · · + ain xn = bi (Ei )
............................................................
an1 x1 + an2 x2 + an3 + · · · + ann xn = bn (En ).
Here Ei denote the i-th row of the coefficients matrix A, i = 1, 2, · · · , n.
Let a11 6= 0 and eliminate x1 from E2 , E3 , · · · , En .
ai1
Define multipliers mi1 = , for each i = 2, 3, · · · , n.
a11
We write each entry in Ei as Ei − mi1 E1 and bi as bi − mi1 b1 ; i = 1, 2, · · · , n.
1
2 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

This will delete x1 from these rows.


We repeat this procedure and follow a sequential procedure for j = 2, 3, · · · , n and perform the opera-
tions
Ei −→ Ei − (aij /ajj Ej ), for each i = j + 1, j + 2, · · · , n.
bi − (aij /ajj bj ), for each i = j + 1, j + 2, · · · , n,
provided aii 6= 0. This eliminates xi in each row below the i-th values.
Also we replace each bi with b(i) = b(i) − (aij /ajj )b(j − 1). The resulting matrix is triangular and has
the form
 
a11 a12 · · · a1n b1
 0 a22 · · · a2n b2 
 
 . . . . . . . . . . . . . . .
0 0 · · · ann bn
Solving the n-th equation for xn gives
bn
xn = .
ann
Solving the (n − 1)st equation for xn−1 and using the known value for xn yields (back substitution)
bn−1 − an−1,n xn
xn−1 = .
an−1,n−1
Continuing this process, we obtain
n
P
bi − aij xj
bi − ai,i+1 xi+1 − ai,i+2 xi+2 − · · · − ain xn j=i+1
xi = = ,
aii aii
for each i = n − 1, n − 2, · · · , 2, 1.

Partial Pivoting: In the elimination process, we divide with aii at each stage and assume that aii 6= 0.
These elements are known as pivot element. If at any stage of elimination, one of the pivot becomes
small (or zero) then we bring other element as pivot by interchanging the rows. This process is called
Gauss elimination with partial pivoting.
Example 1. Solve the system of equations using Gauss elimination. This system has exact solution
(known from other sources!) x1 = 2.6, x2 = −3.8, x3 = −5.0.

6x1 + 2x2 + 2x3 = −2


2 1
2x1 + x2 + x3 = 1
3 3
x1 + 2x2 − x3 = 0.
Sol. Let us use a floating-point representation with 4−digits and all operations will be rounded. We
label each rows as E1 , E2 and E3 . Augmented matrix (coefficients matrix with right hand side) is
given by
 
6.000 2.000 2.000 −2.000
2.000 0.6667 0.3333 1.000 
1.000 2.000 −1.000 0.0
2 1
Multipliers are m21 = = 0.3333 and m31 = = 0.1667.
6 6
We write E2 as E2 − 0.3333E1 and E3 as E3 − 0.1667E1 .
 
6.000 2.000 2.000 −2.000
 0.0 0.0001000 −0.3333 1.667 
0.0 1.667 −1.333 0.3334
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 3

1.667
Multiplier is m32 = = 16670 and we write E3 as E3 − 16670E2 .
0.0001
 
6.000 2.000 2.000 −2.000
 0.0 0.0001000 −0.3333 1.667 
0.0 0.0 5555 −27790
Using back substitution, we obtain
x3 = −5.003
x2 = 0.0
x1 = 1.335.
We observe that computed solution is not compatible with the exact solution.
The difficulty is in a22 . This coefficient is very small (almost zero). This means that the coefficient
in this position had essentially infinite relative error and this was carried through into computation
involving this coefficient. To avoid this, we interchange second and third rows and then continue the
elimination.
In this case (after interchanging) multipliers is m32 = 0.00005999 and we write new E3 as E3 −
000005999E2 .
 
6.000 2.000 2.000 −2.000
 0.0 1.667 −1.337 0.3334  .
0.0 0.0 −0.3332 1.667
Using back substitution, we obtain
x3 = −5.003
x2 = −3.801
x1 = 2.602.
We see that after partial pivoting, we get the desired solution.
Example 2. Given the linear system
x1 − x2 + αx3 = −2,
−x1 + 2x2 − αx3 = 3,
αx1 + x2 + x3 = 2.
a. Find value(s) of α for which the system has no solutions.
b. Find value(s) of α for which the system has an infinite number of solutions.
c. Assuming a unique solution exists for a given α, find the solution.
Sol. Augmented matrix is given by
 
1 −1 α −2
−1 2 −α 3 
α 1 1 2
Multipliers are m21 = −1 and m31 = α. Performing E2 → E2 + E1 and E3 → E3 − αE1 to obtain
 
1 −1 α −2
0 1 0 1 
2
0 1 + α 1 − α 2(1 + α)
Multiplier is m32 = 1 + α and we perform E3 → E3 − m32 E2 .
 
1 −1 α −2
0 1 0 1 
0 2
0 1−α 1+α
a. If α = 1, then the last row of the reduced augmented matrix says that 0.x3 = 2, the system has no
solution.
b. If α = −1, then we see that the system has infinitely many solutions.
c. If α 6= 1, then the system has a unique solution.
1 1
x3 = , x2 = 1, x1 = − .
1−α 1−α
4 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

Remark 2.1. Unique Solution, No Solution, or Infinite Solutions.


(1) If we have a leading one in every column, then we will have a unique solution.
(2) If we have a row of zeros equal to a non-zero number in right side, then the system has no
solution.
(3) If we don’t have a leading one in every column in a homogeneous system, i.e. a system where
all the equations equal zero, or a row of zeros, then system have infinite solutions.
Example 3. Solve the system by Gauss elimination
4x1 + 3x2 + 2x3 + x4 = 1
3x1 + 4x2 + 3x3 + 2x4 = 1
2x1 + 3x2 + 4x3 + 3x4 = −1
x1 + 2x2 + 3x3 + 4x4 = −1.
Sol. We write augmented matrix and solve the system
 
4 3 2 1 1
3 4 3 2 1 
 
2 3 4 3 −1
1 2 3 4 −1
3 1 1
Multipliers are m21 = , m31 = , and m41 = .
4 2 4
Replace E2 with E2 − m21 E1 , E3 with E3 − m31 E1 and E4 with E4 − m41 E1 .
 
4 3 2 1 1
0 7/4 3/2 5/4 1/4 
 
0 3/2 3 5/2 −3/2
0 5/4 5/2 15/4 −5/4
6 5
Multipliers are m32 = and m42 = .
7 7
Replace E3 with E3 − m32 E2 and E4 with E4 − m42 E2 , we obtain
 
4 3 2 1 1
0 7/4 3/2 5/4 1/4 
 
0 0 12/7 10/7 −12/7
0 0 10/7 20/7 −10/7
5
Multiplier is m43 = and we replace E4 with E4 − m43 E3 .
6
 
4 3 2 1 1
0 7/4 3/2 5/4 1/4 
 
0 0 12/7 10/7 −12/7
0 0 0 5/3 0
Using back substitution successively for x4 , x3 , x2 , x1 , we obtain x4 = 0, x3 = −1, x2 = 1, x1 = 0.

Complete Pivoting: In the first stage of elimination, we search the largest element in magnitude
from the entire matrix and bring it at the position of first pivot. We repeat the same process at every
step of elimination. This process require interchange of both rows and columns.

Scaled Partial Pivoting: In this approach, the algorithm select the largest relative entries as the
pivot elements at each stage of elimination. At the beginning, a scale factor must be computed for
each equation in the system. We define
si = max |aij | (1 ≤ i ≤ n)
1≤j≤n

These numbers are recored in the scaled vector s = [s1 , s2 , · · · , sn ]. Note that the scale vector does not
change throughout the procedure.
In starting the forward elimination process, we do not arbitrarily use the first equation as the pivot
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 5

ai,1
equation. Instead, we use the equation for which the ratio is greatest. We repeat the process by
si
taking same scaling factors.
Example 4. Solve the system
2.11x1 − 4.21x2 + 0.921x3 = 2.01
4.01x1 + 10.2x2 − 1.12x3 = −3.09
1.09x1 + 0.987x2 + 0.832x3 = 4.21
by using scaled partial pivoting.
Sol. The augmented matrix is
 
2.11 −4.21 0.921 2.01
4.01 10.2 −1.12 −3.09
1.09 0.987 0.832 4.21.
The scale factors are s1 = 4.21, s2 = 10.2, & s3 = 1.09. We need to pick the largest (2.11/4.21 =
0.501, 4.01/10.2 = 0.393, 1.09/1.09 = 1), which is the third entry, and interchange row 1 and row 3 and
interchange s1 and s3 to get
 
1.09 0.987 0.832 4.21
4.01 10.2 −1.12 −3.09
2.11 −4.21 0.921 2.01.
Performing E2 → E2 − 3.68E1 , E3 → E3 − 1.94E1 , we obtain
 
1.09 0.987 0.832 4.21
 0 6.57 −4.18 −18.6 
0 −6.12 −0.689 −6.16.
Now comparing (6.57/10.2 = 0.6444, 6.12/4.21 = 1.45), the second ratio is largest so we need to
interchange row 2 and row 3 and interchange scale factor accordingly.
 
1.09 0.987 0.832 4.21
 0 −6.12 −0.689 −6.16 
0 6.57 −4.18 −18.6.
Performing E3 → E3 + 1.07E2 , we get
 
1.09 0.987 0.832 4.21
 0 −6.12 −0.689 −6.16 
0 0 −4.92 −25.2.
Backward substitution gives x3 = 5.12, x2 = 0.43, x1 = −0.436.
Example 5. Solve the system
3x1 − 13x2 + 9x3 + 3x4 = −19
−6x1 + 4x2 + x3 − 18x4 = −34
6x1 − 2x2 + 2x3 + 4x4 = 16
12x1 − 8x2 + 6x3 + 10x4 = 26
by hand using scaled partial pivoting. Justify all row interchanges and write out the transformed matrix
after you finish working on each column.
Sol. The augmented matrix is
 
3 −13 9 3 −19
−6 4 1 −18 −34
 
6 −2 2 4 16 
12 −8 6 10 26
6 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

and the scale factors are s1 = 13, s2 = 18, s3 = 6, & s4 = 12. We need to pick the largest (3/13, 6/18, 6/6, 12/12),
which is the third entry, and interchange row 1 and row 3 and interchange s1 and s3 to get
 
6 −2 2 4 16
−6 4 1 −18 −34
 
 3 −13 9 3 −19
12 −8 6 10 26
with s1 = 6, s2 = 18, s3 = 13, s4 = 12. Performing E2 → E2 − (−6/6)E1 , E3 → E3 − (3/6)E1 , and
E4 → E4 − (12/6)E1 , we obtain
 
6 −2 2 4 16
0
 2 3 −14 −18 
0 −12 8 1 −27
0 −4 2 2 −6.
Comparing (|a22 |/s2 = 2/18, |a32 |/s3 = 12/13, |a42 |/s4 = 4/12), the largest is the third entry so we
need to interchange row 2 and row 3 and interchange s2 and s3 to get
 
6 −2 2 4 16
0 −12 8 1 −27
 
0 2 3 −14 −18
0 −4 2 2 −6
with s1 = 6, s2 = 13, s3 = 18, s4 = 12. Performing E3 → E3 − (2/12)E2 and E4 → E4 − (−4/12)E2 ,
we get
 
6 −2 2 4 16
0 −12 8 1 −27 
 
0 0 13/3 −83/6 −45/2
0 0 −2/3 5/3 3
Comparing (|a33 |/s3 = (13/3)/18, |a43 |/s4 = (2/3)/12), the largest is the first entry so we do not
interchange rows. Performing E4 → E4 − (−2/13)E3 , we get the final reduced matrix
 
6 −2 2 4 16
0 −12 8 1 −27 
 
0 0 13/3 −83/6 −45/2
0 0 0 −6/13 −6/13
Backward substitution gives x1 = 3, x2 = 1, x3 = −2, x4 = 1.
Example 6. Solve this system of linear equations:
0.0001x + y = 1
x+y =2
using no pivoting, partial pivoting, and scaled partial pivoting. Carry at most five significant digits
of precision (rounding) to see how finite precision computations and roundoff errors can affect the
calculations.
Sol. By direct substitution, it is easy to verify that the true solution is x = 1.0001 and y = 0.99990 to
five significant digits.
For no pivoting, the first equation in the original system is the pivot equation, and the multiplier is
1/0.0001 = 10000. The new system of equations is
0.0001x + y = 1
9999y = 9998
We obtain y = 9998/9999 ≈ 0.99990 and x = 1. Notice that we have lost the last significant digit in
the correct value of x.
We repeat the solution process using partial pivoting for the original system. We see that the second
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 7

entry is larger, so the second equation is used as the pivot equation. We can interchange the two
equations, obtaining

x+y =2
0.0001x + y = 1

which gives y = 0.99980/0.99990 ≈ 0.99990 and x = 2 − y = 2 − 0.99990 = 1.0001.


Both computed values of x and y are correct to five significant digits.
We repeat the solution process using scaled partial pivoting for the original system. Since the scaling
constants are s = (1, 1) and the ratios for determining the pivot equation are (0.0001/1, 1/1), the
second equation is now the pivot equation. We do not actually interchange the equations and use
the second equation as the first pivot equation. The rest of the calculations are as above for partial
pivoting. The computed values of x and y are correct to five significant digits.

2.1. Operation Counts. We count the number of operations required to solve the system Ax = b.
Both the amount of time required to complete the calculations and the subsequent round-off error
depend on the number of floating-point arithmetic operations needed to solve a routine problem.
In general, the amount of time required to perform a multiplication or division on a computer is
approximately the same and is considerably greater than that required to perform an addition or
subtraction. The actual differences in execution time, however, depend on the particular computing
system.
To demonstrate the counting operations for a given method, we will count the operations required to
solve a typical linear system of n equations in n unknowns using Gauss elimination Algorithm. We will
keep the count of the additions/subtractions separate from the count of the multiplications/divisions
because of the time differential.
First step is to calculate multipliers. Then the replacement of the equation Ei by (Ei − mij Ej ) requires
that mij be multiplied by each term in Ej and then each term of the resulting equation is subtracted
from the corresponding term in Ei . The following table states the operations count from going from
A to U at each step 1, 2, · · · , n − 1.

Step Number of divisions Number of multiplications Number of additions/subtractions


1 (n − 1) (n − 1)2 (n − 1)2
2 (n − 2) (n − 2)2 (n − 2)2
.. .. .. ..
. . . .
n−2 2 4 4
n−1 1 1 1
n(n − 1) n(n − 1)(2n − 1) n(n − 1)(2n − 1)
Total:
2 6 6
n(n − 1)(2n − 1)
Therefore total number of additions/subtractions from A to U are .
6
n(n − 1) n(n − 1)(2n − 1) n(n2 − 1)
Total number of multiplications/divisions are + = .
2 6 3
Now we count the number of additions/subtractions and the number of multiplications/divisions for
right hand side vector b. We have:
n(n − 1)
Total number of additions/subtractions (n − 1) + (n − 2) + · · · + 2 + 1 = .
2
n(n − 1)
Total number of multiplications/divisions (n − 1) + (n − 2) + · · · + 2 + 1 = .
2
Lastly we count the number of additions/subtractions and multiplications/divisions for finding the
solutions from the back-substitution method. Recall that
bn
xn = .
ann
8 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

For each i = n − 1, n − 2, · · · , 2, 1, we have


n
P
bi − aij xj
j=i+1
xi = .
aii
n(n − 1)
Therefore total number of additions/subtractions 0 + 1 + · · · + (n − 1) = .
2
n(n + 1)
Total number of multiplications/divisions 1 + 2 + · · · + n = .
2
Thus the total number of operations to obtain the solution of a system of n linear equations in n
variables using Gaussian elimination is:
Total Number of Additions/Subtractions:
n(n − 1)(2n + 5)
.
6
Total Number of Multiplications/Divisions:
n(n2 + 3n − 1)
.
3
For large n, the total number of multiplications and divisions is approximately n3 /3, as is the total
number of additions and subtractions. Thus the amount of computation and the time required increases
with n in proportion to n3 , as shown in Table.
n Multiplications/Divisions Additions/Subtractions
3 17 11
10 430 375
50 44, 150 42, 875
100 343, 300 338, 250

3. The LU Factorization:
When we use matrix multiplication, another meaning can be given to the Gauss elimination. The
matrix A can be factored into the product of the two triangular matrices.
Let AX = b is the system to be solved, A is n × n coefficient matrix. The linear system can be reduced
to the upper triangular system U X = g with
 
u11 u12 · · · u1n
 0 u22 · · · u2n 
U =  ..
 
.. .. 
 . . . 
0 0 ··· unn
Here uij = aij . Introduce an auxiliary lower triangular matrix L based on the multipliers mij as
following  
1 0 0 ··· 0
 m21
 1 0 ··· 0
L =  m31 m32 1
 ··· 0
 .. . . .. 
 . . . 
mn,1 0 · · · mn,n−1 1
Theorem 3.1. Let A be a non-singular matrix and let L and U be defined as above. If U is produced
without pivoting then
LU = A.
This is called LU factorization of A.
We can use Gaussian elimination to solve a system by LU decomposition. Suppose that A has been
factored into the triangular form A = LU , where L is lower triangular and U is upper triangular. Then
we can solve for x more easily by using a two-step process.
First we let y = U x and solve the lower triangular system Ly = b for y. Once y is known, the upper
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 9

triangular system U x = y provide the solution x. We can check that total number of operations are
same as Gauss elimination.
Example 7. We require to solve the following system of linear equations using LU decomposition.
x1 + 2x2 + 4x3 = 3
3x1 + 8x2 + 14x3 = 13
2x1 + 6x2 + 13x3 = 4.
(a) Find the matrices L and U using Gauss elimination.
(b) Using those values of L and U , solve the system of equations.
Sol. We first apply the Gaussian elimination on the matrix A and collect the multipliers m21 , m31 ,
and m32 .
We have
 
1 2 4
A = 3 8 14
2 6 13
Multipliers are m21 = 3, m31 = 2.
E2 → E2 − 3E1 and E3 → E3 − 2E1 .  
1 2 4
∼ 0 2 2
0 2 5
Multiplier is m32 = 2/2 = 1 and we perform E3 → E3 − E2 .
 
1 2 4
∼ 0 2 2
0 0 3
We observe that m21 = 3, m31 = 2, and m32 = 1. Therefore,
    
1 2 4 1 0 0 1 2 4
A = 3 8 14 = 3 1 0 0 2 2 = LU
2 6 13 2 1 1 0 0 3
Therefore,
Ax = b =⇒ LU x = b.
Assuming U x = y, we obtain,
Ly = b
i.e.     
1 0 0 y1 3
3 1 0 y2  = 13 .
2 1 1 y3 4
Using forward substitution, we obtain y1= 3, y2 = 4, and y3 = −6. Now
    
1 2 4 x1 3
U x = y =⇒ 0 2 2 x2  =  4  .
0 0 3 x3 −6
Now, using the backward substitution process, we obtain the final solution as x3 = −2, x2 = 4, and
x1 = 3.
Example 8. (a) Determine the LU factorization for matrix A in the linear system AX = B, where
   
1 1 0 3 1
2 1 −1 1  1
A=  3 −1 −1 2  and B = −3
   (3.1)
−1 2 3 −1 4
10 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

(b) Then use the factorization to solve the system


x1 + x2 + 3x4 = 8
2x1 + x2 − x3 + x4 = 7
3x1 − x2 − x3 + 2x4 = 14
−x1 + 2x2 + 3x3 − x4 = −7
.
Sol. (a) We take the coefficient matrix and apply Gauss elimination.
Multipliers are m21 = 2, m31 = 3, and m41 = −1.
Sequence of operations E2 → E2 − 2E1, E3 → E3 − 3E1, E4 → E4 − (−1)E1.
 
1 1 0 3
0 −1 −1 −5
 
0 −4 −1 −7
0 3 3 2
Multipliers are m32 = 4 and m42 = −3.
E3 → E3 − 4E2, E4 → E4 − (−3)E2).
 
1 1 0 3
0 −1 −1 −5 
∼
0 0

3 13 
0 0 0 −13
The multipliers mij and the upper triangular matrix produce the following factorization
    
1 1 0 3 1 0 0 0 1 1 0 3
2 1 −1 1  2 1 0 0 0 −1 −1 −5  = LU.
 
A= =
3 −1 −1 2   3 4 1 0 0 0 3 13 
−1 2 3 −1 −1 −3 0 1 0 0 0 −13
(b)
      
1 0 0 0 1 1 0 3 x1 8
2 1 0 0 0 −1 −1 −5  x2   7 
Ax = (LU )x = 
3
    =  
4 1 0 0 0 3 13  x3   14 
−1 −3 0 1 0 0 0 −13 x4 −7
We first introduce the substitution y = U x. Then B = L(U x) = Ly. That is,
    
1 0 0 0 y1 8
2 1 0 0 y2   7 
Ly = 3
  =  
4 1 0 y3   14 
−1 −3 0 1 y4 −7
This system is solved for y by a simple forward-substitution process:

y1 = 8, y2 = −9, y3 = 26, y4 = −26.


We then solve U x = y for x, the solution of the original system; that is,,
    
1 1 0 3 x1 8
0 −1 −1 −5  x2   −9 
   =  
0 0 3 13  x3   26 
0 0 0 −13 x4 −26
Using backward substitution we obtain x4 = 2, x3 = 0, x2 = −1, x1 = 3.
Example 9. Show that the LU Factorization Algorithm requires
a.
1 3 1 1 3 1 2 1
n − n multiplications/divisions and n − n + n additions/subtractions.
3 3 3 2 6
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 11

b. Show that solving Ly = b, where L is a lower-triangular matrix with lii = 1 for all i, requires
1 2 1 1 2 1
n − n multiplications/divisions and n − n additions/subtractions.
2 2 2 2
c. Show that solving Ax = b by first factoring A into A = LU and then solving Ly = b and U x = y
requires the same number of operations as the Gaussian elimination algorithm.

Sol. a. We have already counted the mathematical operation in detail in Gauss elimination. Here we
provide the same for LU factorization.
n(n − 1)(2n − 1) 1 1 1
We found total number of additions/subtractions from A to U are = n3 − n2 + n
6 3 2 6
n(n − 1) n(n − 1)(2n − 1) n(n2 − 1) 1 3
and total number of multiplications/divisions are + = = n −
2 6 3 3
1
n.
3
These counts remains same to factorize the matrix A in to L and U .

b. Solving Ly = b, where L is a lower-triangular matrix with lii = 1 for all i, requires total number of
n(n − 1)
additions/subtractions 0 + 1 + · · · + (n − 1) = .
2
n(n − 1)
Total number of multiplications/divisions 0 + 1 + 2 + · · · + n = .
2
Please note that these operations can be counted in similar manner as we discussed for back substitu-
tion. As lii is always 1, so this will reduce one division at each step. These counts are same as for b.

c. Finally the counts used in b are same as we solve Ly = b. Therefore total counts remains same.

Exercises
(1) Use Gaussian elimination with backward substitution and two-digit rounding arithmetic to
solve the following linear systems. Do not reorder the equations. (The exact solution to each
system is x1 = −1, x2 = 1, x3 = 3.)
a.

−x1 + 4x2 + x3 = 8
5 2 2
x1 + x2 + x3 = 1
3 3 3
2x1 + x2 + 4x3 = 11.

b.

4x1 + 2x2 − x3 = −5
1 1 1
x1 + x2 − x3 = −1
9 9 3
x1 + 4x2 + 2x3 = 9.

(2) Using the four-digit arithmetic solve the following system of equations by Gaussian elimination
with and without partial pivoting

0.729x1 + 0.81x2 + 0.9x3 = 0.6867


x1 + x2 + x3 = 0.8338
1.331x1 + 1.21x2 + 1.1x3 = 1.000.

This system has exact solution, rounded to four places x1 = 0.2245, x2 = 0.2814, x3 = 0.3279.
Compare your answers!
(3) Use the Gaussian elimination algorithm to solve the following linear systems, if possible, and
determine whether row interchanges are necessary:
12 DIRECT METHODS FOR SOLVING LINEAR SYSTEMS

a.
x1 − x2 + 3x3 = 2
3x1 − 3x2 + x3 = −1
x1 + x2 = 3.
b.
2x1 − x2 + x3 − x4 = 6
x2 − x3 + x4 = 5
x4 = 5
x3 − x4 = 3.
(4) Use Gaussian elimination and three-digit chopping arithmetic to solve the following linear
system, and compare the approximations to the actual solution [0, 10, 1/7]T .
3.03x1 − 12.1x2 + 14x3 = −119
−3.03x1 + 12.1x2 − 7x3 = 120
6.11x1 − 14.2x2 + 21x3 = −139.
(5) Repeat the above exercise using Gaussian elimination with partial and scaled partial pivoting
and three-digit rounding arithmetic.
(6) Suppose that
2x1 + x2 + 3x3 = 1
4x1 + 6x2 + 8x3 = 5
6x1 + αx2 + 10x3 = 5,
with |α| < 10. For which of the following values of α will there be no row interchange required
when solving this system using scaled partial pivoting?
a. α = 6 b. α = 9 c. α = −3.
(7) Modify the LU Factorization Algorithm so that it can be used to solve a linear system, and
then solve the following linear systems.
2x1 − x2 + x3 = −1
3x1 + 3x2 + 9x3 = 0
3x1 + 3x2 + 5x3 = 4.

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.

Appendix A. Algorithms
Algorithm (Gauss Elimination)
(1) Give inputs matrix A and right hand vector b and compute order N = max(size(A)).
(2) Perform Gaussian elimination
for j = 2 : N
for i = j : N
Calculate multipliers: m = A(i, j − 1)/A(j − 1, j − 1);
Replace entries: A(i, :) = A(i, :) − A(j − 1, :) ∗ m;
Replace b: b(i) = b(i) − m ∗ b(j − 1);
end for i
end for j.
DIRECT METHODS FOR SOLVING LINEAR SYSTEMS 13

(3) Perform back substitution


x = zeros(N, 1);
Compute x(N ) = b(N )/A(N, N );
for i = N − 1 : −1 : 1,
x(i) = (b(i) − A(i, i + 1 : N ) ∗ x(i + 1 : N ))/A(i, i);
end for i.

(4) Display the result x(i).


(5) End of program. Stop.
CHAPTER 4 (5 LECTURES)
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

1. Introduction
In this chapter we will study iterative techniques to solve linear systems. An initial approximation
(or approximations) will be found, and new approximations are then determined based on how well
the previous approximations satisfied the equation. The objective is to find a way to minimize the
difference between the approximations and the exact solution. To discuss iterative methods for solving
linear systems, we first need to determine a way to measure the distance between n-dimensional column
vectors. This will permit us to determine whether a sequence of vectors converges to a solution of the
system. In actuality, this measure is also needed when the solution is obtained by the direct methods
presented in Chapter 3. Those methods required a large number of arithmetic operations, and using
finite-digit arithmetic leads only to an approximation to an actual solution of the system. We end
the chapter by presenting a way to find eigenvalue (dominant) and associated eigenvector. Dominant
eigenvalue plays an important role in convergence of any iterative method.
1.1. Norms of Vectors and Matrices.
1.2. Vector Norms. Let Rn denote the set of all n−dimensional column vectors with real-number
components. To define a distance in Rn we use the notion of a norm, which is the generalization of
the absolute value on R, the set of real numbers.
Definition 1.1. A vector norm on Rn is a function, k·k, from Rn into R with the following properties:
(1) kxk≥ 0 for all x ∈ Rn .
(2) kxk= 0 if and only if x = 0.
(3) kx + yk≤ kxk+kyk for all x, y ∈ Rn (triangle inequality).
(4) kαxk= |α|kxk for all x ∈ Rn and α ∈ R.
Definition 1.2. The l2 and l∞ norms for the vector x = (x1 , x2 , . . . , xn )t are defined by
Xn 1/2
2
kxk2 = |xi | and kxk∞ = max |xi |.
1≤i≤n
i=1
Note that each of these norms reduces to the absolute value in the case n = 1.
The l2 norm is called the Euclidean norm of the vector x because it represents the usual notion of
distance from the origin in case x is in R1 = R, R2 , or R3 . For example, the l2 norm of the vector
x = (x1 , x2 , x3 )t gives the length of the straight line joining the points (0, 0, 0) and (x1 , x2 , x3 ).
Example 1. Determine the l2 norm and the l∞ norm of the vector x = (−1, 1, −2)t .
Sol. The vector x = (−1, 1, −2)t in R3 has norms
p √
kxk2 = (−1)2 + (1)2 + (−2)2 = 6
and
kxk∞ = max{|−1|, |1|, |−2|} = 2
Definition 1.3 (Distance between vectors in Rn ). If x = (x1 , x2 , · · · , xn )t and y = (y1 , y2 , · · · , yn )t
are vectors in Rn , the l2 and l∞ distances between x and y are defined by
( n )1/2
X
kx − yk2 = (xi − yi ) ,
i=1

kx − yk∞ = max |xi − yi |.


1≤i≤n
1
2 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

Definition 1.4 (Matrix Norm). A matrix norm on the set of all n × n matrices is a real-valued
function, k·k, defined on this set, satisfying for all n × n matrices A and B and all real numbers α :
(1) kAk≥ 0,
(2) kAk= 0 if and only if A is O, the matrix with all 0 entries,
(3) kαAk= |α|kAk,
(4) kA + Bk≤ kAk+kBk,
(5) kABk≤ kAk kBk,
If k·k is a vector norm on Rn , then
kAk= max kAxk
kxk=1

is a matrix norm.
The matrix norms we will consider have the forms
kAk∞ = max kAxk∞ ,
kxk∞ =1

and
kAk2 = max kAxk2 .
kxk2 =1

Theorem 1.5. If A = (aij ) is an n × n matrix, then


n
X
kAk∞ = max |aij |.
1≤i≤n
j=1

Example 2. Determine kAk∞ for the matrix


 
1 2 −1
A = 0 3 −1 (1.1)
5 −1 1
Sol. We have
3
X 3
X
|a1j | = |1| + |2| + | − 1| = 4, |a2j | = |0| + |3| + | − 1| = 4
j=1 j=1
and
3
X
|a3j | = |5| + | − 1| + |1| = 7.
j=1

So above theorem implies that kAk∞ = max{4, 4, 7} = 7.


Definition 1.6 (Eigenvalue and eigenvector). Let A be a square matrix then number λ is called an
eigenvalue of A if there exists a nonzero vector x such that Ax = λx. Here x is called the corresponding
eigenvector.
Definition 1.7 (Characteristic polynomial). Characteristic polynomial is defined as
P (λ) = det(A − λI).
λ is an eigenvalue of matrix A if and only if λ is a root of the characteristic polynomial, i.e., P (λ) = 0.
Definition 1.8 (Spectral Radius). The spectral radius ρ(A) of a matrix A is defined by
ρ(A) = max |λ|, where λ is an eigenvalue of A.
Theorem 1.9. If A is an n × n matrix, then
1/2
kAk2 = ρ(AT A)

.
The proof of this theorem requires more information concerning eigenvalues. We illustrate the
procedure by taking an example.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 3

Example 3. Determine the l2 norm of


 
1 1 0
A =  1 2 1 .
−1 1 2

Sol. We calculate ρ(AT A), therefore we need to find the eigenvalues of AT A.


  
1 1 −1 1 1 0
AT A = 1 2 1   1 2 1
0 1 2 −1 1 2
 
3 2 −1
=  2 6 4 .
−1 4 5
Eigenvalues satisfy the following characteristics equation
0 = det(AT A − λI)
 
3−λ 2 −1
= det  2 6−λ 4 
−1 4 5−λ
= −λ3 + 14λ2 − 42λ = −λ(λ2 − 14λ + 42)

=⇒ λ = 0, 7 ± 7.

Therefore ρ(At A) = 7 + 7. Thus

q
 T
1/2
kAk2 = ρ(A A) = 7 + 7 ≈ 3.106.

1.3. Convergent Matrices. In studying iterative matrix techniques, it is of particular importance


to know when powers of a matrix become small (that is, when all the entries approach zero). Matrices
of this type are called convergent.
Definition 1.10. An n × n matrix A convergent if
lim Ak = 0.
k→∞

Example 4. Show that matrix 1 


2 0
A= 1 1
4 2
is a convergent matrix.
Sol. Computing powers of A, we obtain:
1   1

0 0
A2 = 4
1 1 , A3 = 8
3 1 ,
4 4 16 8
and, in general,
 1 
k 2k
0
A = k 1 .
2k+1 2k
So A is a convergent matrix because
1 k
lim k
= 0, lim k+1 = 0.
k→∞ 2 k→∞ 2

∴ lim Ak = 0,
k→∞
which implies matrix A is convergent.
Note that the convergent matrix A in this Example has spectral radius ρ(A) = 21 , because 12 is the
only eigenvalue of A. This illustrates an important connection that exists between the spectral radius
of a matrix and the convergence of the matrix, as detailed in the following result.
4 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

Theorem 1.11. The following statements are equivalent.


(i) A is a convergent matrix.
(ii) lim kAk k= 0, for all natural norms.
k→∞
(iii) ρ(A) < 1.
(iv) lim Ak x = 0, for every x.
k→∞

The proof of this theorem can be found in advanced texts of numerical analysis.

2. Iterative Methods
The linear system Ax = b may have a large order. For such systems Gauss elimination is often too
expensive in either computation time or computer memory requirements or both.
In an iterative method, a sequence of progressively iterates is produced to approximate the solution.
Jacobi and Gauss-Seidel Method: We start with an example. Let us consider a system of equations
9x1 + x2 + x3 = 10
2x1 + 10x2 + 3x3 = 19
3x1 + 4x2 + 11x3 = 0.
One class of iterative method for solving this system as follows.
We write
1
x1 = (10 − x2 − x3 )
9
1
x2 = (19 − 2x1 − 3x3 )
10
1
x3 = (0 − 3x1 − 4x2 ).
11
(0) (0) (0)
Let x(0) = [x1 x2 x3 ] be an initial approximation of solution x. Then define an iteration of sequence
(k+1) 1 (k) (k)
x1 = (10 − x2 − x3 )
9
(k+1) 1 (k) (k)
x2 = (19 − 2x1 − 3x3 )
10
(k+1) 1 (k) (k)
x3 = (0 − 3x1 − 4x2 ), k = 0, 1, 2, . . . .
11
This is called Jacobi or method of simultaneous replacements. The method is named after German
mathematician Carl Gustav Jacob Jacobi.
We start with [0 0 0] and obtain
(1) (1) (1)
x1 = 1.1111, x2 = 1.900, x3 = 0.0.
(2) (2) (2)
x1 = 0.9000, x2 = 1.6778, x3 = −0.9939
etc.
An another approach to solve the same system will be following.
(k+1) 1 (k) (k)
x1 = (10 − x2 − x3 )
9
(k+1) 1 (k+1) (k)
x2 = (19 − 2x1 − 3x3 )
10
(k+1) 1 (k+1) (k+1)
x3 = (0 − 3x1 − 4x2 ), k = 0, 1, 2, . . . .
11
This method is called Gauss-Seidel or method of successive replacements. It is named after the German
mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel. Starting with [0 0 0], we obtain
(1) (1) (1)
x1 = 1.1111, x2 = 1.6778, x3 = −0.9131.
(2) (2) (2)
x1 = 1.0262, x2 = 1.9687, x3 = −0.9588.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 5

General Approach: We consider a system Ax = b of order n and for i = 1, 2, · · · , n, we write the


i-th equation
ai1 x1 + ai2 x2 + · · · + aii xi + · · · + ain xn = bi
 
n
X 

 a ij x j
 + aii xi = bi .

j=1
j6=i

The Jacobi iterative method is obtained by solving the i-th equation for xi to obtain (provided aii 6= 0)
 
n
1  X 
xi =  (−a ij x j ) + bi
.
aii  
j=1
j6=i
(k+1) (k)
For each k ≥ 0, generate the components xi from xi as
 
n
(k+1) 1  X
(−aij xkj ) + bi 

xi = 
.
aii 
j=1
j6=i

Now we write the above iterative scheme is matrix form. To write matrix form, we take
 
n
(k+1) X
=  (−aij xkj ) + bi 

aii xi 
.
j=1
j6=i

Dx(k+1) = −(L + U )x(k) + b,


where D, L and U are diagonal, strictly lower triangle and upper triangle matrices, respectively and
A = L + D + U . Here b = [b1 b2 · · · bn ]t .
If D−1 exists, then the Jacobi iterative scheme is
x(k+1) = −D−1 (L + U )x(k) + D−1 b.
We write Tj = −D−1 (L + U ) and B = D−1 b to obtain
x(k+1) = Tj x(k) + B.
Matrix Tj is called the iteration matrix.
For Gauss-Seidel, we write the i-th equation as
ai1 x1 + ai2 x2 + · · · + ai,i−1 xi−1 + aii xi + ai,i+1 xi+1 + · · · + ain xn = bi
i−1
X n
X
aij xj + aii xi + aij xj = bi .
j=1 j=i+1
 
i−1 n
1  X X
∴ xi = − aij xj − aij xj + bi  .
aii
j=1 j=i+1
The iterative scheme is given for each k ≥ 0
 
i−1 n
(k+1) 1  X (k+1)
X (k)
xi = − aij xj − aij xj + bi  .
aii
j=1 j=i+1

In matrix form
(D + L)x(k+1) = −U x(k) + b,
where D, L and U are diagonal, strictly lower triangle and upper triangle matrices, respectively.
x(k+1) = −(D + L)−1 U x(k) + (D + L)−1 b
6 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

x(k+1) = Tg x(k) + B, k = 0, 1, 2, · · · .
Here Tg = −(D + L)−1 U and this matrix is called iteration matrix and B = (D + L)−1 b.
Stopping Criteria: Since these techniques are iterative so we require a stopping criteria. Let ε is
accuracy desired then we can use the following
kx(k) − x(k−1) k∞
< ε.
kx(k) k∞
Example 5. Use the Gauss-Seidel method to approximate the solution of the following system:
4x1 + x2 − x3 = 3
2x1 + 7x2 + x3 = 19
x1 − 3x2 + 12x3 = 31.
Continue the iterations until two successive approximations are identical when rounded to three signif-
icant digits.
Sol. To begin, write the system in the form
1
x1 = (3 − x2 + x3 )
4
1
x2 = (19 − 2x1 − x3 )
7
1
x3 = (31 − x1 + 3x2 )
12
As
|a11 | = 4 > |a12 | + |a13 | = 1
|a22 | = 7 > |a21 | + |a23 | = 3
|a33 | = 12 > |a31 | + |a32 | = 2
which shows that coefficient matrix is strictly diagonally dominant. Therefore Gauss-Seidel iterations
will converge.
Start with a random vector x(0) = [0, 0, 0]t the first approximation is
(1)
x1 = 0.7500
(1)
x2 = 2.5000
(1)
x3 = 3.1458.
Similarly
x(2) = [0.9115, 2.0045, 3.0085]t

x(3) = [1.0010, 1.9985, 2.9995]t

x(4) = [1.000, 2.000, 3.000]t .

2.1. Convergence analysis of iterative methods. To study the convergence of general iteration
techniques, we need to analyze the formula
x(k+1) = T x(k) + B, for eachk = 0, 1, · · · ,
where x(0) is arbitrary. The next lemma and Theorem provide the key for this study.
Lemma 2.1. If the spectral radius ρ(T ) < 1, then (I − T )−1 exists, and

X
(I − T )−1 = I + T + T 2 + · · · = T k.
k=0
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 7

Proof. If T v = λv, then T k v = λk v.


Since ρ(T ) < 1, it follows that
lim T k = 0.
k→∞
Since above limit is zero, the matrix series
I + T + T2 + ··· + Tk + ···
is convergent. Now by multiplying the matrix (I − T ) by this series, we obtain
(I − T )(I + T + T 2 + · · · + T k + · · · ) = I.
Thus

X
(I − T )−1 = T k.
k=0

Theorem 2.2 (Necessary and sufficient condition). A necessary and sufficient condition for the con-
vergence of an iterative method is that the eigenvalue of iteration matrix T satisfy the inequality
ρ(T ) < 1.
Proof. Let
ρ(T ) < 1.
The sequence of vector x(k) by iterative method (Gauss-Seidel) are given by

x(1) = T x(0) + B.

x(2) = T x(1) + B = T (T x(0) + B) + B = T 2 x(0) + (T + I)B.

........................

x(k) = T k x(0) + (T k−1 + T k−2 + ... + T + I)B.


Since ρ(T ) < 1, this implies
lim T k x(0) = 0
k→∞
Therefore
lim x(k) = (I − T )−1 B as k → ∞.
k→∞

Therefore, x(k) converges to unique solution x = T x + B.


Conversely, assume that the sequence x(k) converges to x. Now
x − x(k) = T x + B − T x(k−1) − B = T (x − x(k−1) )

= T 2 (x − x(k−2) )

= T k (x − x(0 ).
Let z = x − x(0) ) then

lim T k z = lim (x − x(k) ) = x − lim x(k) = x − x = 0.


k→∞ k→∞ k→∞

=⇒ ρ(T ) < 1.

Theorem 2.3. If A is strictly diagonally dominant in Ax = b, then iterative method always converges
for any initial starting vector.
8 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

Proof. We assume that A is strictly diagonally dominant, hence aii 6= 0 and


n
X
|aii | > |aij |, i = 1, 2, · · · , n
j=1
j6=i

Gauss-Seidel iterations are given by


x(k+1) = −D−1 (L + U )x(k) + (D + L)−1 b

x(k+1) = Tj x(k) + B.
Method is convergent iff ρ(Tj ) < 1.
Now
k(D + L)−1 k∞
kTj k∞ = k − (D + L)−1 U k∞ ≤ k − (D + L)−1 k∞ kU k∞ = < 1.
| max aii |
This shows the convergence condition for Jacobi method.

Further we prove the convergence of Gauss-Seidel method. Gauss-Seidel iterations are given by
x(k+1) = −(D + L)−1 U x(k) + (D + L)−1 b

x(k+1) = Tg x(k) + B.
Let λ be an eigenvalue of matrix A and x be an eigenvector then
Tg x = λx

−(D + L)−1 U x = λx
−U x = λ(D + L)x
n
X Xi
− aij = λ[ aij xj ], i = 1, 2, . . . , n
j=i+1 j=1

n
X i−1
X
− aij = λaii xi + λ aij xj
j=i+1 j=1

i−1
X n
X
λaii xi = −λ aij xj − λ aij xj
j=1 j=i+1

i−1
X n
X
|λaii xi | ≤ |λ| |aij | |xj | + |λ| |aij | |xj |
j=1 j=i+1

Since x is an eigenvector, x 6= 0, so we can take norm ||x||∞ = 1.


Hence  
i−1
X n
X
|λ| |aii | − |aij | ≤ |aij |
j=1 j=i+1

n
P n
P
|aij | |aij |
j=i+1 j=i+1
=⇒ |λ| ≤ i−1
≤ n =1
P
|aij |
P
|aii | − |aij |
j=1 j=i+1

which implies spectral radius ρ(T ) < 1.


This implies Gauss-Seidel is convergent.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 9

Example 6. The linear system


2x1 − x2 + x3 = −1,
2x1 + 2x2 + 2x3 = 4,
−x1 − x2 + 2x3 = −5
has the solution (1, 2, −1)T
√ .
(a) Show that ρ(Tj ) = 2 > 1. (b) Show that ρ(Tg ) = 12 .
Sol. We write A = L + D + U .
(a)
Tj = −D−1 (L + U )
  
0.5 0 0 0 1 −1
=  0 0.5 0  −2 0 −2
0 0 0.5 1 01 0
 
0 0.5 −0.5
= −1 0 −1  .
0.5 0.5 0
The spectral radius ρ(Tj ) of matrix Tj is defined by
ρ(Tj ) = max |λ|, where λ is an eigenvalue of Tj .
Eigenvalues of Tj are

5
λ=± i, 0.
2

5
Thus, ρ(Tj ) = > 1.
2
(b)
Tg = −(D + L)−1 U
  
0.5 0 0 0 −1 1
= − −0.5 0.5 0  0 0 2
0 0.25 0.5 0 00 0
 
0 0.5 −0.5
= 0 −0.5 −0.5 .
0 0 −0.5

Eigenvalues are − 21 , − 21 , 0.
Thus ρ(Tg ) = 12 < 1.
Spectral radius of iteration matrix of Jacobi method is greater than one and less than one for
Gauss-Seidel. Therefore Gauss-Seidel iterations converge.

3. The SOR method


We observed that the convergence of an iterative technique depends on the spectral radius of the
matrix associated with the method. One way to select a procedure to accelerate convergence is to
choose a method whose associated matrix has minimal spectral radius. These techniques are known
as Successive Over-Relaxation (SOR). The SOR method is devised by applying extrapolation to the
Gauss-Seidel method. This extrapolation takes the form of a weighted average between the previous
iterate and the computed Gauss-Seidel iterate successively for each component. We multiply with a
(k+1)
weight ω and to calculate xi , we modify the Gauss-Seidel procedure to
(k+1) (k) (k+1) (k)
xi = xi + ω(xi − xi )
(k+1) (k) (k+1)
xi = (1 − ω)xi + ω xi .
10 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

The last term is calculated by Gauss-Seidel and we write


 
i−1 n
(k+1) ω X (k+1)
X (k)
xi = (1 − ω)xki + bi − (aij xj )− (aij xj )) .
aii
j=1 j=i+1

The choice of relaxation factor ω is not necessarily easy, and depends upon the properties of the
coefficient matrix. If A is a symmetric and positive definite matrix and 0 < ω < 2, then the SOR
method converges for any choice of initial approximate vector x(0) .
Important Note: If a matrix A is symmetric, it is positive definite if and only if all its leading
principle submatrices (minors) has a positive determinant.
Example 7. Consider a linear system Ax = b, where
   
3 −1 1 −1
A = −1 3 −1 , b =  7 
1 −1 3 −7
a. Check, that the SOR method with value ω = 1.25 of the relaxation parameter can be used to solve
this system.
b. Compute the first iteration by the SOR method starting at the point x(0) = (0, 0, 0)t .
Sol. a. Let us verify the sufficient condition for using the SOR method. We have to check, if matrix
A is symmetric, positive definite: A is symmetric as A = AT , so let us check positive definitness:
 
3 −1
det(3) = 3 > 0, det = 8 > 0, det(A) = 20 > 0.
−1 3
All leading principal minors are positive and so the matrix A is positive definite. We know, that for
symmetric positive definite matrices the SOR method converges for values of the relaxation parameter
ω from the interval 0 < ω < 2.
Therefore the SOR method with value ω = 1.25 can be used to solve this system.
b. The iterations of the SOR method are easier to compute by elements than in the vector form:
Write the system as equations and write down the equations for the Gauss-Seidel iterations
(k+1) (k) (k)
x1 = (−1 + x2 − x3 )/3
(k+1) (k+1) (k)
x2 = (7 + x1 + x3 )/3
(k+1) (k+1) (k+1)
x3 = (−7 − x1 + x2 )/3.
Now multiply the right hand side by the parameter ω and add to it the vector x(k) from the previous
iteration multiplied by the factor of (1 − ω) :
(k+1) (k) (k) (k)
x1 = (1 − ω)x1 + ω(−1 + x2 − x3 )/3
(k+1) (k) (k+1) (k)
x2 = (1 − ω)x2 + ω(7 + x1 + x3 )/3
(k+1) (k) (k+1) (k+1)
x3 = (1 − ω)x3 + ω(−7 − x1 + x2 )/3.
For k = 0:
(1)
x1 = (1 − 1.25) · 0 + 1.25 · (−1 + 0 − 0)/3 = −0.41667
(1)
x2 = (1 − 1.25) · 0 + 1.25 · (7 − 0.41667 + 0)/3 = 2.7431
(1)
x3 = (1 − 1.25) · 0 + 1.25 · (−7 + 0.41667 + 2.7431)/3 = −1.6001.
The next three iterations are
x(2) = (1.4972, 2.1880, −2.2288)t ,
x(3) = (1.0494, 1.8782, −2.0141)t ,
x(4) = (0.9428, 2.0007, −1.9723)t .
The exact solution is x = (1, 2, −2)t .
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 11

4. Error Bounds and Iterative Refinement


Definition 4.1. Suppose x̃ ∈ Rn is an approximation to the solution of the linear system defined by
Ax = b. The residual vector for x̃ with respect to this system is r = b − Ax̃.
It seems intuitively reasonable that if x̃ is an approximation to the solution x of Ax = b and the
residual vector r = b − Ax̃ has the property that krk is small, then kx − x̃k would be small as well. This
is often the case, but certain systems, which occur frequently in practice, fail to have this property.
Example 8. The linear system Ax = b given by
    
1 2 x1 3
=
1.0001 2 x2 3.0001
has the unique solution x = (1, 1)t . Determine the residual vector for the poor approximation x̃ =
(3, −0.0001)t .
Sol. We have      
3 1 2 3 0.0002
r = b − Ax̃ = =
3.0001 1.0001 2 −0.0001 0
so krk∞ = 0.0002. Although the norm of the residual vector is small, the approximation x̃ =
(3, −0.0001)t is obviously quite poor; in fact, kx − x̃k∞ = 2.
Theorem 4.2. Suppose that x̃ is an approximation to the solution of Ax = b, A is a nonsingular
matrix, and r is the residual vector for x̃. Then for any natural norm,
kx − x̃k≤ krk·kA−1 k
and if x 6= 0 and b 6= 0
kx − x̃k krk
≤ kAk·kA−1 k .
kxk kbk
Proof. Since r = b − Ax̃ = Ax − Ax̃ and A is nonsingular, we have x − x̃ = A−1 r.
kx − x̃k= kA−1 rk≤ kA−1 k·krk.
Moreover, since b = Ax, we have kbk≤ kAk·kxk. So 1/kxk≤ kAk/kbk and
kx − x̃k kAk·kA−1 k
≤ krk.
kxk kbk

Condition Numbers: The inequalities in the above theorem imply that kA−1 k and kAk·kA−1 k pro-
vide an indication of the connection between the residual vector and the accuracy of the approximation.
In general, the relative error kx − x̃|/kxk is of most interest, and this error is bounded by the product
of kAk·kA−1 k with the relative residual for this approximation, krk/kbk. Any convenient norm can be
used for this approximation; the only requirement is that it be used consistently throughout.
Definition 4.3. The condition number of the nonsingular matrix A relative to a norm k·k is
K(A) = kAk·kA−1 k.
With this notation, the inequalities in above theorem become
krk
kx − x̃k≤ K(A)
kAk
and
kx − x̃k krk
≤ K(A) .
kxk kbk
For any nonsingular matrix A and natural norm k·k,
1 = kIk= kA · A−1 k≤ kAk·kA−1 k= K(A)
A matrix A is well-conditioned if K(A) is close to 1, and is ill-conditioned when K(A) is significantly
greater than 1. Conditioning in this context refers to the relative security that a small residual vector
implies a correspondingly accurate approximate solution. When it is very large, the solution of Ax = b
12 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

will be very sensitive to relatively small changes in b. Or in the the residual, a relatively small residual
will quite possibly lead to a relatively large error in x̃ as compared with x. These comments are also
valid when the changes are made to A rather than to b.
 
0.98
Example 9. Suppose x̄ = is an approximate solution for the linear system Ax = b, where
1.1
   
3.9 1.6 5.5
A= , b= .
6.8 2.9 9.7
kx − x̄k
Find a bound for the relative error .
kxk
Sol. The residual is given by
      
5.5 3.9 1.6 0.98 −0.0820
r = b − Ax̄ = − = .
9.7 6.8 2.9 1.1 −0.1540
The bound for the relative error is (for the infinity norm)
kx − x̄k kAk kA−1 k krk
≤ .
kxk b
Also
det(A) = 0.43.
   
1 2.9 −1.6 6.7442 −3.7209
∴ A−1 = =
0.43 −6.8 2.9 −15.8140 9.0698
kAk = 9.7, kA−1 k = 24.8837, krk = 0.1540, kbk = 9.7.
kx − x̄k kAk kA−1 k krk
∴ ≤ = 3.8321.
kxk b
Example 10. Determine the condition number for the matrix
 
1 2
A= .
1.0001 2
Sol. We saw in previous Example that the very poor approximation (3, −0.0001)t to the exact solution
(1, 1)t had a residual vector with small norm, so we should expect the condition number of A to be
large. We have kAk∞ = max{|1| + |2|, |1.001| + |2|} = 3.0001, which would not be considered large.
However,  
−1 −10000 10000
A = , so kAk∞ = 20000,
5000.5 −5000
and for the infinity norm, K(A) = (20000)(3.0001) = 60002. The size of the condition number for this
example should certainly keep us from making hasty accuracy decisions based on the residual of an
approximation.
Example 11. Find the condition number K(A) of the matrix
 
1 c
A= , |c| =
6 1.
c 1
When does A become ill-conditioned? What does this say about the linear system Ax = b? How is
K(A) related to det(A)?
Sol. For the given system of equations the matrix A is
 
1 c
c 1
and is well conditioned if K(A) is near 1. K(A) with respect to norm k·k∞ is given as
K(A) = kAk∞ kA−1 k∞ .
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 13

1 −c
 
 
1 −c  − c2 1 − c2 
Here det(A) = 1 − c2 and adj(A) = . Thus A−1 = 1 −c 1
−c 1 
1 − c2 1 − c2
1 |c| 1 + |c|
Thus ||A||∞ = 1 + |c| and ||A−1 ||∞ = + = .
|1 − c2 | |1 − c2 | |1 − c2 |
(1 + |c|)2
Hence condition number K(A) = .
|1 − c2 |
Thus A is ill-conditioned when |c| is near 1.
When condition number is large the solution of the system Ax = b is sensitive to small changes in A.
If the determinant of A is small, then the condition number of A will be very large.
4.1. The Residual Correction Method. A further use of this error estimation procedure is to
define an iterative method for improving the computed value x. Let x(0) , the initial computed value
for x, generally obtained by using Gaussian elimination. Define
r(0) = b − Ax(0) = A(x − x(0) ).
Then
Ae(0) = r(0) , e(0) = x − x(0) .
Solving by Gaussian elimination, we obtain an approximate value of e(0) . Using it, we define an
improved approximation
x(1) = x(0) + e(0) .
Now we repeat the entire process, calculating
r(1) = b − Ax(1)
x(2) = x(1) + Ae(1) ,
where e(1) is the approximate solution of
Ae(1) = r(1) , e(1) = x − x(1) .
Continue this process until there is no further decrease in the size of error vector.
For example, use a computer with four-digit floating-point decimal arithmetic with rounding, and
use Gaussian elimination with pivoting. The system to be solved is
x1 + 0.5x2 + 0.3333x3 = 1
0.5x1 + 0.3333x2 + 0.25x3 = 0
0.3333x1 + 0.25x2 + 0.2x3 = 0
Then
x(0) = [8.968, −35.77, 29.77]t
r(0) = [−0.005341, −0.004359, −0.0005344]t
e(0) = [0.09216, −0.5442, 0.5239]t
x(1) = [9.060, −36.31, 30.29]t
r(1) = [−0.0006570, −0.0003770, −0.0001980]t
e(1) = [0.001707, −0.01300, 0.01241]t
x(2) = [9.062, −36.32, 30.30]t .

5. Power method for approximating eigenvalues


The eigenvalues of an n × n of matrix A are obtained by solving its characteristic equation
det(a − λI) = 0
n n−1
λ + cn−1 λ + cn−2 λn−2 + · · · + c0 = 0.
For large values of n, the polynomial equations like this one are difficult, time-consuming to solve
and sensitive to rounding errors. In this section we look at an alternative method known as Power
Method for approximating eigenvalues. The method is an iterative method used to determine the
14 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

dominant eigenvalue - that is the eigenvalue with largest magnitude. By modifying the method it can
be used to determine other eigenvalues. One useful feature of power method is that it produces not
only eigenvalue but also associated eigenvector.
To apply the power method, we assume that n × n matrix A has n eigenvalues λ1 , λ2 , · · · , λn (which
we don’t know) with associated eigenvectors v (1) , v (2) , · · · , v (n) . We say matrix A is diagonalizable.
We write
Av (i) = λi v (i) , i = 1, 2, · · · , n.
We assume that these eigenvalues are ordered so that λ1 is the dominant eigenvalue (with correspond-
ing eigenvector v (1) ).
From linear algebra, if A is diagonalizable, then it has n linearly independent eigenvectors v (1) , v (2) , · · · , v (n) .
An n×n matrix need not have n linearly independent eigenvectors. When it does not the Power method
may still be successful, but it is not guaranteed to be.
As the n eigenvectors v (1) , v (2) , · · · , v (n) are linearly independent, they must form a basis for Rn .
We select an arbitrary nonzero starting vector x(0) and express it as a linear combination of basis
vectors as
x0 = c1 v (1) + c2 v (2) + · · · + cn v (n) .
We assume that c1 6= 0. (If c1 = 0, the power method may not converge, and a different x(0) must be
used as an initial approximation.
Then we repeatedly carry out matrix-vector multiplication, using the matrix A to produce a sequence
of vectors. Specifically, we have
x(1) = Ax(0)
x(2) = Ax(1) = A2 x(0)
..
.
x(k) = Ax(k−1) = Ak x(0) .
In general, we have
x(k) = Ak x(0) , k = 1, 2, 3, · · ·
Substituting the value of x(0) , we obtain
x(k) = Ak x(0)
= c1 Ak v (1) + c2 Ak v (2) + · · · + cn Ak v (n)
= c1 λk1 v (1) + c2 λk2 v (2) + · · · + cn λkn v (n)
"  k  k #
λ 2 λn
= λk1 c1 v (1) + c2 v (2) + · · · + cn v (n)
λ1 λ1
Now, from our original assumption that λ1 is larger in absolute value than the other eigenvalues it
follows that each of the fractions
λ2 λ3 λn
, ,··· , < 1.
λ1 λ1 λ1
Therefore each of the factors  k  k  k
λ2 λ3 λn
, ,··· ,
λ1 λ1 λ1
must approach 0 as k approaches infinity. This implies that the approximation
Ak x(0) ≈ λk1 c1 v (1) , c1 6= 0.
Since v (1) is a dominant eigenvector, it follows that any scalar multiple of v (1) is also a dominant
eigenvector. Thus we have shown that Ak x0 approaches a multiple of the dominant eigenvector of A.
The entries of Ak x(0) may grow with k, therefore we scale the powers of Ak x(0) in an appropriate
manner to ensure that the limit is finite and nonzero. The scaling begins by choosing initial guess to
be a unit vector x(0) relative to maximum norm, that is kx(0) k∞ = 1. Then we compute y (1) = Ax(0)
and next approximation can be taken as
y (1)
x(1) = .
ky (1) k∞
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 15

We repeat the procedure and stop by putting the following stopping criteria:
kx(k) − x(k−1) k∞
< ε,
kx(k) k∞
where ε is the desired accuracy.
Example 12. Calculate four iterations of the power method with scaling to approximate a dominant
eigenvector of the matrix  
1 2 0
−2 1 2
1 3 1
Sol. Using x(0) = [1, 1, 1]T as initial approximation, we obtain
    
1 2 0 1 3
y (1) = Ax(0) = −2 1 2 1 = 1
1 3 1 1 5
and by scaling we obtain the approximation
   
3 0.60
x(1) = 1/5 1 = 0.20 .
5 1.00
Similarly we get    
1.00 0.45
y (2) = Ax(1) = 1.00 = 2.20 0.45 = 2.20x(2) .
2.20 1.00
   
1.35 0.48
y (3) = Ax(2) = 1.55 = 2.8 0.55 = 2.8x(3) .
2.8 1.00
 
0.51
y (4) = Ax(3) = 3.1 0.51 .
1.00
etc.
After four iterations, we observe that dominant eigenvector is
 
0.51
x = 0.51 .
1.00
Scaling factors are approaching to dominant eigenvalue λ = 3.1.
Remark 5.1. The power method is useful to compute the eigenvalue but it gives only dominant eigen-
value. To find other eigenvalue we use properties of matrix such as sum of all eigenvalue is equal to the
trace of matrix. Also if λ is an eigenvalue of A then λ−1 is the eigenvalue of A−1 . Hence the smallest
eigenvalue of A is the dominant eigenvalue of A−1 .
5.1. Inverse Power method. The Inverse Power method is a modification of the Power method that
is used to determine the eigenvalue of A that is closest to a specified number σ.
We consider A − σI then its eigenvalues are λ1 − σ, λ2 − σ, · · · , λn − σ, where λ1 , λ2 , · · · , λn are the
eigenvalues of A.
1 1 1
Now the eigenvalues of (A − σI)−1 are , ,··· , .
λ1 − σ λ2 − σ λn − σ
The eigenvalues of the original matrix A that is the closest to σ corresponds to the eigenvalue of largest
magnitude of the shifted and inverted of matrix (A − σI)−1 .
To find the eigenvalue closest to σ, we apply the Power method to obtain the eigenvalue µ of (A−σI)−1 .
Then we recover the eigenvalue λ of the original problem by λ = 1/µ + σ. This method is called shifted
and inverted. We solve y = (A − σI)−1 x which implies (A − σI)y = x. We need not to compute the
inverse of the matrix.
16 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

Example 13. Apply the inverse power method with x(0) = [1, 1, 1]T to the matrix
 
−4 14 0
−5 13 0
−1 0 2
with σ = 19/3.
Sol. For the inverse power method, we consider
 −31 
3 14 0
19 20
A− I =  −5 3 0 
3
−1 0 − 13
3

Starting with x(0) = [1, 1, 1]T , (A − σI)−1 x(0) = y (1) gives (A − σI)y (1) = x(0) . This gives
 −31    
3 14 0 a 1
 −5 20 0   b  = 1 .
3
−1 0 − 13 3
c 1
Solving above system by Gauss elimination (LU decomposition), we get a = −6.6, b = −4.8, and
c = 1.2923.
Therefore y (1) = (−6.6, −4.8, 1.2923)T . We normalize it by taking 6.6 as scale factor and x(1) =
1 (1) = (1, 0.7272, −0.1958)T .
−6.6 y
1
Therefore first approximation of the eigenvalue of A near 19/3 is − 6.6 + 19
3 = 6.1818.
Repeating the above procedure we can obtain the eigenvalue (and which is 6).

Important Remark: Although the power method worked well in these examples, we must say
something about cases in which the power method may fail. There are basically three such cases:
1. Using the power method when A is not diagonalizable. Recall that A has n linearly Independent
eigenvector if and only if A is diagonalizable. Of course, it is not easy to tell by just looking at A
whether it is diagonalizable.
2. Using the power method when A does not have a dominant eigenvalue or when the dominant
eigenvalue is such that |λ1 | = |λ2 |.
3. If the entries of A contains significant error. Powers Ak will have significant roundoff error in their
entires.

Exercises
(1) Find l∞ and l2 norms of the vectors.
a. x = (3, −4, 0, 23 )t
b. x = (sin k, cos k, 2k )t for a fixedpositive integer
 k.
4 −1 7
(2) Find the l∞ norm of the matrix: −1 4 0 .
−7 0 4
(3) The following linear system Ax = b have x as the actual solution and x̄ as an approximate
solution. Compute kx − x̄k∞ and kAx̄ − bk∞ . Also compute kAk∞ .
x1 + 2x2 + 3x3 = 1
2x1 + 3x2 + 4x3 = −1
3x1 + 4x2 + 6x3 = 2,
x = (0, −7, 5)t
x̄ = (−0.2, −7.5, 5.4)t .
(4) Find the first two iterations of Jacobi and Gauss-Seidel using x(0) = 0:
4.63x1 − 1.21x2 + 3.22x3 = 2.22
−3.07x1 + 5.48x2 + 2.11x3 = −3.17
1.26x1 + 3.11x2 + 4.57x3 = 5.11.
ITERATIVE TECHNIQUES IN MATRIX ALGEBRA 17

(5) The linear system

x1 − x3 = 0.2
1 1
− x1 + x2 − x3 = −1.425
2 4
1
x1 − x2 + x3 = 2
2
has the solution (0.9, −0.8, 0.7)T .
a. Is the coefficient matrix strictly diagonally dominant?
b. Compute the spectral radius of the Gauss-Seidel iteration matrix.
c. Perform four iterations of the Gauss-Seidel iterative method to approximate the solution.
d. What happens in part (c) when the first equation in the system is changed to x1 −2x3 = 0.2?
(6) Show that Gauss-Seidel method does not converge for the following system of equations

2x1 + 3x2 + x3 = −1
3x1 + 2x2 + 2x3 = 1
x1 + 2x2 + 2x3 = 1.

(7) Find the first two iterations of the SOR method with ω = 1.1 for the following linear systems,
using x(0) = 0 :

4x1 + x2 − x3 = 5
−x1 + 3x2 + x3 = −4
2x1 + 2x2 + 5x3 = 1.

(8) Compute the condition


  following matrices relative to k.k∞ .
numbers of the
  0.04 0.01 −0.01
3.9 1.6
a. b.  0.2 0.5 −0.2 .
6.8 2.9
1 2 4
(9) (i) Use Gaussian elimination and three-digit rounding arithmetic to approximate the solutions
to the following linear systems. (ii) Then use one iteration of iterative refinement to improve
the approximation, and compare the approximations to the actual solutions.
a.

0.03x1 + 58.9x2 = 59.2


5.31x1 − 6.10x2 = 47.0.

Actual solution (10, 1)t .


b.

3.3330x1 + 15920x2 + 10.333x3 = 7953


2.2220x1 + 16.710x2 + 9.6120x3 = 0.965
−1.5611x1 + 5.1792x2 − 1.6855x3 = 2.714.

Actual solution (1, 0.5, −1)t .


(10) The linear system Ax = b given by
    
1 2 x1 3
=
1.0001 2 x2 3.0001

has solution (1, 1)t . Use four-digit rounding arithmetic to find the solution of the perturbed
system
    
1 2 x1 3.00001
=
1.000011 2 x2 3.00003
Is matrix A ill-conditioned?
18 ITERATIVE TECHNIQUES IN MATRIX ALGEBRA

(11) Determine the largest eigenvalue and the corresponding eigenvector of the following matrix
correct to three decimals using the power method with x(0) = (−1, 2, 1)t using the power
method.
 
1 −1 0
−2 4 −2 .
0 −1 2
(12) Use the inverse power method to approximate the most dominant eigenvalue of the matrix until
a tolerance of 10−2 is achieved with x(0) = (1, −1, 2)t .
 
2 1 1
1 2 1 .
1 1 2
(13) Find the eigenvalue of matrix nearest to 3
 
2 −1 0
−1 2 −1
0 −1 2
using inverse power method.

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.

Appendix A. Algorithms
Algorithm (Gauss-Seidel):
(1) Input matrix A = [aij ], b, XO = x(0) , tolerance TOL, maximum number of iterations
(2) Set k = 1
(3) while (k ≤ N ) do step 4-7
(4) For i = 1, 2, · · · , n
 
i−1 n
1  X X
xi = − (aij xj ) − (aij XOj ) + bi )
aii
j=1 j=i+1

(5) If ||x − XO|| < T OL, then OUTPUT (x1 , x2 , · · · , xn )


STOP
(6) k = k + 1
(7) For i = 1, 2, · · · , n
Set XOi = xi
(8) OUTPUT (x1 , x2 , · · · , xn )
STOP.
Algorithm (Power Method):
(1) Start
(2) Define matrix A and initial guess x
(3) Calculate y = Ax
(4) Find the largest element in magnitude of matrix y and assign it to K.
(5) Calculate fresh value x = (1/K) ∗ y
(6) If [K(n) − K(n − 1)] > error, goto step 3.
(7) Stop
CHAPTER 5 (8 LECTURES)
POLYNOMIAL INTERPOLATION AND APPROXIMATIONS

1. Introduction
Polynomials are used as the basic means of approximation in nearly all areas of numerical analysis.
They are used in the solution of equations and in the approximation of functions, of integrals and
derivatives, of solutions of integral and differential equations, etc. Polynomials have simple structure,
which makes it easy to construct effective approximations and then make use of them. For this reason,
the representation and evaluation of polynomials is a basic topic in numerical analysis. We discuss this
topic in the present chapter in the context of polynomial interpolation, the simplest and certainly the
most widely used technique for obtaining polynomial approximations.
Definition 1.1 (Polynomial). A polynomial Pn (x) of degree ≤ n is, by definition, a function of the
form
Pn (x) = a0 + a1 x + a2 x2 + · · · + an xn (1.1)
with certain coefficients a0 , a1 , · · · , an . This polynomial has (exact) degree n in case its leading coeffi-
cient an is nonzero.
The power form (1.1) is the standard way to specify a polynomial in mathematical discussions. It is
a very convenient form for differentiating or integrating a polynomial. But, in various specific contexts,
other forms are more convenient. For example, the following shifted power form may be helpful.
P (x) = a0 + a1 (x − c) + a2 (x − c)2 + · · · + an (x − c)n . (1.2)
It is good practice to employ the shifted power form with the center c chosen somewhere in the interval
[a, b] when interested in a polynomial on that interval.
Definition 1.2 (Newton form). A further generalization of the shifted power form is the following
Newton form
P (x) = a0 + a1 x − c1 ) + a2 (x − c1 )(x − c2 ) + · · · + an (x − c1 )(x − c2 ) · · · (x − cn ).
This form plays a major role in the construction of an interpolating polynomial. It reduces to the
shifted power form if the centers c1 , · · · , cn , all equal c, and to the power form if the centers c1 , · · · , cn ,
all equal zero.

2. Lagrange Interpolation
In this chapter, we consider the interpolation problem. Suppose we do not know the function f , but
a few information (data) about f . Now we try to compute a function g that approximates f .

2.1. Polynomial Interpolation. The polynomial interpolation problem, also called Lagrange inter-
polation, can be described as follows: Given (n+1) data points (xi , yi ), i = 0, 1, · · · , n find a polynomial
P of lowest possible degree such
yi = P (xi ), i = 0, 1, · · · , n.
Such a polynomial is said to interpolate the data. Here yi may be the value of some unknown function
f at xi , i.e. yi = f (xi ).
One reason for considering the class of polynomials in approximation of functions is that they uniformly
approximate continuous function.
Theorem 2.1 (Weierstrass Approximation Theorem). Suppose that f is defined and continuous on
[a, b]. For any ε > 0, there exists a polynomial P (x) defined on [a, b] with the property that
|f (x) − P (x)| < ε, ∀x ∈ [a, b].
1
2 INTERPOLATION AND APPROXIMATIONS

Another reason for considering the class of polynomials in approximation of functions is that the
derivatives and indefinite integrals of a polynomial are easy to compute.

Theorem 2.2 (Existence and Uniqueness). Given a real-valued function f (x) and n + 1 distinct points
x0 , x1 , · · · , xn , there exists a unique polynomial Pn (x) of degree ≤ n which interpolates the unknown
f (x) at points x0 , x1 , · · · , xn .

Proof. Existence: Let x0 , x1 , · · · , xn be the given n + 1 discrete data points. We will prove the result
by the mathematical induction.
The Theorem clearly holds for n = 0, only one data point is given and we can take constant polynomial
P0 (x) = f (x0 ), ∀x.
Assume that the Theorem holds for n ≤ k, i.e. there is a polynomial Pk with degree ≤ k such that
Pk (xi ) = f (xi ), for 0 ≤ i ≤ k.
Now we try to construct a polynomial of degree at most k + 1 to interpolate (xi , f (xi )), 0 ≤ i ≤ k + 1.
Let
Pk+1 (x) = Pk (x) + c(x − x0 )(x − x1 ) · · · (x − xk ).
For x = xk+1 ,

Pk+1 (xk+1 ) = f (xk+1 ) = Pk (xk+1 ) + c(xk+1 − x0 )(xk+1 − x1 ) · · · (xk+1 − xk )

f (xk+1 ) − Pk (xk+1 )
=⇒ c = .
(xk+1 − x0 )(xk+1 − x1 ) · · · (xk+1 − xk )
Since xi are distinct, the polynomial Pk+1 (x) is well-defined and degree of Pk+1 ≤ k + 1. Now

Pk+1 (xi ) = Pk (xi ) + 0 = Pk (xi ) = f (xi ), 0 ≤ i ≤ k

and
Pk+1 (xk+1 ) = f (xk+1 )
Above two equations implies
Pk+1 (xi ) = f (xi ), 0 ≤ i ≤ k + 1.
Therefore Pk+1 (x) interpolate f (x) at all k + 2 nodal points. By mathematical induction result is true
for all polynomials.
Uniqueness: Let there are two such polynomials Pn and Qn such that

Pn (xi ) = f (xi )

Qn (xi ) = f (xi ), 0 ≤ i ≤ n.
Define
Sn (x) = Pn (x) − Qn (x)
Since for both Pn and Qn , degree ≤ n, which implies the degree of Sn is also ≤ n.
Also
Sn (xi ) = Pn (xi ) − Qn (xi ) = f (xi ) − f (xi ) = 0, 0 ≤ i ≤ n.
This implies Sn has at least n + 1 zeros which is not possible as degree of Sn is at most n.
This implies
Sn (x) = 0, ∀x

=⇒ Pn (x) = Qn (x), ∀x.


Therefore interpolating polynomial is unique.
INTERPOLATION AND APPROXIMATIONS 3

2.2. Linear Interpolation. We determine a polynomial


P (x) = ax + b (2.1)
where a and b are arbitrary constants satisfying the interpolating conditions f (x0 ) = P (x0 ) and
f (x1 ) = P (x1 ). We have
f (x0 ) = P (x0 ) = ax0 + b
f (x1 ) = P (x1 ) = ax1 + b.
Lagrange interpolation: Solving for a and b, we obtain
f (x0 ) − f (x1 )
a =
x0 − x1
f (x0 )x1 − f (x1 )x0
b =
x1 − x0
Substituting these values in equation (2.1), we obtain
f (x0 ) − f (x1 ) f (x0 )x1 − f (x1 )x0
P (x) = x+
x0 − x1 x1 − x0
x − x1 x − x0
=⇒ P (x) = f (x0 ) + f (x1 )
x0 − x1 x1 − x0
=⇒ P (x) = l0 (x)f (x0 ) + l1 (x)f (x1 )
x − x1 x − x0
where l0 (x) = and l1 (x) = .
x0 − x1 x1 − x0
These functions l0 (x) and l1 (x) are called the Lagrange Fundamental Polynomials and they satisfy the
following conditions.
l0 (x) + l1 (x) = 1.
l0 (x0 ) = 1, l0 (x1 ) = 0
l1 (x0 ) = 0, l1 (x1 ) = 1

1, i = j
=⇒ li (xj ) = δij =
0, i 6= j.
Higher-order Lagrange interpolation: In this section we take a different approach and assume
that the interpolation polynomial is given as a linear combination of n + 1 polynomials of degree n.
This time, we set the coefficients as the interpolated values, {f (xi )}ni=0 , while the unknowns are the
polynomials. We thus let
X n
Pn (x) = f (xi )li (x),
i=0
where li (x) are n + 1 polynomials of degree n. Note that in this particular case, the polynomials li (x)
are precisely of degree n (and not ≤ n). However, Pn (x), given by the above equation may have a
lower degree. In either case, the degree of Pn (x) is n at the most. We now require that Pn (x) satisfies
the interpolation conditions
Pn (xj ) = f (xj ), 0 ≤ j ≤ n.
By substituting xj for x we have
n
X
Pn (xj ) = f (xi )li (xj ), 0 ≤ j ≤ n.
i=0
Therefore we may conclude that li (x) must satisfy
li (xj ) = δij , i, j = 0, 1, · · · , n
where δij is the Kronecker delta, defined as

1, i = j
δij =
6 j.
0, i =
Each polynomial li (x) has n + 1 unknown coefficients. The conditions given above through delta
provide exactly n + 1 equations that the polynomials li (x) must satisfy and these equations can be
4 INTERPOLATION AND APPROXIMATIONS

solved in order to determine all li (x)’s. Fortunately there is a shortcut. An obvious way of constructing
polynomials li (x) of degree n that satisfy the condition is the following:
(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
li (x) = .
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
The uniqueness of the interpolating polynomial of degree ≤ n given n + 1 distinct interpolation points
implies that the polynomials li (x) given by above relation are the only polynomials of degree n.
Note that the denominator does not vanish since we assume that all interpolation points are distinct.
We can write the formula for li (x) in a compact form using the product notation.
(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
li (x) =
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
W (x)
= , i = 0, 1, · · · , n
(x − xi )W 0 (xi )
where
W (x) = (x − x0 ) · · · (x − xi−1 )(x − xi )(x − xi+1 ) · · · (x − xn )
∴ W 0 (xi ) = (xi − x0 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn ).
The Lagrange interpolating polynomial can be written as
n n
X Y (x − xj )
Pn (x) = f (xi ) .
(xi − xj )
i=0 j=0
j6=i

Example 1. Given the following four data point. Find a polynomial in Lagrange form to interpolate

xi 0 1 3 5
yi 1 2 6 7

the data.
Sol. The Lagrange functions are given by
(x − 1)(x − 3)(x − 5) 1
l0 (x) = = − (x − 1)(x − 3)(x − 5).
(0 − 1)(0 − 3)(0 − 5) 15
(x − 0)(x − 3)(x − 5) 1
l1 (x) = = (x − 0)(x − 3)(x − 5).
(1 − 0)(1 − 3)(1 − 5) 8
(x − 0)(x − 1)(x − 5) 1
l2 (x) = = − (x)(x − 1)(x − 5).
(3 − 0)(3 − 1)(3 − 5) 12
(x − 0)(x − 1)(x − 3) 1
l3 (x) = = (x)(x − 1)(x − 3).
(5 − 0)(5 − 1)(5 − 3) 40
The interpolating polynomial in the Lagrange form is
P3 (x) = l0 (x) + 2l1 (x) + 6l2 (x) + 7l3 (x).

Example 2. Let f (x) = x − x2 and P2 (x) be the interpolation polynomial on x0 = 0, x1 and x2 = 1.
Find the largest value of x1 in (0, 1) for which f (0.5) − P2 (0.5) = −0.25.
√ p
Sol. If f (x) = x − x2 then our nodes are [x0 , x1 , x2 ] = [0, x1 , 1] and f (x0 ) = 0, f (x1 ) = x1 − x21
and f (x2 ) = 0. Therefore
(x − x1 )(x − x2 ) (x − x1 )(x − 1)
l0 (x) = = ,
(x0 − x1 )(x0 − x2 ) x1
(x − x0 )(x − x2 ) x(x − 1)
l1 (x) = = ,
(x1 − x0 )(x1 − x2 ) x1 (x1 − 1)
(x − x0 )(x − x1 ) x(x − 1)
l2 (x) = = .
(x2 − x0 )(x2 − x1 ) (1 − x1 )
INTERPOLATION AND APPROXIMATIONS 5

∴ P2 (x) = l0 (x)f (x0 ) + l1 (x)f (x1 ) + l2 (x)f (x2 )


(x − x1 )(x − 1) x(x − 1) x(x − 1)
q
= .0 + . x1 − x21 + .0
x1 x1 (x1 − 1) (1 − x1 )
x(x − 1)
= −p .
x1 (1 − x1 )
If we now consider f (x) − P2 (x), then
p x(x − 1)
f (x) − P2 (x) = x − x2 + p .
x1 (1 − x1 )
Hence f (0.5) − P2 (0.5) = −0.25 implies
p 0.5(0.5 − 1)
0.5 − 0.52 + p = −0.25
x1 (1 − x1 )
Solving for x1 gives
x21 − x1 = −1/9
or
(x1 − 1/2)2 = 5/36
q q
which gives x1 = 12 − 36 5
or x1 = 1
2
5
+ 36 .
The largest of these is therefore
r
1 5
x1 = + ≈ 0.8727.
2 36
2.3. Error Analysis for Polynomial Interpolation. We are given nodes x0 , x1 , · · · , xn , and the
corresponding function values f (x0 ), f (x1 ), · · · , f (xn ), but the we don’t know the expression for the
function. Let Pn (x) be the polynomial of order ≤ n that passes through the n + 1 points (x0 , f (x0 )),
(x1 , f (x1 )),· · · , (xn , f (xn )).
Question: What is the error between f (x) and Pn (x) even we don’t know f (x) in advance?
Definition 2.3 (Truncation error). The polynomial P (x) coincides with f (x) at all nodal points and
may deviates at other points in the interval. This deviation is called the truncation error and we write
En (f ; x) = f (x) − P (x).
Theorem 2.4. Suppose that x0 , x1 , · · · , xn are distinct numbers in [a, b] and f ∈ C n+1 [a, b]. Let P (x)
be the unique polynomial of degree ≤ n that passes through n + 1 nodal points then prove that
∀x ∈ [a, b], ∃ξ = ξ(x) ∈ (a, b)
such that
(x − x0 ) · · · (x − xn ) (n+1)
En (f ; x) = f (x) − P (x) = f (ξ).
(n + 1)!
Proof. Let x0 , x1 , · · · , xn are distinct numbers in [a, b] and f ∈ C n+1 [a, b].
Let P (x) be the unique polynomial of degree ≤ n that passes through n + 1 discrete points.
Since f (xi ) = P (xi ), which implies f (x) − P (x) = 0.
Now for any t in the domain, define a function g(t),t ∈ [a, b],
(t − x0 ) · · · (t − xn )
g(t) = f (t) − P (t) − [f (x) − P (x)] (2.2)
(x − x0 ) · · · (x − xn )
Now g(t) ∈ C n+1 [a, b] as f ∈ C n+1 [a, b] and P (x) ∈ C n+1 [a, b].
Now g(t) = 0 at t = x, x0 , x1 , ....., xn . Therefore g(t) satisfy the conditions of Rolle’s Theorem which
states that between n + 2 zeros of a function, there is at least one zero of (n + 1)th derivative of the
function. Hence there exists a point ξ such that
g (n+1) (ξ) = 0
6 INTERPOLATION AND APPROXIMATIONS

where ξ ∈ (a, b) and depends on x.


Now differentiate function g(t) (n + 1) times to obtain
(n + 1)!
g (n+1) (t) = f (n+1) (t) − P (n+1) (t) − [f (x) − P (x)]
(x − x0 ) · · · (x − xn )
(n + 1)!
= f (n+1) (t) − [f (x) − P (x)]
(x − x0 ) · · · (x − xn )
Here P (n+1) (t) = 0 as P (x) is a n-th degree polynomial.
Now g (n+1) (ξ) = 0 and then solving for f (x) − P (x), we obtain
(x − x0 ) · · · (x − xn ) (n+1)
f (x) − P (x) = f (ξ)
(n + 1)!

Corollary 2.5. If |f (n+1) (ξ)| ≤ M then we can obtain a bound of the error
M
|f (x) − P (x)| ≤ max |(x − x0 ) · · · (x − xn )|.
(n + 1)! x∈[a,b]
The next example illustrates how the error formula can be used to prepare a table of data that will
ensure a specified interpolation error within a specified bound.
Example 3. Suppose a table is to be prepared for the function f (x) = ex , for x in [0, 1]. Assume
the number of decimal places to be given per entry is d ≥ 8 and that the difference between adjacent
x−values, the step size, is h. What step size h will ensure that linear interpolation gives an absolute
error of at most 10−6 for all x in [0, 1]?
Sol. Let x0 , x1 , . . . be the numbers at which f is evaluated, x be in [0, 1], and suppose i satisfies
xi ≤ x ≤ xi+1 .
The error in linear interpolation is
1 2 |f 2 (ξ)|
|f (x) − P (x)| = f (ξ)(x − xi )(x − xi+1 ) = |(x − xi )||(x − xi+1 )|.
2 2
The step size is h, so xi = ih, xi+1 = (i + 1)h, and
1
|f (x) − p(x)| ≤ |f 2 (ξ)| |(x − ih)(x − (i + 1)h|.
2
Hence
1
|f (x) − p(x)| ≤ max eξ max |(x − ih)(x − (i + 1)h|
2 ξ∈[0,1] xi ≤x≤xi+1
e
≤ max |(x − ih)(x − (i + 1)h|.
2 xi ≤x≤xi+1
Consider the function g(x) = (x − ih)(x − (i + 1)h), for ih ≤ x ≤ (i + 1)h. Because
h
g 0 (x) = (x − (i + 1)h) + (x − ih) = 2 x − ih − ),
2
the only critical point for g is at x = ih + h/2, with g(ih + h/2) = (h/2) = h2 /4. Since g(ih) = 0 and
2

g((i + 1)h) = 0, the maximum value of |g 0 (x)| in [ih, (i + 1)h] must occur at the critical point which
implies that
e e h2 eh2
|f (x) − p(x)| ≤ max |g(x)| ≤ · = .
2 xi ≤x≤xi+1 2 4 8
Consequently, to ensure that the the error in linear interpolation is bounded by 10−6 , it is sufficient
for h to be chosen so that
eh2
≤ 10−6 .
8
This implies that h < 1.72 × 10−3 .
Because n = (1 − 0)/h must be an integer, a reasonable choice for the step size is h = 0.001.
Example 4. Determine the step size h that can be used in the tabulation of a function f (x), a ≤ x ≤ b,
at equally spaced nodal points so that the truncation error of the quadratic interpolation is less than ε.
INTERPOLATION AND APPROXIMATIONS 7

Sol. Let xi−1 , xi , xi+1 are three eqispaced points with spacing h. The truncation error of the quadratic
interpolation is given by
M
|f (x) − P( x)| ≤ max |(x − xi−1 , )(x − xi )(x − xi+1 )|
3! a≤x≤b
where M = max |f (3) (x)|.
a≤x≤b
To simplify the calculation, let
x − xi = th
∴ x − xi−1 = x − (xi − h) = (t + 1)h
and x − xi+1 = x − (xi + h) = (t − 1)h.
∴ |(x − xi−1 , )(x − xi )(x − xi+1 )| = h3 |t(t + 1)(t − 1)| = g(t)(say)
Now g(t) attains its extreme values if
dg
=0
dt
which gives t = ± √13 . At end points of the interval g becomes zero.
For both values of t = ± √13 , we obtain max 2
|g(t)| = h3 3√ 3
.
xi−1 ≤x≤xi+1
Truncation error
|f (x) − P2 (x)| < ε
h3
=⇒ √ M < ε
9 3
" √ #1/3
9 3ε
=⇒ h < .
M

3. Neville’s Method
Neville’s method can be applied in the situation that we want to interpolate f (x) at a given point
x = p with increasingly higher order Lagrange interpolation polynomials.
For concreteness, consider three distinct points x0 , x1 , and x2 at which we can evaluate f (x) ex-
actly f (x0 ), f (x1 ), f (x2 ). From each of these three points we can construct an order zero (constant)
”polynomial” to approximate f (p)
f (p) ≈ P0 (p) = f (x0 ) (3.1)
f (p) ≈ P1 (p) = f (x1 ) (3.2)
f (p) ≈ P2 (p) = f (x2 ) (3.3)
Of course this isn’t a very good approximation so we turn to first order Lagrange polynomials
p − x1 p − x0
f (p) ≈ P0,1 (p) = f (x0 ) + f (x1 )
x0 − x1 x1 − x0
p − x2 p − x1
f (p) ≈ P1,2 (p) = f (x1 ) + f (x2 ).
x1 − x2 x2 − x1
There is also P0,2 , but we won’t concern ourselves with that one.
If we note that f (xi ) = Pi (x), we find
p − x1 p − x0
P0,1 (p) = P0 (p) + P1 (p)
x0 − x1 x1 − x0
(p − x1 )P0 (p) − (p − x0 )P1 (p)
=
x0 − x1
and similarly
(p − x2 )P1 (p) − (p − x1 )P2 (p)
P1,2 (p) =
x1 − x2
In general we want to multiply Pi (x) by (x−xj ) where j 6= i (i.e., xj is a point that is NOT interpolated
by Pi (x)). We take the difference of two such products and divide by the difference between the added
8 INTERPOLATION AND APPROXIMATIONS

points.
The result is a polynomial Pi,i−1 of one degree higher than either of the two used to construct it
and that interpolates all the points of the two constructing polynomials combined. This idea can be
extended to construct the third order polynomial P0,1,2
(p − x2 )P0,1 (p) − (p − x0 )P1,2 (p)
P0,1,2 (p) = .
x0 − x2
A little algebra will convince that
(p − x1 )(p − x2 ) (p − x0 )(p − x2 ) (p − x0 )(p − x1 )
P0,1,2 (p) = f (x0 ) + f (x1 ) + f (x2 )
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
which is just the third order Lagrange polynomial interpolating the points x0 , x1 , x2 . This shouldn’t
surprise you since this is the unique third order polynomial interpolating these three points.
Example: We are given the function
1
f (x) =
x
We want to approximate the value f (3). First we evaluate the function at the three points
xi f (xi )
2 0.5
2.5 0.4
4 0.25
We can first make three separate zero-order approximations
f (3) ≈ P0 (3) = f (x0 ) = 0.5

f (3) ≈ P1 (3) = f (x1 ) = 0.4


f (3) ≈ P2 (3) = f (x2 ) = 0.25
From these we proceed to construct P0,1 and P1,2 by using the Neville formula
(3 − x1 )P0 (3) − (3 − x0 )P1 (3)
f (3) ≈ P0,1 (3) = = 0.3
x0 − x1
(3 − x2 )P1 (3) − (3 − x1 )P2 (3)
f (3) ≈ P1,2 (3) = = 0.35
x1 − x2
So we can add these numbers to our table
xi f (xi ) Pi,i+1
2 0.5
2.5 0.2 0.3
4 0.25 0.35
Finally we can compute P0,1,2 using P0,1 and P1,2 .
(3 − x2 )P0,1 (3) − (3 − x0 )P1,2 (3)
f (3) ≈ P0,1,2 (3) = = 0.325
x0 − x2
xi f (xi ) Pi,i+1 Pi,i+1,i+2
2 0.5
2.5 0.2 0.3
4 0.25 0.35 0.325
Example 5. Neville’s method is used to approximate f (0.4) as follows. Complete the table.
xi Pi (0.4) Pi,i+1 (0.4) Pi,i+1,i+2 Pi,i+1,i+2,i+3
0 1
0.25 2 P0,1 (0.4)=2.6
0.5 P2 P1,2 (0.4) P0,1,2 (0.4)
0.75 8 P2,3 (0.4) = 2.4 P1,2,3 (0.4) = 2.96 P0,1,2,3 (0.4) = 3.016
INTERPOLATION AND APPROXIMATIONS 9

Sol.
(0.4 − 0.75)P2 − (0.4 − 0.5)P3
P2,3 (0.4) = = 2.4
0.5 − 0.75
=⇒ P2 = 4
(0.4 − 0.5)P1 − (0.4 − 0.25)P2 (−0.1)(2) − (0.15)(4)
P1,2 (0.4) = = = 3.2
0.25 − 0.5 −0.25
(0.4 − 0.5)P0,1 − (0.4 − 0)P1,2 (0.4) (−0.1)(2.6) − (0.4)(3.2)
P0,1,2 (0.4) == = = 3.08.
0 − 0.5 −0.5
Example 6. In Neville’s method, suppose xi = i, for i = 0, 1, 2, 3 and it is known that P0,1 (x) =
x + 1, P1,2 (x) = 3x − 1, and P1,2,3 (1.5) = 4. Find P2,3 (1.5) and P0,1,2,3 (1.5).
Sol. Here x0 = 0, x1 = 1, x2 = 2, x3 = 3.
(x − x2 )P0,1 (x) − (x − x0 )P1,2 (x) (x − 2)(x + 1) − x(3x + 1)
P0,1,2 (x) = = = x2 + 1.
x0 − x2 −2

(1.5 − x1 )P2,3 (1.5) − (1.5 − x3 )P1,2 (1.5)


P1,2,3 (1.5) = =4
x3 − x1
=⇒ P2,3 (1.5) = 5.5
Now
P0,1,2 (1.5) = 3.25
(1.5 − 3)P0,1,2 (1.5) − (1.5 − 0)P1,2,3 (1.5)
P0,1,2,3 (1.5) =
0−3
= 3.625.

4. Newton’s divided difference interpolation


Suppose that Pn (x) is the n-th order Lagrange polynomial that agrees with the function f at the
distinct numbers x0 , x1 , · · · , xn . Although this polynomial is unique, there are alternate algebraic
representations that are useful in certain situations. The divided differences of f with respect to
x0 , x1 , · · · , xn are used to express Pn (x) in the form
Pn (x) = a0 + a1 (x − x0 ) + a2 (x − x0 )(x − x1 ) + · · · + an (x − x0 ) · · · (x − xn−1 ), (4.1)
for appropriate constants a0 , a1 , · · · , an .
Now we determine the first of these constants a0 . For this we substitute x = x0 in Pn (x) and we obtain
a0 = Pn (x0 ) = f (x0 ).
Similarly, when Pn (x) is evaluated at x1 , the only nonzero terms in the evaluation of Pn (x1 ) are the
constant and linear terms,
f (x0 ) + a1 (x1 − x0 ) = Pn (x1 ) = f (x1 ),
so
f (x1 ) − f (x0 )
a1 = = f [x0 , x1 ]
x1 − x0
f (x1 ) − f (x0 )
The ratio f [x0 , x1 ] = , is called first divided difference of f (x) and in general
x1 − x0
f (xi+1 ) − f (xi )
f [xi , xi+1 ] = .
xi+1 − xi
The remaining divided differences are defined recursively.
The second divided difference of three points, xi , xi+1 , xi+2 , is defined as
f [xi+1 , xi+2 ] − f [xi , xi+1 ]
f [xi , xi+1 , xi+2 ] = .
xi+2 − xi
10 INTERPOLATION AND APPROXIMATIONS

Now if we substitute x = x2 and the values of a0 and a1 in Eqs. (4.1), we obtain


f (x1 ) − f (x0 )
P (x2 ) = f (x2 ) = f (x0 ) + (x2 − x0 ) + a2 (x2 − x0 )(x2 − x1 )
x1 − x0
f (x0 ) f (x1 ) f (x2 )
=⇒ a2 = + +
(x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x2 − x0 )(x2 − x1 )
f [x1 , x2 ] − f [x0 , x1 ]
= = f [x0 , x1 , x2 ].
x2 − x0
The process ends with the single n-th divided difference,
f [x1 , x2 , · · · , xn ] − f [x0 , x1 , · · · , xn−1 ]
an = f [x0 , x1 , · · · , xn ] =
xn − x0
n
X f (xi )
= n
Q
i=0 (xi − xj )
j=0
j6=i

We can write the Newton divided difference formula in the following fashion (and we will prove in next
Theorem).
Pn (x) = f (x0 ) + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) +
· · · + f [x0 , x1 , · · · , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 )
n
X i−1
Y
= f (x0 ) + f [x0 , x1 , · · · , xi ] (x − xj ).
i=1 j=0
We can also construct the Newton interpolating polynomial as given in the next result.
Theorem 4.1. The unique polynomial of degree ≤ n that passes through (x0 , y0 ), (x1 , y1 ), · · · , (xn , yn )
is given by
Pn (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , · · · , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 )
Proof. We prove it by induction. The unique polynomial of degree 0 that passes through (x0 , y0 ) is
obviously
P0 (x) = y0 = f [x0 ].
Suppose that the polynomial Pk (x) of order ≤ k that passes through (x0 , y0 ), (x1 , y1 ), · · · , (xk , yk ) is
Pk (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 )
Write Pk+1 (x), the unique polynomial of order (degree) ≤ k that passes through (x0 , y0 ), (x1 , y1 ), · · · ,
(xk , yk )(xk+1 , yk+1 ) by
Pk+1 (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 ) + · · · +
f [x0 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 ) + C(x − x0 )(x − x1 ) · · · (x − xk−1 )(x − xk )
We only need to show that
C = f [x0 , x1 , · · · , xk , xk+1 ].
For this, let Qk (x) be the unique polynomial of degree ≤ k that passes through (x1 , y1 ), · · · , (xk , yk )
(xk+1 , yk+1 ). Define
x − x0
R(x) = Pk (x) + [Qk (x) − Pk (x)]
xk+1 − x0
Then,
• R(x) is a polynomial of degree k + 1.
• R(x0 ) = Pk (x0 ) = y0 ,
xi − x0
R(xi ) = Pk (xi ) + (Qk (xi ) − Pk (xi )) = Pk (xi ) = yi , i = 1, · · · , k,
xk+1 − x0
R(xk+1 ) = Qk (xk+1 ) = yk+1 .
INTERPOLATION AND APPROXIMATIONS 11

By the uniqueness, R(x) = Pk+1 (x).


The leading coefficient of Pk+1 (x) is C.
x − x0
The leading coefficient of R(x) is the leading coefficient of [Qk (x) − Pk (x)] which is
xk+1 − x0
1
(leading coefficient of Qk (x) - leading coefficient of Pk (x)).
xk+1 − x0
On the other hand, the leading coefficient of Qk (x) is f [x1 , · · · , xk+1 ], and the leading coefficient of
Pk (x) is f [x0 , · · · , xk ]. Therefore
f [x1 , · · · , xk+1 ] − f [x0 , · · · , xk ]
C= = f [x0 , x1 , · · · , xk+1 ].
xk+1 − x0

The generation of the divided differences is outlined in following Table.

Example 7. Given the following four data points. Find a polynomial in Newton form to interpolate

xi 0 1 3 5
yi 1 2 6 7

the data (the same exercise was done by Lagrange interpolation).


Sol. To write the Newton form, we draw divided difference table as following

xi yi first d.d. second d.d. third d.d.


0 1 1 1/3 −17/120
1 2 2 −3/8
3 6 1/2
5 7

P3 (x) = f (x0 ) + (x − 0)f [0, 1] + (x − 0)(x − 1)f [0, 1, 3] + (x − 0)(x − 1)(x − 3)f [0, 1, 3, 5]
P3 (x) = 1 + x + 1/3x(x − 1) − 17/120x(x − 1)(x − 3).
Note that xi can be re-ordered but must be distinct. When the order of some xi are changed, one
obtain the same polynomial but in different form.
Theorem 4.2. Let f ∈ C n [a, b] and x0 , · · · , xn are distinct numbers in [a, b]. Then there exists ξ such
that
f (n) (ξ)
f [x0 , x1 , x2 , · · · , xn ] = .
n!
12 INTERPOLATION AND APPROXIMATIONS

Proof. Let
n
X
Pn (x) = f (x0 ) + f [x0 , x1 , · · · , xk ](x − x0 )(x − x1 ) · · · (x − xk−1 )
k=1
be the interpolating polynomial of f in Newton’s form. Define
g(x) = f (x) − Pn (x).
Since Pn (xi ) = f (xi ) for i = 0, 1, · · · , n, the function g has n + 1 distinct zeros in [a, b]. By the
generalized Rolle’s Theorem there exists ξ ∈ (a, b) such that
g (n) (ξ) = f (n) (ξ) − Pn(n) (ξ) = 0.
Here
Pn(n) (x) = n! f [x0 , x1 , · · · , xn ].
Therefore
f (n) (ξ)
f [x0 , x1 , · · · , xn ] = .
n!

Example 8. Let f (x) = xn for some integer n ≥ 0. Let x0 , x1 , · · · , xm be m + 1 distinct numbers.


What is f [x0 , x1 , · · · , xm ] for m = n? For m > n?
Sol. Since we can write
f (m) (ξ)
f [x0 , x1 , · · · , xm ] = ,
m!
n!
∴ f [x0 , x1 , · · · , xn ] = = 1.
n!
If m > n, then f (m) (x) = 0 as f (x) is a monomial of degree n, thus f [x0 , x1 , · · · , xm ] = 0.
4.1. Newton interpolation for equally spaced points. Newton’s divided-difference formula can
be expressed in a simplified form when the nodes are arranged consecutively with equal spacing. Let
n + 1 points x0 , x1 , · · · , xn are arranged consecutively with equal spacing h.
Let
xn − x0
h= = xi+1 − xi , i = 0, 1, · · · , n
n
Then each xi = x0 + ih, i = 0, 1, · · · , n.
For any x ∈ [a, b], we can write x = x0 + sh, s ∈ R.
Then x − xi = (s − i)h.
Now Newton interpolating polynomial is given by
Xn
Pn (x) = f (x0 ) + f [x0 , x1 , · · · , xk ] (x − x0 ) · · · (x − xk−1 )
k=1
n
X
= f (x0 ) + f [x0 , x1 , · · · , xk ] (s − 0)h (s − 1)h · · · (s − k + 1)h
k=1
Xn
= f (x0 ) + f [x0 , x1 , · · · , xk ] s(s − 1) · · · (s − k + 1) hk
k=1
n  
X s k
= f (x0 ) + f [x0 , x1 , · · · , xk ] k! h
k
k=1
where the binomial formula  
s s(s − 1) · · · (s − k + 1)
= .
k k!
Now we introduce the forward difference operator
4f (xi ) = f (xi+1 ) − f (xi ).
4k f (xi ) = 4k−1 4f (xi ) = 4k−1 [f (xi+1 ) − f (xi )], i = 0, 1, · · · , n − 1
INTERPOLATION AND APPROXIMATIONS 13

Using the 4 notation, we can write


f (x1 ) − f (x0 ) 1
f [x0 , x1 ] = = 4f (x0 )
x1 − x0 h
1
f [x1 , x2 ] − f [x0 , x1 ] h 4f (x1 ) − h1 4f (x0 ) 1
f [x0 , x1 , x2 ] = = = 42 f (x0 )
x2 − x0 2h 2!h2
In general
1
4k f (x0 ).
f [x0 , x1 , · · · , xk ] =
k!hk
Therefore
n  
X s
Pn (x) = Pn (x0 + sh) = f (x0 ) + 4k f (x0 ).
k
k=1
This is the Newton’s forward divided difference interpolation.
If the interpolation nodes are arranged recursively as xn , xn−1 , · · · , x0 , a formula for the interpolating
polynomial is similar to previous result. In this case, Newton’s divided difference formula can be
written as
n
X
Pn (x) = f (xn ) + f [xn , xn−1 · · · , xn−k ] (x − xn ) · · · (x − xn−k+1 ).
k=1
If nodes are equally spaced with spacing
xn − x0
h= , xi = xn − (n − i)h, i = n, n − 1, · · · , 0.
n
Let x = xn + sh.
Therefore
Xn
Pn (x) = f (xn ) + f [xn , xn−1 · · · , xn−k ] (x − xn ) · · · (x − xn−k+1 )
k=1
Xn
= f (xn ) + f [xn , xn−1 · · · , xn−k ] (s)h (s + 1)h · · · (s + k − 1)h
k=1
n  
X −s
k
= f (xn ) + f [xn , xn−1 · · · , xn−k ] (−1) hk k!
k
k=1
where the binomial formula is extended to include all real values s,
 
−s −s(−s − 1) · · · (−s − k + 1) s(s + 1) · · · (s + k − 1)
= = (−1)k .
k k! k!
Like-wise the forward difference operator, we introduce the backward-difference operator by symbol ∇
(nabla) and
∇f (xi ) = f (xi ) − f (xi−1 ).
∇k f (xi ) = ∇k−1 ∇f (xi ) = ∇k−1 [f (xi ) − f (xi−1 )]
then
1
∇f (xn )
f [xn , xn−1 ] =
h
1
f [xn , xn−1 ] − f [xn−1 , xn−2 ] ∇f (xn ) − h1 ∇f (xn−1 ) 1
f [xn , xn−1 , xn−2 ] = = h = ∇2 f (xn ).
xn − xn−2 2h 2!h2
In general
1
f [xn , xn−1 , xn−2 · · · , xn−k ] = ∇k f (xn ).
k!hk
Therefore by using the backward-difference operator, the divided-difference formula can be written as
n  
X −s
Pn (x) = f (xn ) + (−1)k ∇k f (xn ).
k
k=1
This is the Newton’s backward difference interpolation formula.
14 INTERPOLATION AND APPROXIMATIONS

Example 9. Using the following table for tan x, approximate its value at 0.71 using Newton interpo-
lation.

xi 0.70 72 0.74 0.76 0.78


tan xi 0.84229 0.87707 0.91309 0.95045 0.98926

Sol. As the point x = 0.71 lies in the beginning, we will use Newton’s forward interpolation. The
forward difference table is:

xi f (xi ) ∆f (xi ) ∆2 f (xi ) ∆3 f (xi ) ∆4 f (xi )


0.70 0.84229 0.03478 0.00124 0.0001 0.00001
0.72 0.87707 0.03602 0.00134 0.00011
0.74 0.91309 0.03736 0.00145
0.76 0.95045 0.03881
0.78 0.98926

Here x0 = 0.70, h = 0.02, x = 0.71 = x0 + sh gives s = 0.5. The Newton forward difference polynomial
is given by
s(s − 1) 2 s(s − 1)(s − 2) 3 s(s − 1)(s − 2)(s − 3) 4
P3 (x) = f (x0 )+s∆f (x0 )+ ∆ f (x0 )+ ∆ f (x0 )+ ∆ f (x0 ).
2! 3! 4!
Substituting the values from table (first entries of each column starting from second), we obtain
P3 (0.71) = tan(0.71) = 0.8596.
Example 10. Show that the cubic polynomials
P (x) = 3 − 2(x + 1) + 0(x + 1)(x) + (x + 1)(x)(x − 1)
and
Q(x) = −1 + 4(x + 2) − 3(x + 2)(x + 1) + (x + 2)(x + 1)(x)
both interpolate the given data. Why does this not violate the uniqueness property of interpolating

x -2 -1 0 1 2
f (x) -1 3 1 -1 3

polynomials?
Sol. The forward difference table is:

x f (x) first d.d. second d.d. third d.d. fourth d.d.


-2 -1 4 -3 1 0
-1 3 -2 0 1
0 1 -2 3
1 -1 4
2 3

In the formulation of P (x), second node is taken as x0 while in the formulation of Q(x) first point
is taken as initial point.
Also (alternatively without drawing the table) P (−2) = Q(−2) = −1, P (−1) = Q(−1) = 3, P (0) =
Q(0) = 1, P (1) = Q(1) = −1, P (2) = Q(2) = 3.
Therefore both the cubic polynomials interpolate the given data. Further the interpolating polyno-
mials are unique but format of a polynomial is not unique.
If P (x) and Q(x) are expanded, they are identical.
INTERPOLATION AND APPROXIMATIONS 15

5. Curve Fitting : Principles of Least Squares


Least-squares, also called “regression analysis”, is one of the most commonly used methods in
numerical computation. Essentially it is a technique for solving a set of equations where there are more
equations than unknowns, i.e. an overdetermined set of equations. Least squares is a computational
procedure for fitting an equation to a set of experimental data points. The criterion of the “best” fit is
that the sum of the squares of the differences between the observed data points, (xi , yi ), and the value
calculated by the fitting equation, is minimum. The goal is to find the parameter values for the model
which best fits the data. The least squares method finds its optimum when the sum E, of squared
residuals
Xn
E= ei 2
i=1
is a minimum. A residual is defined as the difference between the actual value of the dependent variable
and the value predicted by the model. Thus
ei = yi − f (xi ).
Least square fit of a straight line: Suppose that we are given a data set (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn )
of observations from an experiment. We are interested in fitting a straight line of the form f (x) = a+bx,
to the given data. Now residuals is given by
ei = yi − (a + bxi ).
Note that ei is a function of parameters a and b. We need to find a and b such that
Xn
E= e2i
i=1
is minimum. The necessary condition for the minimum is given by
∂E ∂E
= 0, = 0.
∂a ∂b
The conditions yield
n
∂E X
= [yi − (a + bxi )](−2) = 0
∂a
i=1
n
X n
X
=⇒ yi = na + b xi (5.1)
i=1 i=1
n
∂E X
= [yi − (a + bxi )](−2xi ) = 0
∂b
i=1
n
X n
X n
X
=⇒ x i yi = a xi + b x2i . (5.2)
i=1 i=1 i=1
These equations (5.1-5.2) are called normal equations, which are to be solved to get desired values for
a and b.
Example 11. Obtain the least square straight line fit to the following data

x 0.2 0.4 0.6 0.8 1


f (x) 0.447 0.632 0.775 0.894 1

Sol. The normal equations for fitting a straight line y = a + bx are


5
X 5
X
f (xi ) = 5a + b xi
i=1 i=1
5
X 5
X 5
X
xi f (xi ) = a xi + b x2i
i=1 i=1 i=1
16 INTERPOLATION AND APPROXIMATIONS

5 5 5 5
x2i = 2.2,
P P P P
From the data, we have xi = 3, f (xi ) = 3.748, and xi f (xi ) = 2.5224.
i=1 i=1 i=1 i=1
Therefore
5a + 3b = 3.748, 3a + 2.2b = 2.5224.
The solution of this system is a = 0.3392 and b = 0.684. The required approximation is y = 0.3392 +
0.684x.
5
[f (xi ) − (0.3392 + 0.684xi )2 ] = 0.00245.
P
Least square error=
i=1

Example 12. Find the least square approximation of second degree for the discrete data

x −2 −1 0 1 2
f (x) 15 1 1 3 19

Sol. We fit a second degree polynomial y = a + bx + cx2 .


By principle of least squares, we minimize the function
5
X
E= [yi − (a + bxi + cx2i )]2 .
i=1

The necessary condition for the minimum is given by


∂E ∂E ∂E
= 0, = 0, = 0.
∂a ∂b ∂c
The normal equations for fitting a second degree polynomial are

5
X 5
X 5
X
f (xi ) = 5a + b xi + c x2i
i=1 i=1 i=1
5
X 5
X 5
X 5
X
xi f (xi ) = a xi + b x2i + c x3i
i=1 i=1 i=1 i=1
5
X 5
X 5
X 5
X
x2i f (xi ) = a x2i + b x3i + c x4i .
i=1 i=1 i=1 i=1
5 5 4 5 5 5 5
x2i = 10, x3i = 0, x4i = 34, x2i f (xi ) =
P P P P P P P
We have xi = 0, f (xi ) = 39, xi f (xi ) = 10,
i=1 i=1 i=1 i=1 i=1 i=1 i=1
140.
From given data
5a + 10c = 39
10b = 10
10a + 34c = 140.
−37 31
The solution of this system is a = , b = 1, and c = .
35 7
1
The required approximation is y = (−37 + 35x + 155x2 ).
35

Example 13. Use the method of least square to fit the curve f (x) = c0 x + c1 / x. Also find the least

x 0.2 0.3 0.5 1 2


f (x) 16 14 11 6 3

square error.
INTERPOLATION AND APPROXIMATIONS 17

Sol. By principle of least squares, we minimize the error


5
X c1
E(c0 , c1 ) = [f (xi ) − c0 xi − √ ]2
xi
i=1
We obtain the normal equations
5 5 5
X X √ X
c0 x2i + c1 xi = xi f (xi )
i=1 i=1 i=1
5 5 5
X √ X 1 X f (xi )
c0 xi + c1 = √ .
xi xi
i=1 i=1 i=1
We have
5 5 5
X √ X 1 X
xi = 4.1163, = 11.8333, x2i = 5.38
xi
i=1 i=1 i=1
5 5
X X f (xi )
xi f (xi ) = 24.9, √ = 85.0151.
xi
i=1 i=1
The normal equations are given by
5.3c0 + 4.1163c1 = 24.9
4.1163c0 + 11.8333c1 = 85.0151.
Whose solution is c0 = −1.1836, c1 = 7.5961.
Therefore, the least square fit is given as
7.5961
f (x) = √ − 1.1836x.
x
The least square error is given by
5
X 7.5961
E= [f (xi ) − √ + 1.1836xi ]2 = 1.6887
xi
i=1

Example 14. Obtain the least square fit of the form y = abx to the following data

x 1 2 3 4 5 6 7 8
f (x) 1.0 1.2 1.8 2.5 3.6 4.7 6.6 9.1

Sol. The curve y = abx takes the form Y = A + Bx after taking log, where Y = log y, A = log a and
B = log b.
Hence the normal equations are given by
8
X 8
X
Yi = 8A + B xi
i=1 i=1
8
X x
X 8
X
xi Yi = A xi + B x2i
i=1 i=1 i=1
From the data, we form the following table.
Putting the values, we obtain
8A + 36B = 3.7393, 36A + 204B = 22.7385
=⇒ A = 0.1656, B = 0.1407
=⇒ a = 0.68, b = 1.38
The required curve is y = (0.68)(1.38)x .
18 INTERPOLATION AND APPROXIMATIONS

x y Y = log y xY x2
1 1.0 0.0 0.0 1
2 1.2 0.0792 0.1584 4
3 1.8 0.2553 0.7659 9
4 2.5 0.3979 1.5916 16
5 3.6 0.5563 2.7815 25
6 4.7 0.6721 4.0326 36
7 6.6 0.8195 5.7365 49
8 9.1 0.9590 7.6720 64
Σ 36 30.5 3.7393 22.7385 204

Remark 5.1. If data is quite large then we can make it small by changing the origin and appropriating
scaling.
Example 15. Show that the line of fit to the following data is given by y = 0.7x + 11.28.
x 0 5 10 15 20 25
y 12 15 17 22 24 30
Sol. Here n = 6. We fit a line of the form y = A + Bx.
x − 15
Let u = , v = y − 20 and line of the form v = a + bu.
5
x y u v uv u2
0 12 −3 −8 24 9
5 15 −2 −5 10 4
10 17 −1 −3 3 1
15 22 0 2 0 0
20 24 1 4 4 1
25 30 2 10 20 4
Σ −3 0 61 19
The normal equations are,
0 = 6a − 3b
61 = −3a + 19b.
By solving a = 1.7428 and b = 3.4857.
Therefore equation of the line is v = 1.7428 + 3.4857u.
Changing in to original variable, we obtain
 
x − 15
y − 20 = 1.7428 + 3.4857
5
=⇒ y = 11.2857 + 0.6971x.

Exercises
(1) Find the unique polynomial P (x) of degree 2 or less such that
P (1) = 1, P (3) = 27, P (4) = 64
using Lagrange interpolation. Evaluate P (1.05).
(2) For the given functions f (x), let x0 = 1, x1 = 1.25, and x2 = 1.6. Construct Lagrange
interpolation polynomials of degree at most one and at most two to approximate f (1.4), and
find the absolute error. √
a. f (x) = sin πx b. f (x) = 3 x − 1 c. f (x) = log10 (3x − 1) d. f (x) = e2x − x.
(3) Let P3 (x) be the Lagrange interpolating polynomial for the data (0, 0), (0.5, y), (1, 3) and (2, 2).
Find y if the coefficient of x3 in P3 (x) is 6.
(4) Let f (x) = ln(1 + x), x0 = 1, x1 = 1.1. Use Lagrange linear interpolation to find the
approximate value of f (1.04) and obtain a bound on the truncation error.
INTERPOLATION AND APPROXIMATIONS 19

(5) Construct the Lagrange interpolating polynomials for the following functions, and find a bound
for the absolute error on the interval [x0 , xn ].
a. f (x) = e2x cos 3x, x0 = 0, x1 = 0.3, x2 = 0.6, n = 2.
b. f (x) = sin(ln x), x0 = 2.0, x1 = 2.4, x2 = 2.6, n = 2.
c. f (x) = cos x + sin x, x0 = 0, x1 = 0.25, x2 = 0.5, x3 = 1.0, n = 3.
(6) Use the following values and four-digit rounding arithmetic to construct a third degree Lagrange
polynomial approximation to f (1.09). The function being approximated is f (x) = log10 (tan x).
Use this knowledge to find a bound for the error in the approximation.
f (1.00) = 0.1924, f (1.05) = 0.2414, f (1.10) = 0.2933, f (1.15) = 0.3492.
(7) Use the Lagrange interpolating polynomial of degree three or less and four-digit chopping
arithmetic to approximate cos 0.750 using the following values. Find an error bound for the
approximation.
cos 0.698 = 0.7661, cos 0.733 = 0.7432, cos 0.768 = 0.7193, cos 0.803 = 0.6946.
The actual value of cos 0.750 is 0.7317 (to four decimal places). Explain the discrepancy between
the actual error and the error bound. √
(8) Determine the spacing h in a table of equally spaced values of the function f (x) = x between
1 and 2, so that interpolation with a quadratic polynomial will yield an accuracy of 5 × 10−8 .
(9) It is suspected that the high amounts of tannin in mature oak leaves inhibit the growth of the
winter moth larvae that extensively damage these trees in certain years. The following table
lists the average weight of two samples of larvae at times in the first 28 days after birth. The
first sample was reared on young oak leaves, whereas the second sample was reared on mature
leaves from the same tree.
a. Use Lagrange interpolation to approximate the average weight curve for each sample.
b. Find an approximate maximum average weight for each sample by determining the maximum
of the interpolating polynomial.
Day 0 6 10 13 17 20 8
Sample 1 average weight (mg) 6.67 17.33 42.67 37.33 30.10 29.31 28.74
Sample 2 average weight (mg) 6.67 16.11 18.89 15.00 10.56 9.44 8.89
(10) Use Neville’s method to obtain the approximations for Lagrange interpolating polynomials of
degrees one, two, and three to approximate each of the following:
a. f (8.4) if f (8.1) = 16.94410, f (8.3) = 17.56492, f (8.6) = 18.50515, f (8.7) = 18.82091
b. f (−1/3) if f (−0.75) = −0.07181250, f (−0.5) = −0.02475000, f (−0.25) = 0.33493750, f (0) =
1.10100000. √
(11) Use Neville’s method to approximate 3 with the following functions and values.
a. f (x) = 3√x and the values x0 = −2, x1 = −1, x2 = 0, x3 = 1, and x4 = 2.
b. f (x) = x and the values x0 = 0, x1 = 1, x2 = 2, x3 = 4, and x4 = 5.
c. Compare the accuracy of the approximation in parts (a) and (b).
(12) Let P3 (x) be the interpolating polynomial for the data (0, 0), (0.5, y), (1, 3), and (2, 2). Use
Neville’s method to find y if P3 (1.5) = 0.
(13) Neville’s Algorithm is used to approximate f (0) using f (−2), f (−1), f (1), and f (2). Suppose
f (−1) was understated by 2 and f (1) was overstated by 3. Determine the error in the original
calculation of the value of the interpolating polynomial to approximate f (0).
(14) If linear interpolation is used to interpolate the error function
Z x
2 2
f (x) = √ e−x dt,
π 0
show that the √ error of linear interpolation using data (x0 , f0 ) and (x1 , f1 ) cannot exceed
(x1 − x0 )2 /2 2πe.
(15) Using Newton divided difference interpolation, construct interpolating polynomials of degree
one, two, and three for the following data. Approximate the specified value using each of the
polynomials.
f (0.43) if f (0) = 1, f (0.25) = 1.64872, f (0.5) = 2.71828, f (0.75) = 4.4816.
20 INTERPOLATION AND APPROXIMATIONS

(16) Show that the polynomial interpolating (in Newton form) the following data has degree 3.

x −2 −1 0 1 2 3
f (x) 1 4 11 16 13 −4
(17) Let f (x) = ex , show that f [x0 , x1 , . . . , xm ] > 0 for all values of m and all distinct equally spaced
nodes {x0 < x1 < · · · < xm }.
(18) Show that the interpolating polynomial for f (x) = xn+1 at n + 1 nodal points x0 , x1 , · · · , xn is
given by
xn+1 − (x − x0 )(x − x1 ) · · · (x − xn ).
(19) The following data are given for a polynomial P (x) of unknown degree
x 0 1 2 3
f (x) 4 9 15 18
Determine the coefficient of x3 in P (x) if all fourth-order forward differences are 1.
(20) Let i0 , i1 , · · · , in be a rearrangement of the integers 0, 1, · · · , n. Show that f [xi0 , xi1 , · · · , xin ] =
f [x0 , x1 , · · · , xn ].
(21) Let f (x) = 1/(1 + x) and let x0 = 0, x1 = 1, x2 = 2. Calculate the divided differences f [x0 , x1 ]
and f [x0 , x1 , x2 ]. Using these divided differences, give the quadratic polynomial P2 (x) that
interpolates f (x) at the given node points {x0 , x1 , x2 }. Graph the error f (x) − P2 (x) on the
interval [0, 2].
(22) Construct the interpolating polynomial that fits the following data using Newton forward and
backward
x 0 0.1 0.2 0.3 0.4 0.5
f (x) −1.5 −1.27 −0.98 −0.63 −0.22 0.25
difference interpolation. Hence find the values of f (x) at x = 0.15 and 0.45.
(23) For a function f , the forward-divided differences are given by
x0 = 0.0 f [x0 ]
50
x1 = 0.4 f [x1 ] f [x0 , x1 ] f [x0 , x1 , x2 ] =
7
x1 = 0.4 f [x2 ] = 6 f [x1 , x2 ] = 10
Determine the missing entries in the table.
(24) A fourth-degree polynomial P (x) satisfies ∆4 P (0) = 24, ∆3 P (0) = 6, and ∆2 P (0) = 0, where
∆P (x) = P (x + 1) − P (x). Compute ∆2 P (10).
(25) Show that
f (n+1) (ξ(x))
f [x0 , x1 , x2 , · · · , xn , x] = .
(n + 1)!
(26) Use the method of least squares to fit the linear and quadratic polynomial to the following
data.
x −2 −1 0 1 2
f (x) 15 1 1 3 19
(27) By the method of least square fit a curve of the form y = axb to the following data.
x 2 3 4 5
y 27.8 62.1 110 161

(28) Use the method of least squares to fit a curve y = c0 /x + c1 x to the following data.
x 0.1 0.2 0.4 0.5 1 2
y 21 11 7 6 5 6
(29) Experiment with a periodic process provided the following data :
t◦ 0 50 100 150 200
y 0.754 1.762 2.041 1.412 0.303
Estimate the parameter a and b in the model y = a + b sin t, using the least square approxima-
tion.
INTERPOLATION AND APPROXIMATIONS 21

Appendix A. Algorithms
Algorithm (Lagrange Interpolation):


Read the degree n of the polynomial Pn (x).

Read the values of x(i) and y(i) = f (xi ), i = 0, 1, . . . , n.

Read the point of interpolation p.

Calculate the Lagrange’s fundamental polynomials li (x) using the following loop:
for i=1 to n
l(i) = 1.0
for j=1 to n
if j 6= i
p − x(j)
l(i) = ∗ l(i)
x(i) − x(j)
end j
end i
• Calculate the approximate value of the function at x = p using the following loop:
sum=0.0
for i=1 to n
sum = sum + l(i) ∗ y(i)
end i
• Print sum.
Algorithm (Newton Divided-Difference Interpolation):
Given n distinct interpolation points x0 , x1 , · · · , xn , and the values of a function f (x) at these
points, the following algorithm computes the matrix of divided differences:
D = zeros(n, n);
for i = 1 : n
D(i,1) = y(i);
end i
for j = 2 : n,
for k = j : n,
D(k, j) = (D(k, j − 1) − D(k − 1, j − 1))/(x(k) − x(k − j + 1));
end i
end j.

Now compute the value at interpolating point p using nesting:


f p = D(n, n);
for i = n − 1 : −1 : 1
f p = f p ∗ (p − x(i)) + D(i, i);
end i
Print Matrix D and f p.

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 6 (4 LECTURES)
NUMERICAL INTEGRATION

1. Introduction
The general problem is to find the approximate value of the integral of a given function f (x) over
an interval [a, b]. Thus
Z b
I= f (x)dx. (1.1)
a
Problem can be solved by using the Fundamental Theorem of Calculus by finding an anti-derivative
F of f , that is, F 0 (x) = f (x), and then
Z b
f (x)dx = F (b) − F (a).
a
But finding an anti-derivative is not an easy task in general. Hence, it is certainly not a good approach
for numerical computations.
In this chapter we’ll study methods for finding integration rules. We’ll also consider composite versions
of these rules and the errors associated with them.

2. Elements of numerical integration


The basic method involved in approximating the integration is called numerical quadrature and uses
a sum of the type
Z b
f (x)dx ≈ λi f (xi ). (2.1)
a
The method of quadrature is based on the polynomial interpolation. We divide the interval [a, b] in to
a set of distinct nodes {x0 , x1 , x2 , · · · , xn }. Then we approximate the function f (x) by an interpolating
polynomial, say Lagrange interpolating polynomial is used to approximate f (x), i.e.
f (x) = Pn (x) + en
n n
X f (n+1) (ξ) Y
= f (xi )li (x) + (x − xi ).
(n + 1)!
i=0 i=0

Here ξ = ξ(x) ∈ (a, b) and


n
Y x − xj
li (x) = , 0 ≤ i ≤ n.
xi − xj
j=0
j6=i
Therefore
Z b Z b Z b
f (x)dx = Pn (x)dx + en (x)dx
a a a
n Z b Z b n
X 1 (n+1)
Y
= f (xi ) li (x)dx + f (ξ) (x − xi )dx
a (n + 1)! a
i=0 i=0
Xn
= λi f (xi ) + E(f )
i=0

where
Z b
λi = li (x)dx.
a
1
2 NUMERICAL INTEGRATION

Error in the numerical quadrature is given by


Z b n
1 (n+1)
Y
E(f ) = f (ξ) (x − xi )dx.
(n + 1)! a i=0

We can also use Newton divided difference interpolation to approximate the function f (x).

3. Newton-Cotes Formula
b−a
Let all nodes are equally spaced with spacing h = . The number h is also called the step
n
length.
Let x0 = a and xn = b then xi = a + ih, i = 0, 1, · · · , n.
The general quadrature formula is given by
Z b n
X
f (x)dx = λi f (xi ) + E(f ).
a i=0

This formula is called Newton-Cotes formula if all points are equally spaced. We now derive rules by
taking one and two degree interpolating polynomials.

Rb
3.1. Trapezoidal Rule. We derive the Trapezoidal rule for approximating f (x)dx using the linear
a
Lagrange polynomial.
Let x0 = a, x1 = b, and h = b − a.
b=x
Z 1 Zx1
f (x) dx = P1 (x)dx + E(f ).
a=x0 x0

We calculate the both the integrals separately as:

Zx1 Zx1
P1 (x)dx = [l0 (x)f (x0 ) + l1 (x)f (x1 )] dx
x0 x0
Zx1 Zx1
x − x1 x − x0
= f (x0 ) dx + f (x1 ) dx
x0 − x1 x1 − x0
x0 x0
2
x1 x
(x − x0 )2 1
 
(x − x1 )
= f (x0 ) + f (x1 )
2(x0 − x1 ) x0 2(x1 − x0 ) x0
x1 − x0
= [f (x0 ) + f (x1 )]
2
h
= [f (a) + f (b)].
2

Zx1
f (2) (ξ)
E(f ) = (x − x0 )(x − x1 ) dx
2!
x0
Zx1
1
= f (2) (ξ)(x − x0 )(x − x1 ) dx.
2
x0
NUMERICAL INTEGRATION 3

Since (x − x0 )(x − x1 ) does not change its sign in [x0 , x1 ], therefore by the Weighted Mean-Value
Theorem, there exists a point c ∈ (x0 , x1 ) such that
Zx1
f (2) (c)
E(f ) = (x − x0 )(x − x1 ) dx
2
x0

f (2) (c) (x0 − x1 )3


 
=
2 6
h3
= − f (2) (c).
12
Thus the integration formula is
Zb
h h3
f (x)dx = [f (a) + f (b)] − f (2) (c).
2 12
a

Geometrically, it is the area of Trapezium (Trapezoid) with width h and ordinates f (a) and f (b).

3.2. Simpson’s Rule. We take second degree Lagrange interpolating polynomial. We take n =
a+b
2, x0 = a, x1 = , x2 = b, h = (b − a)/2.
2
b=x
Z 2 Zx2
f (x) dx = P2 (x)dx + E(f ).
a=x0 x0

Zx2 Zx2
P2 (x)dx = [l0 (x)f (x0 ) + l1 (x)f (x1 ) + +l2 (x)f (x2 )] dx
x0 x0
= λ0 f (x0 ) + λ1 f (x1 ) + λ2 f (x2 ).
The values of the multipliers λ0 , λ1 , and λ2 are given by
Zx2
(x − x1 )(x − x2 )
λ0 = dx.
(x0 − x1 )(x0 − x2 )
x0

To simply this integral, we substitute x = x0 + ht, dx = h dt and change the limits from 0 to 2
accordingly.
Z 2
(t − 1)(t − 2)
∴ λ0 = h dt = h/3.
0 (0 − 1)(0 − 2)
Similarly
Zx2
(x − x0 )(x − x2 )
λ1 = dx
(x1 − x0 )(x1 − x2 )
x0
2
(t − 0)(t − 2)
Z
= h dt = 4h/3.
0 (1 − 0)(1 − 2)
and
Zx2
(x − x0 )(x − x1 )
λ2 = dx
(x2 − x0 )(x2 − x1 )
x0
Z2
(t − 0)(t − 1)
= h dt = h/3.
(2 − 0)(2 − 1)
0
4 NUMERICAL INTEGRATION

Now error is given by


Z x2
1
E(f ) = f 000 (ξ)(x − x0 )(x − x1 )(x − x2 )dx.
3! x0

Since (x − x0 )(x − x1 )(x − x2 ) changes its sign in the interval [x0 , x1 ], therefore we cannot apply the
Weighted Mean-Value Theorem (as we did in trapezoidal rule).
Also Z x2
(x − x0 )(x − x1 )(x − x2 ) dx = 0.
x0
We can add an interpolation point without affecting the area of the interpolated polynomial, leaving
the error unchanged. We can therefore do our error analysis of Simpson’s rule with any single point
added, since adding any point in [a, b] does not affect the area, we simply double the midpoint, so that
our node set is {x0 = a, x1 = (a + b)/2, x1 = (a + b)/2, x2 = b}. We can now examine the value of the
next interpolating polynomial. Therefore
1 x2 (4)
Z
E(f ) = f (ξ)(x − x0 )(x − x1 )2 (x − x2 )dx.
4! x0
Now the product (x−x0 )(x−x1 )2 (x−x2 ) does not change its sign in [x0 , x1 ], therefore by the Weighted
Mean-Value Theorem, there exists a point c ∈ (x0 , x1 ) such that
Z x2
1 (4)
E(f ) = f (c) (x − x0 )(x − x1 )2 (x − x2 )dx
24 x0
f (4) (c)
= − (x2 − x0 )5
2880
h5
= − f (4) (c).
90
Hence
b
h5
Z    
h a+b
f (x)dx = f (a) + 4f + f (b) − f (4) (c).
a 3 2 90
1
This rule is called Simpson’s rule.
3
Similarly by taking third order Lagrange interpolating polynomial with three nodes a = x0 , x1 , x2 , x3 =
b−a 3
b with h = , we get the next integration formula known as Simpson’s rule.
3 8
Z b
3h 3
f (x)dx = [f (x0 ) + 3f (x1 ) + 3f (x2 ) + f (x3 )] − h5 f (4) (c).
a 8 80
Definition 3.1. The degree of accuracy, or precision, or order of a quadrature formula is the largest
positive integer n such that the formula is exact for xk , for each k = 0, 1, · · · , n.
In other words, an integration method of the form
Z b n Z b n
X 1 (n+1)
Y
f (x)dx = λi f (xi ) + f (ξ) (x − xi )dx
a (n + 1)! a
i=0 i=0

is said to be of order n if it provides exact results for all polynomials of degree less than or equal to n
and the error term will be zero for all polynomials of degree ≤ n.
Trapezoidal rule has degree of precision one and Simpson’s rule has three.
Example 1. Find the value of the integral
Z 1
dx
I=
0 1+x
using trapezoidal and Simpson’s rule. Also obtain a bound on the errors. Compare with exact value.
NUMERICAL INTEGRATION 5

Sol.
1
f (x) =
1+x
By trapezoidal rule
IT = h/2[f (a) + f (b)]
Here a = 0, b = 1, h = b − a = 1.

I = 1/2[1 + 1/2] = 0.75


Exact value
Iexact = ln 2 = 0.693147
Error= |0.75 − 0.693147| = 0.056853
The error bound for the trapezoidal rule is given by
E(f ) ≤ h3 /12 max |f 00 (ξ)|
0≤ξ≤1
2
= 1/12 max
0≤ξ≤1 (1 + ξ)3
= 1/6
Similarly by using Simpson’s rule with h = (b − a)/2 = 1/2, we obtain
IS = h/3[f (0) + 4f (1/2) + f (1)] = 1/6(1 + 8/3 + 1/2) = 0.69444
Error= |0.75 − 0.69444| = 0.001297.
The error bound for the Simpson’s rule is given by
h5
E(f ) ≤ max |f (4) (ξ)|
90 0≤ξ≤1
1 24
= max
2880 0≤ξ≤1 (1 + ξ)5
= 0.008333.
Example 2. Find the quadrature formula by method of undetermined coefficients
Z 1
f (x)
p dx = α1 f (0) + α2 f (1/2) + α3 f (1)
0 x(1 − x)
which is exact for polynomials of highest possible degree. Then use the formula to evaluate
Z 1
dx
√ .
0 x − x3
Sol. We make the method exact for polynomials up to degree 2.
Z 1
dx
f (x) = 1 : I1 = p = α1 + α2 + α3
0 x(1 − x)
Z 1
xdx
f (x) = x : I2 = p = 1/2α2 + α3
0 x(1 − x)
Z 1
2 x2 dx
f (x) = x : I3 = p = 1/4α2 + α3
0 x(1 − x)
Now Z 1 Z 1 Z 1
dx dx dt
I1 = p = p = √ = [sin−1 t]1−1 = π
x(1 − x) 1 − (2x − 1) 2 1−t 2
0 0 −1
Similarly
I2 = π/2
I3 = 3π/8.
Therefore
α1 + α2 + α3 = π
6 NUMERICAL INTEGRATION

1/2α2 + α3 = π/2
1/4α2 + α3 = 3π/8.
By solving these equations, we obtain α1 = π/4, α2 = π/2, α3 = π/4. Hence
Z 1
f (x)
p dx = π/4[f (0) + 2f (1/2) + f (1)].
0 x(1 − x)
Z 1 Z 1 Z 1
dx dx f (x)dx
I= √ = √ p = p .
x−x 3 1 + x x(1 − x) x(1 − x)
0 0 0

Here f (x) = 1/ 1 + x.
By using the above formula, we obtain
" √ √ #
2 2 2
I = π/4 1 + √ + = 2.62331.
3 2
The exact value of the given integral is
I = 2.6220575.

4. Composite Integration
As the order of integration method is increased, the order of the derivative involved in error term also
increase. Therefore, we can use higher-order method if the integrand is differentiable up to required
degree. We can apply lower-order methods by dividing the whole interval in to subintervals and then
we use any Newton-Cotes or Gauss quadrature method for each subintervals separately.
Composite Trapezoidal Method: We divide the interval [a, b] into N subintervals with step size
b−a
h= and taking nodal points a = x0 < x1 < · · · < xN = b where xi = x0 +i h, i = 1, 2, · · · , N −1.
N
Now
Z b
I = f (x)dx
Za x1 Z x2 Z xN
= f (x)dx + f (x)dx + · · · + f (x)dx.
x0 x1 xN −1

Now use trapezoidal rule for each of the integrals on the right side, we obtain
h
I = [(f (x0 ) + f (x1 )) + (f (x1 ) + f (x2 )) + · · · + (f (xN −1 ) + f (xN )]
2
h3
− [f (2) (ξ1 ) + f (2) (ξ2 ) + · · · + f (2) (ξN )]
12
N −1 N
" #
h X h3 X (2)
= f (x0 ) + f (xN ) + 2 f (xi ) − f (ξi )
2 12
i=1 i=1
This formula is composite trapezoidal rule where where xi−1 ≤ ξi ≤ xi , i = 1, 2, · · · , N.
The error associated with this approximation is
N
h3 X (2)
E(f ) = − f (ξi ).
12
i=1

If f ∈ C 2 [a, b], the Extreme Value Theorem implies that f (2) assumes its maximum and minimum in
[a, b]. Since
min f (2) (x) ≤ f (2) (ξi ) ≤ max f (2) (x).
x∈[a,b] x∈[a,b]
On summing, we have
N
X
N min f (2) (x) ≤ f (2) (ξi ) ≤ N max f (2) (x)
x∈[a,b] x∈[a,b]
i=1
NUMERICAL INTEGRATION 7

and
N
(2) 1 X (2)
min f (x) ≤ f (ξi ) ≤ max f (2) (x).
x∈[a,b] N x∈[a,b]
i=1
By the Intermediate Value Theorem, there is a c ∈ (a, b) such that
N
(2) 1 X (2)
f (c) = f (ξi ).
N
i=1
Therefore
h3
E(f ) = − N f (2) (c),
12
or, since h = (b − a)/N ,
(b − a) 2 (2)
E(f ) = −
h f (c).
12
Composite Simpson’s Method: Simpson’s rule require three abscissas, choose an even integer N
b−a
to produce odd number of nodes with h = . Likewise before, we write
N
Z b
I = f (x)dx
Za x2 Z x4 Z xN
= f (x)dx + f (x)dx + · · · + f (x)dx.
x0 x2 xN −2

Now use Simpson’s rule for each of the integrals on the right side to obtain
h
I = [(f (x0 ) + 4f (x1 ) + f (x2 )) + (f (x2 ) + 4f (x3 ) + f (x4 )) + · · · + (f (xN −2 ) + 4f (xN −1 ) + f (xN )]
3
h5
− [f (4) (ξ1 ) + f (4) (ξ2 ) + · · · + f (4) (ξN/2 )]
 90 
N/2−1 N/2 N/2
h X X h5 X (4)
= f (x0 ) + 2 f (x2i ) + 4 f (x2i−1 ) + f (xN ) − f (ξi ).
3 90
i=1 i=1 i=1

This formula is called composite Simpson’s rule. The error in the integration rule is given by
N
h5 X (4)
E(f ) = − f (ξi ).
90
i=1

If f ∈ C 4 [a, b], the Extreme Value Theorem implies that f (4) assumes its maximum and minimum in
[a, b]. Since
min f (4) (x) ≤ f (4) (ξi ) ≤ max f (4) (x).
x∈[a,b] x∈[a,b]
On summing, we have
N/2
N X N
min f (4) (x) ≤ f (4) (ξi ) ≤ max f (4) (x)
2 x∈[a,b] 2 x∈[a,b]
i=1
and
N/2
(4) 2 X (4)
min f (x) ≤ f (ξi ) ≤ max f (4) (x).
x∈[a,b] N x∈[a,b]
i=1
By the Intermediate Value Theorem, there is a c ∈ (a, b) such that
N/2
2 X (4)
f (4) (c) = f (ξi ).
N
i=1
Therefore
h5
E(f ) = − N f (4) (c),
180
8 NUMERICAL INTEGRATION

or, since h = (b − a)/N ,


(b − a) 4 (4)
E(f ) = − h f (c).
180
Example 3. Determine the values of subintervals n and step-size h required to approximate
Z2
1
dx
x+4
0
to within 10−5 and hence compute the approximation using composite Simpson’s rule.
1 24
Sol. Here f (x) = , therefore f (4) (x) = .
x+4 (x + 4)5
24
∴ max |f (4) (x)| = .
x∈[0,2] 45
Now error in Simpson’s rule is given by
h4 (b − a)f 4 (c)
E(f ) = − .
180
To get desire accuracy, we have
2h4 × 24
< 10−5
45
=⇒ h < 0.44267,
which gives
n ≥ 6.
By taking 6 subintervals with h = 2/6 = 1/3 and using Simpson’s rule, we obtain
1
IS = [f (0) + 4{f (1/3) + f (1) + f (5/3)} + 2{f (2/3) + f (4/3)} + f (2)] = 0.405466.
9
Example 4. Determine values of h (or n) that will ensure an approximation error of less than 0.00002

when approximating sin x dx and employing (a) Composite Trapezoidal rule and (b) Composite Simp-
o
son’s rule.
Sol. (a) The error form for the composite trapezoidal rule for f (x) = sin x on [0, π] is
π h2 00 π h2 πh2
f (c) = (− sin c) = |sin c|.
12 12 12
To ensure sufficient accuracy with this technique we need to have
π h2 π h2
|sin c|≤ < 0.0002.
12 12
Since h = π/n implies that n = π/h, we need
1/2
π3 π3

< 0.0002 which implies that n > ≈ 359.44
12n2 12(0.00002)
and the composite trapezoidal rule requires n ≥ 360.
(b) The error form for the composite Simpson’s rule for f (x) = sin x on [0, π] is
π h4 (4) π h4 πh4
f (c) = sin c = |sin c|.
180 180 180
To ensure sufficient accuracy with this technique we need to have
π h4 π h4
|sin c| ≤ < 0.0002.
180 180
Using again the fact that n = π/h gives
1/4
π5 π5

< 0.0002 which implies that n > ≈ 17.07.
180 n4 180(0.00002)
NUMERICAL INTEGRATION 9

So composite Simpson’s rule requires only n ≥ 18. Composite Simpson’s rule with n = 18 gives
Z π  X 8 9 
π iπ X (2i − 1)π
sin x dx ≈ 2 sin( ) + 4 sin( ) = 2.0000104.
0 54 9 18
i=1 i=1

This is accurate to within about 10−5 because the true value is − cos(π) − (− cos(0)) = 2.
Example 5. The area A inside the closed curve y 2 + x2 = cos x is given by
Z α
1/2
A=4 (cos x − x2 ) dx
0

where α is the positive root of the equation cos x = x2 .


(a) Compute α with three correct decimals.
(b) Use trapezoidal rule to compute the area A with an absolute error less than 0.05.
Sol. (a) Using Newton method to find the root of the equation

f (x) = cos x − x2 = 0,
we obtain the following iteration scheme
cos xk − x2k
xk+1 = xk + , k = 0, 1, 2, · · ·
sin xk + 2xk
Starting with x0 = 0.5, we obtain
0.62758
x1 = 0.5 + = 0.92420
1.47942
−0.25169
x2 = 0.92420 + = 0.82911
2.64655
−0.011882
x3 = 0.82911 + = 0.82414
2.39554
−0.000033
x4 = 0.82414 + = 0.82413.
2.38226
Hence the value of α correct to three decimals is 0.824.
(b) Substituting the value of α, we obtain
Z 0.824
1/2
A=4 (cos x − x2 ) dx.
0

Using composite trapezoidal method by taking h = 0.824, 0.412, and 0.206 respectively, we obtain the
following approximations of the area A.
4(0.824)
A = [1 + 0.017753] = 1.67725
2
4(0.412)
A = [1 + 2(0.864047) + 0.017753] = 2.262578
2
4(0.206)
A = [1 + 2(0.967688 + 0.864047 + 0.658115) + 0.017753] = 2.470951.
2

5. Gauss Quadrature
In the numerical integration method if both nodes xi and multipliers λi are unknown then method is
called Gaussian quadrature. We can obtain the unknowns by making the method exact for polynomials
of degree as high as required. The formulas are derived for the interval [−1, 1] and any interval [a, b]
can be transformed to [−1, 1] by taking the transformation x = At + B which gives a = −A + B and
b−a b+a
b = A + B and after solving we get x = t+ .
2 2
10 NUMERICAL INTEGRATION

As observed in Newton-Cotes quadrature, we can write any integral as


Z 1 n Z1
X f (n+1) (c)
f (x)dx = λi f (xi ) + (x − x0 ) · · · (x − xn ) dx
−1 (n + 1)!
i=0 −1
n
X
= λi f (xi ) + E(f ).
i=0
If product does not change its sign in interval concerned, we can write error as
Z1
f (n+1) (c)
E(f ) = (x − x0 ) · · · (x − xn ) dx.
(n + 1)!
−1

f (n+1) (c)
= C,
(n + 1)!
where
Z1
C= (x − x0 ) · · · (x − xn ) dx.
−1
We can compute the value of C by putting f (x) = xn+1 to obtain
Z b n
n+1
X C
x dx = λi xi n+1 + (n + 1)!
a (n + 1)!
i=0
Z b Xn
=⇒ C = xn+1 dx − λi xi n+1 .
a i=0
The number C is called error constant. By using the notation, we can write error term as following
C
E(f ) = f (n+1) (c).
(n + 1)!
Gauss-Legendre Integration Methods: The technique we have described could be used to de-
termine the nodes and coefficients for formulas that give exact results for higher-degree polynomials.
One-point formula: The formula is given by
Z 1
f (x)dx = λ0 f (x0 ).
−1
The method has two unknowns λ0 and x0 . Make the method exact for f (x) = 1, x, we obtain
Z 1
f (x) = 1 : dx = 2 = λ0
−1
Z 1
f (x) = x : xdx = 0 = λ0 x0 =⇒ x0 = 0.
−1
Therefore one-point formula is given by
Z 1
f (x)dx = 2f (0).
−1
The error in approximation is given by
C 00
E(f ) = f (ξ)
2!
where error constant C is given by
Z 1
C= x2 dx − 2f (0) = 2/3.
−1
Hence
1
E(f ) = f 00 (ξ), −1 < ξ < 1.
3
NUMERICAL INTEGRATION 11

Two-point formula:
Z 1
f (x)dx = λ0 f (x0 ) + λ1 f (x1 ).
−1
The method has four unknowns. Make the method exact for f (x) = 1, x, x2 , x3 , we obtain
Z 1
f (x) = 1 : dx = 2 = λ0 + λ1 (5.1)
−1
Z 1
f (x) = x : xdx = 0 = λ0 x0 + λ1 x1 (5.2)
−1
Z 1
f (x) = x2 : x2 dx = 2/3 = λ0 x20 + λ1 x21 (5.3)
−1
Z 1
f (x) = x3 : x3 dx = 0 = λ0 x30 + λ1 x31 (5.4)
−1

Now eliminate λ0 from second and fourth equation


λ1 x31 − λ1 x1 x20 = 0
which gives
λ1 x1 (x1 − x0 )(x1 + x0 ) = 0
Since λ1 6= 0, x0 6= x1 and x1 6= 0 (if x1 = 0 then by second equation x0 = 0). Therefore x1 = −x0 .
Substituting in second equation, we obtain λ0 = λ1 .
By substituting these values in first equation,
√ we get λ0 = λ1√= 1.
Third equation gives x20 = 1/3 or x0 = ±1/ 3 and x1 = ∓1/ 3.
Therefore, the two-point formula is given by
Z 1    
1 1
f (x)dx = f − √ +f √ .
−1 3 3
The error is given by
C (4)
E(f ) = f (ξ)
4!
and Z 1     
4 1 1 8
C= x dx − f − √ +f √ = .
−1 3 3 45
The error in two-point formula is given by
1 (4)
f (ξ), −1 < ξ < 1.
E(f ) =
135
Three-point formula: By taking n = 2, we obtain
Z 1
f (x)dx = λ0 f (x0 ) + λ1 f (x1 ) + λ2 f (x2 ).
−1

The method has six unknowns. Make the method exact for f (x) = 1, x, x2 , x3 , x4 , x5 , we obtain
f (x) = 1 : 2 = λ0 + λ1 + λ2
f (x) = x : 0 = λ0 x0 + λ1 x1 + λ2 x2
f (x) = x2 : 2/3 = λ0 x20 + λ1 x21 + λ2 x22
f (x) = x3 : 0 = λ0 x30 + λ1 x31 + λ2 x32
f (x) = x4 : 2/5 = λ0 x40 + λ1 x41 + λ2 x42
f (x) = x5 : 0 = λ0 x50 + λ1 x51 + λ2 x52
p
p these equations, we obtain λ0 = λ2 = 5/9 and λ1 = 8/9. x0 = ± 3/5, x1 = 0 and
By solving
x2 = ∓ 3/5.
12 NUMERICAL INTEGRATION

Therefore formula is given by


Z 1 " r ! r !#
1 3 3
f (x)dx = 5f − + 8f (0) + 5f .
−1 9 5 5

The error in three-point formula is given by


C (6)
E5 = f (ξ)
6!
where  r !
6 r !6 
1
−3 −3
Z
6 1 = 8 .
C= x dx − 5 +8×0+5+
−1 9 5 5 175

1
∴ E5 = f (6) (ξ), −1 < ξ < 1.
15750
Note: Legendre polynomials Pn (x) is a monic polynomial of degree n. The first few Legendre poly-
nomials are
P0 (x) = 1,
P1 (x) = x,
1
P2 (x) = x2 − ,
3
3
P3 (x) = x3 − x.
5
Nodes in Gauss-Legendre rules are roots of these polynomials.
Example 6. Evaluate
Z 2
2x
I= 4
dx
1 1+x
using Gauss-Legendre 1 and 2-point formula. Also compare with the exact value.
t+3
Sol. Firstly we change the interval [1, 2] in to [−1, 1] by taking x = , dx = dt/3.
2
Z 2 Z 1
2x 8(t + 3)
I= dx = dt
1 1 + x4 −1 16 + (t + 3)4
Let
8(t + 3)
f (t) = .
16 + (t + 3)4
By 1-point formula
I = 2f (0) = 0.4948
By 2-point formula
   
1 1
I = f −√ +f √ = 0.5434
3 3
Now exact value of the integral is given by
Z 2
2x π
I= 4
dx = tan−1 4 − = 0.5408
1 1+x 4
Example 7. Evaluate
Z 1
3/2
I= (1 − x2 ) cos x dx
−1
using Gauss-Legendre 3-point formula.
NUMERICAL INTEGRATION 13

3/2
Sol. Using Gauss-Legendre 3-point formula with f (x) = (1 − x2 ) cos x, we obtain
" r ! r !#
1 3 3
I = 5f − + 8f (0) + 5f
9 5 5
"   ! r !#
2 3/2
r  3/2
1 3 2 3
= 5 cos +8+5 cos
9 5 5 5 5
= 1.08979.
Example 8. Evaluate Z 1
dx
I=
0 1+x
by subdividing the interval [0, 1] into two equal parts and then by using Gauss-Legendre three-point
formula.
Sol. " r ! r !#
Z 1
1 3 3
f (x)dx = 5f − + 8f (0) + 5f .
−1 9 5 5
Let Z 1 Z 1/2 Z 1
dx dx dx
I= = + = I1 + I2 .
0 1+x 0 1+x 1/2 1 + x
t+1 z+3
Now substitute x = and x = in I1 and I2 , respectively to change the limits to [−1, 1].
4 4
We have dx = dt/4 and dx = dz/4 for integral I1 and I2 , respectively.
Therefore Z 1 " #
dt 1 5 8 5
I1 = = p + + p = 0.405464
−1 t + 5 9 5 − 3/5 5 5 + 3/5
Z 1 " #
dz 1 5 8 5
I2 = = p + + p = 0.287682
−1 z + 7 9 7 − 3/5 7 7 + 3/5
Hence
I = I1 + I2 = 0.405464 + 0.287682 = 0.693146.

Exercises
(1) Given
Z 1
I= x ex dx.
0
Approximate the value of I using trapezoidal and Simpson’s one-third method. Also obtain
the error bounds and compare with exact value of the integral.
(2) Evaluate
Z 1
dx
I=
0 1 + x2
using trapezoidal and Simpson’s rule with 4 and 6 subintervals. Compare with the exact value
of the integral.
(3) Approximate the following integrals using the trapezoidal and Simpson formulas. a. I =
0.25 e+1 1
(cos x)2 dx
R R
b. dx. Find a bound for the error using the error formula, and
−0.25 e x ln x
compare this to the actual error.
R2
(4) The quadrature formula f (x)dx = c0 f (0) + c1 f (1) + c2 f (2) is exact for all polynomials of
0
degree less than or equal to 2. Determine c0 , c1 , and c2 .
R2
(5) The quadrature formula f (x)dx = c0 f (0) + c1 f (1) + c2 f (2) is exact for all polynomials of
0
degree less than or equal to 2. Determine c0 , c1 , and c2 .
14 NUMERICAL INTEGRATION

(6) Determine the values of a, b, and c such that the formula


Z h
f (x)dx = h [af (0) + bf (h/3) + cf (h)]
0
is exact for polynomials of degree as high as possible. Also obtain the degree of the precision.
(7) The length of the curve represented by a function y = f (x) on an interval [a, b] is given by the
integral
Z bp
I= 1 + [f 0 (x)]2 dx.
a
Use the trapezoidal rule and Simpson’s rule with 4 and 6 subintervals compute the length of
the graph of the ellipse given with equation 4x2 + 9y 2 = 36.
(8) Determine the values of n and h required to approximate
Z2
e2x sin 3x dx
0

to within 10−4 . Use composite Trapezoidal and composite Simpson’s rule.


(9) The equation
Z x
1 2
√ e−t /2 dt = 0.45
0 2π
can be solved for x by applying Newton’s method to the function
Z x
1 2 1 2
f (x) = √ e−t /2 dt − 0.45 & f 0 (x) = √ e−x /2 .
0 2π 2π
Note that Newton’s method would require the evaluation of f (xk ) at various xk which can be
estimated using a quadrature formula. Find a solution for f (x) = 0 with error no more than
10−5 using Newton’s method starting with x0 = 0.5 and by means of the composite Simpson’s
rule.
(10) A car laps a race track in 84 seconds. The speed of the car at each 6-second interval is
determined by using a radar gun and is given from the beginning of the lap, in feet/second, by
the entries in the following table.
Time 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84
Speed 124 134 148 156 147 133 121 109 99 85 78 89 104 116 123
How long is the track?
(11) A particle of mass m moving through a fluid is subjected to a viscous resistance R, which is a
function of the velocity v. The relationship between the resistance R, velocity v, and time t is
given by the equation
Zv(t)
m
t= du
R(u)
v(t0 )

Suppose that R(v) = −v v for a particular fluid, where R is in newtons and v is in me-
ters/second. If m = 10 kg and v(0) = 10 m/s, approximate the time required for the particle
to slow to v = 5 m/s.
(12) Evaluate the integral
Z 1
2
e−x cos x dx
−1
by using the Gaussian quadrature with n = 1 and n = 2.
(13) Determine constants a, b, c, and d that will produce a quadrature formula
Z 1
f (x)dx = af (−1) + bf (1) + cf 0 (−1) + df 0 (1)
−1
that has degree of precision 3.
NUMERICAL INTEGRATION 15

(14) Compute by Gaussian quadrature with n = 2 and compare with the exact value of the integral.
Z 3.5
x
√ dx.
2
x −4
3
(15) Evaluate
Z 1
sin x dx
I=
0 2+x
by subdividing the interval [0, 1] into two equal parts and then by using Gaussian quadrature
with n = 2.
(16) Determine the coefficients in the formula
Z 2h
x−1/2 f (x)dx = (2h)1/2 [A0 f (0) + A1 f (h) + A2 f (2h)] + R
0
and calculate the remainder R, while f (3) (x) is constant.
(17) Consider approximating integrals of the form
Z1

I= xf (x)dx
0
in which f (x) has several continuous derivatives on [0, 1].
(a) Find a formula
Z1

xf (x)dx ≈ w1 f (x1 ) = I1
0
which is exact if f (x) is any linear polynomial.
(b) To find a formula
Z1

xf (x)dx ≈ w1 f (x1 ) + w2 f (x2 ) = I2
0
which is exact for all polynomials of degree ≤ 3, set up a system of four equations with unknowns
w1 , w2 , x1 , x2 . Verify that
r ! r !
1 10 1 10
x1 = 5+2 , x2 = 5−2 ,
9 7 9 7
r !
1 7 2
w1 = 5+ , w2 = − w1
15 10 3
is a solution of the system.
(c) Apply I1 and I2 to the evaluation of
Z1

I= xe−x dx = 0.37894469164.
0

Appendix A. Algorithms
Algorithm (Composite Trapezoidal Method):
Step 1 : Inputs: function f (x); end points a and b; and N number of subintervals.
Step 2 : Set h = (b − a)/N .
Step 3 : Set sum = 0
Step 4 : For i = 1 to N − 1
Step 5 : Set x = a + h ∗ i
Step 6 : Set sum = sum+2 ∗ f (x)
end
Step 7 : Set sum = sum+f (a) + f (b)
16 NUMERICAL INTEGRATION

Step 8 : Set ans = sum∗(h/2)


End

Algorithm (Composite Simpson’s Method):


Step 1 : Inputs: function f (x); end points a and b; and N number of subintervals (even).
Step 2 : Set h = (b − a)/N .
Step 3 : Set sum = 0
Step 4 : For i = 1 to N − 1
Step 5 : Set x = a + h ∗ i
Step 6 : If i%2 = 0
sum = sum+2 ∗ f (x)
else
sum = sum+4 ∗ f (x)
end
Step 7 : Set sum = sum+f (a) + f (b)
Step 8 : Set ans = sum∗(h/3)
End

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.
CHAPTER 7 (4 LECTURES)
INITIAL-VALUE PROBLEMS FOR ORDINARY DIFFERENTIAL EQUATIONS

1. Introduction
Differential equations are used to model problems in science and engineering that involve the change
of some variable with respect to another. Most of these problems require the solution of an initial-value
problem, that is, the solution to a differential equation that satisfies a given initial condition.
In common real-life situations, the differential equation that models the problem is too complicated to
solve exactly, and one of two approaches is taken to approximate the solution. The first approach is to
modify the problem by simplifying the differential equation to one that can be solved exactly and then
use the solution of the simplified equation to approximate the solution to the original problem. The
other approach, which we will examine in this chapter, uses methods for approximating the solution of
the original problem. This is the approach that is most commonly taken because the approximation
methods give more accurate results and realistic error information.
In this chapter, we discuss the numerical methods for solving the ordinary differential equations of
initial-value problems (IVP) of the form
dy
= f (t, y), t ∈ R, y(t0 ) = y0 (1.1)
dt
where y is a function of t, f is function of t and y, and t0 is called the initial value. The numerical
values of y(t) on an interval containing t0 are to be determined.
We divide the domain [a, b] in to subintervals
a = t0 < t1 < · · · < tN = b.
These points are called mesh points or grid points. Let equal spacing is h. The uniform mesh points
are given by ti = t0 + ih, i = 0, 1, 2, ... The set of points y0 , y1 , · · · , yN are the numerical solution of
the initial-value problem (IVP).

2. Existence and Uniqueness of Solutions


Definition 2.1. A function f (t, y) is said to satisfy a Lipschitz condition in the variable y on some
domain if a constant L > 0 exists with
|f (t, y1 ) − f (t, y2 )| ≤ L |y1 − y2 |,
whenever (t, y1 ) and (t, y2 ) are in domain. The constant L is called a Lipschitz constant for f .
2
Example 1. Let f (t, x) = x2 e−t sin t be defined on
D = {(t, x) ∈ R2 : 0 ≤ x ≤ 2}.
Show that f satisfies Lipschitz condition.
Sol. Let (t, x1 ), (t, x2 ) ∈ D.
2 2
|f (t, x1 ) − f (t, x2 )| = |x21 e−t sin t − x22 e−t sin t|
2
= |e−t sin t||x1 + x2 ||x1 − x2 |
≤ (1)(4)|x1 − x2 |
Thus we may take L = 4 and f satisfies a Lipschitz condition in D with Lipschitz constant 4.
Example 2. Show that f (t, y) = t|y| satisfies a Lipschitz condition on the interval D = {(t, y) |1 ≤
t ≤ 2 and − 3 ≤ y ≤ 4}.
1
2 NUMERICAL DIFFERENTIAL EQUATIONS

Sol. For each pair of points (t, y1 ) and (t, y2 ) in D, we have


|f (t, y1 ) − f (t, y2 )| = | |t| y1 − |t| y2 |
≤ |t| |y1 − y2 |
≤ 2|y1 − y2 |.
Thus f satisfies a Lipschitz condition on D in the variable y with Lipschitz constant L = 2.
Theorem 2.2. If f (t, y) is continuous in a ≤ t ≤ b, −∞ ≤ y ≤ ∞, and
|f (t, y1 ) − f (t, y2 )| ≤ L|y1 − y2 |
for some positive constant L (which means f is Lipschitz continuous in y), then the IVP (1.1) has a
unique solution in the interval [a, b].
Example 3. Show that there is a unique solution to the initial–value problem
y 0 = 1 + t sin(ty), 0 ≤ t ≤ 2, y(0) = 0.
Sol.
|f (t, y1 ) − f (t, y2 )| = |(1 + t sin(ty1 )) − (1 + t sin(ty2 ))|.
Holding t constant and applying the Mean Value Theorem to the function f (t, y) = 1 + t sin(ty), we
find that when y1 < y2 , a number ξ in (y1 , y2 ) exists with
f (t, y2 ) − f (t, y1 ) ∂
= f (t, ξ) = t2 cos(ξt).
y2 − y1 ∂y
Thus
|f (t, y1 ) − f (t, y2 )| = |y2 − y1 | |t2 cos(ξt)| ≤ 4 |y2 − y1 |,
and f satisfies a Lipschitz condition in the variable y with Lipschitz constant L = 4. Additionally,
f (t, y) is continuous when 0 ≤ t ≤ 2 and −∞ ≤ y ≤ ∞, so by Existence Theorem implies that a unique
solution exists to this initial-value problem.

2.1. Picard method. This method is also known as method of successive approximations.
We consider the following IVP
dy
= f (t, y), t ∈ R, y(t0 ) = y0
dt
Let f (t, y) to be a continuous function on the given domain. The initial value problem is equivalent to
following integral equation,
Zt
y(t) = y(0) + f (t, y(t))dt.
t0
Writing y(0) = y0 and we can compute the solution y(t) at any time t by integrating above equation.
Note that y(t) also appears in integral in f (t, y(t)). Therefore we take any approximation of y(t) to
start the procedure.
The successive approximations for solutions are given by
Zt
y0 (t) = y0 , yk+1 (t) = 1 + f (s, yk (s))ds, k = 0, 1, 2, · · ·
0

Or equivalently a solution of this equation is meant a continuous function φ exists and which approx-
imate y(t), i.e,
Zt
φ0 (t) = y0 , φk+1 (t) = 1 + f (s, φk (s))ds, k = 0, 1, 2, · · · .
0

Example 4. Consider the initial value problem y 0 = ty, y(0) = 1.


NUMERICAL DIFFERENTIAL EQUATIONS 3

Sol. The integral equation corresponding to this problem is


Zt
y(t) = 1 + s.y(s)ds,
0

and the successive approximations are given by


Zt
φ0 (t) = 1, φk+1 (t) = 1 + sφk (s)ds, k = 0, 1, 2, .....
0

Thus
t
t2
Z
φ1 (t) = 1 + s ds = 1 + ,
0 2
Zt
s2 t2 t4
 
φ2 (t) = 1 + s 1+ ds = 1 + + ,
2 2 2.4
0
and it may be established by induction that
 2  2  k
t 1 t2 1 t2
φk (t) = 1 + + + .... + .
2 2! 2 k! 2
2
We recognize φk (x) as the partial sum for the series expansion of the function φ(t) = et /2 . We know
that this series converges for all t and this means that φk (t) → φ(t) as k → ∞, for all x ∈ R.
Indeed φ is a solution of the given initial value problem.

2.2. Taylor’s Series method. Consider the one dimensional initial value problem
y 0 = f (t, y), y(t0 ) = y0
where f is a function of two variables t and y and (t0 , y0 ) is a known point on the solution curve.
If the existence of all higher order partial derivatives is assumed for y at some point t = ti , then by
Taylor series the value of y at any neighboring point xi + h can be written as
h2 00 h3 000
y(ti + h) = y(ti ) + hy 0 (ti ) +
y (ti ) + y (ti ) + · · · + O(hp+1 ).
2 3!
Since at ti , yi is known, y 0 at xi can be found by computing f (ti , yi ).
Similarly higher derivatives of y at ti can be computed by making use of the relation y 0 = f (t, y).
Hence the value of y at any neighboring point ti + h can be obtained by summing the above infinite
series. If the series has been terminated after the p-th derivative term then the approximated formula
is called the Taylor series approximation to y of order p and the error is of order p + 1.
Example 5. Given the IVP y 0 = x2 y − 1, y(0) = 1. By Taylor series method of order 4 with step size
0.1. Find y at x = 0.1 and x = 0.2.
Sol. Given IVP
y 0 = x2 y − 1, y 00 = 2xy + x2 y 0 , y 000 = 2y + 4xy 0 + x2 y 00 , y (4) = 6y 0 + 6xy 00 + x2 y 000 .

∴ y 0 (0) = −1, y 00 (0) = 0, y (3) (0) = 2, y (4) (0) = −6.


The fourth-order Taylor’s formula is given by
h2 00 h3 h4
y(xi + h) = y(xi ) + hy 0 (xi ) + y (xi ) + y 000 (xi ) + y iv (xi ) + O(h5 )
2 3! 4!
Therefore
y(0.1) = 1 + (0.1)(−1) + 0 + (0.1)3 (2)/6 − (0.1)4 (−6)/24 = 0.900033
Similarly
y(0.2) = 0.80227.
4 NUMERICAL DIFFERENTIAL EQUATIONS

3. Numerical methods for IVP


We consider the following IVP
dy
= f (t, y), t ∈ R (3.1)
dt
y(t0 ) = y0 . (3.2)
Its integral form is the following equation
Zt
y(t) = y0 + f (s, y(s))ds.
0

3.1. Euler’s Method: The Euler method is named after Swiss mathematician Leonhard Euler (1707-
1783). This is the one of the simplest method to solve the IVP. Consider the IVP given in Eqs(3.1-3.2).
dy
We can approximate the derivative as following by assuming that all nodes ti are equally spaced
dt
with spacing h and ti+1 = ti + h.
Now by the definition of derivative
1
y 0 (t0 ) ≈ [y(t0 + h) − y(t0 )].
h
Apply this approximation to the given IVP at point t = t0 gives
y 0 (t0 ) = f (ti , yi )
Therefore
1
[y(t1 ) − y(t0 )] = f (t0 , y0 )
h
=⇒ y(t1 ) − y(t0 ) = hf (t0 , y0 )
which gives
y(t1 ) = y(t0 ) + hf (t0 , y0 ).
In general, we write
ti+1 = ti + h
yi+1 = yi + hf (ti , yi )
where yi = y(ti ). This method is called Euler’s method.
Alternatively we can derive this method from a Taylor’s series. We write
h2 00
y(ti+1 ) = y(ti + h) = y(ti ) + hy 0 (ti ) + y (ti ) + · · ·
2!
If we cut the series at y 0 (ti ), we obtain
y(ti+1 ) = y(ti ) + hy 0 (ti )
=⇒ y(ti+1 ) = y(ti ) + hf (ti , y(ti ))
=⇒ yi+1 = yi + hf (ti , yi ).
If truncation error has term hp+1 , then order of the numerical method is p. Therefore, Euler’s method
is a first-order method.
3.2. The Improved or Modified Euler’s method. We write the integral form of y(t) as
Zt
dy
= f (t, y) ⇐⇒ y(t) = y(t0 ) + f (t, y(t))dt.
dt
t0
Approximate integral using the trapezium rule:
h
y(t) = y(t0 ) + [f (t0 , y(t0 )) + f (t0 + h, y(t1 )], t1 = t0 + h.
2
Use Euler’s method to approximate y(t1 ) ≈ y(t0 ) + hf (t0 , y(t0 )) in trapezium rule:
h
y(t1 ) = y(t0 ) + [f (t0 , y(t0 )) + f (t1 , y(t0 ) + hf (t0 , y(t0 )))].
2
NUMERICAL DIFFERENTIAL EQUATIONS 5

Hence the modified Euler’s scheme


K1 = hf (t0 , y0 )
K2 = hf (t1 , y0 + K1 )
K1 + K2
y1 = y0 + .
2
In general, the modified Euler’s scheme is given by
ti+1 = ti + h
K1 = hf (ti , yi )
K2 = hf (ti+1 , yi + K1 )
K1 + K2
yi+1 = yi + .
2
Example 6.
y 0 + 2y = 2 − e−4t , y(0) = 1
By taking step size 0.1, find y at t = 0.1 and 0.2 by Euler method.
Sol.
y 0 = −2y + 2 − e−4t = f (t, y), y(0) = 1
f (0, 1) = −2(1) + 2 − 1 = −1
By Euler’s method with step size h = 0.1,
t1 = t0 + h = 0 + 0.1 = 0.1
y1 = y0 + hf (0, 1) = 1 + 0.1(−1) = 0.9
∴ y1 = y(0.1) = 0.9.

t2 = t0 + 2h = 0 + 2 × 0.1 = 0.2
y2 = y1 + hf (0.1, 0.9) = 0.9 + 0.1(−2 × 0.9 + 2 − e−4(0.1) )
= 0.9 + 0.1(−0.47032) = 0.852967
∴ y2 = y(0.2) = 0.852967.

Example 7. For the IVP y 0 = t + y, y(0) = 1. Calculate y in the interval [0.0.6] with h = 0.2 by
using modified Euler’s method.
Sol.

y 0 = t + y = f (t, y), t0 = 0, y0 = 1, h = 0.2, t1 = 0.2
K1 = hf (t0 , y0 ) = 0.2(1) = 0.2
K2 = hf (t1 , y0 + K1 ) = hf (0.2, 1.2) = 0.2591
K1 + K2
y1 = y(0.2) = y0 + = 1.22955.
2
Similarly we can compute solutions at other points.
Example 8. Show that the following initial-value problem has a unique solution.
y 0 = t−2 (sin 2t − 2ty), 1 ≤ t ≤ 2, y(1) = 2.
Find y(1.1) and y(1.2) with step-size h = 0.1 using modified Euler’s method.
Sol.
y 0 = t−2 (sin 2t − 2ty) = f (t, y).
Holding t as a constant,
|f (t, y1 ) − f (t, y2 )| = |t−2 (sin 2t − 2ty1 ) − t−2 (sin 2t − 2ty2 )|
2
= |y1 − y2 |
|t|
≤ 2|y1 − y2 |.
6 NUMERICAL DIFFERENTIAL EQUATIONS

Thus f satisfies a Lipschitz condition in the variable y with Lipschitz constant L = 2. Additionally,
f (t, y) is continuous when 1 ≤ t ≤ 2, and −∞ < y < ∞, so Existence Theorem implies that a unique
solution exists to this initial-value problem.
Now we apply Modified Euler’s method to find the solution.
t0 = 1, y0 = 2, h = 0.1, t1 = 1.1.
K1 = hf (t0 , y0 ) = hf (1, 2) = −0.309072
K2 = hf (t1 , y0 + K1 ) = hf (1, 1.6909298) = −0.24062
y1 = y(1.1) = y0 + 1/2(K1 + K2 ) = 1.725152.
Now y1 = 1.725152, h = 0.1, t2 = 1.2.
∴ K1 = −0.24684
K2 = −0.19947
y2 = y(0.2) = 1.50199.
Example 9. Given the initial-value problem
2
y 0 = y + t2 et , 1 ≤ t ≤ 2, y(1) = 0
t
(i) Use Euler’s method with h = 0.1 to approximate the solution in the interval [1, 1.6].
(ii) Use the answers generated in part (i) and linear interpolation to approximate y at t = 1.04 and
t = 1.55.
Sol. Given the initial-value problem
2
y 0 = y + t2 et = f (t, y)
t
t0 = 1.0, y(t0 ) = 0.0, h = 0.1.
By Euler’s method, approximation of solutions at different time-level are given by
y(ti+1 ) = y(ti ) + hf (ti , y(ti )).
 
2 2 1.0
∴ y(t1 ) = y(1.1) = y(0) + hf (1, 0) = 0.0 + 0.1 0.0 + 1.0 e = 0.271828.
1.0
t1 = 1.1
 
2 2 1.1
y(t2 ) = y(1.2) = 0.271828 + 0.1 0.271828 + (1.1) e = 0.684756
1.1
t2 = 1.2
 
2 2 1.2
y(t3 ) = y(1.3) = 0.684756 + 0.1 0.684756 + (1.2) e = 1.27698.
1.2
t3 = 1.3
Similarly
t4 = 1.4
y(t4 ) = y(1.4) = 2.09355
t5 = 1.5
y(t5 ) = y(1.5) = 3.18745
t6 = 1.6
y(t6 ) = y(1.6) = 4.62082.
Now using linear interpolation, approximate y can be found in the following way.
1.04 − 1.1 1.04 − 1.1
y(1.04) = y(1.0) + y(1.1) = 0.10873120.
1.0 − 1.1 1.0 − 1.1
1.55 − 1.6 1.55 − 1.5
y(1.55) = y(1.5) + y(1.6) = 3.90413500.
1.5 − 1.6 1.6 − 1.5
NUMERICAL DIFFERENTIAL EQUATIONS 7

3.3. Runge-Kutta Methods: This is the one of the most important method to solve the IVP. These
techniques were developed around 1900 by the German mathematicians C. Runge and M. W. Kutta.
If we apply Taylor’s Theorem directly then we require that the function have higher-order derivatives.
The class of Runge-Kutta methods does not involve higher-order derivatives which is the advantage of
this class.
Euler’s method is an example of the Runge-Kutta method of first-order and modified Euler’s method
is an example of second-order Runge-Kutta method.
Third-order Runge-Kutta methods: Like-wise modified Euler’s, using Simpson’s rule to approxi-
mate the integral, we obtain the following Runge-Kutta method of order three.
ti+1 = ti + h
K1 = hf (ti , yi )
K2 = hf (ti + h/2, yi + K1 /2)
K3 = hf (ti + h, yi − K1 + 2K2 )
1
yi+1 = yi + (K1 + 4K2 + K3 ).
6
There are different Runge-Kutta method of order three. Most commonly used method is Heun’s
method, given by
ti+1 = ti + h
  
h 2h 2h h h
yi+1 = yi + f (ti , yi ) + 3f ti + , yi + f (ti + , yi + f (ti , yi )) .
4 3 3 3 3
Runge-Kutta methods of order three are not generally used. The most common Runge-Kutta method
in use is of order four, is given by the following.
Fourth-order Runge-Kutta method:
ti+1 = ti + h
K1 = hf (ti , yi )
K2 = hf (ti + h/2, yi + K1 /2)
K3 = hf (ti + h/2, yi + K2 /2)
K4 = hf (ti + h, yi + K3 )
1
yi+1 = yi + (K1 + 2K2 + 2K3 + K4 ) + O(h5 ).
6
Local truncation error in the Runge-Kutta method is the error that arises in each step because of
the truncated Taylor series. This error is inevitable. The fourth-order Runge-Kutta involves a local
truncation error of O(h5 ).
dy y 2 − t2
Example 10. Using Runge-Kutta fourth-order, solve = 2 with y0 = 1 at t = 0.2 and 0.4.
dt y + t2
Sol.
y 2 − t2
f (t, y) = , t0 = 0, y0 = 1, h = 0.2
y 2 + t2
K1 = hf (t0 , y0 ) = 0.2f (0, 1) = 0.200
h K1
K2 = hf (t0 + , y0 + ) = 0.2f (0.1, 1.1) = 0.19672
2 2
h K2
K3 = hf (t0 + , y0 + ) = 0.2f (0.1, 1.09836) = 0.1967
2 2
K4 = hf (t0 + h, y0 + K3 ) = 0.2f (0.2, 1.1967) = 0.1891
1
y1 = y0 + (K1 + 2K2 + 2K3 + K4 ) = 1 + 0.19599 = 1.196
6
∴ y(0.2) = 1.196.
8 NUMERICAL DIFFERENTIAL EQUATIONS

Now
t1 = t0 + h = 0.2
K1 = hf (t1 , y1 ) = 0.1891
h K1
K2 = hf (t1 + , y1 + ) = 0.2f (0.3, 1.2906) = 0.1795
2 2
h K2
K3 = hf (t1 + , y1 + ) = 0.2f (0.3, 1.2858) = 0.1793
2 2
K4 = hf (t1 + h, y1 + K3 ) = 0.2f (0.4, 1.3753) = 0.1688
1
y2 = y(0.4) = y1 + (K1 + 2K2 + 2K3 + K4 ) = 1.196 + 0.1792 = 1.3752.
6

4. Numerical solution of system and second-order equations


We can apply the Euler and Runge-Kutta methods to find the numerical solution of system of
differential equations. Second-order equations can be changed in to system of differential equations.
The application of numerical methods are explained in the following examples.

Example 11. Solve the following system


dx
= 3x − 2y
dt
dy
= 5x − 4y
dt
x(0) = 3, y(0) = 6.
Find solution by Euler’s method at 0.1 and 0.2 by taking time increment 0.1.

Sol. Given t0 = 0, x0 = 3, y0 = 6, h = 0.1.


Write f (t, x, y) = 3x − 2y, g(t, x, y) = 5x − 4y.
By Euler’s method
x1 = x(0.1) = x0 + hf (t0 , x0 , y0 ) = 3 + 0.1(3 × 3 − 2 × 6) = 2.7.
y1 = y(0.1) = y0 + hg(x0 , y0 ) = 6 + 0.1(5 × 3 − 4 × 6) = 5.1.
Similarly
x2 = x(0.2) = x1 + hf (t1 , x1 , y1 ) = 2.7 + 0.1(3 × 2.7 − 2 × 5.1) = 2.49.
y2 = y(0.2) = y1 + hg(t1 , x1 , y1 ) = 5.1 + 0.1(5 × 2.7 − 4 × 5.1) = 4.41.

Example 12. Solve the following system


dy dz
= 1 + xz, = −xy
dx dx
for x = 0.3 by using fourth-order Runge-Kutta method. Given y(0) = 0, z(0) = 1.

Sol. Given
dy dz
= 1 + xz = f (x, y, z), = −xy = g(x, y, z)
dx dx
x0 = 0, y0 = 0, z0 = 1, h = 0.3
K1 = hf (x0 , y0 , z0 ) = 0.3f (0, 0, 1) = 0.3
L1 = hg(x0 , y0 , z0 ) = 0.3g(0, 0, 1) = 0
h K1 L1
K2 = hf (x0 + , y0 + , z0 + ) = 0.3f (0.15, 0.15, 1) = 0.346
2 2 2
h K1 L1
L2 = hg(x0 + , y0 + , z0 + ) = −0.00675
2 2 2
NUMERICAL DIFFERENTIAL EQUATIONS 9

h K2 L2
K3 = hf (x0 + , y0 + , z0 + ) = 0.34385
2 2 2
h K2 L2
L3 = hg(x0 + , y0 + , z0 + ) = −0.007762
2 2 2
K4 = hf (x0 + h, y0 + K3 , z0 + L3 ) = 0.3893
L4 = hg(x0 + h, y0 + K3 , z0 + L3 ) = −0.03104.
Hence
1
y1 = y(0.3) = y0 + (K1 + 2K2 + 2K3 + K4 ) = 0.34483
6
1
z1 = z(0.3) = z0 + (L1 + 2L2 + 2L3 + L4 ) = 0.9899.
6
Example 13. Consider the following Lotka-Volterra system in which u is the number of prey and v
is the number of predators.
du
= 2u − uv, u(0) = 1.5
dt
dv
= −9v + 3uv, v(0) = 1.5.
dt
Use the fourth-order Runge-Kutta method with step-size h = 0.2 to approximate the solution at t = 0.2.
Sol.
du
= 2u − uv = f (t, u, v)
dt
dv
= −9v + 3uv = g(t, u, v),
dt
u0 = 1.5, v0 = 1.5, h = 0.2.
K1 = hf (t0 , u0 , v0 ) = 0.15
L1 = hg(t0 , u0 , v0 ) = −1.35
K2 = hf (t0 + h/2, u0 + K1 /2, v0 + L1 /2) = 0.370125.
h K1 L1
L2 = hg(t0 + , u0 + , v0 + ) = −0.7054
2 2 2
h K2 L2
K3 = hf (t0 + , u0 + , v0 + ) = 0.2874
2 2 2
h K2 L2
L3 = hg(t0 + , u0 + , v0 + ) = −0.9052
2 2 2
K4 = hf (t0 + h, u0 + K3 , v0 + L3 ) = 0.5023
L4 = hg(x0 + h, u0 + K3 , v0 + L3 ) = −0.4348.
Therefore
1
u(0.2) = 1.5 + (0.15 + 2 × 0.370125 + 2 × 0.2874 + 0.5023) = 1.8279.
6
1
v(0.2) = 1.5 + (−1.35 − 2 × 0.7054 − 2 × 0.9052 − 0.4348) = 0.6657.
6
Example 14. Solve by using fourth-order Runge-Kutta method for x = 0.2.
 2
d2 y dy
2
=x − y 2 , y(0) = 1, y 0 (0) = 0.
dx dx
Sol. Let
dy
= z = f (x, y, z).
dx
Therefore
dz
= xz 2 − y 2 = g(x, y, z).
dx
10 NUMERICAL DIFFERENTIAL EQUATIONS

Now

x0 = 0, y0 = 1, z0 = 0, h = 0.2
K1 = hf (x0 , y0 , z0 ) = 0.0
L1 = hg(x0 , y0 , z0 ) = −0.2
h K1 L1
K2 = hf (x0 + , y0 + , z0 + ) = −0.02
2 2 2
h K1 L1
L2 = hg(x0 + , y0 + , z0 + ) = −0.1998
2 2 2
h K2 L2
K3 = hf (x0 + , y0 + , z0 + ) = −0.02
2 2 2
h K2 L2
L3 = hg(x0 + , y0 + , z0 + ) = −0.1958
2 2 2
K4 = hf (x0 + h, y0 + K3 , z0 + L3 ) = −0.0392
L4 = hg(x0 + h, y0 + K3 , z0 + L3 ) = −0.1905.

Hence
1
y1 = y(0.2) = y0 + (K1 + 2K2 + 2K3 + K4 ) = 0.9801.
6
1
z1 = y 0 (0.3) = z0 + (L1 + 2L2 + 2L3 + L4 ) = −0.1970.
6
Example 15. The motion of a swinging pendulum is described by the following second-order differential
equation
d2 θ g π
2
+ sin θ = 0, θ(0) = , θ0 (0) = 0,
dt L 6
where θ be the angle with vertical at time t, length of the pendulum L = 2 ft, and g = 32.17 ft/s2 . With
h = 0.1 s, find the angle θ at t = 0.1 using Runge-Kutta fourth order method.

Sol. First of all we convert the given second order initial value problem into simultaneous first order
initial value problems.

Assuming = y, we obtain the following system:
dt

= y = f (t, θ, y), θ(0) = π/6
dt
dy g
= − sin θ = g(t, θ, y), y(0) = 0.
dt L
Here t0 = 0, θ0 = π/6, and y0 = 0. We have, by Runge-Kutta fourth order method, taking h = 0.1.

K1 = hf (t0 , θ0 , y0 ) = 0.00000000
L1 = hg(t0 , θ0 , y0 ) = −0.80425000
K2 = hf (t0 + 0.5h, θ0 + 0.5K1 , y0 + 0.5L1 ) = −0.04021250
L2 = hg(t0 + 0.5h, θ0 + 0.5K1 , y0 + 0.5L1 ) = −0.80425000
K3 = hf (t0 + 0.5h, θ0 + 0.5K2 , y0 + 0.5L2 ) = −0.04021250
L3 = hg(t0 + 0.5h, θ0 + 0.5K2 , y0 + 0.5L2 ) = −0.77608129
K4 = hf (t0 + h, θ0 + K3 , y0 + L3 ) = −0.07760813
L4 = hg(t0 + h, θ0 + K3 , y0 + L3 ) = −0.74759884.
(K1 + 2K2 + 2K3 + K4 )
θ1 = θ0 + = 0.48385575.
6
Therefore, θ(0.1) ≈ θ1 = 0.48385575.
NUMERICAL DIFFERENTIAL EQUATIONS 11

Exercises
(1) Show that each of the following initial-value problems (IVP) has a unique solution, and find
the solution.
2
a. y 0 = y cos t, 0 ≤ t ≤ 1, y(0) = 1. b. y 0 = y + t2 et , 1 ≤ t ≤ 2, y(1) = 0.
t
(2) Apply Picard’s method for solving the initial-value problem generate y0 (t), y1 (t), y2 (t), and
y3 (t) for the initial-value problem
y 0 = −y + t + 1, 0 ≤ t ≤ 1, y(0) = 1.
(3) Consider the following initial-value problem
x0 = t(x + t) − 2, x(0) = 2.
Use the Euler method with stepsize h = 0.2 to compute x(0.6).
(4) Given the initial-value problem
1 y
y0 = 2
− − y 2 , 1 ≤ t ≤ 2, y(1) = −1,
t t
1
with exact solution y(t) = − :
t
a. Use Euler’s method with h = 0.05 to approximate the solution, and compare it with the
actual values of y.
b. Use the answers generated in part (a) and linear interpolation to approximate the following
values of y, and compare them to the actual values.
i. y(1.052) ii. y(1.555) iii. y(1.978).
(5) Solve the following IVP by second-order Runge-Kutta method
y 0 = −y + 2 cos t, y(0) = 1.
Compute y(0.2), y(0.4), and y(0.6) with mesh length 0.2.
(6) Compute solutions to the following problems with a second-order Taylor method. Use step size
h = 0.2.
20
a. y 0 = (cos y)2 , 0 ≤ x ≤ 1, y(0) = 0. b. y 0 = , 0 ≤ x ≤ 1, y(0) = 1.
1 + 19e−x/4
(7) A projectile of mass m = 0.11 kg shot vertically upward with initial velocity v(0) = 8 m/s is
slowed due to the force of gravity, Fg = −mg, and due to air resistance, Fr = −kv|v|, where
g = 9.8 m/s2 and k = 0.002 kg/m. The differential equation for the velocity v is given by
mv 0 = −mg − kv|v|.
a. Find the velocity after 0.1, 0.2, · · · , 1.0 s.
b. To the nearest tenth of a second, determine when the projectile reaches its maximum height
and begins falling.
(8) Using Runge-Kutta fourth-order method to solve the IVP at x = 0.8 for
dy √
= x + y, y(0.4) = 0.41
dx
with step length h = 0.2.
(9) Water flows from an inverted conical tank with circular orifice at the rate

dx 2
p x
= −0.6πr 2g ,
dt A(x)
where r is the radius of the orifice, x is the height of the liquid level from the vertex of the
cone, and A(x) is the area of the cross section of the tank x units above the orifice. Suppose
r = 0.1 ft, g = 32.1 ft/s2 , and the tank has an initial water level of 8 ft and initial volume of
512(π/3) ft3 . Use the Runge-Kutta method of order four to find the following.
a. The water level after 10 min with h = 20 s.
b. When the tank will be empty, to within 1 min.
12 NUMERICAL DIFFERENTIAL EQUATIONS

(10) The following system represent a much simplified model of nerve cells
dx
= x + y − x3 , x(0) = 0.5
dt
dy x
= − , y(0) = 0.1
dt 2
where x(t) represents voltage across the boundary of nerve cell and y(t) is the permeability of
the cell wall at time t. Solve this system using Runge-Kutta fourth-order method to generate
the profile up to t = 0.2 with step size 0.1.
(11) Use Runge-Kutta method of order four to solve
y 00 − 3y 0 + 2y = 6e−t , 0 ≤ t ≤ 1, y(0) = y 0 (0) = 2
for t = 0.2 with stepsize 0.2.

Bibliography
[Burden] Richard L. Burden, J. Douglas Faires and Annette Burden, “Numerical Analysis,” Cengage
Learning, 10th edition, 2015.
[Atkinson] K. Atkinson and W. Han, “Elementary Numerical Analysis,” John Willey and Sons, 3rd
edition, 2004.

Appendix A. Algorithms
Algorithm for second-order Runge-Kutta method:
for i = 0, 1, 2, .. do
ti+1 = ti + h = t0 + (i + 1)h
K1 = hf (ti , yi )
K2 = hf (ti+1 , yi + K1 )
1
yi+1 = yi + (K1 + K2 ).
2
end for
Algorithm for fourth-order Runge-Kutta method:
for i = 0, 1, 2, .. do
ti+1 = ti + h
K1 = hf (ti , yi )
K2 = hf (ti + h/2, yi + K1 /2)
K3 = hf (ti + h/2, yi + K2 /2)
K4 = hf (ti+1 , yi + K3 )
1
yi+1 = yi + (K1 + 2K2 + 2K3 + K4 ).
6
end for

You might also like