MATH3230B Numerical Analysis
Tutorial 8
1 Recall:
1. Least-square solution for general non-square linear systems:
Let A be a m × n matrix, with m > n and now we consider a general linear system
Ax = b
The least square solution seeks for some vector x that minimizes the error (Ax − b) in the least square sense,
that is
minn kAx − bk22
x∈R
We assume that the columns of A are linearly independent. We define
f (x) = kAx − bk22
The minimizer of f (x) satisfies the normal equation
AT Ax = AT b.
2. Floating point error
Recall that most computers adopt the binary system. A machine number is a string consisting of bits, whose
value are decoded as the following normalized floating point:
p
·m
a = (−1)s q × 2(−1)
For a 32-bits floating-point binary storage formats (also called single precision), we have
1 bit for the sign (s), 8 bit for the exponent (1 bit for p,7 bits for m), 23 bits for the mantissa.
For a 64-bits floating-point binary storage formats (also called double precision), we have
1 bit for the sign (s), 11 bits for the exponent (1 bit for p,10 bits for m), 52 bits for the mantissa.
During the decimal-binary conversion, small roundoff errors occurs.
Also rounding is usually adapted in scientific computing. If x is rounded to x e with n digits after the decimal
points, we have the error estimate
1
e| ≤ × 10−n
|x − x
2
In addition to rounding input, rounding is also needed after most arithmetic operations. The roundoff error
is less than 2−24 (32-bits) or 2−53 (64-bits) and it is called the unit rounded error. All these put together to
form floating point arithmetics.
Given a real number x, let f l(x) be the floating point representation of x, which means
f l(x) − x
≤ 2−β := m
x
where m is the machine precision/ machine unit roundoff error. Then we can write
f l(x) = x(1 + )
with ≤ m .
1
2 Exercise:
1. Consider the following tridiagonal matrix
b1 c1
a2 b2 c2
A=
.. .. ..
. . .
an−1 bn−1 cn−1
an bn
(a) What is the LU factorization of A?
(b) Using the result of (a), solve the following linear system
4 −1 x1 2
−1 4 −1 x2 6
−1 4 −1
x3 =
0
−1 4 −1 x4 6
−1 4 x5 2
Solution. (a) By direct LU decomposition, we have
1 0 b1 c1
l2 1 v2 c2
A=
l3 1 .. ..
. .
.. ..
. . vn−1 cn−1
0 ln 1 vn
ak
where lk = vk−1 and vk = bk − lk ck−1 , k = 2, ..., n
(b) By LU factorization, we have
1
−1 1
4 4
L=
− 15 1
15
− 56 1
56
− 209 1
and
4 −1
15
4 −1
56
U =
15 −1
209
56 −1
780
209
Solving
y1 2
y2
6
L
y3 =
0
y4 6
y5 2
we have
y1 2
13
y2
2
26
y3 =
15
181
y4
28
780
y5 209
2
Then solving
x1 y1
x2
y2
U
x3 =
y3
x4 y4
x5 y5
we have
x1 1
x2
2
x3 =
1
x4 2
x5 1
2. (a) Assume that if matrix Am×n is full rank, we can find the least square solution x∗ which minimizes the
energy:
1 2
kAx − bk2 .
2
Prove that solving for x∗ is equal to solve the following equation
AT Ax = AT b.
(b) Solve the following linear system using the Cholesky factorization in least square sense.
1 −1 1 2
−1 1 1 , b = 1
Ax = b, A = 0 1 −1 1
0 1 −1 −2
Solution. (a) Let
1 2
F (x) = kAx − bk2
2
Then
∇F (x) = AT Ax − AT b
Therefore minimum is attended if
∇F (x) = 0 ⇐⇒ AT Ax = AT b
(b) Note that
2 −2 0
AT A = −2 4 −2
0 −2 4
Then √ √ T √ √
2 −√ 2 √0 2 −√ 2 √0
T
A A= 0 2 −√ 2 0 2 −√ 2 = RT R
0 0 2 0 0 2
First we solve Ry = AT b, we have
√1
2
y=
− √12
√3
2
then we solve Rx = y, we have
3
2
x= 1
3
2
3
3. Let p = 0.54617 and q = 0.54601. Use four-digit arithmetic to approximate r = p − q and determine the
relative errors using
(a) rounding approximation;
(b) chopping approximation.
Please also write down the number of significant digits for the approximation (a) and (b) respectively. (Hint:
consider the relative error.)
Solution. The exact value of r = p − q is r = 0.00016.
(a) Rounding approximation: p∗ = 0.5462 and q ∗ = 0.5460.
r∗ = p∗ − q ∗ = 0.0002. The relative error is
|r − r∗ | |0.00016 − 0.0002|
= = 0.25,
|r| |0.00016|
so the result has only one significant digit.
(b) Chopping approximation: p∗ = 0.5461 and q ∗ = 0.5460.
r∗ = p∗ − q ∗ = 0.0001. The relative error is
|r − r∗ | |0.00016 − 0.0001|
= = 0.375,
|r| |0.00016|
so the result has only one significant digit.
4. Recall that most computers adopt the binary system. A machine number is a string consisting of bits, whose
value are decoded as the following normalized floating point:
p
·m
a = (−1)s q × 2(−1) , (1)
where s, p = 0, 1, m is the 7-bit exponent, and q = (1.f )2 with f being 23-bit fractional part.
(a) Find the smallest and second smallest positive numbers of the form (1).
(b) Find the largest and second largest numbers of the form (1).
(c) Suppose that x is a real number. If x is rounded to x̃ with n digits after the decimal point, show that
1
|x − x̃| ≤ × 10−n .
2
Solution. (a) Put s = 0, p = 1, f = 000...000 and m = 1111111,
the smallest positive number is 2−127 .
Put s = 0, p = 1, f = 000...001 and m = 1111111,
the second smallest positive number is:
(1.000...001) × 2−127 = (1 + 2−23 ) × 2−127 .
(b) Put s = 0, p = 0, f = 111...111 and m = 1111111, the largest number is:
(1.111...111) × 2127 = (2 − 2−23 ) × 2127 .
Put s = 0, p = 0, f = 111...110 and m = 1111111, the second largest number is:
(1.111...110) × 2127 = (2 − 2−22 ) × 2127 .
4
(c) If the (n + 1)th digit of x is 0,1,2,3, or 4, then x = x̃ + with < 0.5 × 10−n .
If the (n + 1)th digit of x is 5,6,7,8, or 9, then x̃ = x̂ + 10−n , where x̂ is a number with the same n digits
as x and all digits beyond the nth are 0. So we have
x = x̂ + δ × 10−n
with δ ≥ 0.5 and
1
x̂ − x = (1 − δ) × 10−n ≤ × 10−n .
2
Result follows.