0% found this document useful (0 votes)
63 views

Block LU Factorization

Block LU Factorization

Uploaded by

Chris1908
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Block LU Factorization

Block LU Factorization

Uploaded by

Chris1908
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.

php

Chapter 13
Block LU Factorization

Block algorithms are advantageous for at least two important reasons.


First, they work with blocks of data having b2 elements,
performing O(b3 ) operations.
The O(b) ratio of work to storage means that
processing elements with an O(b) ratio of
computing speed to input/output bandwidth can be tolerated.
Second, these algorithms are usually rich in matrix multiplication.
This is an advantage because
nearly every modern parallel machine is good at matrix multiplication.
— ROBERT S. SCHREIBER, Block Algorithms for Parallel Machines (1988)

It should be realized that, with partial pivoting,


any matrix has a triangular factorization.
DECOMP actually works faster when zero pivots occur because they mean that
the corresponding column is already in triangular form.
— GEORGE E. FORSYTHE, MICHAEL A. MALCOLM, and CLEVE B. MOLER,
Computer Methods for Mathematical Computations (1977)

It was quite usual when dealing with very large matrices to


perform an iterative process as follows:
the original matrix would be read from cards and the reduced matrix punched
without more than a single row of the original matrix
being kept in store at any one time;
then the output hopper of the punch would be
transferred to the card reader and the iteration repeated.
— MARTIN CAMPBELL-KELLY, Programming the Pilot ACE (1981)

245
246 Block LU Factorization

13.1. Block Versus Partitioned LU Factorization


As we noted in Chapter 9 (Notes and References), Gaussian elimination (GE) com-
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

prises three nested loops that can be ordered in six ways, each yielding a different
algorithmic variant of the method. These variants involve different computational
kernels: inner product and saxpy operations (level-1 BLAS), or outer product and
gaxpy operations (level-2 BLAS). To introduce matrix–matrix operations (level-3
BLAS), which are beneficial for high-performance computing, further manipula-
tion beyond loop reordering is needed. We will use the following terminology,
which emphasises an important distinction.
A partitioned algorithm is a scalar (or point) algorithm in which the operations
have been grouped and reordered into matrix operations.
A block algorithm is a generalization of a scalar algorithm in which the basic
scalar operations become matrix operations (α + β, αβ, and α/β become A + B,
AB, and AB −1 ), and a matrix property based on the nonzero structure becomes
the corresponding property blockwise (in particular, the scalars 0 and 1 become
the zero matrix and the identity matrix, respectively). A block factorization is
defined in a similar way and is usually what a block algorithm computes.
A partitioned version of the outer product form of LU factorization may be
developed as follows. For A ∈ Rn×n and a given block size r, write
     
A11 A12 L11 0 Ir 0 U11 U12
= , (13.1)
A21 A22 L21 In−r 0 S 0 In−r
where A11 is r × r.

Algorithm 13.1 (partitioned LU factorization). This algorithm computes an LU


factorization A = LU ∈ Rn×n using a partitioned outer product implementation,
using block size r and the notation (13.1).
1. Factor A11 = L11 U11 .
2. Solve L11 U12 = A12 for U12 .
3. Solve L21 U11 = A21 for L21 .
4. Form S = A22 − L21 U12 .
5. Repeat steps 1–4 on S to obtain L22 and U22 .
Note that in step 4, S = A22 − A21 A−1
11 A12 is the Schur complement of A11 in
A. Steps 2 and 3 require the solution of the multiple right-hand side triangular
systems, so steps 2–4 are all level-3 BLAS operations. This partitioned algorithm
does precisely the same arithmetic operations as any other variant of GE, but
it does the operations in an order that permits them to be expressed as matrix
operations.
A genuine block algorithm computes a block LU factorization, which is a factor-
ization A = LU ∈ Rn×n , where L and U are block triangular and L has identity
matrices on the diagonal:
   
I U11 U12 . . . U1m
 .
.. 
 L21 I  U22
L= . . , U =  

.
 .. ..   ..
. Um−1,m 
Lm1 . . . Lm,m−1 I Umm
13.1 Block Versus Partitioned LU Factorization 247

In general, the blocks can be of different dimensions. Note that this factorization is
not the same as a standard LU factorization, because U is not triangular. However,
the standard and block LU factorizations are related as follows: if A = LU is a
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

block LU factorization and each Uii has an LU factorization Uii = Lii U ii , then
A = L diag(Lii ) · diag(U ii )U is an LU factorization. Conditions for the existence
of a block LU factorization are easy to state.

Theorem 13.2. The matrix A = (Aij )m i,j=1 ∈ R


n×n
has a unique block LU factor-
ization if and only if the first m − 1 leading principal block submatrices of A are
nonsingular.
Proof. The proof is entirely analogous to the proof of Theorem 9.1.
This theorem makes clear that a block LU factorization may exist when an LU
factorization does not.
If A11 ∈ Rr×r is nonsingular we can write
    
A11 A12 I 0 A11 A12
A= = , (13.2)
A21 A22 L21 I 0 S
which describes one block step of an outer-product-based algorithm for computing
a block LU factorization. Here, S is again the Schur complement of A11 in A. If
the (1, 1) block of S of appropriate dimension is nonsingular then we can factorize
S in a similar manner, and this process can be continued recursively to obtain
the complete block LU factorization. The overall algorithm can be expressed as
follows.

Algorithm 13.3 (block LU factorization). This algorithm computes a block LU


factorization A = LU ∈ Rn×n , using the notation (13.2).
1. U11 = A11 , U12 = A12 .
2. Solve L21 A11 = A21 for L21 .
3. S = A22 − L21 A12 .
4. Compute the block LU factorization of S, recursively.
Given a block LU factorization of A, the solution to a system Ax = b can
be obtained by solving Ly = b by forward substitution (since L is triangular)
and solving U x = y by block back substitution. There is freedom in how step 2
of Algorithm 13.3 is accomplished, and how the linear systems with coefficient
matrices Uii that arise in the block back substitution are solved. The two main
possibilities are as follows.
Implementation 1: A11 is factorized by GEPP. Step 2 and the solution of linear
systems with Uii are accomplished by substitution with the LU factors of A11 .
Implementation 2: A−1 11 is computed explicitly, so that step 2 becomes a matrix
multiplication and U x = y is solved entirely by matrix–vector multiplications.
This approach is attractive for parallel machines.
A particular case of partitioned LU factorization is recursively partitioned LU
factorization. Assuming, for simplicity, that n is even, we write
     
A11 A12 L11 0 In/2 0 U11 U12
= , (13.3)
A21 A22 L21 In/2 0 S 0 In/2
248 Block LU Factorization

where each block is n/2 × n/2. The algorithm is as follows.

Algorithm 13.4 (recursively partitioned LU factorization). This algorithm com-


Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

putes an LU factorization A = LU ∈ Rn×n using a recursive partitioning, using


the notation (13.3).
   
A11 L11
1. Recursively factorize = U11 .
A21 L21
2. Solve L11 U12 = A12 for U12 .
3. Form S = A22 − L21 U12 .
4. Recursively factorize S = L22 U22 .
In contrast with Algorithm 13.1, this recursive algorithm does not require a
block size to be chosen. Intuitively, the recursive algorithm maximizes the dimen-
sions of the matrices that are multiplied in step 3: at the top level of the recursion
two n/2 × n/2 matrices are multiplied, at the next level two n/4 × n/4 matrices,
and so on. Toledo [1145, ] shows that Algorithm 13.4 transfers fewer words
of data between primary and secondary computer memory than Algorithm 13.1
and shows that it outperforms Algorithm 13.1 on a range of computers. He also
shows that the large matrix multiplications in Algorithm 13.4 enable it to benefit
particularly well from the use of Strassen’s fast matrix multiplication method (see
§23.1).
What can be said about the numerical stability of partitioned and block LU
factorization? Because the partitioned algorithms are just rearrangements of stan-
dard GE, the standard error analysis applies if the matrix operations are computed
in the conventional way. However, if fast matrix multiplication techniques are used
(for example, Strassen’s method), the standard results are not applicable. Stan-
dard results are, in any case, not applicable to block LU factorization; its stability
can be very different from that of LU factorization. Therefore we need error anal-
ysis for both partitioned and block LU factorization based on general assumptions
that permit the use of fast matrix multiplication.
Unless otherwise stated, in this chapter an unsubscripted norm denotes kAk :=
maxi,j |aij |. We make two assumptions about the underlying level-3 BLAS (matrix–
matrix operations).
(1) If A ∈ Rm×n and B ∈ Rn×p then the computed approximation C b to
C = AB satisfies
b = AB + ∆C,
C k∆Ck ≤ c1 (m, n, p)ukAk kBk + O(u2 ), (13.4)
where c1 (m, n, p) denotes a constant depending on m, n, and p.
(2) The computed solution X b to the triangular systems T X = B, where T ∈
m×m m×p
R and B ∈ R , satisfies
b = B + ∆B,
TX b + O(u2 ).
k∆Bk ≤ c2 (m, p)ukT k kXk (13.5)
For conventional multiplication and substitution, conditions (13.4) and (13.5)
hold with c1 (m, n, p) = n2 and c2 (m, p) = m2 . For implementations based on
Strassen’s method, (13.4) and (13.5) hold with c1 and c2 rather complicated func-
tions of the dimensions m, n, p, and the threshold n0 that determines the level of
recursion (see Theorem 23.2 and [592, ]).
13.2 Error Analysis of Partitioned LU Factorization 249

13.2. Error Analysis of Partitioned LU Factorization


An error analysis for partitioned LU factorization must answer two questions.
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

The first is whether partitioned LU factorization becomes unstable in some fun-


damental way when fast matrix multiplication is used. The second is whether the
constants in (13.4) and (13.5) are propagated stably into the final error bound
(exponential growth of the constants would be disastrous).
We will analyse Algorithm 13.1 and will assume that the block level LU factor-
ization is done in such a way that the computed LU factors of A11 ∈ Rr×r satisfy
b 11 U
L b11 = A11 + ∆A11 , b 11 k kU
k∆A11 k ≤ c3 (r)ukL b11 k + O(u2 ). (13.6)

Theorem 13.5 (Demmel and Higham). Under the assumptions (13.4)–(13.6), the
LU factors of A ∈ Rn×n computed using the partitioned outer product form of LU
bU
factorization with block size r satisfy L b = A + ∆A, where

k∆Ak ≤ u δ(n, r)kAk + θ(n, r)kLk b kUb k + O(u2 ), (13.7)

and where

δ(n, r) = 1 + δ(n − r, r), δ(r, r) = 0,



θ(n, r) = max c3 (r), c2 (r, n − r), 1 + c1 (n − r, r, n − r) + δ(n − r, r)

+ θ(n − r, r) , θ(r, r) = c3 (r).

Proof. The proof is essentially inductive. To save clutter we will omit “+O(u2 )”
from each bound. For n = r, the result holds trivially. Consider the first block
stage of the factorization, with the partitioning (13.1). The assumptions imply
that
b 11 U
L b12 = A12 + ∆A12 , b 11 k kU
k∆A12 k ≤ c2 (r, n − r)ukL b12 k, (13.8)
b 21 U
L b11 = A21 + ∆A21 , b 21 k kU
k∆A21 k ≤ c2 (r, n − r)ukL b11 k. (13.9)
b 21 U
To obtain S = A22 − L21 U12 we first compute C = L b12 , obtaining

b=L
C b 21 U
b12 + ∆C, b 21 k kU
k∆Ck ≤ c1 (n − r, r, n − r)ukL b12 k,

and then subtract from A22 , obtaining

Sb = A22 − C
b + F, b
kF k ≤ u(kA22 k + kCk). (13.10)

It follows that

Sb = A22 − L
b 21 U
b12 + ∆S, (13.11a)

b b b b
k∆Sk ≤ u kA22 k + kL21 k kU12 k + c1 (n − r, r, n − r)kL21 k kU12 k . (13.11b)

The remainder of the algorithm consists of the computation of the LU factorization


b and by our inductive assumption (13.7), the computed LU factors satisfy
of S,
b 22 U
L b22 = Sb + ∆S,
b (13.12a)
k∆Sk b ≤ δ(n − r, r)ukSk
b + θ(n − r, r)ukL
b 22 k kU
b22 k. (13.12b)
250 Block LU Factorization

b using (13.10), we obtain


Combining (13.11) and (13.12), and bounding kSk
b 21 U
L b12 + L
b 22 U
b22 = A22 + ∆A22 ,
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

k∆A22 k ≤ u [1 + δ(n − r, r)]kA22 k + [1 + c1 (n − r, r, n − r) + δ(n − r, r)]



b 21 k kU
× kL b12 k + θ(n − r, r)kL
b 22 k kU
b22 k . (13.13)
bU
Collecting (13.6), (13.8), (13.9), and (13.13) we have L b = A + ∆A, where bounds
on k∆Aij k are given in the equations just mentioned. These bounds for the blocks
of ∆A can be weakened slightly and expressed together in the more succinct form
(13.7).
These recurrences for δ(n, r) and θ(n, r) show that the basic error constants in
assumptions (13.4)–(13.6) combine additively at worst. Thus, the backward error
analysis for the LU factorization is commensurate with the error analysis for the
particular implementation of the BLAS3 employed in the partitioned factorization.
In the case of the conventional BLAS3 we obtain a Wilkinson-style result for GE
without pivoting, with θ(n, r) = O(n3 ) (the growth factor is hidden in L b and U
b ).
Although the above analysis is phrased in terms of the partitioned outer prod-
uct form of LU factorization, the same result holds for other “ijk” partitioned
forms (with slightly different constants), for example, the gaxpy or sdot forms and
the recursive factorization (Algorithm 13.4). There is no difficulty in extending
the analysis to cover partial pivoting and solution of Ax = b using the computed
LU factorization (see Problem 13.6).

13.3. Error Analysis of Block LU Factorization


Now we turn to block LU factorization. We assume that the computed matrices
b 21 from step 2 of Algorithm 13.3 satisfy
L
b 21 A11 = A21 + E21 ,
L b 21 k kA11 k + O(u2 ).
kE21 k ≤ c4 (n, r)ukL (13.14)

We also assume that when a system Uii xi = di of order r is solved, the computed
solution x
bi satisfies

(Uii + ∆Uii )b
xi = di , k∆Uii k ≤ c5 (r)ukUii k + O(u2 ). (13.15)

The assumptions (13.14) and (13.15) are satisfied for Implementation 1 of Algo-
rithm 13.3 and are sufficient to prove the following result.

Theorem 13.6 (Demmel, Higham, and Schreiber). Let L b and U


b be the computed
n×n
block LU factors of A ∈ R from Algorithm 13.3 (with Implementation 1), and
let x
b be the computed solution to Ax = b. Under the assumptions (13.4), (13.14),
and (13.15),
bU
L b = A + ∆A1 , (A + ∆A2 )b
x = b,

b kU
k∆Ai k ≤ dn u kAk + kLk b k + O(u ), i = 1: 2,
2
(13.16)

where the constant dn is commensurate with those in the assumptions.


13.3 Error Analysis of Block LU Factorization 251

Proof. We omit the proof (see Demmel, Higham, and Schreiber [326, ]
for details). It is similar to the proof of Theorem 13.5.
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

The bounds in Theorem 13.6 are valid also for other versions of block LU
factorization obtained by “block loop reordering”, such as a block gaxpy based
algorithm.
Theorem 13.6 shows that the stability of block LU factorization is determined
by the ratio kLkb kU b k/kAk (numerical experiments show that the bounds are, in
fact, reasonably sharp). If this ratio is bounded by a modest function of n, then
b and U
L b are the true factors of a matrix close to A, and x b solves a slightly
perturbed system. However, kLk b kU b k can exceed kAk by an arbitrary factor, even
if A is symmetric positive definite or diagonally dominant by rows. Indeed, kLk ≥
kL21 k = kA21 A−111 k, using the partitioning (13.2), and this lower bound for kLk can
be arbitrarily large. In the following two subsections we investigate this instability
more closely and show that kLk kU k can be bounded in a useful way for particular
classes of A. Without further comment we make the reasonable assumption that
kLk kU k ≈ kLkb kU b k, so that these bounds may be used in Theorem 13.6.
What can be said for Implementation 2? Suppose, for simplicity, that the
inverses A−1
11 (which are used in step 2 of Algorithm 13.3 and in the block back
substitution) are computed exactly. Then the best bounds of the forms (13.14)
and (13.15) are

b 21 A11 = A21 + ∆A21 ,


L k∆A21 k ≤ c4 (n, r)uκ(A11 )kA21 k + O(u2 ),
(Uii + ∆Uii )b
xi = di , k∆Uii k ≤ c5 (r)uκ(Uii )kUii k + O(u2 ).

Working from these results, we find that Theorem 13.6 still holds provided the
first-order terms in the bounds in (13.16) are multiplied by maxi κ(Ubii ). This
suggests that Implementation 2 of Algorithm 13.3 can be much less stable than
Implementation 1 when the diagonal blocks of U are ill conditioned, and this is
confirmed by numerical experiments.

13.3.1. Block Diagonal Dominance


One class of matrices for which block LU factorization has long been known to be
stable is block tridiagonal matrices that are diagonally dominant in an appropriate
block sense. A general matrix A ∈ Rn×n is block diagonally dominant by columns
with respect to a given partitioning A = (Aij ) and a given norm if, for all j,
X
kA−1
jj k
−1
− kAij k =: γj ≥ 0. (13.17)
i6=j

This definition implicitly requires that the diagonal blocks Ajj are all nonsingu-
lar. A is block diagonally dominant by rows if AT is block diagonally dominant
by columns. For the block size 1, the usual property of point diagonal domi-
nance is obtained. Note that for the 1- and ∞-norms diagonal dominance does
not imply block diagonal dominance, nor does the reverse implication hold (see
Problem 13.2). Throughout our analysis of block diagonal dominance we take the
norm to be an arbitrary subordinate matrix norm.
252 Block LU Factorization

First, we show that for block diagonally dominant matrices a block LU factor-
ization exists, using the key property that block diagonal dominance is inherited by
the Schur complements obtained in the course of the factorization. In the analysis
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

we assume that A has m block rows and columns.

Theorem 13.7 (Demmel, Higham, and Schreiber). Suppose A ∈ Rn×n is non-


singular and block diagonally dominant by rows or columns with respect to a sub-
ordinate matrix norm in (13.17). Then A has a block LU factorization, and all
the Schur complements arising in Algorithm 13.3 have the same kind of diagonal
dominance as A.

Proof. This proof is a generalization of Wilkinson’s proof of the corresponding


result for point diagonally dominant matrices [1229, , pp. 288–289], [509,
, Thm. 3.4.3] (as is the proof of Theorem 13.8 below). We consider the case of
block diagonal dominance by columns; the proof for row-wise diagonal dominance
is analogous.
The first step of Algorithm 13.3 succeeds, since A11 is nonsingular, producing
a matrix that we can write as
 
U11 U12
A(2) = .
0 S

For j = 2: m we have
m
X m
X
(2)
kAij k = kAij − Ai1 A−1
11 A1j k
i=2 i=2
i6=j i6=j
m
X m
X
≤ kAij k + kA1j k kA−1
11 k kAi1 k
i=2 i=2
i6=j i6=j
m
X 
≤ kAij k + kA1j k kA−1 −1 −1
11 k kA11 k − kAj1 k , using (13.17),
i=2
i6=j
m
X
= kAij k + kA1j k − kA1j k kA−1
11 k kAj1 k
i=2
i6=j

≤ kA−1
jj k
−1
− kA1j k kA−1
11 k kAj1 k, using (13.17),
= min kAjj xk − kA1j k kA−1
11 k kAj1 k
kxk=1

≤ min k(Ajj − Aj1 A−1


11 A1j )xk
kxk=1
(2)
= min kAjj xk. (13.18)
kxk=1

(2) Pm (2)
Now if Ajj is singular it follows that i=2,i6=j kAij k = 0; therefore A(2) , and
(2)
hence also A, is singular, which is a contradiction. Thus Ajj is nonsingular, and
13.3 Error Analysis of Block LU Factorization 253

(13.18) can be rewritten


m
X (2) (2) −1 −1
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

kAij k ≤ kAjj k ,
i=2
i6=j

showing that A(2) is block diagonally dominant by columns. The result follows by
induction.
The next result allows us to bound kU k for a block diagonally dominant matrix.

Theorem 13.8 (Demmel, Higham, and Schreiber). Let A satisfy the conditions
of Theorem 13.7. If A(k) denotes the matrix obtained after k − 1 steps of Algo-
rithm 13.3, then
(k)
max kAij k ≤ 2 max kAij k.
k≤i,j≤m 1≤i,j≤m

Proof. Let A be block diagonally dominant by columns (the proof for row
diagonal dominance is similar). Then
m
X m
X
(2)
kAij k = kAij − Ai1 A−1
11 A1j k
i=2 i=2
m
X m
X
≤ kAij k + kA1j k kA−1
11 k kAi1 k
i=2 i=2
Xm
≤ kAij k,
i=1

Pm (k)
using (13.17). By induction, using Theorem 13.7, it follows that i=k kAij k ≤
Pm
i=1 kAij k. This yields

m
X m
X
(k) (k)
max kAij k ≤ max kAij k ≤ max kAij k.
k≤i,j≤m k≤j≤m k≤j≤m
i=k i=1
P
From (13.17), i6=j kAij k ≤ kA−1
jj k
−1
≤ kAjj k, so

(k)
max kAij k ≤ 2 max kAjj k ≤ 2 max kAjj k = 2 max kAij k.
k≤i,j≤m k≤j≤m 1≤j≤m 1≤i,j≤m

The implications of Theorems 13.7 and 13.8 for stability are as follows. Suppose
A is block diagonally dominant by columns. Also, assume for the moment that
the (subordinate) norm has the property that
X
max kAij k ≤ kAk ≤ kAij k, (13.19)
i,j
i,j

which holds for any p-norm, for example. The subdiagonal blocks in the first
block column of L are given by Li1 = Ai1 A−1 T T T
11 and so k[L21 , . . . , Lm1 ] k ≤ 1, by
(13.17) and (13.19). From Theorem 13.7 it follows that k[Lj+1,j , . . . , LTmj ]T k ≤ 1
T
254 Block LU Factorization

(i)
for j = 2: m. Since Uij = Aij for j ≥ i, Theorem 13.8 shows that kUij k ≤ 2kAk
for each block of U (and kUii k ≤ kAk). Therefore kLk ≤ m and kU k ≤ m2 kAk,
and so kLk kU k ≤ m3 kAk. For particular norms the bounds on the blocks of L
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

and U yield a smaller bound for kLk and kU k. For example, for the 1-norm we
have kLk1 kU k1 ≤ 2mkAk1 and for the ∞-norm kLk∞ kU k∞ ≤ 2m2 kAk∞ . We
conclude that block LU factorization is stable if A is block diagonally dominant
by columns with respect to any subordinate matrix norm satisfying (13.19).
Unfortunately, block LU factorization can be unstable when A is block diago-
nally dominant by rows, for although Theorem 13.8 guarantees that kUij k ≤ 2kAk,
kLk can be arbitrarily large. This can be seen from the example
    
A 0 I 0 A11 0
A = 111 = 1 −1 = LU,
2I I 2 A11 I 0 I

where A is block diagonally dominant by rows in any subordinate norm for any
nonsingular matrix A11 . It is easy to confirm numerically that block LU factor-
ization can be unstable on matrices of this form.
Next, we bound kLk kU k for a general matrix and then specialize to point
diagonal dominance. From this point on we use the norm kAk := maxi,j |aij |. We
partition A according to
 
A11 A12
A= , A11 ∈ Rr×r , (13.20)
A21 A22

and denote by ρn the growth factor for GE without pivoting. We assume that GE
applied to A succeeds.
To bound kLk, we note that, under the partitioning (13.20), for the first
block stage of Algorithm 13.3 we have kL21 k = kA21 A−111 k ≤ nρn κ(A) (see Prob-
lem 13.4). Since the algorithm works recursively with the Schur complement S,
and since every Schur complement satisfies κ(S) ≤ ρn κ(A) (see Problem 13.4),
each subsequently computed subdiagonal block of L has norm at most nρ2n κ(A).
Since U is composed of elements of A together with elements of Schur complements
of A,
kU k ≤ ρn kAk. (13.21)
Overall, then, for a general matrix A ∈ Rn×n ,

kLk kU k ≤ nρ2n κ(A) · ρn kAk = nρ3n κ(A)kAk. (13.22)

Thus, block LU factorization is stable for a general matrix A as long as GE is


stable for A (that is, ρn is of order 1) and A is well conditioned.
If A is point diagonally dominant by columns then, since every Schur comple-
ment enjoys the same property, we have kLij k ≤ 1 for i > j, by Problem 13.5.
Hence kLk = 1. Furthermore, ρn ≤ 2 (Theorem 9.9 or Theorem 13.8), giving
kU k ≤ 2kAk by (13.21), and so

kLk kU k ≤ 2kAk.

Thus block LU factorization is perfectly stable for a matrix point diagonally dom-
inant by columns.
13.3 Error Analysis of Block LU Factorization 255

If A is point diagonally dominant by rows then the best we can do is to take


ρn ≤ 2 in (13.22), obtaining
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

kLk kU k ≤ 8nκ(A)kAk. (13.23)

Hence for point row diagonally dominant matrices, stability is guaranteed if A


is well conditioned. This in turn is guaranteed if the row diagonal dominance
amounts γj in the analogue of (13.17) for point row diagonal dominance are suffi-
ciently large relative to kAk, because kA−1 k∞ ≤ (minj γj )−1 (see Problem 8.7(a)).

13.3.2. Symmetric Positive Definite Matrices


Further useful results about the stability of block LU factorization can be derived
for symmetric positive definite matrices. First, note that the existence of a block
LU factorization is immediate for such matrices, since all their leading princi-
pal submatrices are nonsingular. Let A be a symmetric positive definite matrix,
partitioned as  
A11 AT21
A= , A11 ∈ Rr×r .
A21 A22
The definiteness implies certain relations among the submatrices Aij that can be
used to obtain a stronger bound for kLk2 than can be deduced for a general matrix
(cf. Problem 13.4).

Lemma 13.9. If A is symmetric positive definite then kA21 A−1


11 k2 ≤ κ2 (A)
1/2
.
Proof. This lemma is a corollary of Lemma 10.12, but we give a separate
proof. Let A have the Cholesky factorization
 T  
R11 0 R11 R12
A= T T , R11 ∈ Rr×r .
R12 R22 0 R22

Then A21 A−1 T −1 −T T −T


11 = R12 R11 · R11 R11 = R12 R11 , so

kA21 A−1 −1
11 k2 ≤ kR12 k2 kR11 k2 ≤ kRk2 kR
−1
k2 = κ2 (R) = κ2 (A)1/2 .

The following lemma is proved in a way similar to the second inequality in Prob-
lem 13.4.

Lemma 13.10. If A is symmetric positive definite then the Schur complement


S = A22 − A21 A−1 T
11 A21 satisfies κ2 (S) ≤ κ2 (A).

Using the same reasoning as in the last subsection, we deduce from these
two lemmas that each subdiagonal block of L is bounded in 2-norm by κ2 (A)1/2 .
Therefore kLk2 ≤ 1 + mκ2 (A)1/2 , where
√ there are m block stages in the algorithm.
Also, it can be shown that kU k2 ≤ mkAk2 . Hence

kLk2 kU k2 ≤ m(1 + mκ2 (A)1/2 )kAk2 . (13.24)
256 Block LU Factorization

Table 13.1. Stability of block and point LU factorization. ρn is the growth factor for GE
without pivoting.
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Matrix property Block LU Point LU


Symmetric positive definite κ(A)1/2 1
Block column diagonally dominant 1 ρn
Point column diagonally dominant 1 1
Block row diagonally dominant ρ3n κ(A) ρn
Point row diagonally dominant κ(A) 1
Arbitrary ρ3n κ(A) ρn

It follows from Theorem 13.6 that when Algorithm 13.3 is applied to a symmetric
positive definite matrix A, the backward errors for the LU factorization and the
subsequent solution of a linear system are both bounded by

cn mukAk2 (2 + mκ2 (A)1/2 ) + O(u2 ). (13.25)

Any resulting bound for kx − x bk2 /kxk2 will be proportional to κ2 (A)3/2 , rather
than κ2 (A) as for a stable method. This suggests that block LU factorization
can lose up to 50% more digits of accuracy in x than a stable method for solving
symmetric positive definite linear systems. The positive conclusion to be drawn,
however, is that block LU factorization is guaranteed to be stable for a symmetric
positive definite matrix that is well conditioned.
The stability results for block LU factorization are summarized in Table 13.1,
which tabulates a bound for kA− L bU b k/(cn ukAk) for block and point LU factoriza-
tion for the matrix properties considered in this chapter. The constant cn incor-
porates any constants in the bound that depend polynomially on the dimension,
so a value of 1 in the table indicates unconditional stability.

13.4. Notes and References


The distinction between a partitioned algorithm and a block algorithm is rarely
made in the literature (exceptions include the papers by Schreiber [1021, ] and
Demmel, Higham, and Schreiber [326, ]); the term “block algorithm” is fre-
quently used to describe both types of algorithm. A partitioned algorithm might
also be called a “blocked algorithm” (as is done by Dongarra, Duff, Sorensen,
and van der Vorst [349, ]), but the similarity of this term to “block algo-
rithm” can cause confusion and so we do not recommend this terminology. Note
that in the particular case of matrix multiplication, partitioned and block algo-
rithms are equivalent. Our treatment of partitioned LU factorization has focused
on the stability aspects; for further details, particularly concerning implementa-
tion on high-performance computers, see Dongarra, Duff, Sorensen, and van der
Vorst [349, ] and Golub and Van Loan [509, ].
Recursive LU factorization is now regarded as the most efficient way in which
to implement LU factorization on machines with hierarchical memories [535, ],
[1145, ], but it has not yet been incorporated into LAPACK.
Problems 257

Block LU factorization appears to have first been proposed for block tridi-
agonal matrices, which frequently arise in the discretization of partial differential
equations. References relevant to this application include Isaacson and Keller [667,
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

, p. 59], Varah [1187, ], Bank and Rose [62, ], Mattheij [827, ],
[828, ], and Concus, Golub, and Meurant [262, ].
For an application of block LU factorization to linear programming, see Elder-
sveld and Saunders [388, ].
Theorem 13.5 is from Demmel and Higham [324, ]. The results in §13.3 are
from Demmel, Higham, and Schreiber [326, ], which extends earlier analysis
of block LU factorization by Demmel and Higham [324, ].
Block diagonal dominance was introduced by Feingold and Varga [406, ],
and has been used mainly in generalizations of the Gershgorin circle theorem.
Varah [1187, ] obtained bounds on kLk and kU k for block diagonally dominant
block tridiagonal matrices; see Problem 13.1.
Theorem 13.7 is obtained in the case of block diagonal dominance by rows
with minj γj > 0 by Polman [946, ]; the proof in [946, ] makes use of the
corresponding result for point diagonal dominance and thus differs from the proof
we have given.
At the cost of a much more difficult proof, Lemma 13.9 can be strengthened
to the attainable bound kA21 A−1 11 k2 ≤ (κ2 (A)
1/2
− κ2 (A)−1/2 )/2, as shown by
Demmel [307, , Thm. 4], but the weaker bound is sufficient for our purposes.

13.4.1. LAPACK
LAPACK does not implement block LU factorization, but its LU factorization
(and related) routines for full matrices employ partitioned LU factorization in
order to exploit the level-3 BLAS and thereby to be efficient on high-performance
machines.

Problems
13.1. (Varah [1187, ]) Suppose A is block tridiagonal and has the block LU
factorization A = LU (so that L and U are block bidiagonal and Ui,i+1 = Ai,i+1 ).
Show that if A is block diagonally dominant by columns then

kLi,i−1 k ≤ 1, kUii k ≤ kAii k + kAi−1,i k,

while if A is block diagonally dominant by rows then

kLi,i−1 k ≤ kAi,i−1 k/kAi−1,i k, kUii k ≤ kAii k + kAi,i−1 k.

What can be deduced about the stability of the factorization for these two classes
of matrices?
13.2. Show that for the 1- and ∞-norms diagonal dominance does not imply block
diagonal dominance, and vice versa.
13.3. If A ∈ Rn×n is symmetric, has positive diagonal elements, and is block
diagonally dominant by rows, must it be positive definite?
258 Block LU Factorization

13.4. Let A ∈ Rn×n be partitioned


 
A11 A12
A= , A11 ∈ Rr×r , (13.26)
Downloaded 07/01/14 to 129.174.21.5. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

A21 A22

with A11 nonsingular. Let kAk := maxij |aij |. Show that kA21 A−1
11 k ≤ nρn κ(A),
where ρn is the growth factor for GE without pivoting on A. Show that the Schur
complement S = A22 − A21 A−1 11 A12 satisfies κ(S) ≤ ρn κ(A).

13.5. Let A ∈ Rn×n be partitioned as in (13.26), with A11 nonsingular, and


suppose that A is point diagonally dominant by columns. Show that kA21 A−1
11 k1 ≤
1.
13.6. Show that under the conditions of Theorem 13.5 the computed solution to
Ax = b satisfies

(A + ∆A)b
x = b, k∆Ak ≤ cn u kAk + kLkb kUb k + O(u2 ),

and the computed solution to the multiple right-hand side system AX = B (where
(13.5) is assumed to hold for the multiple right-hand side triangular solves) satisfies

b − Bk ≤ cn u kAk + kLk
kAX b kU
b k kXk
b + O(u2 ).

In both cases, cn is a constant depending on n and the block size.


A B n×n
13.7. Let X = C D ∈R , where A is square and nonsingular. Show that

det(X) = det(A) det(D − CA−1 B).

Assuming A, B, C, D are all m × m, give a condition under which det(X) =


det(AD − CB).
13.8. By using a block LU factorization show that
 −1  
A B A−1 + A−1 BS −1 CA−1 −A−1 BS −1
= ,
C D −S −1 CA−1 S −1

where A is assumed to be nonsingular and S = D − CA−1 B.


13.9. Let A ∈ Rn×m , B ∈ Rm×n . Derive the expression

(I − AB)−1 = I + A(I − BA)−1 B


 
by considering block LU and block UL factorizations of BI AI . Deduce the Sherman–
Morrison–Woodbury formula

(T − U W −1 V T )−1 = T −1 + T −1 U (W − V T T −1 U )−1 V T T −1 ,

where T ∈ Rn×n , U ∈ Rn×r , W ∈ Rr×r , V ∈ Rr×n .

You might also like