0% found this document useful (0 votes)
84 views29 pages

Applications Revisited: 6.1. The "Best" Subspace For Given Data

Uploaded by

Petter Loved
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views29 pages

Applications Revisited: 6.1. The "Best" Subspace For Given Data

Uploaded by

Petter Loved
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

10.

1090/stml/094/06

Chapter 6

Applications Revisited

We now revisit the big applications mentioned in the introduction and


Chapters 1 and 3, and prove the statements made there about the solu-
tions of those problems.

6.1. The “Best” Subspace for Given Data


Consider the following situation: we have a large amount of data points,
each with a large number of individual entries (variables). In principle,
we may suspect that this data arises from a process that is driven by only
a small number of key quantities. That is, we suspect that the data may in
fact be “low-dimensional.” How can we test this hypothesis? As stated,
this is very general. We want to look at the situation where the data
arises from a process that is linear, and so the data should be close to a
low-dimensional subspace. Notice that because of error in measurement
and/or noise, the data is very unlikely to perfectly line up with a low-
dimensional subspace, and may in fact span a very high-dimensional
space. Mathematically, we can phrase our problem as follows: suppose
we have 𝑚 points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 in ℝ𝑛 . Given 𝑘 ≤ min{𝑚, 𝑛}, what is the
“closest” 𝑘-dimensional subspace to these 𝑚 points? As will hopefully
be no great surprise at this stage, the Singular Value Decomposition can
provide an answer! Before we show how to use the Singular Value De-
composition to solve this problem, the following exercises should ALL
be completed.

171

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
172 6. Applications Revisited

𝑥 − 𝑃𝒰 𝑥

𝑃𝒰 𝑥

Figure 6.1. 𝒱 = ℝ3 , 𝒰 is a two-dimensional subspace, and


𝑑(𝑥, 𝒰) = ‖𝑥 − 𝑃𝒰 𝑥‖.

Exercise 6.1. (1) Suppose 𝒰 is a finite-dimensional subspace of a vector


space 𝒱 that has inner product ⟨⋅, ⋅⟩𝒱 , and let 𝑥 ∈ 𝒱. Let 𝑃𝒰 ∶ 𝒱 → 𝒱 be
the orthogonal projection onto 𝒰, and show that the distance from 𝑥 to
𝒰, 𝑑(𝑥, 𝒰), is ‖𝑥 − 𝑃𝒰 𝑥‖𝑉 . (Here, the norm is that induced by the inner
product.) In terms of the notation from Chapter 3, this means that we
will use ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖ in place of 𝑑(𝑎𝑗 , 𝒰). (See also Figure 6.1.)
(2) Suppose we have 𝑚 points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 in ℝ𝑛 . Let 𝐴 be the ma-
trix whose rows are given by 𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 (remember: we consider el-
ements of ℝ𝑛 as column vectors, and so to get row vectors, we need the
transpose). Note that 𝐴 will be an 𝑚 × 𝑛 matrix. Let 𝑢 ∈ ℝ𝑛 be a unit
vector. Explain why 𝐴𝑢 will be the (column) vector whose 𝑗th entry is
the component of the projection of 𝑎𝑗 onto span{𝑢}.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.1. The “Best” Subspace for Given Data 173

(3) Suppose 𝒰 is a subspace of ℝ𝑛 and {𝑣 1 , 𝑣 2 , . . . , 𝑣 𝑘 } is an orthonormal


basis for 𝒰. Using the same notation as in the previous problem, if 𝑃𝒰 is
the orthogonal projection onto 𝒰, explain why
𝑚 𝑘
∑ ‖𝑃𝒰 𝑎𝑗 ‖22 = ∑ ‖𝐴𝑣 𝑖 ‖22 .
𝑗=1 𝑖=1

In other words, the sum of the squared magnitudes of the projections


of the 𝑎𝑗 onto 𝒰 is given by the sum of the squared magnitudes of 𝐴𝑣 𝑖 .
(Note also that for an element of 𝑥 ∈ ℝ𝑛 , ‖𝑥‖22 = 𝑥 ⋅𝑥, where ⋅ represents
the dot product: 𝑥 ⋅ 𝑦 = 𝑥𝑇 𝑦 when 𝑥, 𝑦 ∈ ℝ𝑛 .)

Suppose now that we have 𝑚 points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 in ℝ𝑛 and an in-


teger 1 ≤ 𝑘 ≤ min{𝑚, 𝑛}. For a 𝑘-dimensional subspace 𝒰 of ℝ𝑛 , notice
that the distance from 𝑎𝑗 to 𝒰 is ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖, where 𝑃𝒰 is the orthogonal
projection onto 𝒰. For measuring how close the points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 are
to 𝒰, we use the sum of the squares of their individual distances. That
is, how close the points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 are to 𝒰 is given by
𝑚
∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 .
𝑗=1

Thus, finding the “best” approximating 𝑘-dimensional subspace to the


points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 is equivalent to the following minimization prob-
lem: find a subspace 𝒰̂ with dim 𝒰̂ = 𝑘 such that
𝑚 𝑚
(6.1) ∑ ‖𝑎𝑗 − 𝑃𝒰̂ 𝑎𝑗 ‖2 = inf ∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 .
dim 𝒰=𝑘
𝑗=1 𝑗=1

Notice that just as for the Pythagorean Theorem, we have


‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ⟨𝑎𝑗 − 𝑃𝒰 𝑎𝑗 , 𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ⟩
= ‖𝑎𝑗 ‖2 − 2 ⟨𝑎𝑗 , 𝑃𝒰 𝑎𝑗 ⟩ + ‖𝑃𝒰 𝑎𝑗 ‖2
= ‖𝑎𝑗 ‖2 − 2 ⟨𝑃𝒰 𝑎𝑗 , 𝑃𝒰 𝑎𝑗 ⟩ + ‖𝑃𝒰 𝑎𝑗 ‖2
= ‖𝑎𝑗 ‖2 − ‖𝑃𝒰 𝑎𝑗 ‖2 ,

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
174 6. Applications Revisited

since ⟨𝑎𝑗 , 𝑢⟩ = ⟨𝑃𝒰 𝑎𝑗 , 𝑢⟩ for any 𝑢 ∈ 𝒰. (In fact, that is a defining char-
acteristic of the projection onto 𝒰 - see Proposition 3.23.) Thus,
𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ∑ (‖𝑎𝑗 ‖2 − ‖𝑃𝒰 𝑎𝑗 ‖2 )
𝑗=1 𝑗=1
𝑚 𝑚
= ( ∑ ‖𝑎𝑗 ‖2 ) − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 )
𝑗=1 𝑗=1
𝑚
= ‖𝐴‖2𝐹 − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ) ,
𝑗=1

where 𝐴 is the matrix whose rows are 𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 . Therefore, to min-
imize
𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ‖𝐴‖𝐹2 − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ) ,
𝑗=1 𝑗=1
we need to make the second term as large as possible! (That is, we want
to subtract as much as we possibly can.) Thus, it turns out that (6.1) is
equivalent to finding a subspace 𝒰̂ with dim 𝒰̂ = 𝑘 such that
𝑚 𝑚
(6.2) ∑ ‖𝑃𝒰̂ 𝑎𝑗 ‖2 = sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 .
𝑗=1 dim 𝒰=𝑘 𝑗=1

Moreover, if 𝒰̂ is a subspace of dimension 𝑘 that maximizes (6.2), then


the corresponding minimum in (6.1) is provided by
𝑚
‖𝐴‖2𝐹 − ∑ ‖𝑃𝒰̂ 𝑎𝑗 ‖2 .
𝑗=1

Notice that to specify a subspace 𝒰, we need only specify a basis of 𝒰.


Suppose now that 𝐴 is the 𝑚 × 𝑛 matrix whose rows are given by
𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 , and suppose that 𝐴 = 𝑌 Σ𝑋 𝑇 is the Singular Value De-
composition of 𝐴. (Thus, the 𝑚 columns of 𝑌 form an orthonormal basis
of ℝ𝑚 , and the 𝑛 columns of 𝑋 form an orthonormal basis of ℝ𝑛 .) For a
fixed 𝑘 ∈ {1, 2, . . . , min{𝑚, 𝑛}}, we claim that the best 𝑘-dimensional sub-
space of ℝ𝑛 is given by 𝒲 𝑘 = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑘 }, where 𝑥1 , 𝑥2 , . . . , 𝑥𝑛
are the columns of 𝑋 (or equivalently their transposes are the rows of
𝑋 𝑇 ). Thus, if the singular triples of 𝐴 are (𝜎 𝑖 , 𝑥𝑖 , 𝑦 𝑖 ), then the closest
𝑘-dimensional subspace to 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 is span{𝑥1 , 𝑥2 , . . . , 𝑥𝑘 }.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.1. The “Best” Subspace for Given Data 175

Theorem 6.2. Suppose 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 are points in ℝ𝑛 , let 𝐴 be the 𝑚 × 𝑛


matrix whose rows are given by 𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 , and let 𝑝 = min{𝑚, 𝑛}.
Suppose the singular triples of 𝐴 are (𝜎 𝑖 , 𝑥𝑖 , 𝑦 𝑖 ). For any 𝑘 ∈ {1, 2, . . . , 𝑝},
if 𝒲 𝑘 = span{𝑥1 , 𝑥2 . . . , 𝑥𝑘 }, we will have

𝑚 𝑚
∑ ‖𝑃𝒲 𝑘 𝑎𝑗 ‖2 = sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ,
𝑗=1 dim 𝒰=𝑘 𝑗=1

or equivalently

𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘 𝑎𝑗 ‖2 = inf ∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 .
dim 𝒰=𝑘
𝑗=1 𝑗=1

Moreover, we will have


𝑚
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘 𝑎𝑗 ‖2 = 𝜎𝑘+1 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).
𝑗=1

Proof. We use induction on 𝑘. When 𝑘 = 1, specifying a one dimen-


sional subspace of ℝ𝑛 means specifying a non-zero 𝑢. Moreover, for any
⟨𝑥,ᵆ⟩ 𝑥⋅ᵆ |𝑥⋅ᵆ|
𝑢 ≠ 0ℝ𝑛 , 𝑃span{ᵆ} 𝑥 = ⟨ᵆ,ᵆ⟩ 𝑢 = ᵆ⋅ᵆ 𝑢, and thus we have ‖𝑃span{ᵆ} 𝑥‖ = ‖ᵆ‖ .
Therefore,
𝑚 𝑚
(𝑎𝑗 ⋅ 𝑢)2
sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = sup ∑ 2
dim 𝒰=1 𝑗=1 ᵆ≠0ℝ𝑛 𝑗=1 ‖𝑢‖

‖𝐴𝑢‖2
= sup
ᵆ≠0ℝ𝑛 ‖𝑢‖2
2
‖𝐴𝑢‖
= ( sup )
ᵆ≠0ℝ𝑛 ‖𝑢‖

= 𝜎1 (𝐴)2 .

(To go from the first line to the second line, we have used the fact that the
𝑗th entry in 𝐴𝑢 is the dot product of 𝑎𝑗 and 𝑢.) Moreover, a maximizer
above is provided by 𝑥1 , as is shown by Theorem 5.19. Therefore, we
see that a maximizing subspace is given by 𝒲1 = span{𝑥1 }. Next, as we

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
176 6. Applications Revisited

showed in the discussion between (6.1) and (6.2),


𝑚 𝑚
inf ∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ‖𝐴‖2𝐹 − sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2
dim 𝒰=1 dim 𝒰=1 𝑗=1
𝑗=1
𝑝
= ( ∑ 𝜎𝑖2 (𝐴)) − 𝜎12 (𝐴)
𝑖=1

= 𝜎22 (𝐴) + 𝜎32 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).

This finishes the proof of the base case (k= 1).


Suppose now that we know that 𝒲 𝑘 = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑘 } is a max-
imizing 𝑘-dimensional subspace: i.e.
𝑚 𝑚
∑ ‖𝑃𝒲 𝑘 𝑎𝑗 ‖2 = sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ,
𝑗=1 dim 𝒰=𝑘 𝑗=1

and that
𝑚
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘 𝑎𝑗 ‖2 = 𝜎𝑘+1 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).
𝑗=1

𝑚
We now show that 𝒲 𝑘+1 will maximize ∑𝑗=1 ‖𝑃𝒰 𝑎𝑗 ‖2 over all possible
(𝑘 + 1)-dimensional subspaces 𝒰 of ℝ𝑛 , and that
𝑚
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 = 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).
𝑗=1

Let 𝒰 be an arbitrary (𝑘 + 1)-dimensional subspace. Notice that 𝑊𝑘⊥ has


dimension 𝑛 − 𝑘, and so therefore we have

𝑛 ≥ dim(𝒰 + 𝑊𝑘⊥ ) = dim 𝒰 + dim 𝑊𝑘⊥ − dim(𝒰 ∩ 𝑊𝑘⊥ )


= 𝑘 + 1 + 𝑛 − 𝑘 − dim(𝒰 ∩ 𝑊𝑘⊥ ) .

Therefore, a little rearrangement shows that dim(𝒰 ∩ 𝑊𝑘⊥ ) ≥ 1. Thus,


there is an orthonormal basis 𝑦1 , 𝑦2 , . . . , 𝑦 𝑘 , 𝑦 𝑘+1 of 𝒰 such that we have
𝑦 𝑘+1 ∈ 𝒰 ∩ 𝑊𝑘⊥ . From Theorem 5.19, we then know that

‖𝐴𝑢‖
(6.3) ‖𝐴𝑦 𝑘+1 ‖ ≤ sup = ‖𝐴𝑥𝑘+1 ‖ = 𝜎 𝑘+1 (𝐴).
ᵆ∈𝑊𝑘⊥ \{0ℝ 𝑛}
‖𝑢‖

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.1. The “Best” Subspace for Given Data 177

Moreover, since 𝑈̃ = span{𝑦1 , 𝑦2 , . . . , 𝑦 𝑘 } is a 𝑘-dimensional subspace,


we will have (using the result of exercise (3))
𝑘 𝑚 𝑚 𝑘
(6.4) ∑ ‖𝐴𝑦 𝑖 ‖2 = ∑ ‖𝑃𝑈̃ 𝑎𝑗 ‖2 ≤ sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = ∑ ‖𝐴𝑥𝑖 ‖2 .
𝑖=1 𝑗=1 dim 𝒰=𝑘 𝑗=1 𝑖=1

Combining (6.3) and (6.4) and using the result of exercise (3) above, we
see
𝑚 𝑘+1 𝑘+1 𝑚
∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = ∑ ‖𝐴𝑦 𝑖 ‖2 ≤ ∑ ‖𝐴𝑥𝑖 ‖2 = ∑ ‖𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 .
𝑗=1 𝑖=1 𝑖=1 𝑗=1

Since 𝒰 is an arbitrary subspace of dimension 𝑘 + 1,


𝒲 𝑘+1 = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑘+1 }
is a maximizing subspace of dimension 𝑘 + 1. Thus, ‖𝐴𝑥𝑘+1 ‖ = 𝜎 𝑘+1 (𝐴)
implies
𝑚 𝑘+1 𝑘+1
sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = ∑ ‖𝐴𝑥𝑖 ‖2 = ∑ 𝜎𝑖2 (𝐴),
dim 𝒰=𝑘+1 𝑗=1 𝑖=1 𝑖=1

and so we have
𝑚 𝑘+1
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 = ‖𝐴‖2𝐹 − ∑ 𝜎𝑖2 (𝐴) = 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴). □
𝑗=1 𝑖=1

It’s important to make sure that the points are normalized so that
their “center of mass” is at the origin.
Example 6.3. Consider the points [−1 1]𝑇 , [0 1]𝑇 , and [1 1]𝑇 . These are
clearly all on the line 𝑦 = 1 in the 𝑥𝑦 plane. Let
−1 1
𝐴 = [ 0 1] .
1 1
We have
−1 1
−1 0 1
𝐴𝑇 𝐴 = [ ] [ 0 1]
1 1 1
1 1
2 0
=[ ].
0 3

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
178 6. Applications Revisited

Therefore, the eigenvalues of 𝐴𝑇 𝐴 are 𝜆↓1 = 3 and 𝜆↓2 = 2, with eigen-


vectors [0 1]𝑇 and [1 0]𝑇 . Consequently, the singular values of 𝐴 are √3
and √2. Moreover, the first singular triple is
1
0 1
(√3, [ ] , [1]) ,
1 √3
1
and so by Theorem 6.2, the closest one-dimensional subspace to these
points is span{[0 1]𝑇 }, which is the 𝑦-axis in the 𝑥𝑦 plane . . . which is not
really a good approximation of the line 𝑦 = 1. What’s going on here?
The issue is that the best subspace must contain the origin, and the
line that the given points lie on does not contain the origin. Therefore,
we “normalize” the points by computing their center of mass and sub-
tracting it from each point. What is left over is then centered at the ori-
gin. In this example, the center of mass is the point 𝑥̄ = [0 1]𝑇 . When
we subtract this point from each point in the given collection, we have
the new collection [−1 0]𝑇 , [0 0]𝑇 , and [1 0]𝑇 . We now consider
−1 0
𝐴 ̃ = [ 0 0] .
1 0
We have
−1 0
̃
𝑇̃ −1 0 1 2 0
𝐴 𝐴=[ ] [ 0 0] = [ ],
0 0 0 0 0
1 0
−1
1 1
and hence the first singular triple of 𝐴̃ is (√2, [ ] , [ 0 ]), and The-
0 √2
1
orem 6.2 tells us that the closest subspace to the translated collection
[−1 0]𝑇 , [0 0]𝑇 , and [1 0]𝑇 is span{[1 0]𝑇 }, i.e. the 𝑥-axis, which makes
sense since the translated collection lies on the 𝑥-axis. When we undo
the translation, we get a horizontal line through the point 𝑥.̄

In the general situation, the idea is to translate the given collection


of points to a new collection whose center of mass is the origin. If the
original collection has the points {𝑎1 , 𝑎2 , . . . , 𝑎𝑛 }, the center of mass is

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.2. Least Squares and Moore-Penrose Pseudo-Inverse 179

1𝑛 −1 0
𝑎̄ ≔ ∑𝑖=1 𝑎𝑖 . (In our example, we have 𝑎1 = [ ], 𝑎2 = [ ], and
𝑛 1 1
1 1 0 0
𝑎3 = [ ]. The center of mass is then 𝑎̄ = 3 [ ] = [ ].) We then consider
1 3 1
the translated collection {𝑎1 − 𝑎,̄ 𝑎2 − 𝑎,̄ . . . , 𝑎𝑛 − 𝑎},
̄ and determine the
closest subspace to this translated collection. To undo the translation,
we simply add 𝑎̄ to the best subspace, creating an “affine subspace.”
By calculating and graphing the singular values, we can get an idea
as to whether or not the data set is low-dimensional. One way of doing
this is looking at the relative sizes of subsequent singular values. Some
caution: the drop that shows when we can neglect higher dimensions
depends on the problem! For some problems, a drop of a factor of 10
may be a good enough sign. For others, perhaps a factor of 100 may be
necessary.

6.2. Least Squares and Moore-Penrose Pseudo-Inverse


Suppose 𝐿 ∈ ℒ (𝒱, 𝒲). If dim 𝒱 ≠ dim 𝒲, then 𝐿 will not have an
inverse. Moreover, even if dim 𝒱 = dim 𝒲, 𝐿 may not have an inverse.
However, it will often be the case that given a 𝑦 ∈ 𝒲, we will want to
find an 𝑥 ∈ 𝒱 such that 𝐿𝑥 is close to 𝑦. If 𝐿 has an inverse, then the
“correct” 𝑥 is clearly 𝐿−1 𝑦. Notice that in this case 𝐿−1 𝑦 clearly minimizes
‖𝐿𝑥 − 𝑦‖𝒲 over all possible 𝑥 ∈ 𝒱. (In fact, when 𝐿 has an inverse, the
minimum of ‖𝐿𝑥−𝑦‖𝒲 is zero!) So one way to generalize the inverse is to
consider the following process: given 𝑦 ∈ 𝒲, find 𝑥 ∈ 𝒱 that minimizes
‖𝐿𝑥−𝑦‖𝒲 . However, what if there are lots of minimizing 𝑥 ∈ 𝒱? Which
one should we pick? A common method is to pick the smallest such 𝑥.
Thus, to get a “pseudo”-inverse we follow the following procedure: given
𝑦 ∈ 𝒲, find the smallest 𝑥 ∈ 𝒱 that minimizes ‖𝐿𝑥 − 𝑦‖𝒲 . Is there a
formula that we can use for this process?
Let’s look at the first part: given 𝑦 ∈ 𝒲, find an 𝑥 ∈ 𝒱 that mini-
mizes ‖𝐿𝑥 − 𝑦‖𝒲 . Notice that the set of all 𝐿𝑥 for which 𝑥 ∈ 𝒱 is the
range of 𝐿 (ℛ (𝐿)), and so therefore minimizing ‖𝐿𝑥 − 𝑦‖𝒲 means mini-
mizing the distance from 𝑦 to ℛ (𝐿), i.e. finding the projection of 𝑦 onto

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
180 6. Applications Revisited

ℛ (𝐿). If we have an orthonormal basis of ℛ (𝐿), calculating this projec-


tion is straightforward. Notice that an orthonormal basis of ℛ (𝐿) is one
of the things provided by the Singular Value Decomposition!
Recall that the Singular Value Decomposition says that there are or-
thonormal bases {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } of 𝒱 and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } of 𝒲, and there
are non-negative numbers 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑝 ≥ 0 such that 𝐿𝑥𝑖 = 𝜎 𝑖 𝑦 𝑖
and 𝐿∗ 𝑦 𝑖 = 𝜎 𝑖 𝑥𝑖 for 𝑖 = 1, 2, . . . , 𝑝. (Here, 𝑝 equals the minimum of
{dim 𝒱, dim 𝒲} = min{𝑛, 𝑚}.) Further, if 𝑟 ≔ max{𝑖 ∶ 𝜎 𝑖 > 0}, then
Theorem 3.51 and the following problem imply that

𝒩 (𝐿) = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑟 } .
Exercise 6.4. Show that ℛ (𝐿∗ ) = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑟 }.

Thus, given 𝑦 ∈ 𝒲, Proposition 3.26 implies that


𝑃𝑦 = ⟨𝑦, 𝑦1 ⟩𝒲 𝑦1 + ⟨𝑦, 𝑦2 ⟩𝒲 𝑦2 + ⋯ + ⟨𝑦, 𝑦𝑟 ⟩𝒲 𝑦𝑟 .
Since 𝐿𝑥𝑖 = 𝜎 𝑖 𝑦 𝑖 and 𝜎 𝑖 ≠ 0 for 𝑖 = 1, 2, . . . , 𝑟, if we define
⟨𝑦, 𝑦1 ⟩𝒲 ⟨𝑦, 𝑦2 ⟩𝒲 ⟨𝑦, 𝑦𝑟 ⟩𝒲
𝑥̂ = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑟 ,
𝜎1 𝜎2 𝜎𝑟
we have 𝐿𝑥̂ = 𝑃𝑦. Notice that 𝐿𝑥̂ minimizes the distance from 𝑦 to ℛ (𝐿),
and so 𝑥̂ is a minimizer of ‖𝐿𝑥 − 𝑦‖𝑊 .
We now claim that 𝑥̂ is the smallest 𝑥 ∈ 𝒱 such that 𝐿𝑥 = 𝑃𝑦. Let
𝑥 be any element of 𝒱 such that 𝐿𝑥 = 𝑃𝑦. Therefore, 𝐿(𝑥 − 𝑥)̂ = 0𝒲
and hence 𝑥 − 𝑥̂ ∈ 𝒩 (𝐿), i.e. 𝑥 = 𝑥̂ + 𝑧 for some 𝑧 ∈ 𝒩 (𝐿). Because

𝑥̂ ∈ span{𝑥1 , 𝑥2 , . . . , 𝑥𝑟 }, and 𝑧 ∈ 𝒩 (𝐿) = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑟 } , we have
⟨𝑥,̂ 𝑧⟩𝒱 = 0. By the Pythagorean Theorem,
‖𝑥‖𝒱2 = ‖𝑥‖̂ 𝒱2 + ‖𝑧‖𝒱2 ≥ ‖𝑥‖̂ 𝒱2 ,
which means that 𝑥̂ has the smallest norm among all 𝑥 with 𝐿𝑥 = 𝑃𝑦.
This gives us a formula for our pseudo-inverse 𝐿† in terms of the or-
thonormal bases and the singular values of 𝐿:
⟨𝑦, 𝑦1 ⟩𝒲 ⟨𝑦, 𝑦2 ⟩𝒲 ⟨𝑦, 𝑦𝑟 ⟩𝒲
(6.5) 𝐿† 𝑦 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑟 .
𝜎1 𝜎2 𝜎𝑟
Thus, 𝐿† 𝑦 is the smallest element of 𝒱 that minimizes ‖𝐿𝑥 − 𝑦‖𝒲 .

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.2. Least Squares and Moore-Penrose Pseudo-Inverse 181

Exercise 6.5. Explain why formula (6.5) above implies 𝐿† and 𝐿−1 are
the same when 𝐿 is invertible. (Think about what the singular triples of
𝐿−1 are when 𝐿−1 exists.)

What does this mean when 𝒱 = ℝ𝑛 and 𝒲 = ℝ𝑚 (with the dot


product as their inner product) and 𝐴 is an 𝑚 × 𝑛 matrix? Here we recall
the reduced Singular Value Decomposition of 𝐴: let 𝑋 be the 𝑛×𝑛 matrix
whose columns are given by {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and 𝑌 be the 𝑚 × 𝑚 matrix
whose columns are given by {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 }, and finally Σ̃ is the 𝑟 × 𝑟
diagonal matrix whose diagonal entries are 𝜎 𝑖 (for 𝑖 = 1, 2, . . . 𝑟, where 𝑟
is the rank of 𝐴). If 𝑋𝑟 denotes the first 𝑟 columns of 𝑋 and 𝑌𝑟 denotes
the first 𝑟 columns of 𝑌 , then we know that 𝐴 = 𝑌𝑟 Σ𝑋 ̃ 𝑟𝑇 . We now get a

formula for 𝐴 in terms of these matrices. For 𝑗 = 1, 2, . . . , 𝑟, we have
−1
𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 𝑦𝑗
1
⎡ 𝜍1 ⎤⎡ 𝑦𝑇1
⎢ 1 ⎥⎢ ⎤
𝑦𝑇2 ⎥
= [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥⎢ ⎥[𝑦𝑗 ]
⎢ ⋱ ⎥⎢ ⋮ ⎥
⎢ 1 ⎥ 𝑦𝑇𝑟
⎣ 𝜍𝑟 ⎦⎣ ⎦
1
𝑇
⎡ 𝜍1
1
⎤⎡ 𝑦1 𝑦𝑗 ⎤
⎢ ⎥⎢ 𝑦𝑇 𝑦𝑗 ⎥
= [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥⎢ 2 ,
⎢ ⋱ ⎥⎢ ⋮ ⎥
⎢ 1 ⎥ 𝑦𝑇 𝑦 ⎥
⎣ 𝜍𝑟 ⎦⎣ 𝑟 𝑗 ⎦
and therefore (letting e𝑗,𝑟 be the first 𝑟 entries from the column vector e𝑗
in ℝ𝑚 ) we will have
1
⎡ 𝜍1
1

−1 ⎢ ⎥
𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 𝑦𝑗 = [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥e𝑗,𝑟
⎢ ⋱ ⎥
⎢ 1 ⎥
⎣ 𝜍𝑟 ⎦

1 𝑥𝑗
= [ 𝑥1 𝑥2 ... 𝑥𝑟 ] e = ,
𝜎𝑗 𝑗,𝑟 𝜎𝑗

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
182 6. Applications Revisited
−1
while for 𝑗 = 𝑟+1, . . . , 𝑚, we will have 𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 𝑦𝑗 = 0. Therefore, we
−1 𝑇
see that 𝑋𝑟 (Σ)̃ 𝑌𝑟 𝑣𝑗 agrees with formula (6.5) applied to 𝑦𝑗 for each
−1
𝑗 = 1, 2, . . . , 𝑚, and so we have 𝐴† = 𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 :
1
⎡ 𝜍1 ⎤⎡ 𝑦𝑇1
⎢ 1 ⎥⎢ ⎤
𝑦𝑇2 ⎥.
𝐴† = [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥⎢ ⎥
⎢ ⋱ ⎥⎢ ⋮ ⎥
⎢ 1 ⎥ 𝑦𝑇𝑟
⎣ 𝜍𝑟 ⎦⎣ ⎦

Exercise 6.6. Use the formula above to give (yet) another explanation
that 𝐴† and 𝐴−1 are the same when 𝐴−1 exists.

6.3. Eckart-Young-Mirsky for the Operator Norm


Suppose 𝒱 and 𝒲 are finite-dimensional inner-product spaces, with in-
ner products ⟨⋅, ⋅⟩𝒱 and ⟨⋅, ⋅⟩𝒲 , respectively, and suppose further that
𝐿 ∈ ℒ (𝒱, 𝒲). The Singular Value Decomposition provides a way to
see what the most important parts of 𝐿 are. This is related to the follow-
ing problem: given 𝑘 = 1, 2, . . . , rank 𝐿 − 1, what rank 𝑘 𝑀 ∈ ℒ (𝒱, 𝒲)
is closest to 𝐴? In other words, what is the closest rank 𝑘 operator to
𝐿? A very important ingredient to answering this question is determin-
ing what we mean by “closest” — how are we measuring distance? One
‖𝑀𝑥‖
approach is to use the operator norm: ‖𝐿‖𝑜𝑝 = sup{ ‖𝑥‖ 𝒲 ∶ 𝑥 ≠ 0𝒱 },
𝒱
where ‖ ⋅ ‖𝒱 and ‖ ⋅ ‖𝒲 are the norms induced by the inner products.
Thus, our problem is as follows: find 𝑀̃ ∈ ℒ (𝒱, 𝒲) with rank 𝑘 such
that
̃ 𝑜𝑝 = inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} .
‖𝐿 − 𝑀‖
We can get an upper bound on inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} by con-
sidering a particular 𝑀. Suppose that dim 𝒱 = 𝑛 and dim 𝒲 = 𝑚,
and let 𝑝 = min{𝑛, 𝑚}. Let {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } be the
orthonormal bases of 𝒱, 𝒲 and suppose 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑝 ≥ 0 are the
singular values of 𝐿 provided by the SVD. If rank 𝐿 = 𝑟, then we know
that 𝜎 𝑖 = 0 when 𝑖 > 𝑟. Moreover, we know that for any 𝑥 ∈ 𝒱, we have
𝑟
𝐿𝑥 = ∑ 𝜎 𝑖 ⟨𝑥𝑖 , 𝑥⟩𝒱 𝑦 𝑖 .
𝑖=1

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.3. Eckart-Young-Mirsky for the Operator Norm 183

We now want to consider 𝑀𝑘 ∈ ℒ (𝒱, 𝒲) defined by


𝑘
(6.6) 𝑀𝑘 𝑥 = ∑ 𝜎 𝑖 ⟨𝑥𝑖 , 𝑥⟩𝒱 𝑦 𝑖 .
𝑖=1

Exercise 6.7. Show that ℛ (𝑀𝑘 ) = span{𝑦1 , 𝑦2 , . . . , 𝑦 𝑘 }, and conclude


that rank 𝑀𝑘 = 𝑘.

We will have
𝑟
(𝐿 − 𝑀𝑘 )𝑥 = ∑ 𝜎 𝑖 ⟨𝑥𝑖 , 𝑥⟩𝒱 𝑦 𝑖 ,
𝑖=𝑘+1

and we now calculate ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 . Let 𝑥 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎𝑛 𝑥𝑛 . By


the orthonormality of the 𝑥𝑖 , we will have
𝑟
(𝐿 − 𝑀𝑘 )𝑥 = ∑ 𝜎 𝑖 𝑎𝑖 𝑦 𝑖 .
𝑖=𝑘+1

By the Pythagorean Theorem and the fact that the singular values are
decreasing,
𝑟
2
‖(𝐿 − 𝑀𝑘 )𝑥‖𝒲 = ∑ 𝜎𝑖2 𝑎2𝑖
𝑖=𝑘+1
𝑟
2
≤ 𝜎𝑘+1 ∑ 𝑎2𝑖
𝑖=𝑘+1
𝑛
2
≤ 𝜎𝑘+1 2
∑ 𝑎2𝑖 = 𝜎𝑘+1 ‖𝑥‖𝒱2 ,
𝑖=1

and therefore for any 𝑥 ≠ 0𝒱 ,


2
‖(𝐿 − 𝑀𝑘 )𝑥‖𝒲 2
2
≤ 𝜎𝑘+1 .
‖𝑥‖𝒱
𝑘 𝒲 ‖(𝐿−𝑀 )𝑥‖
Taking square roots, we see ‖𝑥‖𝒱
≤ 𝜎 𝑘+1 for any 𝑥 ≠ 0𝒱 . Thus,
‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 ≤ 𝜎 𝑘+1 . On the other hand, if we consider 𝑥𝑘+1 , we have
‖(𝐿 − 𝑀𝑘 )𝑥𝑘+1 ‖𝒲
= ‖𝜎 𝑘+1 𝑦 𝑘+1 ‖𝒲 = 𝜎 𝑘+1 ,
‖𝑥𝑘+1 ‖𝒱
which implies ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 = 𝜎 𝑘+1 .

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
184 6. Applications Revisited

Therefore, inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} ≤ 𝜎 𝑘+1 . (Why?) A harder


question: can we do any better? The answer turns out to be no!
Theorem 6.8 (Eckart-Young-Mirsky Theorem, Operator Norm). Sup-
pose 𝒱 and 𝒲 are two finite-dimensional inner-product spaces, and as-
sume dim 𝒱 = 𝑛 and dim 𝒲 = 𝑚. Let 𝑝 = min{𝑛, 𝑚}, and suppose
𝐿 ∈ ℒ (𝒱, 𝒲) and rank 𝐿 = 𝑟. Let {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 }
and 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑝 ≥ 0 be the orthonormal bases of 𝒱, 𝒲 (respec-
tively) and the singular values of 𝐿 provided by the SVD. If 𝑀𝑘 is defined
by (6.6), we have
inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} = 𝜎 𝑘+1 = ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 .
In other words, 𝑀𝑘 is the closest rank 𝑘 linear operator to 𝐿, as measured
by the operator norm.

Proof. Since the proof requires examining the singular values of differ-
ent operators, we use 𝜎 𝑘 (𝐴) to mean the 𝑘th singular value of 𝐴. In the
preceding paragraphs, we have shown that
inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} ≤ 𝜎 𝑘+1 (𝐿) = ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 .
To finish the proof, we must show that ‖𝐿 − 𝑀‖𝑜𝑝 ≥ 𝜎 𝑘+1 (𝐿) for any
𝑀 ∈ ℒ (𝒱, 𝒲) with rank 𝑀 = 𝑘.
Suppose that 𝑀 ∈ ℒ (𝒱, 𝒲) has rank 𝑘, and recall that Theorem
5.2(b) tells us ‖𝐿 − 𝑀‖𝑜𝑝 = 𝜎1 (𝐿 − 𝑀). We now use Weyl’s inequality:
𝜎 𝑘+𝑗−1 (𝐿1 + 𝐿2 ) ≤ 𝜎 𝑘 (𝐿1 ) + 𝜎𝑗 (𝐿2 ) for any 𝐿1 , 𝐿2 ∈ ℒ (𝒱, 𝒲) and for
any indices 𝑘, 𝑗, 𝑘 + 𝑗 − 1 between 1 and min{𝑛, 𝑚}. Replacing 𝐿1 with
𝐿 − 𝑀 and 𝐿2 with 𝑀, and taking 𝑘 = 1 and 𝑗 = 𝑘 + 1, we will have
𝜎 𝑘+1 (𝐿) = 𝜎1+𝑘+1−1 (𝐿) ≤ 𝜎1 (𝐿 − 𝑀) + 𝜎 𝑘+1 (𝑀)
= 𝜎1 (𝐿 − 𝑀) = ‖𝐿 − 𝑀‖𝑜𝑝 ,
since rank 𝑀 = 𝑘 implies that 𝜎 𝑘+1 (𝑀) = 0. □
Corollary 6.9. Let 𝐴 be an 𝑚 × 𝑛 matrix, and let 𝑝 = min{𝑚, 𝑛}. Let
{𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } be the orthonormal bases of ℝ𝑛 and
ℝ𝑚 and suppose 𝜎1 ≥ 𝜎2 ≥ . . . 𝜎𝑝 ≥ 0 are the singular values provided by
the Singular Value Decomposition of 𝐴. Assume 𝑘 ≤ rank 𝐴, and let
𝑘
𝐵𝑘 = ∑ 𝜎 𝑖 (𝐴)𝑦 𝑖 𝑥𝑖𝑇 .
𝑖=1

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.4. Eckart-Young-Mirsky for Frobenius Norm 185

Then 𝐵𝑘 has rank 𝑘 and


‖𝐴 − 𝐵𝑘 ‖𝑜𝑝 = inf{‖𝐴 − 𝐵‖𝑜𝑝 ∶ rank 𝐵 = 𝑘} .
Exercise 6.10. Prove the preceding corollary. (A possible approach:
show that 𝐵𝑘 and 𝑀𝑘 are equal.)

6.4. Eckart-Young-Mirsky for the Frobenius Norm and


Image Compression
The previous section tells us how to “best” approximate a given linear
operator 𝐿 ∈ ℒ (𝒱, 𝒲) by one of lower rank when we use the operator
norm. In this section we consider a more concrete problem, approximat-
ing a matrix. Recall that if 𝐴 is a gray-scale matrix, a better measure of
distance and size is provided by the Frobenius norm. (In addition, the
Frobenius norm is easier to calculate!) Recall, the Frobenius norm of an
𝑚 × 𝑛 matrix is given by
1
2

‖𝐴‖𝐹 ≔ (∑ 𝐴2𝑖𝑗 ) ,
𝑖,𝑗

so ‖𝐴‖2𝐹
is the sum of the squares of all of the entries of 𝐴. What is the
closest rank 𝑘 matrix to 𝐴, as measured in the Frobenius norm? Our
problem is: find 𝐵̃ with rank 𝑘 such that
‖𝐴 − 𝐵‖̃ 𝐹 = inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} .
As in the case of the operator norm, we can get an upper bound on
the infimum by considering a particular 𝐵. We use the same particu-
lar matrix as in the operator norm: suppose 𝐴 = 𝑌 Σ𝑋 𝑇 is the Singular
Value Decomposition of 𝐴. Σ is an 𝑚 × 𝑛 matrix, whose diagonal entries
are Σ𝑖𝑖 = 𝜎 𝑖 (𝐴). Let Σ𝑘 be the 𝑚 × 𝑛 matrix that has the same first 𝑘
diagonal entries as Σ, and the remaining entries are all zeros. Now, let
𝐵 𝑘 = 𝑌 Σ𝑘 𝑋 𝑇 .
Exercise 6.11. Show that
‖𝐴 − 𝐵𝑘 ‖2𝐹 = 𝜎𝑘+1
2 2
(𝐴) + 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑟2 (𝐴).
(Recall the extraordinarily useful fact that the Frobenius norm is invari-
ant under orthogonal transformations, which means that if 𝑊 is any
appropriately sized square matrix such that 𝑊 𝑇 𝑊 = 𝐼, then we have

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
186 6. Applications Revisited

‖𝐴𝑊 𝑇 ‖𝐹 = ‖𝐴‖𝐹 and ‖𝑊𝐴‖𝐹 = ‖𝐴‖𝐹 ; and the consequence that Frobe-
nius norm is the square root of the sum of the squares of the singular
values.)

Exercise 6.11 implies that


1
𝑟 2

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} ≤ ( ∑ 𝜎𝑗2 (𝐴)) .


𝑗=𝑘+1

(Why?) A harder question: can we do any better? The answer turns out
to be no!
Theorem 6.12 (Eckart-Young-Mirsky Theorem, Frobenius Norm). Sup-
pose 𝐴 is an 𝑚 × 𝑛 matrix with rank 𝐴 = 𝑟. For any 𝑘 = 1, 2, . . . , 𝑟, if
𝐵𝑘 = 𝑌 Σ𝑘 𝑋 𝑇 (where 𝐴 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decomposition of
𝐴), then we have
1
𝑟 2

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} = ( ∑ 𝜎𝑗2 (𝐴)) = ‖𝐴 − 𝐵𝑘 ‖𝐹 .


𝑗=𝑘+1

In other words, 𝐵𝑘 is the closest (as measured by the Frobenius norm) rank
𝑘 matrix to 𝐴.

Proof. From Exercise 6.11, we know that


1
𝑟 2

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} ≤ ( ∑ 𝜎𝑗2 (𝐴))


𝑗=𝑘+1

= ‖𝐴 − 𝐵𝑘 ‖𝐹 .
𝑟
To finish the proof, we show that ‖𝐴 − 𝐵‖2𝐹 ≥ ∑𝑗=𝑘+1 𝜎𝑗2 (𝐴) for any 𝐵
with rank 𝑘.
Let 𝐵 be an arbitrary matrix with rank 𝑘, and let 𝑝 = min{𝑛, 𝑚}, we
have
𝑝
‖𝐴 − 𝐵‖2𝐹 = ∑ 𝜎𝑗2 (𝐴 − 𝐵).
𝑗=1

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.4. Eckart-Young-Mirsky for Frobenius Norm 187

We again use Weyl’s inequality: 𝜎 𝑖+𝑗−1 (𝐿1 + 𝐿2 ) ≤ 𝜎 𝑖 (𝐿1 ) + 𝜎𝑗 (𝐿2 ) for


any linear operators 𝐿1 , 𝐿2 ∈ ℒ (ℝ𝑛 , ℝ𝑚 ) and for any indices 𝑖 and 𝑗 for
which 𝑖 + 𝑗 − 1 is between 1 and min{𝑛, 𝑚}. Replacing 𝐿1 with 𝐴 − 𝐵 and
𝐿2 with 𝐵, we will have
𝜎 𝑖+𝑗−1 (𝐴) ≤ 𝜎 𝑖 (𝐴 − 𝐵) + 𝜎𝑗 (𝐵)
so long as 1 ≤ 𝑘 + 𝑗 − 1 ≤ 𝑝. Taking 𝑗 = 𝑘 + 1, and noting that rank
𝐵 = 𝑘 implies that 𝜎 𝑘+1 (𝐵) = 0, we have
𝜎 𝑖+𝑘 (𝐴) ≤ 𝜎 𝑖 (𝐴 − 𝐵) + 𝜎 𝑘+1 (𝐵) = 𝜎 𝑖 (𝐴 − 𝐵)
so long as 1 ≤ 𝑖 ≤ 𝑝 and 1 ≤ 𝑖 + 𝑘 ≤ 𝑝. Therefore, we have
𝑝
‖𝐴 − 𝐵‖2𝐹 = ∑ 𝜎𝑖2 (𝐴 − 𝐵)
𝑖=1
𝑝−𝑘 𝑝
= ∑ 𝜎𝑖2 (𝐴 − 𝐵) + ∑ 𝜎𝑖2 (𝐴 − 𝐵)
𝑖=1 𝑖=𝑝−𝑘+1
𝑝−𝑘
≥ ∑ 𝜎𝑖2 (𝐴 − 𝐵)
𝑖=1
𝑝−𝑘
2
≥ ∑ 𝜎𝑖+𝑘 (𝐴)
𝑖=1
2 2
= 𝜎𝑘+1 (𝐴) + 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴)
= ‖𝐴 − 𝐵𝑘 ‖2𝐹 . □
Corollary 6.13. Let 𝐴 be an 𝑚 × 𝑛 matrix, and let 𝑝 = min{𝑚, 𝑛}. Let
{𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } be the orthonormal bases of ℝ𝑛 and
ℝ𝑚 and let 𝜎1 ≥ 𝜎2 ≥ . . . 𝜎𝑝 ≥ 0 be the singular values provided by the
Singular Value Decomposition of 𝐴. Suppose that 𝑘 ≤ rank 𝐴, and define
𝑘
𝐵𝑘 ≔ ∑𝑖=1 𝜎 𝑖 (𝐴)𝑦 𝑖 𝑥𝑖𝑇 . Then 𝐵𝑘 has rank 𝑘 and
‖𝐴 − 𝐵𝑘 ‖𝐹 = inf{‖𝐴 − 𝑀‖𝐹 ∶ rank 𝑀 = 𝑘} .
Exercise 6.14. Prove the preceding corollary.

As it turns out, 𝐵𝑘 will be the closest rank 𝑘 approximation to 𝐴 in


lots of norms! 𝐵𝑘 will in fact be the closest rank 𝑘 approximation to 𝐴
in any norm that is invariant under orthogonal transformations! That
is, if ‖ ⋅ ‖ is a norm on 𝑚 × 𝑛 matrices such that ‖𝑊𝐴‖ = ‖𝐴‖ and

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
188 6. Applications Revisited

‖𝐴𝑊‖ = ‖𝐴‖ for any appropriately sized orthogonal matrices 𝑊, then


𝐵𝑘 will be the closest rank 𝑘 matrix to 𝐴 in that norm! The proof of this
surprising generalization is due to Mirsky, see [29]. (Eckart and Young
first considered the problem of approximating a given matrix by one of
a lower rank, and provided a solution in 1936: [10].)
Exercise 6.15. Suppose 𝐴 is a given square invertible matrix. What is
the closest singular matrix to 𝐴? Why?

6.5. The Orthogonal Procrustes Problem


Suppose we have a collection of 𝑚 points in ℝ𝑛 , representing some con-
figuration of points in ℝ𝑛 , and we want to know how close this “test”
configuration is to a given reference configuration. In many situations,
as long as the distances and angles between the points are the same, two
configurations are regarded as the same. Thus, to determine how close
the test configuration is to the reference configuration, we want to trans-
form the test configuration to be as close as possible to the reference
configuration — making sure to preserve lengths and angles in the test
configuration. Notice that lengths and angles (or at least their cosines)
are determined by the dot product, so we want transformations that pre-
serve dot product. We will confine ourselves to linear transformations.
Theorem 6.16. Suppose 𝑈 ∈ ℒ (ℝ𝑛 , ℝ𝑛 ). The following three conditions
are equivalent:
(1) ‖𝑈𝑥‖ = ‖𝑥‖ for all 𝑥 ∈ ℝ𝑛 (where ‖𝑣‖2 = 𝑣 ⋅ 𝑣 for any 𝑣 ∈ ℝ𝑛 ).
(2) 𝑈𝑥 ⋅ 𝑈𝑦 = 𝑥 ⋅ 𝑦 for all 𝑥, 𝑦 ∈ ℝ𝑛 .
(3) 𝑈 𝑇 𝑈 = 𝐼 (or equivalently 𝑈𝑈 𝑇 = 𝐼), i.e. 𝑈 is an orthogonal
matrix.

Proof. Suppose that (1) is true. Let 𝑥, 𝑦 ∈ ℝ𝑛 be arbitrary. We have


‖𝑈𝑥 − 𝑈𝑦‖2 = ‖𝑈𝑥‖2 − 2𝑈𝑥 ⋅ 𝑈𝑦 + ‖𝑈𝑦‖2
= ‖𝑥‖2 + ‖𝑦|2 − 2𝑈𝑥 ⋅ 𝑈𝑦,
and in addition
‖𝑥 − 𝑦‖2 = ‖𝑥‖2 − 2𝑥 ⋅ 𝑦 + ‖𝑦‖2 = ‖𝑥‖2 + ‖𝑦‖2 − 2𝑥 ⋅ 𝑦.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 189

Since ‖𝑈𝑥 − 𝑈𝑦‖2 = ‖𝑈(𝑥 − 𝑦)‖2 = ‖𝑥 − 𝑦‖2 by assumption, comparing


the previous two lines we must have 𝑈𝑥 ⋅ 𝑈𝑦 = 𝑥 ⋅ 𝑦. Therefore, we have
(1) ⟹ (2).
Suppose now that (2) is true. Let 𝑥 ∈ ℝ𝑛 be arbitrary. We will show
that for any 𝑦 ∈ ℝ𝑛 , 𝑈 𝑇 𝑈𝑥 ⋅ 𝑦 = 𝑥 ⋅ 𝑦. We have
𝑈 𝑇 𝑈𝑥 ⋅ 𝑦 = 𝑈𝑥 ⋅ 𝑈𝑦 = 𝑥 ⋅ 𝑦.
Since 𝑈 𝑇 𝑈𝑥 ⋅ 𝑦 = 𝑥 ⋅ 𝑦 for any 𝑦 ∈ ℝ𝑛 , we must have 𝑈 𝑇 𝑈𝑥 = 𝑥. Since
𝑥 ∈ ℝ𝑛 is arbitrary, we see that 𝑈 𝑇 𝑈 = 𝐼. Thus, (2) ⟹ (3).
Finally, suppose that (3) is true. We will have
‖𝑈𝑥‖2 = 𝑈𝑥 ⋅ 𝑈𝑥 = 𝑈 𝑇 𝑈𝑥 ⋅ 𝑥 = 𝑥 ⋅ 𝑥 = ‖𝑥‖2 ,
since 𝑈 𝑇 𝑈 = 𝐼. Therefore, (1) ⟹ (2) ⟹ (3) ⟹ (1), and so the
three conditions are equivalent. □

Suppose now 𝐴 and 𝐵 are two fixed 𝑚 × 𝑛 matrices. We view 𝐴 and


𝐵 as made up of 𝑚 rows, each consisting of the transpose of an element
of ℝ𝑛 :
𝑎𝑇1 𝑏𝑇1
⎡ ⎤ ⎡ ⎤
𝑎𝑇2 𝑏𝑇2
𝐴=⎢ ⎢
⎥ and 𝐵 = ⎢
⎥ ⎢
⎥.

⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝑇 𝑇
⎣ 𝑎𝑚 ⎦ ⎣ 𝑏𝑚 ⎦
Notice that if 𝑈 is an orthogonal matrix, then Theorem 6.16 tells us that
(as a transformation) 𝑈 will preserve dot products and hence lengths
and angles. Next, notice that

𝑈𝐴𝑇 = 𝑈 [ 𝑎1 𝑎2 ... 𝑎𝑚 ] = [ 𝑈𝑎1 𝑈𝑎2 ... 𝑈𝑎𝑚 ] ,

which means that the columns of 𝑈𝐴𝑇 represent a configuration that has
the same lengths and angles as the original configuration represented by
𝐴𝑇 . Next, for the distance to the reference configuration 𝐵, we calculate
the Frobenius norm squared of 𝐴𝑈 𝑇 − 𝐵, i.e. the sum of the squared
distances between the corresponding rows of 𝐴𝑈 𝑇 and 𝐵. Therefore,
finding the closest configuration to the given reference 𝐵 means solving
the following:
minimize ‖𝐴𝑈 𝑇 − 𝐵‖2𝐹 over all 𝑈 such that 𝑈 𝑇 𝑈 = 𝐼.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
190 6. Applications Revisited

Since taking the transpose of an orthogonal matrix again yields an or-


thogonal matrix, replacing 𝑈 𝑇 above with 𝑉, we look at the following
problem: find 𝑉ˆ with 𝑉ˆ𝑇 𝑉ˆ = 𝐼 such that
(6.7) ‖𝐴𝑉ˆ − 𝐵‖2𝐹 = inf ‖𝐴𝑉 − 𝐵‖2𝐹 .
𝑉 𝑇 𝑉 =𝐼

As should be no surprise, we can determine a minimizing 𝑉ˆ in terms of


the Singular Value Decomposition.
Theorem 6.17. Suppose 𝐴 and 𝐵 are arbitrary 𝑚 × 𝑛 matrices. Next,
suppose that 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decomposition of 𝐴𝑇 𝐵.
Then 𝑉ˆ = 𝑌 𝑋 𝑇 is a minimizer for (6.7).

Proof. Notice that the Frobenius norm is the norm associated with the
Frobenius inner product ⟨𝐴, 𝐵⟩𝐹 = tr 𝐴𝑇 𝐵. Therefore, we will have
‖𝐴𝑉 − 𝐵‖2𝐹 = ⟨𝐴𝑉 − 𝐵, 𝐴𝑉 − 𝐵⟩𝐹
= ⟨𝐴𝑉, 𝐴𝑉⟩𝐹 − 2 ⟨𝐴𝑉, 𝐵⟩𝐹 + ⟨𝐵, 𝐵⟩𝐹
= tr (𝐴𝑉)𝑇 𝐴𝑉 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝑉 𝑇 𝐴𝑇 𝐴𝑉 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝐴𝑇 𝐴𝑉𝑉 𝑇 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝐴𝑇 𝐴 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= ‖𝐴‖2𝐹 + ‖𝐵‖2𝐹 − 2tr 𝑉 𝑇 𝐴𝑇 𝐵.
Since ‖𝐴‖2𝐹 and ‖𝐵‖2𝐹 are fixed, minimizing ‖𝐴𝑉 − 𝐵‖2𝐹 over all 𝑉 with
𝑉 𝑇 𝑉 = 𝐼 is equivalent to maximizing tr 𝑉 𝑇 𝐴𝑇 𝐵 over all 𝑉 with 𝑉 𝑇 𝑉 = 𝐼.
Notice that since 𝐴 and 𝐵 are 𝑚 × 𝑛 matrices, 𝐴𝑇 𝐵 is an 𝑛 × 𝑛 matrix.
Thus, 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 means that the matrices 𝑌 , Σ, and 𝑋 are all 𝑛 × 𝑛
matrices. Moreover, since the columns of 𝑌 and 𝑋 are orthonormal, the
matrices 𝑌 and 𝑋 are orthogonal: 𝑌 𝑇 𝑌 = 𝐼 and 𝑋 𝑇 𝑋 = 𝐼. Therefore,
sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = sup tr 𝑉 𝑇 𝑌 Σ𝑋 𝑇
𝑉 𝑇 𝑉 =𝐼 𝑉 𝑇 𝑉 =𝐼

= sup tr Σ𝑋 𝑇 𝑉 𝑇 𝑌
𝑉 𝑇 𝑉 =𝐼
𝑛
= sup tr Σ𝑍 = sup ∑ 𝜎 𝑖 𝑧𝑖𝑖 ,
𝑍 𝑇 𝑍=𝐼 𝑍 𝑇 𝑍=𝐼 𝑖=1

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 191

where we replaced 𝑉 with 𝑍 = 𝑋 𝑇 𝑉 𝑇 𝑌 . Now, for any orthogonal ma-


trix 𝑍, each column of 𝑍 is a unit vector, and therefore any entry in a
column of 𝑍 must be between -1 and 1. Thus, −1 ≤ 𝑧𝑖𝑖 ≤ 1, and there-
𝑛 𝑛
fore ∑𝑖=1 𝜎 𝑖 𝑧𝑖𝑖 ≤ ∑𝑖=1 𝜎 𝑖 . Thus,
𝑛 𝑛
sup tr Σ𝑍 = sup ∑ 𝜎 𝑖 𝑧𝑖𝑖 ≤ ∑ 𝜎 𝑖 .
𝑍 𝑇 𝑍=𝐼 𝑍 𝑇 𝑍=𝐼 𝑖=1 𝑖=1
Moreover, we will have equality above when 𝑍 = 𝐼. That is, 𝐼 is a
maximizer for tr Σ𝑍. Therefore, a maximizer for tr 𝑉 𝑇 𝐴𝑇 𝐵 occurs when
𝑋 𝑇 𝑉ˆ𝑇 𝑌 = 𝐼, i.e. when 𝑉ˆ𝑇 = 𝑋𝑌 𝑇 , which is exactly when 𝑉ˆ = 𝑌 𝑋 𝑇 . In
addition, we will have
𝑛
sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = ∑ 𝜎 𝑖 ,
𝑉 𝑇 𝑉 =𝐼 𝑖=1

where 𝜎 𝑖 are the singular value of 𝐴 𝐵. Thus, a minimizer for ‖𝐴𝑉 −𝐵‖2𝐹
𝑇

subject to 𝑉 𝑇 𝑉 = 𝐼 is provided by 𝑉ˆ = 𝑌 𝑋 𝑇 (where 𝑌 Σ𝑋 𝑇 is the Singular


Value Decomposition of 𝐴𝑇 𝐵), and we have
𝑛
inf ‖𝐴𝑉 − 𝐵‖2𝐹 = ‖𝐴𝑉ˆ − 𝐵‖2𝐹 = ‖𝐴‖2𝐹 + ‖𝐵‖2𝐹 − 2 ∑ 𝜎 𝑖 . □
𝑉 𝑇 𝑉 =𝐼
𝑖=1

Note that the Orthogonal Procrustes problem above is slightly dif-


ferent from the problem where we additionally require that 𝑈 preserve
orientation, since preserving orientation requires that det 𝑈 > 0, and 𝑈
orthogonal means that 𝑈 𝑇 𝑈 = 𝐼 and hence the orthogonality of 𝑈 tells
us only that det 𝑈 = ±1. Of course, notice that if det 𝑌 𝑋 𝑇 > 0, then the
minimizer of ‖𝐴𝑉 − 𝐵‖2𝐹 subject to 𝑉 𝑇 𝑉 = 𝐼 and det 𝑉 = 1 is provided
by 𝑌 𝑋 𝑇 . Thus, for the constrained Orthogonal Procrustes problem, the
interesting situation is when det 𝑌 𝑋 𝑇 = −1. For this, the following tech-
nical proposition from [23] (see also [2]) is useful.
Proposition 6.18. Suppose Σ is an 𝑛 × 𝑛 diagonal matrix, with entries
𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0. Then
(1) For any orthogonal 𝑊, we have tr Σ𝑊 ≤ tr Σ.
(2) For any orthogonal 𝐵, 𝑊, we have tr 𝐵 𝑇 Σ𝐵𝑊 ≤ tr 𝐵 𝑇 Σ𝐵.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
192 6. Applications Revisited

(3) For every orthogonal 𝑊 with det 𝑊 < 0, we have


𝑛−1
tr Σ𝑊 ≤ ( ∑ 𝜎𝑗 ) − 𝜎𝑛 .
𝑗=1

Part (2) of Proposition 6.18 has an interpretation in terms of similar


matrices.
Definition 6.19. Suppose 𝐴 is an 𝑛 × 𝑛 matrix. We say that 𝐶 is similar
to 𝐴 exactly when there is an invertible matrix 𝑃 such that 𝐶 = 𝑃 −1 𝐴𝑃.

With this definition in mind, (2) says that if 𝐴 is orthogonally similar


to Σ, then tr 𝐴𝑊 ≤ tr 𝐴 for any orthogonal 𝑊.

Proof. For (1), if 𝑊 is an orthogonal matrix, then the columns of 𝑊


form an orthonormal set. In particular, if 𝑤 𝑖 is the 𝑖th column of 𝑊,
then ‖𝑤 𝑖 ‖ = 1, which means that any particular entry of 𝑤 𝑖 is between
𝑇
-1 and 1. Thus, if 𝑤 𝑖 = [𝑤 1𝑖 𝑤 2𝑖 ... 𝑤 𝑛𝑖 ] , we have −1 ≤ 𝑤𝑗𝑗 ≤ 1
and hence
𝑛 𝑛
tr Σ𝑊 = ∑ 𝜎𝑗 𝑤𝑗𝑗 ≤ ∑ 𝜎𝑗 = tr Σ.
𝑗=1 𝑗=1

For (2), let 𝐵, 𝑊 be arbitrary orthogonal matrices. Notice that the


product of orthogonal matrices is again an orthogonal matrix, and so by
(1), we have
tr 𝐵 𝑇 Σ𝐵𝑊 = tr Σ𝐵𝑊𝐵 𝑇 ≤ tr Σ = tr Σ𝐵𝐵 𝑇 = tr 𝐵 𝑇 Σ𝐵,
where we have made use of Lemma 2.12.
(3) is much more technically involved. Suppose now that 𝑊 is an
orthogonal matrix, and assume det 𝑊 < 0. Since 𝑊 is orthogonal, we
must have det 𝑊 = ±1 and so det 𝑊 < 0 means that det 𝑊 = −1. Thus,
det(𝑊 + 𝐼) = det(𝑊 + 𝑊𝑊 𝑇 )
= det (𝑊(𝐼 + 𝑊 𝑇 ))
= det 𝑊 det(𝐼 + 𝑊 𝑇 )
= − det(𝐼 + 𝑊 𝑇 )𝑇
= − det(𝐼 + 𝑊).

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 193

Therefore, 2 det(𝑊 + 𝐼) = 0, and so we must have det(𝑊 + 𝐼) = 0. This


means that -1 is an eigenvalue of 𝑊, and so there must be a unit vector
𝑥 such that 𝑊𝑥 = −𝑥. We then also have 𝑊 𝑇 𝑊𝑥 = −𝑊 𝑇 𝑥, and thus
𝑥 = −𝑊 𝑇 𝑥. In particular, this implies that 𝑥 is in fact an eigenvector for
both 𝑊 and 𝑊 𝑇 , with eigenvalue -1. We can find an orthonormal basis
{𝑏1 , 𝑏2 , . . . , 𝑏𝑛−1 , 𝑥} of ℝ𝑛 . Relabeling 𝑥 as 𝑏𝑛 , and letting 𝐵 be the matrix
whose columns are 𝑏1 , 𝑏2 , . . . , 𝑏𝑛−1 , 𝑏𝑛 , we see 𝐵 will be an orthogonal
matrix. Moreover, we have

𝐵 𝑇 𝑊𝐵 = 𝐵 𝑇 𝑊 [𝑏1 𝑏2 ⋯ 𝑏𝑛−1 𝑏𝑛 ]

= 𝐵 𝑇 [𝑊𝑏1 𝑊𝑏2 ⋯ 𝑊𝑏𝑛−1 𝑊𝑏𝑛 ]

𝑏𝑇1
⎡ ⎤
⎢ 𝑏𝑇2 ⎥
=⎢ ⋮ ⎥ [𝑊𝑏1 𝑊𝑏2 ⋯ 𝑊𝑏𝑛−1 −𝑏𝑛 ]
⎢ 𝑇 ⎥
⎢ 𝑏𝑛−1 ⎥
⎣ 𝑏𝑛𝑇 ⎦
Therefore, the entries of 𝐵 𝑇 𝑊𝐵 are given by the products 𝑏𝑇𝑖 𝑊𝑏𝑗 . We
now claim that the last column of 𝐵 𝑇 𝑊𝐵 is −e𝑛 , and the last row of
𝐵 𝑇 𝑊𝐵 is −e𝑇𝑛 . The last column will have entries 𝑏𝑇𝑖 (−𝑏𝑛 )𝑇 , which is 0
for 𝑖 ≠ 𝑛 and -1 when 𝑖 = 𝑛. In other words, the last column is −e𝑛 .
Similarly, the last row will have entries
𝑏𝑛𝑇 𝑊𝑏𝑗 = 𝑏𝑛 ⋅ 𝑊𝑏𝑗 = 𝑊 𝑇 𝑏𝑛 ⋅ 𝑏𝑗 = −𝑏𝑛 ⋅ 𝑏𝑗 ,
which again is either 0 (if 𝑗 ≠ 𝑛) or 1 ( if 𝑗 = 𝑛). Thus, the last row is
−e𝑛 . This means that we can write 𝐵 𝑇 𝑊𝐵 in block form as
𝐵0 0ℝ𝑛−1
(6.8) 𝐵 𝑇 𝑊𝐵 = [ ],
0𝑇ℝ𝑛−1 −1
where 𝐵0 is an (𝑛 − 1) × (𝑛 − 1) matrix. Moreover, since 𝐵 and 𝑊 are
orthogonal matrices, so too is 𝐵 𝑇 𝑊𝑇, which implies that the columns
of 𝐵 𝑇 𝑊𝐵 form an orthonormal basis of ℝ𝑛 . Since the first 𝑛 − 1 entries
in the last row of 𝐵 𝑇 𝑊𝑇 are all zeros, the columns of 𝐵0 are orthonormal

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
194 6. Applications Revisited

in ℝ𝑛−1 , which means that 𝐵0 is an (𝑛 − 1) × (𝑛 − 1) orthogonal matrix.


Similarly, we can write the product 𝐵𝑇 Σ𝐵 in block form:
𝐴0 𝑎
(6.9) 𝐵 𝑇 Σ𝐵 = [ ].
𝑐𝑇 𝛾
Here, 𝐴0 is an (𝑛 − 1) × (𝑛 − 1) matrix, 𝑎, 𝑐 ∈ ℝ𝑛−1 and 𝛾 ∈ ℝ. Consider
now the matrix
𝐵 0ℝ𝑛−1
𝑈 ≔ [ 𝑇0 ].
0ℝ𝑛−1 1
Note that 𝑈 is an orthogonal matrix. Using block multiplication, we
have
𝐴 𝑎 𝐵 0ℝ𝑛−1
𝐵 𝑇 Σ𝐵𝑈 = [ 𝑇0 ][ 𝑇0 ]
𝑐 𝛾 0ℝ𝑛−1 1
𝐴0 𝐵0 𝑎
(6.10) =[ ].
𝑐𝑇 𝐵0 𝛾
Now, since 𝑈 is an orthogonal matrix, (2) of this lemma tells us that
tr 𝐵 𝑇 Σ𝐵𝑈 ≤ tr 𝐵 𝑇 Σ𝐵.
From (6.9) and (6.10) we then have
tr 𝐴0 𝐵0 + 𝛾 = tr 𝐵 𝑇 Σ𝐵𝑈 ≤ tr 𝐵 𝑇 Σ𝐵 = tr 𝐴0 + 𝛾,
and so we must have
(6.11) tr 𝐴0 𝐵0 ≤ tr 𝐴0 .
Next, using (6.8) and (6.9), we have
tr Σ𝑊 = tr Σ𝐵𝐵 𝑇 𝑊
= tr Σ𝐵𝐵 𝑇 𝑊𝐵𝐵 𝑇
= tr 𝐵 𝑇 Σ𝐵𝐵 𝑇 𝑊𝐵
𝐴0 𝑎 𝐵 0ℝ𝑛−1
= tr ([ ][ 𝑇0 ])
𝑐𝑇 𝛾 0ℝ𝑛−1 −1
𝐴0 𝐵0 −𝑎
= tr [ ] = tr 𝐴0 𝐵0 − 𝛾.
𝑐𝑇 𝐵0 −𝛾

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 195

Combining with (6.11) yields


(6.12) tr Σ𝑊 = tr 𝐴0 𝐵0 − 𝛾 ≤ tr 𝐴0 − 𝛾.
Next, we look at the entries that make up 𝐴0 . Looking at the left side of
(6.9), we have

𝐵 𝑇 Σ𝐵 = 𝐵 𝑇 Σ [𝑏1 𝑏2 ⋯ 𝑏𝑛−1 𝑏𝑛 ]

= 𝐵 𝑇 [Σ𝑏1 Σ𝑏2 ⋯ Σ𝑏𝑛−1 Σ𝑏𝑛 ]

𝑏𝑇1
⎡ ⎤
⎢ 𝑏𝑇2 ⎥
=⎢ ⋮ ⎥ [Σ𝑏1 Σ𝑏2 ⋯ Σ𝑏𝑛−1 Σ𝑏𝑛 ] .
⎢ 𝑇 ⎥
⎢ 𝑏𝑛−1 ⎥
⎣ 𝑏𝑛𝑇 ⎦
Thus, the 𝑖𝑗th entry of 𝐵 𝑇 Σ𝐵 is given by 𝑏𝑇𝑖 Σ𝑏𝑗 , which is the dot product
of 𝑏𝑖 and Σ𝑏𝑗 . Therefore, (6.9) tells us 𝛾 = 𝑏𝑛𝑇 Σ𝑏𝑛 and
𝑛−1 𝑛
tr 𝐴0 = ∑ (𝑏𝑇ℓ Σ𝑏ℓ ) = ∑ (𝑏𝑇ℓ Σ𝑏ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛 .
ℓ=1 ℓ=1
𝑇
Now, if 𝑏ℓ = [𝑏1ℓ 𝑏2ℓ ⋯ 𝑏𝑛−1ℓ 𝑏𝑛ℓ ] , then since Σ is a diagonal
matrix with diagonal entries 𝜎ℓ , we will have
𝜎1 𝑏1ℓ
⎡ ⎤
⎢ 𝜎2 𝑏2ℓ ⎥
Σ𝑏ℓ = ⎢ ⋮ ⎥.
⎢ ⎥
⎢𝜎𝑛−1 𝑏𝑛−1ℓ ⎥
⎣ 𝜎𝑛 𝑏𝑛ℓ ⎦

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
196 6. Applications Revisited
𝑛
In particular, we will have 𝑏𝑇ℓ Σ𝑏ℓ = ∑𝑗=1 𝜎𝑗 𝑏𝑗ℓ
2
, and so
𝑛 𝑛
2
tr 𝐴0 = ∑ ( ∑ 𝜎𝑗 𝑏𝑗ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛
ℓ=1 𝑗=1
𝑛 𝑛
2
= ∑ 𝜎𝑗 ( ∑ 𝑏𝑗ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛 .
𝑗=1 ℓ=1
𝑛 2
Now, the sum ∑ℓ=1 𝑏𝑗ℓ
is the sum of the squares of the entries in row 𝑗
of the matrix 𝐵. Since 𝐵 is orthogonal, so too is 𝐵𝑇 . In particular, that
means that the (transposes of the) rows of 𝐵 form an orthonormal basis
of ℝ𝑛 . Thus each row of 𝐵 must have Euclidean norm equal to 1, i.e.
𝑛 2
∑ℓ=1 𝑏𝑗ℓ = 1 for each 𝑗. Therefore, we have
𝑛
(6.13) tr 𝐴0 = ∑ 𝜎𝑗 − 𝑏𝑛𝑇 Σ𝑏𝑛 = tr Σ − 𝑏𝑛𝑇 Σ𝑏𝑛 = tr Σ − 𝛾.
𝑗=1

Next, we estimate 𝛾 = 𝑏𝑛𝑇 Σ𝑏𝑛 . Because 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0, we have


𝜎1 𝑏1𝑛
⎡ ⎤
⎢ 𝜎2 𝑏2𝑛 ⎥
𝛾 = 𝑏𝑛𝑇 Σ𝑏𝑛 = [𝑏1𝑛 𝑏2𝑛 ⋯ 𝑏𝑛−1𝑛 𝑏𝑛 ] ⎢ ⋮ ⎥
⎢ ⎥
⎢𝜎𝑛−1 𝑏𝑛−1𝑛 ⎥
⎣ 𝜎𝑛 𝑏𝑛𝑛 ⎦
𝑛
(6.14) = ∑ 𝜎ℓ 𝑏2ℓ𝑛
ℓ=1
𝑛
≥ ∑ 𝜎𝑛 𝑏2ℓ𝑛 = 𝜎𝑛 ‖𝑏𝑛 ‖2 = 𝜎𝑛 ,
ℓ=1

since the (Euclidean) norm of 𝑏𝑛 is 1. Thus, combining (6.12), (6.13),


and (6.14) we finally have
tr Σ𝑊 ≤ tr 𝐴0 − 𝛾 = tr Σ − 𝛾 − 𝛾
≤ tr Σ − 2𝜎𝑛
𝑛−1
= ( ∑ 𝜎𝑗 ) − 𝜎𝑛 . □
𝑗=1

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 197

We can now solve the problem of finding the orientation preserving


orthogonal matrix that minimizes ‖𝐴𝑉 − 𝐵‖2𝐹 . This problem is relevant
in computational chemistry. There, 𝐵 may represent a configuration of
the atoms in a molecule in its lowest energy configuration and 𝐴 rep-
resents another configuration of the molecule. In this situation, it is
important to require that 𝑈 preserve orientation, since not all chemical
properties are preserved by non-orientation preserving transformations.
The original references are [18], [19], and [41].
Theorem 6.20 (Kabsch-Umeyama Algorithm). Let 𝐴, 𝐵 ∈ ℝ𝑚×𝑛 be given.
Suppose 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decomposition of 𝐴𝑇 𝐵. Let
𝐼 ̂ be the 𝑛 × 𝑛 diagonal matrix whose 𝑖𝑖th entries are 1 for 1 ≤ 𝑖 < 𝑛 and
whose 𝑛𝑛th entry is det 𝑌 𝑋 𝑇 . Then 𝑉ˆ ≔ 𝑌 𝐼𝑋̂ 𝑇 minimizes ‖𝐴𝑉 − 𝐵‖2𝐹 over
all orthogonal 𝑉 with det 𝑉 = 1. (Notice that if det 𝑌 𝑋 𝑇 = 1, then 𝐼 ̂ is
simply the 𝑛 × 𝑛 identity matrix, while if det 𝑌 𝑋 𝑇 = −1, then 𝐼 ̂ is the 𝑛 × 𝑛
identity matrix whose lower right entry has been replaced with -1.)

Proof. As in the proof for the solution of the Orthogonal Procrustes


Problem, we have
‖𝐴𝑉 − 𝐵‖2𝐹 = ‖𝐴‖2𝐹 + ‖𝐵‖2𝐹 − 2tr 𝑉 𝑇 𝐴𝑇 𝐵,
and so to minimize ‖𝐴𝑉 − 𝐵‖2𝐹 over all orthogonal matrices 𝑉 with
det 𝑉 = 1, it suffices to maximize tr 𝑉 𝑇 𝐴𝑇 𝐵 over all orthogonal matrices
𝑉 with det 𝑉 = 1. Suppose 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decom-
position of 𝐴𝑇 𝐵. In particular, both 𝑌 and 𝑋 are orthogonal matrices.
Thus, we have
sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = sup tr 𝑉 𝑇 𝑌 Σ𝑋 𝑇
𝑉 𝑇 𝑉 =𝐼 𝑉 𝑇 𝑉 =𝐼
det 𝑉 =1 det 𝑉 =1

= sup tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 .
𝑉 𝑇 𝑉 =𝐼
det 𝑉 =1

Notice that 𝑋 𝑇 𝑉 𝑇 𝑌 is an orthogonal matrix, and so we would like to


replace 𝑋 𝑇 𝑉 𝑇 𝑌 with 𝑍, and maximize tr Σ𝑍 over all orthogonal 𝑍, just
as we did in the general Procrustes Problem. However, in this situation,
we require that det 𝑉 = 1. Notice that if 𝑍 = 𝑋 𝑇 𝑉 𝑇 𝑌 , then we have
det 𝑍 = det 𝑋 𝑇 det 𝑉 𝑇 det 𝑌 = det 𝑋 𝑇 det 𝑌 = det 𝑌 𝑋 𝑇 .

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
198 6. Applications Revisited

Therefore, we can look at maximizing tr Σ𝑍 over all orthogonal 𝑍 with


det 𝑍 = det 𝑌 𝑋 𝑇 . That is:
(6.15) sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = sup tr Σ𝑍.
𝑉 𝑇 𝑉 =𝐼 𝑍𝑇 𝑍=𝐼
det 𝑉 =1 det 𝑍=det 𝑌 𝑋 𝑇

We consider now two cases: (i) det 𝑌 𝑋 𝑇 = 1, and (ii) det 𝑌 𝑋 𝑇 = −1.
Case (i): det 𝑌 𝑋 𝑇 = 1. We claim that in this case 𝑍ˆ = 𝐼 ̂ maximizes
tr Σ𝑍 over all orthogonal 𝑍 with det 𝑍 = 1. By (1) of Proposition 6.18, we
𝑛
know that tr Σ𝑍 ≤ ∑𝑖=1 𝜎 𝑖 for any orthogonal 𝑍. Moreover, equality is
attained for 𝑍ˆ ≔ 𝐼. By the definition of 𝐼 ̂ in the statement of the theorem,
𝐼 ̂ = 𝐼 in this case. Thus, 𝑍ˆ = 𝐼 ̂ is the maximizer. Since we replaced
𝑋 𝑇 𝑉 𝑇 𝑌 with 𝑍, a maximizer of tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 subject to the constraints
𝑉 𝑇 𝑉 = 𝐼, det 𝑉 = 1 is provided by 𝑉ˆ such that 𝑋 𝑇 𝑉ˆ𝑇 𝑌 = 𝐼.̂ Solving for
ˆ we see 𝑉ˆ = 𝑌 𝐼𝑋
𝑉, ̂ 𝑇 . Thus, the theorem is true in this case.
Case (ii): det 𝑌 𝑋 𝑇 = −1. In this case, we want to find an orthogonal
𝑍ˆ that maximizes tr Σ𝑍 over all orthogonal 𝑍 with det 𝑍 = −1. By (3) of
𝑛−1
Proposition 6.18, we know that tr Σ𝑍 ≤ (∑𝑖=1 𝜎 𝑖 ) − 𝜎𝑛 for any orthog-
onal 𝑍 with det 𝑍 = −1, and equality will occur if we take 𝑍ˆ to be the
diagonal matrix whose entries are all 1, except the lower-right most en-
try, which will be −1 = det 𝑌 𝑋 𝑇 . By the definition of 𝐼 ̂ in the statement
of the theorem, this means 𝑍ˆ = 𝐼.̂ As in the previous case, a maximizer
of tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 subject to 𝑉 𝑇 𝑉 = 𝐼, det 𝑉 = 1 is provided by 𝑉ˆ such that
𝑋 𝑇 𝑉ˆ𝑇 𝑌 = 𝐼,̂ which means 𝑉ˆ = 𝑌 𝐼𝑋 ̂ 𝑇. □

6.6. Summary
We hope that by this stage, the reader has been convinced of the util-
ity of analytic ideas in linear algebra, as well as the importance of the
Singular Value Decomposition. There are many different directions that
an interested reader can go from here. As we saw in this chapter, in-
teresting applications are often optimization problems, where we seek
a particular type of matrix to make some quantity as small as possible.
Thus, one direction is the book [1] that investigates general matrix op-
timization. (The reader should be forewarned: there is a jump from the
level here to the level in that text.) Another direction (related to the
“best” 𝑘-dimensional subspace problem) is: given a collection of points

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.6. Summary 199

in some ℝ𝑚 , what is the minimal volume ellipsoid that contains these


points? The book [39] investigates this problem and provides algorithms
for solving it. We mention again [26], which has a wealth of examples
of applications. Another direction is to look at the infinite-dimensional
setting, and the following chapter gives a glimpse in that direction.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms

You might also like