0% found this document useful (0 votes)

84 views29 pages

Applications Revisited: 6.1. The "Best" Subspace For Given Data

Uploaded by

Petter Loved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views29 pages

Applications Revisited: 6.1. The "Best" Subspace For Given Data

Uploaded by

Petter Loved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

10.

1090/stml/094/06

Chapter 6

Applications Revisited

We now revisit the big applications mentioned in the introduction and

Chapters 1 and 3, and prove the statements made there about the solu-
tions of those problems.

6.1. The “Best” Subspace for Given Data

Consider the following situation: we have a large amount of data points,
each with a large number of individual entries (variables). In principle,
we may suspect that this data arises from a process that is driven by only
a small number of key quantities. That is, we suspect that the data may in
fact be “low-dimensional.” How can we test this hypothesis? As stated,
this is very general. We want to look at the situation where the data
arises from a process that is linear, and so the data should be close to a
low-dimensional subspace. Notice that because of error in measurement
and/or noise, the data is very unlikely to perfectly line up with a low-
dimensional subspace, and may in fact span a very high-dimensional
space. Mathematically, we can phrase our problem as follows: suppose
we have 𝑚 points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 in ℝ𝑛 . Given 𝑘 ≤ min{𝑚, 𝑛}, what is the
“closest” 𝑘-dimensional subspace to these 𝑚 points? As will hopefully
be no great surprise at this stage, the Singular Value Decomposition can
provide an answer! Before we show how to use the Singular Value De-
composition to solve this problem, the following exercises should ALL
be completed.

171

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
172 6. Applications Revisited

𝑥 − 𝑃𝒰 𝑥

𝑃𝒰 𝑥

Figure 6.1. 𝒱 = ℝ3 , 𝒰 is a two-dimensional subspace, and

𝑑(𝑥, 𝒰) = ‖𝑥 − 𝑃𝒰 𝑥‖.

Exercise 6.1. (1) Suppose 𝒰 is a finite-dimensional subspace of a vector

space 𝒱 that has inner product ⟨⋅, ⋅⟩𝒱 , and let 𝑥 ∈ 𝒱. Let 𝑃𝒰 ∶ 𝒱 → 𝒱 be
the orthogonal projection onto 𝒰, and show that the distance from 𝑥 to
𝒰, 𝑑(𝑥, 𝒰), is ‖𝑥 − 𝑃𝒰 𝑥‖𝑉 . (Here, the norm is that induced by the inner
product.) In terms of the notation from Chapter 3, this means that we
will use ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖ in place of 𝑑(𝑎𝑗 , 𝒰). (See also Figure 6.1.)
(2) Suppose we have 𝑚 points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 in ℝ𝑛 . Let 𝐴 be the ma-
trix whose rows are given by 𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 (remember: we consider el-
ements of ℝ𝑛 as column vectors, and so to get row vectors, we need the
transpose). Note that 𝐴 will be an 𝑚 × 𝑛 matrix. Let 𝑢 ∈ ℝ𝑛 be a unit
vector. Explain why 𝐴𝑢 will be the (column) vector whose 𝑗th entry is
the component of the projection of 𝑎𝑗 onto span{𝑢}.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.1. The “Best” Subspace for Given Data 173

(3) Suppose 𝒰 is a subspace of ℝ𝑛 and {𝑣 1 , 𝑣 2 , . . . , 𝑣 𝑘 } is an orthonormal

basis for 𝒰. Using the same notation as in the previous problem, if 𝑃𝒰 is
the orthogonal projection onto 𝒰, explain why
𝑚 𝑘
∑ ‖𝑃𝒰 𝑎𝑗 ‖22 = ∑ ‖𝐴𝑣 𝑖 ‖22 .
𝑗=1 𝑖=1

In other words, the sum of the squared magnitudes of the projections

of the 𝑎𝑗 onto 𝒰 is given by the sum of the squared magnitudes of 𝐴𝑣 𝑖 .
(Note also that for an element of 𝑥 ∈ ℝ𝑛 , ‖𝑥‖22 = 𝑥 ⋅𝑥, where ⋅ represents
the dot product: 𝑥 ⋅ 𝑦 = 𝑥𝑇 𝑦 when 𝑥, 𝑦 ∈ ℝ𝑛 .)

Suppose now that we have 𝑚 points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 in ℝ𝑛 and an in-

teger 1 ≤ 𝑘 ≤ min{𝑚, 𝑛}. For a 𝑘-dimensional subspace 𝒰 of ℝ𝑛 , notice
that the distance from 𝑎𝑗 to 𝒰 is ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖, where 𝑃𝒰 is the orthogonal
projection onto 𝒰. For measuring how close the points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 are
to 𝒰, we use the sum of the squares of their individual distances. That
is, how close the points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 are to 𝒰 is given by
𝑚
∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 .
𝑗=1

Thus, finding the “best” approximating 𝑘-dimensional subspace to the

points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 is equivalent to the following minimization prob-
lem: find a subspace 𝒰̂ with dim 𝒰̂ = 𝑘 such that
𝑚 𝑚
(6.1) ∑ ‖𝑎𝑗 − 𝑃𝒰̂ 𝑎𝑗 ‖2 = inf ∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 .
dim 𝒰=𝑘
𝑗=1 𝑗=1

Notice that just as for the Pythagorean Theorem, we have

‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ⟨𝑎𝑗 − 𝑃𝒰 𝑎𝑗 , 𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ⟩
= ‖𝑎𝑗 ‖2 − 2 ⟨𝑎𝑗 , 𝑃𝒰 𝑎𝑗 ⟩ + ‖𝑃𝒰 𝑎𝑗 ‖2
= ‖𝑎𝑗 ‖2 − 2 ⟨𝑃𝒰 𝑎𝑗 , 𝑃𝒰 𝑎𝑗 ⟩ + ‖𝑃𝒰 𝑎𝑗 ‖2
= ‖𝑎𝑗 ‖2 − ‖𝑃𝒰 𝑎𝑗 ‖2 ,

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
174 6. Applications Revisited

since ⟨𝑎𝑗 , 𝑢⟩ = ⟨𝑃𝒰 𝑎𝑗 , 𝑢⟩ for any 𝑢 ∈ 𝒰. (In fact, that is a defining char-
acteristic of the projection onto 𝒰 - see Proposition 3.23.) Thus,
𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ∑ (‖𝑎𝑗 ‖2 − ‖𝑃𝒰 𝑎𝑗 ‖2 )
𝑗=1 𝑗=1
𝑚 𝑚
= ( ∑ ‖𝑎𝑗 ‖2 ) − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 )
𝑗=1 𝑗=1
𝑚
= ‖𝐴‖2𝐹 − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ) ,
𝑗=1

where 𝐴 is the matrix whose rows are 𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 . Therefore, to min-
imize
𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ‖𝐴‖𝐹2 − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ) ,
𝑗=1 𝑗=1
we need to make the second term as large as possible! (That is, we want
to subtract as much as we possibly can.) Thus, it turns out that (6.1) is
equivalent to finding a subspace 𝒰̂ with dim 𝒰̂ = 𝑘 such that
𝑚 𝑚
(6.2) ∑ ‖𝑃𝒰̂ 𝑎𝑗 ‖2 = sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 .
𝑗=1 dim 𝒰=𝑘 𝑗=1

Moreover, if 𝒰̂ is a subspace of dimension 𝑘 that maximizes (6.2), then

the corresponding minimum in (6.1) is provided by
𝑚
‖𝐴‖2𝐹 − ∑ ‖𝑃𝒰̂ 𝑎𝑗 ‖2 .
𝑗=1

Notice that to specify a subspace 𝒰, we need only specify a basis of 𝒰.

Suppose now that 𝐴 is the 𝑚 × 𝑛 matrix whose rows are given by
𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 , and suppose that 𝐴 = 𝑌 Σ𝑋 𝑇 is the Singular Value De-
composition of 𝐴. (Thus, the 𝑚 columns of 𝑌 form an orthonormal basis
of ℝ𝑚 , and the 𝑛 columns of 𝑋 form an orthonormal basis of ℝ𝑛 .) For a
fixed 𝑘 ∈ {1, 2, . . . , min{𝑚, 𝑛}}, we claim that the best 𝑘-dimensional sub-
space of ℝ𝑛 is given by 𝒲 𝑘 = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑘 }, where 𝑥1 , 𝑥2 , . . . , 𝑥𝑛
are the columns of 𝑋 (or equivalently their transposes are the rows of
𝑋 𝑇 ). Thus, if the singular triples of 𝐴 are (𝜎 𝑖 , 𝑥𝑖 , 𝑦 𝑖 ), then the closest
𝑘-dimensional subspace to 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 is span{𝑥1 , 𝑥2 , . . . , 𝑥𝑘 }.

Theorem 6.2. Suppose 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 are points in ℝ𝑛 , let 𝐴 be the 𝑚 × 𝑛

matrix whose rows are given by 𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 , and let 𝑝 = min{𝑚, 𝑛}.
Suppose the singular triples of 𝐴 are (𝜎 𝑖 , 𝑥𝑖 , 𝑦 𝑖 ). For any 𝑘 ∈ {1, 2, . . . , 𝑝},
if 𝒲 𝑘 = span{𝑥1 , 𝑥2 . . . , 𝑥𝑘 }, we will have

𝑚 𝑚
∑ ‖𝑃𝒲 𝑘 𝑎𝑗 ‖2 = sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ,
𝑗=1 dim 𝒰=𝑘 𝑗=1

or equivalently

𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘 𝑎𝑗 ‖2 = inf ∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 .
dim 𝒰=𝑘
𝑗=1 𝑗=1

Moreover, we will have

𝑚
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘 𝑎𝑗 ‖2 = 𝜎𝑘+1 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).
𝑗=1

Proof. We use induction on 𝑘. When 𝑘 = 1, specifying a one dimen-

sional subspace of ℝ𝑛 means specifying a non-zero 𝑢. Moreover, for any
⟨𝑥,ᵆ⟩ 𝑥⋅ᵆ |𝑥⋅ᵆ|
𝑢 ≠ 0ℝ𝑛 , 𝑃span{ᵆ} 𝑥 = ⟨ᵆ,ᵆ⟩ 𝑢 = ᵆ⋅ᵆ 𝑢, and thus we have ‖𝑃span{ᵆ} 𝑥‖ = ‖ᵆ‖ .
Therefore,
𝑚 𝑚
(𝑎𝑗 ⋅ 𝑢)2
sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = sup ∑ 2
dim 𝒰=1 𝑗=1 ᵆ≠0ℝ𝑛 𝑗=1 ‖𝑢‖

‖𝐴𝑢‖2
= sup
ᵆ≠0ℝ𝑛 ‖𝑢‖2
2
‖𝐴𝑢‖
= ( sup )
ᵆ≠0ℝ𝑛 ‖𝑢‖

= 𝜎1 (𝐴)2 .

(To go from the first line to the second line, we have used the fact that the
𝑗th entry in 𝐴𝑢 is the dot product of 𝑎𝑗 and 𝑢.) Moreover, a maximizer
above is provided by 𝑥1 , as is shown by Theorem 5.19. Therefore, we
see that a maximizing subspace is given by 𝒲1 = span{𝑥1 }. Next, as we

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
176 6. Applications Revisited

showed in the discussion between (6.1) and (6.2),

𝑚 𝑚
inf ∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ‖𝐴‖2𝐹 − sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2
dim 𝒰=1 dim 𝒰=1 𝑗=1
𝑗=1
𝑝
= ( ∑ 𝜎𝑖2 (𝐴)) − 𝜎12 (𝐴)
𝑖=1

= 𝜎22 (𝐴) + 𝜎32 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).

This finishes the proof of the base case (k= 1).

Suppose now that we know that 𝒲 𝑘 = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑘 } is a max-
imizing 𝑘-dimensional subspace: i.e.
𝑚 𝑚
∑ ‖𝑃𝒲 𝑘 𝑎𝑗 ‖2 = sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ,
𝑗=1 dim 𝒰=𝑘 𝑗=1

and that
𝑚
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘 𝑎𝑗 ‖2 = 𝜎𝑘+1 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).
𝑗=1

𝑚
We now show that 𝒲 𝑘+1 will maximize ∑𝑗=1 ‖𝑃𝒰 𝑎𝑗 ‖2 over all possible
(𝑘 + 1)-dimensional subspaces 𝒰 of ℝ𝑛 , and that
𝑚
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 = 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).
𝑗=1

Let 𝒰 be an arbitrary (𝑘 + 1)-dimensional subspace. Notice that 𝑊𝑘⊥ has

dimension 𝑛 − 𝑘, and so therefore we have

𝑛 ≥ dim(𝒰 + 𝑊𝑘⊥ ) = dim 𝒰 + dim 𝑊𝑘⊥ − dim(𝒰 ∩ 𝑊𝑘⊥ )

= 𝑘 + 1 + 𝑛 − 𝑘 − dim(𝒰 ∩ 𝑊𝑘⊥ ) .

Therefore, a little rearrangement shows that dim(𝒰 ∩ 𝑊𝑘⊥ ) ≥ 1. Thus,

there is an orthonormal basis 𝑦1 , 𝑦2 , . . . , 𝑦 𝑘 , 𝑦 𝑘+1 of 𝒰 such that we have
𝑦 𝑘+1 ∈ 𝒰 ∩ 𝑊𝑘⊥ . From Theorem 5.19, we then know that

‖𝐴𝑢‖
(6.3) ‖𝐴𝑦 𝑘+1 ‖ ≤ sup = ‖𝐴𝑥𝑘+1 ‖ = 𝜎 𝑘+1 (𝐴).
ᵆ∈𝑊𝑘⊥ \{0ℝ 𝑛}
‖𝑢‖

Moreover, since 𝑈̃ = span{𝑦1 , 𝑦2 , . . . , 𝑦 𝑘 } is a 𝑘-dimensional subspace,

we will have (using the result of exercise (3))
𝑘 𝑚 𝑚 𝑘
(6.4) ∑ ‖𝐴𝑦 𝑖 ‖2 = ∑ ‖𝑃𝑈̃ 𝑎𝑗 ‖2 ≤ sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = ∑ ‖𝐴𝑥𝑖 ‖2 .
𝑖=1 𝑗=1 dim 𝒰=𝑘 𝑗=1 𝑖=1

Combining (6.3) and (6.4) and using the result of exercise (3) above, we
see
𝑚 𝑘+1 𝑘+1 𝑚
∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = ∑ ‖𝐴𝑦 𝑖 ‖2 ≤ ∑ ‖𝐴𝑥𝑖 ‖2 = ∑ ‖𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 .
𝑗=1 𝑖=1 𝑖=1 𝑗=1

Since 𝒰 is an arbitrary subspace of dimension 𝑘 + 1,

𝒲 𝑘+1 = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑘+1 }
is a maximizing subspace of dimension 𝑘 + 1. Thus, ‖𝐴𝑥𝑘+1 ‖ = 𝜎 𝑘+1 (𝐴)
implies
𝑚 𝑘+1 𝑘+1
sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = ∑ ‖𝐴𝑥𝑖 ‖2 = ∑ 𝜎𝑖2 (𝐴),
dim 𝒰=𝑘+1 𝑗=1 𝑖=1 𝑖=1

and so we have
𝑚 𝑘+1
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 = ‖𝐴‖2𝐹 − ∑ 𝜎𝑖2 (𝐴) = 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴). □
𝑗=1 𝑖=1

It’s important to make sure that the points are normalized so that
their “center of mass” is at the origin.
Example 6.3. Consider the points [−1 1]𝑇 , [0 1]𝑇 , and [1 1]𝑇 . These are
clearly all on the line 𝑦 = 1 in the 𝑥𝑦 plane. Let
−1 1
𝐴 = [ 0 1] .
1 1
We have
−1 1
−1 0 1
𝐴𝑇 𝐴 = [ ] [ 0 1]
1 1 1
1 1
2 0
=[ ].
0 3

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
178 6. Applications Revisited

Therefore, the eigenvalues of 𝐴𝑇 𝐴 are 𝜆↓1 = 3 and 𝜆↓2 = 2, with eigen-

vectors [0 1]𝑇 and [1 0]𝑇 . Consequently, the singular values of 𝐴 are √3
and √2. Moreover, the first singular triple is
1
0 1
(√3, [ ] , [1]) ,
1 √3
1
and so by Theorem 6.2, the closest one-dimensional subspace to these
points is span{[0 1]𝑇 }, which is the 𝑦-axis in the 𝑥𝑦 plane . . . which is not
really a good approximation of the line 𝑦 = 1. What’s going on here?
The issue is that the best subspace must contain the origin, and the
line that the given points lie on does not contain the origin. Therefore,
we “normalize” the points by computing their center of mass and sub-
tracting it from each point. What is left over is then centered at the ori-
gin. In this example, the center of mass is the point 𝑥̄ = [0 1]𝑇 . When
we subtract this point from each point in the given collection, we have
the new collection [−1 0]𝑇 , [0 0]𝑇 , and [1 0]𝑇 . We now consider
−1 0
𝐴 ̃ = [ 0 0] .
1 0
We have
−1 0
̃
𝑇̃ −1 0 1 2 0
𝐴 𝐴=[ ] [ 0 0] = [ ],
0 0 0 0 0
1 0
−1
1 1
and hence the first singular triple of 𝐴̃ is (√2, [ ] , [ 0 ]), and The-
0 √2
1
orem 6.2 tells us that the closest subspace to the translated collection
[−1 0]𝑇 , [0 0]𝑇 , and [1 0]𝑇 is span{[1 0]𝑇 }, i.e. the 𝑥-axis, which makes
sense since the translated collection lies on the 𝑥-axis. When we undo
the translation, we get a horizontal line through the point 𝑥.̄

In the general situation, the idea is to translate the given collection

of points to a new collection whose center of mass is the origin. If the
original collection has the points {𝑎1 , 𝑎2 , . . . , 𝑎𝑛 }, the center of mass is

1𝑛 −1 0
𝑎̄ ≔ ∑𝑖=1 𝑎𝑖 . (In our example, we have 𝑎1 = [ ], 𝑎2 = [ ], and
𝑛 1 1
1 1 0 0
𝑎3 = [ ]. The center of mass is then 𝑎̄ = 3 [ ] = [ ].) We then consider
1 3 1
the translated collection {𝑎1 − 𝑎,̄ 𝑎2 − 𝑎,̄ . . . , 𝑎𝑛 − 𝑎},
̄ and determine the
closest subspace to this translated collection. To undo the translation,
we simply add 𝑎̄ to the best subspace, creating an “affine subspace.”
By calculating and graphing the singular values, we can get an idea
as to whether or not the data set is low-dimensional. One way of doing
this is looking at the relative sizes of subsequent singular values. Some
caution: the drop that shows when we can neglect higher dimensions
depends on the problem! For some problems, a drop of a factor of 10
may be a good enough sign. For others, perhaps a factor of 100 may be
necessary.

6.2. Least Squares and Moore-Penrose Pseudo-Inverse

Suppose 𝐿 ∈ ℒ (𝒱, 𝒲). If dim 𝒱 ≠ dim 𝒲, then 𝐿 will not have an
inverse. Moreover, even if dim 𝒱 = dim 𝒲, 𝐿 may not have an inverse.
However, it will often be the case that given a 𝑦 ∈ 𝒲, we will want to
find an 𝑥 ∈ 𝒱 such that 𝐿𝑥 is close to 𝑦. If 𝐿 has an inverse, then the
“correct” 𝑥 is clearly 𝐿−1 𝑦. Notice that in this case 𝐿−1 𝑦 clearly minimizes
‖𝐿𝑥 − 𝑦‖𝒲 over all possible 𝑥 ∈ 𝒱. (In fact, when 𝐿 has an inverse, the
minimum of ‖𝐿𝑥−𝑦‖𝒲 is zero!) So one way to generalize the inverse is to
consider the following process: given 𝑦 ∈ 𝒲, find 𝑥 ∈ 𝒱 that minimizes
‖𝐿𝑥−𝑦‖𝒲 . However, what if there are lots of minimizing 𝑥 ∈ 𝒱? Which
one should we pick? A common method is to pick the smallest such 𝑥.
Thus, to get a “pseudo”-inverse we follow the following procedure: given
𝑦 ∈ 𝒲, find the smallest 𝑥 ∈ 𝒱 that minimizes ‖𝐿𝑥 − 𝑦‖𝒲 . Is there a
formula that we can use for this process?
Let’s look at the first part: given 𝑦 ∈ 𝒲, find an 𝑥 ∈ 𝒱 that mini-
mizes ‖𝐿𝑥 − 𝑦‖𝒲 . Notice that the set of all 𝐿𝑥 for which 𝑥 ∈ 𝒱 is the
range of 𝐿 (ℛ (𝐿)), and so therefore minimizing ‖𝐿𝑥 − 𝑦‖𝒲 means mini-
mizing the distance from 𝑦 to ℛ (𝐿), i.e. finding the projection of 𝑦 onto

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
180 6. Applications Revisited

ℛ (𝐿). If we have an orthonormal basis of ℛ (𝐿), calculating this projec-

tion is straightforward. Notice that an orthonormal basis of ℛ (𝐿) is one
of the things provided by the Singular Value Decomposition!
Recall that the Singular Value Decomposition says that there are or-
thonormal bases {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } of 𝒱 and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } of 𝒲, and there
are non-negative numbers 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑝 ≥ 0 such that 𝐿𝑥𝑖 = 𝜎 𝑖 𝑦 𝑖
and 𝐿∗ 𝑦 𝑖 = 𝜎 𝑖 𝑥𝑖 for 𝑖 = 1, 2, . . . , 𝑝. (Here, 𝑝 equals the minimum of
{dim 𝒱, dim 𝒲} = min{𝑛, 𝑚}.) Further, if 𝑟 ≔ max{𝑖 ∶ 𝜎 𝑖 > 0}, then
Theorem 3.51 and the following problem imply that
⊥
𝒩 (𝐿) = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑟 } .
Exercise 6.4. Show that ℛ (𝐿∗ ) = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑟 }.

Thus, given 𝑦 ∈ 𝒲, Proposition 3.26 implies that

𝑃𝑦 = ⟨𝑦, 𝑦1 ⟩𝒲 𝑦1 + ⟨𝑦, 𝑦2 ⟩𝒲 𝑦2 + ⋯ + ⟨𝑦, 𝑦𝑟 ⟩𝒲 𝑦𝑟 .
Since 𝐿𝑥𝑖 = 𝜎 𝑖 𝑦 𝑖 and 𝜎 𝑖 ≠ 0 for 𝑖 = 1, 2, . . . , 𝑟, if we define
⟨𝑦, 𝑦1 ⟩𝒲 ⟨𝑦, 𝑦2 ⟩𝒲 ⟨𝑦, 𝑦𝑟 ⟩𝒲
𝑥̂ = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑟 ,
𝜎1 𝜎2 𝜎𝑟
we have 𝐿𝑥̂ = 𝑃𝑦. Notice that 𝐿𝑥̂ minimizes the distance from 𝑦 to ℛ (𝐿),
and so 𝑥̂ is a minimizer of ‖𝐿𝑥 − 𝑦‖𝑊 .
We now claim that 𝑥̂ is the smallest 𝑥 ∈ 𝒱 such that 𝐿𝑥 = 𝑃𝑦. Let
𝑥 be any element of 𝒱 such that 𝐿𝑥 = 𝑃𝑦. Therefore, 𝐿(𝑥 − 𝑥)̂ = 0𝒲
and hence 𝑥 − 𝑥̂ ∈ 𝒩 (𝐿), i.e. 𝑥 = 𝑥̂ + 𝑧 for some 𝑧 ∈ 𝒩 (𝐿). Because
⊥
𝑥̂ ∈ span{𝑥1 , 𝑥2 , . . . , 𝑥𝑟 }, and 𝑧 ∈ 𝒩 (𝐿) = span{𝑥1 , 𝑥2 , . . . , 𝑥𝑟 } , we have
⟨𝑥,̂ 𝑧⟩𝒱 = 0. By the Pythagorean Theorem,
‖𝑥‖𝒱2 = ‖𝑥‖̂ 𝒱2 + ‖𝑧‖𝒱2 ≥ ‖𝑥‖̂ 𝒱2 ,
which means that 𝑥̂ has the smallest norm among all 𝑥 with 𝐿𝑥 = 𝑃𝑦.
This gives us a formula for our pseudo-inverse 𝐿† in terms of the or-
thonormal bases and the singular values of 𝐿:
⟨𝑦, 𝑦1 ⟩𝒲 ⟨𝑦, 𝑦2 ⟩𝒲 ⟨𝑦, 𝑦𝑟 ⟩𝒲
(6.5) 𝐿† 𝑦 = 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑟 .
𝜎1 𝜎2 𝜎𝑟
Thus, 𝐿† 𝑦 is the smallest element of 𝒱 that minimizes ‖𝐿𝑥 − 𝑦‖𝒲 .

Exercise 6.5. Explain why formula (6.5) above implies 𝐿† and 𝐿−1 are
the same when 𝐿 is invertible. (Think about what the singular triples of
𝐿−1 are when 𝐿−1 exists.)

What does this mean when 𝒱 = ℝ𝑛 and 𝒲 = ℝ𝑚 (with the dot

product as their inner product) and 𝐴 is an 𝑚 × 𝑛 matrix? Here we recall
the reduced Singular Value Decomposition of 𝐴: let 𝑋 be the 𝑛×𝑛 matrix
whose columns are given by {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and 𝑌 be the 𝑚 × 𝑚 matrix
whose columns are given by {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 }, and finally Σ̃ is the 𝑟 × 𝑟
diagonal matrix whose diagonal entries are 𝜎 𝑖 (for 𝑖 = 1, 2, . . . 𝑟, where 𝑟
is the rank of 𝐴). If 𝑋𝑟 denotes the first 𝑟 columns of 𝑋 and 𝑌𝑟 denotes
the first 𝑟 columns of 𝑌 , then we know that 𝐴 = 𝑌𝑟 Σ𝑋 ̃ 𝑟𝑇 . We now get a
†
formula for 𝐴 in terms of these matrices. For 𝑗 = 1, 2, . . . , 𝑟, we have
−1
𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 𝑦𝑗
1
⎡ 𝜍1 ⎤⎡ 𝑦𝑇1
⎢ 1 ⎥⎢ ⎤
𝑦𝑇2 ⎥
= [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥⎢ ⎥[𝑦𝑗 ]
⎢ ⋱ ⎥⎢ ⋮ ⎥
⎢ 1 ⎥ 𝑦𝑇𝑟
⎣ 𝜍𝑟 ⎦⎣ ⎦
1
𝑇
⎡ 𝜍1
1
⎤⎡ 𝑦1 𝑦𝑗 ⎤
⎢ ⎥⎢ 𝑦𝑇 𝑦𝑗 ⎥
= [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥⎢ 2 ,
⎢ ⋱ ⎥⎢ ⋮ ⎥
⎢ 1 ⎥ 𝑦𝑇 𝑦 ⎥
⎣ 𝜍𝑟 ⎦⎣ 𝑟 𝑗 ⎦
and therefore (letting e𝑗,𝑟 be the first 𝑟 entries from the column vector e𝑗
in ℝ𝑚 ) we will have
1
⎡ 𝜍1
1
⎤
−1 ⎢ ⎥
𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 𝑦𝑗 = [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥e𝑗,𝑟
⎢ ⋱ ⎥
⎢ 1 ⎥
⎣ 𝜍𝑟 ⎦

1 𝑥𝑗
= [ 𝑥1 𝑥2 ... 𝑥𝑟 ] e = ,
𝜎𝑗 𝑗,𝑟 𝜎𝑗

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
182 6. Applications Revisited
−1
while for 𝑗 = 𝑟+1, . . . , 𝑚, we will have 𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 𝑦𝑗 = 0. Therefore, we
−1 𝑇
see that 𝑋𝑟 (Σ)̃ 𝑌𝑟 𝑣𝑗 agrees with formula (6.5) applied to 𝑦𝑗 for each
−1
𝑗 = 1, 2, . . . , 𝑚, and so we have 𝐴† = 𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 :
1
⎡ 𝜍1 ⎤⎡ 𝑦𝑇1
⎢ 1 ⎥⎢ ⎤
𝑦𝑇2 ⎥.
𝐴† = [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥⎢ ⎥
⎢ ⋱ ⎥⎢ ⋮ ⎥
⎢ 1 ⎥ 𝑦𝑇𝑟
⎣ 𝜍𝑟 ⎦⎣ ⎦

Exercise 6.6. Use the formula above to give (yet) another explanation
that 𝐴† and 𝐴−1 are the same when 𝐴−1 exists.

6.3. Eckart-Young-Mirsky for the Operator Norm

Suppose 𝒱 and 𝒲 are finite-dimensional inner-product spaces, with in-
ner products ⟨⋅, ⋅⟩𝒱 and ⟨⋅, ⋅⟩𝒲 , respectively, and suppose further that
𝐿 ∈ ℒ (𝒱, 𝒲). The Singular Value Decomposition provides a way to
see what the most important parts of 𝐿 are. This is related to the follow-
ing problem: given 𝑘 = 1, 2, . . . , rank 𝐿 − 1, what rank 𝑘 𝑀 ∈ ℒ (𝒱, 𝒲)
is closest to 𝐴? In other words, what is the closest rank 𝑘 operator to
𝐿? A very important ingredient to answering this question is determin-
ing what we mean by “closest” — how are we measuring distance? One
‖𝑀𝑥‖
approach is to use the operator norm: ‖𝐿‖𝑜𝑝 = sup{ ‖𝑥‖ 𝒲 ∶ 𝑥 ≠ 0𝒱 },
𝒱
where ‖ ⋅ ‖𝒱 and ‖ ⋅ ‖𝒲 are the norms induced by the inner products.
Thus, our problem is as follows: find 𝑀̃ ∈ ℒ (𝒱, 𝒲) with rank 𝑘 such
that
̃ 𝑜𝑝 = inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} .
‖𝐿 − 𝑀‖
We can get an upper bound on inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} by con-
sidering a particular 𝑀. Suppose that dim 𝒱 = 𝑛 and dim 𝒲 = 𝑚,
and let 𝑝 = min{𝑛, 𝑚}. Let {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } be the
orthonormal bases of 𝒱, 𝒲 and suppose 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑝 ≥ 0 are the
singular values of 𝐿 provided by the SVD. If rank 𝐿 = 𝑟, then we know
that 𝜎 𝑖 = 0 when 𝑖 > 𝑟. Moreover, we know that for any 𝑥 ∈ 𝒱, we have
𝑟
𝐿𝑥 = ∑ 𝜎 𝑖 ⟨𝑥𝑖 , 𝑥⟩𝒱 𝑦 𝑖 .
𝑖=1

We now want to consider 𝑀𝑘 ∈ ℒ (𝒱, 𝒲) defined by

𝑘
(6.6) 𝑀𝑘 𝑥 = ∑ 𝜎 𝑖 ⟨𝑥𝑖 , 𝑥⟩𝒱 𝑦 𝑖 .
𝑖=1

Exercise 6.7. Show that ℛ (𝑀𝑘 ) = span{𝑦1 , 𝑦2 , . . . , 𝑦 𝑘 }, and conclude

that rank 𝑀𝑘 = 𝑘.

We will have
𝑟
(𝐿 − 𝑀𝑘 )𝑥 = ∑ 𝜎 𝑖 ⟨𝑥𝑖 , 𝑥⟩𝒱 𝑦 𝑖 ,
𝑖=𝑘+1

and we now calculate ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 . Let 𝑥 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎𝑛 𝑥𝑛 . By

the orthonormality of the 𝑥𝑖 , we will have
𝑟
(𝐿 − 𝑀𝑘 )𝑥 = ∑ 𝜎 𝑖 𝑎𝑖 𝑦 𝑖 .
𝑖=𝑘+1

By the Pythagorean Theorem and the fact that the singular values are
decreasing,
𝑟
2
‖(𝐿 − 𝑀𝑘 )𝑥‖𝒲 = ∑ 𝜎𝑖2 𝑎2𝑖
𝑖=𝑘+1
𝑟
2
≤ 𝜎𝑘+1 ∑ 𝑎2𝑖
𝑖=𝑘+1
𝑛
2
≤ 𝜎𝑘+1 2
∑ 𝑎2𝑖 = 𝜎𝑘+1 ‖𝑥‖𝒱2 ,
𝑖=1

and therefore for any 𝑥 ≠ 0𝒱 ,

2
‖(𝐿 − 𝑀𝑘 )𝑥‖𝒲 2
2
≤ 𝜎𝑘+1 .
‖𝑥‖𝒱
𝑘 𝒲 ‖(𝐿−𝑀 )𝑥‖
Taking square roots, we see ‖𝑥‖𝒱
≤ 𝜎 𝑘+1 for any 𝑥 ≠ 0𝒱 . Thus,
‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 ≤ 𝜎 𝑘+1 . On the other hand, if we consider 𝑥𝑘+1 , we have
‖(𝐿 − 𝑀𝑘 )𝑥𝑘+1 ‖𝒲
= ‖𝜎 𝑘+1 𝑦 𝑘+1 ‖𝒲 = 𝜎 𝑘+1 ,
‖𝑥𝑘+1 ‖𝒱
which implies ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 = 𝜎 𝑘+1 .

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
184 6. Applications Revisited

Therefore, inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} ≤ 𝜎 𝑘+1 . (Why?) A harder

question: can we do any better? The answer turns out to be no!
Theorem 6.8 (Eckart-Young-Mirsky Theorem, Operator Norm). Sup-
pose 𝒱 and 𝒲 are two finite-dimensional inner-product spaces, and as-
sume dim 𝒱 = 𝑛 and dim 𝒲 = 𝑚. Let 𝑝 = min{𝑛, 𝑚}, and suppose
𝐿 ∈ ℒ (𝒱, 𝒲) and rank 𝐿 = 𝑟. Let {𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 }
and 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑝 ≥ 0 be the orthonormal bases of 𝒱, 𝒲 (respec-
tively) and the singular values of 𝐿 provided by the SVD. If 𝑀𝑘 is defined
by (6.6), we have
inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} = 𝜎 𝑘+1 = ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 .
In other words, 𝑀𝑘 is the closest rank 𝑘 linear operator to 𝐿, as measured
by the operator norm.

Proof. Since the proof requires examining the singular values of differ-
ent operators, we use 𝜎 𝑘 (𝐴) to mean the 𝑘th singular value of 𝐴. In the
preceding paragraphs, we have shown that
inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} ≤ 𝜎 𝑘+1 (𝐿) = ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 .
To finish the proof, we must show that ‖𝐿 − 𝑀‖𝑜𝑝 ≥ 𝜎 𝑘+1 (𝐿) for any
𝑀 ∈ ℒ (𝒱, 𝒲) with rank 𝑀 = 𝑘.
Suppose that 𝑀 ∈ ℒ (𝒱, 𝒲) has rank 𝑘, and recall that Theorem
5.2(b) tells us ‖𝐿 − 𝑀‖𝑜𝑝 = 𝜎1 (𝐿 − 𝑀). We now use Weyl’s inequality:
𝜎 𝑘+𝑗−1 (𝐿1 + 𝐿2 ) ≤ 𝜎 𝑘 (𝐿1 ) + 𝜎𝑗 (𝐿2 ) for any 𝐿1 , 𝐿2 ∈ ℒ (𝒱, 𝒲) and for
any indices 𝑘, 𝑗, 𝑘 + 𝑗 − 1 between 1 and min{𝑛, 𝑚}. Replacing 𝐿1 with
𝐿 − 𝑀 and 𝐿2 with 𝑀, and taking 𝑘 = 1 and 𝑗 = 𝑘 + 1, we will have
𝜎 𝑘+1 (𝐿) = 𝜎1+𝑘+1−1 (𝐿) ≤ 𝜎1 (𝐿 − 𝑀) + 𝜎 𝑘+1 (𝑀)
= 𝜎1 (𝐿 − 𝑀) = ‖𝐿 − 𝑀‖𝑜𝑝 ,
since rank 𝑀 = 𝑘 implies that 𝜎 𝑘+1 (𝑀) = 0. □
Corollary 6.9. Let 𝐴 be an 𝑚 × 𝑛 matrix, and let 𝑝 = min{𝑚, 𝑛}. Let
{𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } be the orthonormal bases of ℝ𝑛 and
ℝ𝑚 and suppose 𝜎1 ≥ 𝜎2 ≥ . . . 𝜎𝑝 ≥ 0 are the singular values provided by
the Singular Value Decomposition of 𝐴. Assume 𝑘 ≤ rank 𝐴, and let
𝑘
𝐵𝑘 = ∑ 𝜎 𝑖 (𝐴)𝑦 𝑖 𝑥𝑖𝑇 .
𝑖=1

Then 𝐵𝑘 has rank 𝑘 and

‖𝐴 − 𝐵𝑘 ‖𝑜𝑝 = inf{‖𝐴 − 𝐵‖𝑜𝑝 ∶ rank 𝐵 = 𝑘} .
Exercise 6.10. Prove the preceding corollary. (A possible approach:
show that 𝐵𝑘 and 𝑀𝑘 are equal.)

6.4. Eckart-Young-Mirsky for the Frobenius Norm and

Image Compression
The previous section tells us how to “best” approximate a given linear
operator 𝐿 ∈ ℒ (𝒱, 𝒲) by one of lower rank when we use the operator
norm. In this section we consider a more concrete problem, approximat-
ing a matrix. Recall that if 𝐴 is a gray-scale matrix, a better measure of
distance and size is provided by the Frobenius norm. (In addition, the
Frobenius norm is easier to calculate!) Recall, the Frobenius norm of an
𝑚 × 𝑛 matrix is given by
1
2

‖𝐴‖𝐹 ≔ (∑ 𝐴2𝑖𝑗 ) ,
𝑖,𝑗

so ‖𝐴‖2𝐹
is the sum of the squares of all of the entries of 𝐴. What is the
closest rank 𝑘 matrix to 𝐴, as measured in the Frobenius norm? Our
problem is: find 𝐵̃ with rank 𝑘 such that
‖𝐴 − 𝐵‖̃ 𝐹 = inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} .
As in the case of the operator norm, we can get an upper bound on
the infimum by considering a particular 𝐵. We use the same particu-
lar matrix as in the operator norm: suppose 𝐴 = 𝑌 Σ𝑋 𝑇 is the Singular
Value Decomposition of 𝐴. Σ is an 𝑚 × 𝑛 matrix, whose diagonal entries
are Σ𝑖𝑖 = 𝜎 𝑖 (𝐴). Let Σ𝑘 be the 𝑚 × 𝑛 matrix that has the same first 𝑘
diagonal entries as Σ, and the remaining entries are all zeros. Now, let
𝐵 𝑘 = 𝑌 Σ𝑘 𝑋 𝑇 .
Exercise 6.11. Show that
‖𝐴 − 𝐵𝑘 ‖2𝐹 = 𝜎𝑘+1
2 2
(𝐴) + 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑟2 (𝐴).
(Recall the extraordinarily useful fact that the Frobenius norm is invari-
ant under orthogonal transformations, which means that if 𝑊 is any
appropriately sized square matrix such that 𝑊 𝑇 𝑊 = 𝐼, then we have

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
186 6. Applications Revisited

‖𝐴𝑊 𝑇 ‖𝐹 = ‖𝐴‖𝐹 and ‖𝑊𝐴‖𝐹 = ‖𝐴‖𝐹 ; and the consequence that Frobe-
nius norm is the square root of the sum of the squares of the singular
values.)

Exercise 6.11 implies that

1
𝑟 2

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} ≤ ( ∑ 𝜎𝑗2 (𝐴)) .

𝑗=𝑘+1

(Why?) A harder question: can we do any better? The answer turns out
to be no!
Theorem 6.12 (Eckart-Young-Mirsky Theorem, Frobenius Norm). Sup-
pose 𝐴 is an 𝑚 × 𝑛 matrix with rank 𝐴 = 𝑟. For any 𝑘 = 1, 2, . . . , 𝑟, if
𝐵𝑘 = 𝑌 Σ𝑘 𝑋 𝑇 (where 𝐴 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decomposition of
𝐴), then we have
1
𝑟 2

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} = ( ∑ 𝜎𝑗2 (𝐴)) = ‖𝐴 − 𝐵𝑘 ‖𝐹 .

𝑗=𝑘+1

In other words, 𝐵𝑘 is the closest (as measured by the Frobenius norm) rank
𝑘 matrix to 𝐴.

Proof. From Exercise 6.11, we know that

1
𝑟 2

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} ≤ ( ∑ 𝜎𝑗2 (𝐴))

𝑗=𝑘+1

= ‖𝐴 − 𝐵𝑘 ‖𝐹 .
𝑟
To finish the proof, we show that ‖𝐴 − 𝐵‖2𝐹 ≥ ∑𝑗=𝑘+1 𝜎𝑗2 (𝐴) for any 𝐵
with rank 𝑘.
Let 𝐵 be an arbitrary matrix with rank 𝑘, and let 𝑝 = min{𝑛, 𝑚}, we
have
𝑝
‖𝐴 − 𝐵‖2𝐹 = ∑ 𝜎𝑗2 (𝐴 − 𝐵).
𝑗=1

We again use Weyl’s inequality: 𝜎 𝑖+𝑗−1 (𝐿1 + 𝐿2 ) ≤ 𝜎 𝑖 (𝐿1 ) + 𝜎𝑗 (𝐿2 ) for

any linear operators 𝐿1 , 𝐿2 ∈ ℒ (ℝ𝑛 , ℝ𝑚 ) and for any indices 𝑖 and 𝑗 for
which 𝑖 + 𝑗 − 1 is between 1 and min{𝑛, 𝑚}. Replacing 𝐿1 with 𝐴 − 𝐵 and
𝐿2 with 𝐵, we will have
𝜎 𝑖+𝑗−1 (𝐴) ≤ 𝜎 𝑖 (𝐴 − 𝐵) + 𝜎𝑗 (𝐵)
so long as 1 ≤ 𝑘 + 𝑗 − 1 ≤ 𝑝. Taking 𝑗 = 𝑘 + 1, and noting that rank
𝐵 = 𝑘 implies that 𝜎 𝑘+1 (𝐵) = 0, we have
𝜎 𝑖+𝑘 (𝐴) ≤ 𝜎 𝑖 (𝐴 − 𝐵) + 𝜎 𝑘+1 (𝐵) = 𝜎 𝑖 (𝐴 − 𝐵)
so long as 1 ≤ 𝑖 ≤ 𝑝 and 1 ≤ 𝑖 + 𝑘 ≤ 𝑝. Therefore, we have
𝑝
‖𝐴 − 𝐵‖2𝐹 = ∑ 𝜎𝑖2 (𝐴 − 𝐵)
𝑖=1
𝑝−𝑘 𝑝
= ∑ 𝜎𝑖2 (𝐴 − 𝐵) + ∑ 𝜎𝑖2 (𝐴 − 𝐵)
𝑖=1 𝑖=𝑝−𝑘+1
𝑝−𝑘
≥ ∑ 𝜎𝑖2 (𝐴 − 𝐵)
𝑖=1
𝑝−𝑘
2
≥ ∑ 𝜎𝑖+𝑘 (𝐴)
𝑖=1
2 2
= 𝜎𝑘+1 (𝐴) + 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴)
= ‖𝐴 − 𝐵𝑘 ‖2𝐹 . □
Corollary 6.13. Let 𝐴 be an 𝑚 × 𝑛 matrix, and let 𝑝 = min{𝑚, 𝑛}. Let
{𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } be the orthonormal bases of ℝ𝑛 and
ℝ𝑚 and let 𝜎1 ≥ 𝜎2 ≥ . . . 𝜎𝑝 ≥ 0 be the singular values provided by the
Singular Value Decomposition of 𝐴. Suppose that 𝑘 ≤ rank 𝐴, and define
𝑘
𝐵𝑘 ≔ ∑𝑖=1 𝜎 𝑖 (𝐴)𝑦 𝑖 𝑥𝑖𝑇 . Then 𝐵𝑘 has rank 𝑘 and
‖𝐴 − 𝐵𝑘 ‖𝐹 = inf{‖𝐴 − 𝑀‖𝐹 ∶ rank 𝑀 = 𝑘} .
Exercise 6.14. Prove the preceding corollary.

As it turns out, 𝐵𝑘 will be the closest rank 𝑘 approximation to 𝐴 in

lots of norms! 𝐵𝑘 will in fact be the closest rank 𝑘 approximation to 𝐴
in any norm that is invariant under orthogonal transformations! That
is, if ‖ ⋅ ‖ is a norm on 𝑚 × 𝑛 matrices such that ‖𝑊𝐴‖ = ‖𝐴‖ and

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
188 6. Applications Revisited

‖𝐴𝑊‖ = ‖𝐴‖ for any appropriately sized orthogonal matrices 𝑊, then

𝐵𝑘 will be the closest rank 𝑘 matrix to 𝐴 in that norm! The proof of this
surprising generalization is due to Mirsky, see [29]. (Eckart and Young
first considered the problem of approximating a given matrix by one of
a lower rank, and provided a solution in 1936: [10].)
Exercise 6.15. Suppose 𝐴 is a given square invertible matrix. What is
the closest singular matrix to 𝐴? Why?

6.5. The Orthogonal Procrustes Problem

Suppose we have a collection of 𝑚 points in ℝ𝑛 , representing some con-
figuration of points in ℝ𝑛 , and we want to know how close this “test”
configuration is to a given reference configuration. In many situations,
as long as the distances and angles between the points are the same, two
configurations are regarded as the same. Thus, to determine how close
the test configuration is to the reference configuration, we want to trans-
form the test configuration to be as close as possible to the reference
configuration — making sure to preserve lengths and angles in the test
configuration. Notice that lengths and angles (or at least their cosines)
are determined by the dot product, so we want transformations that pre-
serve dot product. We will confine ourselves to linear transformations.
Theorem 6.16. Suppose 𝑈 ∈ ℒ (ℝ𝑛 , ℝ𝑛 ). The following three conditions
are equivalent:
(1) ‖𝑈𝑥‖ = ‖𝑥‖ for all 𝑥 ∈ ℝ𝑛 (where ‖𝑣‖2 = 𝑣 ⋅ 𝑣 for any 𝑣 ∈ ℝ𝑛 ).
(2) 𝑈𝑥 ⋅ 𝑈𝑦 = 𝑥 ⋅ 𝑦 for all 𝑥, 𝑦 ∈ ℝ𝑛 .
(3) 𝑈 𝑇 𝑈 = 𝐼 (or equivalently 𝑈𝑈 𝑇 = 𝐼), i.e. 𝑈 is an orthogonal
matrix.

Proof. Suppose that (1) is true. Let 𝑥, 𝑦 ∈ ℝ𝑛 be arbitrary. We have

‖𝑈𝑥 − 𝑈𝑦‖2 = ‖𝑈𝑥‖2 − 2𝑈𝑥 ⋅ 𝑈𝑦 + ‖𝑈𝑦‖2
= ‖𝑥‖2 + ‖𝑦|2 − 2𝑈𝑥 ⋅ 𝑈𝑦,
and in addition
‖𝑥 − 𝑦‖2 = ‖𝑥‖2 − 2𝑥 ⋅ 𝑦 + ‖𝑦‖2 = ‖𝑥‖2 + ‖𝑦‖2 − 2𝑥 ⋅ 𝑦.

Since ‖𝑈𝑥 − 𝑈𝑦‖2 = ‖𝑈(𝑥 − 𝑦)‖2 = ‖𝑥 − 𝑦‖2 by assumption, comparing

the previous two lines we must have 𝑈𝑥 ⋅ 𝑈𝑦 = 𝑥 ⋅ 𝑦. Therefore, we have
(1) ⟹ (2).
Suppose now that (2) is true. Let 𝑥 ∈ ℝ𝑛 be arbitrary. We will show
that for any 𝑦 ∈ ℝ𝑛 , 𝑈 𝑇 𝑈𝑥 ⋅ 𝑦 = 𝑥 ⋅ 𝑦. We have
𝑈 𝑇 𝑈𝑥 ⋅ 𝑦 = 𝑈𝑥 ⋅ 𝑈𝑦 = 𝑥 ⋅ 𝑦.
Since 𝑈 𝑇 𝑈𝑥 ⋅ 𝑦 = 𝑥 ⋅ 𝑦 for any 𝑦 ∈ ℝ𝑛 , we must have 𝑈 𝑇 𝑈𝑥 = 𝑥. Since
𝑥 ∈ ℝ𝑛 is arbitrary, we see that 𝑈 𝑇 𝑈 = 𝐼. Thus, (2) ⟹ (3).
Finally, suppose that (3) is true. We will have
‖𝑈𝑥‖2 = 𝑈𝑥 ⋅ 𝑈𝑥 = 𝑈 𝑇 𝑈𝑥 ⋅ 𝑥 = 𝑥 ⋅ 𝑥 = ‖𝑥‖2 ,
since 𝑈 𝑇 𝑈 = 𝐼. Therefore, (1) ⟹ (2) ⟹ (3) ⟹ (1), and so the
three conditions are equivalent. □

Suppose now 𝐴 and 𝐵 are two fixed 𝑚 × 𝑛 matrices. We view 𝐴 and

𝐵 as made up of 𝑚 rows, each consisting of the transpose of an element
of ℝ𝑛 :
𝑎𝑇1 𝑏𝑇1
⎡ ⎤ ⎡ ⎤
𝑎𝑇2 𝑏𝑇2
𝐴=⎢ ⎢
⎥ and 𝐵 = ⎢
⎥ ⎢
⎥.
⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⎥
𝑇 𝑇
⎣ 𝑎𝑚 ⎦ ⎣ 𝑏𝑚 ⎦
Notice that if 𝑈 is an orthogonal matrix, then Theorem 6.16 tells us that
(as a transformation) 𝑈 will preserve dot products and hence lengths
and angles. Next, notice that

𝑈𝐴𝑇 = 𝑈 [ 𝑎1 𝑎2 ... 𝑎𝑚 ] = [ 𝑈𝑎1 𝑈𝑎2 ... 𝑈𝑎𝑚 ] ,

which means that the columns of 𝑈𝐴𝑇 represent a configuration that has
the same lengths and angles as the original configuration represented by
𝐴𝑇 . Next, for the distance to the reference configuration 𝐵, we calculate
the Frobenius norm squared of 𝐴𝑈 𝑇 − 𝐵, i.e. the sum of the squared
distances between the corresponding rows of 𝐴𝑈 𝑇 and 𝐵. Therefore,
finding the closest configuration to the given reference 𝐵 means solving
the following:
minimize ‖𝐴𝑈 𝑇 − 𝐵‖2𝐹 over all 𝑈 such that 𝑈 𝑇 𝑈 = 𝐼.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
190 6. Applications Revisited

Since taking the transpose of an orthogonal matrix again yields an or-

thogonal matrix, replacing 𝑈 𝑇 above with 𝑉, we look at the following
problem: find 𝑉ˆ with 𝑉ˆ𝑇 𝑉ˆ = 𝐼 such that
(6.7) ‖𝐴𝑉ˆ − 𝐵‖2𝐹 = inf ‖𝐴𝑉 − 𝐵‖2𝐹 .
𝑉 𝑇 𝑉 =𝐼

As should be no surprise, we can determine a minimizing 𝑉ˆ in terms of

the Singular Value Decomposition.
Theorem 6.17. Suppose 𝐴 and 𝐵 are arbitrary 𝑚 × 𝑛 matrices. Next,
suppose that 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decomposition of 𝐴𝑇 𝐵.
Then 𝑉ˆ = 𝑌 𝑋 𝑇 is a minimizer for (6.7).

Proof. Notice that the Frobenius norm is the norm associated with the
Frobenius inner product ⟨𝐴, 𝐵⟩𝐹 = tr 𝐴𝑇 𝐵. Therefore, we will have
‖𝐴𝑉 − 𝐵‖2𝐹 = ⟨𝐴𝑉 − 𝐵, 𝐴𝑉 − 𝐵⟩𝐹
= ⟨𝐴𝑉, 𝐴𝑉⟩𝐹 − 2 ⟨𝐴𝑉, 𝐵⟩𝐹 + ⟨𝐵, 𝐵⟩𝐹
= tr (𝐴𝑉)𝑇 𝐴𝑉 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝑉 𝑇 𝐴𝑇 𝐴𝑉 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝐴𝑇 𝐴𝑉𝑉 𝑇 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝐴𝑇 𝐴 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= ‖𝐴‖2𝐹 + ‖𝐵‖2𝐹 − 2tr 𝑉 𝑇 𝐴𝑇 𝐵.
Since ‖𝐴‖2𝐹 and ‖𝐵‖2𝐹 are fixed, minimizing ‖𝐴𝑉 − 𝐵‖2𝐹 over all 𝑉 with
𝑉 𝑇 𝑉 = 𝐼 is equivalent to maximizing tr 𝑉 𝑇 𝐴𝑇 𝐵 over all 𝑉 with 𝑉 𝑇 𝑉 = 𝐼.
Notice that since 𝐴 and 𝐵 are 𝑚 × 𝑛 matrices, 𝐴𝑇 𝐵 is an 𝑛 × 𝑛 matrix.
Thus, 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 means that the matrices 𝑌 , Σ, and 𝑋 are all 𝑛 × 𝑛
matrices. Moreover, since the columns of 𝑌 and 𝑋 are orthonormal, the
matrices 𝑌 and 𝑋 are orthogonal: 𝑌 𝑇 𝑌 = 𝐼 and 𝑋 𝑇 𝑋 = 𝐼. Therefore,
sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = sup tr 𝑉 𝑇 𝑌 Σ𝑋 𝑇
𝑉 𝑇 𝑉 =𝐼 𝑉 𝑇 𝑉 =𝐼

= sup tr Σ𝑋 𝑇 𝑉 𝑇 𝑌
𝑉 𝑇 𝑉 =𝐼
𝑛
= sup tr Σ𝑍 = sup ∑ 𝜎 𝑖 𝑧𝑖𝑖 ,
𝑍 𝑇 𝑍=𝐼 𝑍 𝑇 𝑍=𝐼 𝑖=1

where we replaced 𝑉 with 𝑍 = 𝑋 𝑇 𝑉 𝑇 𝑌 . Now, for any orthogonal ma-

trix 𝑍, each column of 𝑍 is a unit vector, and therefore any entry in a
column of 𝑍 must be between -1 and 1. Thus, −1 ≤ 𝑧𝑖𝑖 ≤ 1, and there-
𝑛 𝑛
fore ∑𝑖=1 𝜎 𝑖 𝑧𝑖𝑖 ≤ ∑𝑖=1 𝜎 𝑖 . Thus,
𝑛 𝑛
sup tr Σ𝑍 = sup ∑ 𝜎 𝑖 𝑧𝑖𝑖 ≤ ∑ 𝜎 𝑖 .
𝑍 𝑇 𝑍=𝐼 𝑍 𝑇 𝑍=𝐼 𝑖=1 𝑖=1
Moreover, we will have equality above when 𝑍 = 𝐼. That is, 𝐼 is a
maximizer for tr Σ𝑍. Therefore, a maximizer for tr 𝑉 𝑇 𝐴𝑇 𝐵 occurs when
𝑋 𝑇 𝑉ˆ𝑇 𝑌 = 𝐼, i.e. when 𝑉ˆ𝑇 = 𝑋𝑌 𝑇 , which is exactly when 𝑉ˆ = 𝑌 𝑋 𝑇 . In
addition, we will have
𝑛
sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = ∑ 𝜎 𝑖 ,
𝑉 𝑇 𝑉 =𝐼 𝑖=1

where 𝜎 𝑖 are the singular value of 𝐴 𝐵. Thus, a minimizer for ‖𝐴𝑉 −𝐵‖2𝐹
𝑇

subject to 𝑉 𝑇 𝑉 = 𝐼 is provided by 𝑉ˆ = 𝑌 𝑋 𝑇 (where 𝑌 Σ𝑋 𝑇 is the Singular

Value Decomposition of 𝐴𝑇 𝐵), and we have
𝑛
inf ‖𝐴𝑉 − 𝐵‖2𝐹 = ‖𝐴𝑉ˆ − 𝐵‖2𝐹 = ‖𝐴‖2𝐹 + ‖𝐵‖2𝐹 − 2 ∑ 𝜎 𝑖 . □
𝑉 𝑇 𝑉 =𝐼
𝑖=1

Note that the Orthogonal Procrustes problem above is slightly dif-

ferent from the problem where we additionally require that 𝑈 preserve
orientation, since preserving orientation requires that det 𝑈 > 0, and 𝑈
orthogonal means that 𝑈 𝑇 𝑈 = 𝐼 and hence the orthogonality of 𝑈 tells
us only that det 𝑈 = ±1. Of course, notice that if det 𝑌 𝑋 𝑇 > 0, then the
minimizer of ‖𝐴𝑉 − 𝐵‖2𝐹 subject to 𝑉 𝑇 𝑉 = 𝐼 and det 𝑉 = 1 is provided
by 𝑌 𝑋 𝑇 . Thus, for the constrained Orthogonal Procrustes problem, the
interesting situation is when det 𝑌 𝑋 𝑇 = −1. For this, the following tech-
nical proposition from [23] (see also [2]) is useful.
Proposition 6.18. Suppose Σ is an 𝑛 × 𝑛 diagonal matrix, with entries
𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0. Then
(1) For any orthogonal 𝑊, we have tr Σ𝑊 ≤ tr Σ.
(2) For any orthogonal 𝐵, 𝑊, we have tr 𝐵 𝑇 Σ𝐵𝑊 ≤ tr 𝐵 𝑇 Σ𝐵.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
192 6. Applications Revisited

(3) For every orthogonal 𝑊 with det 𝑊 < 0, we have

𝑛−1
tr Σ𝑊 ≤ ( ∑ 𝜎𝑗 ) − 𝜎𝑛 .
𝑗=1

Part (2) of Proposition 6.18 has an interpretation in terms of similar

matrices.
Definition 6.19. Suppose 𝐴 is an 𝑛 × 𝑛 matrix. We say that 𝐶 is similar
to 𝐴 exactly when there is an invertible matrix 𝑃 such that 𝐶 = 𝑃 −1 𝐴𝑃.

With this definition in mind, (2) says that if 𝐴 is orthogonally similar

to Σ, then tr 𝐴𝑊 ≤ tr 𝐴 for any orthogonal 𝑊.

Proof. For (1), if 𝑊 is an orthogonal matrix, then the columns of 𝑊

form an orthonormal set. In particular, if 𝑤 𝑖 is the 𝑖th column of 𝑊,
then ‖𝑤 𝑖 ‖ = 1, which means that any particular entry of 𝑤 𝑖 is between
𝑇
-1 and 1. Thus, if 𝑤 𝑖 = [𝑤 1𝑖 𝑤 2𝑖 ... 𝑤 𝑛𝑖 ] , we have −1 ≤ 𝑤𝑗𝑗 ≤ 1
and hence
𝑛 𝑛
tr Σ𝑊 = ∑ 𝜎𝑗 𝑤𝑗𝑗 ≤ ∑ 𝜎𝑗 = tr Σ.
𝑗=1 𝑗=1

For (2), let 𝐵, 𝑊 be arbitrary orthogonal matrices. Notice that the

product of orthogonal matrices is again an orthogonal matrix, and so by
(1), we have
tr 𝐵 𝑇 Σ𝐵𝑊 = tr Σ𝐵𝑊𝐵 𝑇 ≤ tr Σ = tr Σ𝐵𝐵 𝑇 = tr 𝐵 𝑇 Σ𝐵,
where we have made use of Lemma 2.12.
(3) is much more technically involved. Suppose now that 𝑊 is an
orthogonal matrix, and assume det 𝑊 < 0. Since 𝑊 is orthogonal, we
must have det 𝑊 = ±1 and so det 𝑊 < 0 means that det 𝑊 = −1. Thus,
det(𝑊 + 𝐼) = det(𝑊 + 𝑊𝑊 𝑇 )
= det (𝑊(𝐼 + 𝑊 𝑇 ))
= det 𝑊 det(𝐼 + 𝑊 𝑇 )
= − det(𝐼 + 𝑊 𝑇 )𝑇
= − det(𝐼 + 𝑊).

Therefore, 2 det(𝑊 + 𝐼) = 0, and so we must have det(𝑊 + 𝐼) = 0. This

means that -1 is an eigenvalue of 𝑊, and so there must be a unit vector
𝑥 such that 𝑊𝑥 = −𝑥. We then also have 𝑊 𝑇 𝑊𝑥 = −𝑊 𝑇 𝑥, and thus
𝑥 = −𝑊 𝑇 𝑥. In particular, this implies that 𝑥 is in fact an eigenvector for
both 𝑊 and 𝑊 𝑇 , with eigenvalue -1. We can find an orthonormal basis
{𝑏1 , 𝑏2 , . . . , 𝑏𝑛−1 , 𝑥} of ℝ𝑛 . Relabeling 𝑥 as 𝑏𝑛 , and letting 𝐵 be the matrix
whose columns are 𝑏1 , 𝑏2 , . . . , 𝑏𝑛−1 , 𝑏𝑛 , we see 𝐵 will be an orthogonal
matrix. Moreover, we have

𝐵 𝑇 𝑊𝐵 = 𝐵 𝑇 𝑊 [𝑏1 𝑏2 ⋯ 𝑏𝑛−1 𝑏𝑛 ]

= 𝐵 𝑇 [𝑊𝑏1 𝑊𝑏2 ⋯ 𝑊𝑏𝑛−1 𝑊𝑏𝑛 ]

𝑏𝑇1
⎡ ⎤
⎢ 𝑏𝑇2 ⎥
=⎢ ⋮ ⎥ [𝑊𝑏1 𝑊𝑏2 ⋯ 𝑊𝑏𝑛−1 −𝑏𝑛 ]
⎢ 𝑇 ⎥
⎢ 𝑏𝑛−1 ⎥
⎣ 𝑏𝑛𝑇 ⎦
Therefore, the entries of 𝐵 𝑇 𝑊𝐵 are given by the products 𝑏𝑇𝑖 𝑊𝑏𝑗 . We
now claim that the last column of 𝐵 𝑇 𝑊𝐵 is −e𝑛 , and the last row of
𝐵 𝑇 𝑊𝐵 is −e𝑇𝑛 . The last column will have entries 𝑏𝑇𝑖 (−𝑏𝑛 )𝑇 , which is 0
for 𝑖 ≠ 𝑛 and -1 when 𝑖 = 𝑛. In other words, the last column is −e𝑛 .
Similarly, the last row will have entries
𝑏𝑛𝑇 𝑊𝑏𝑗 = 𝑏𝑛 ⋅ 𝑊𝑏𝑗 = 𝑊 𝑇 𝑏𝑛 ⋅ 𝑏𝑗 = −𝑏𝑛 ⋅ 𝑏𝑗 ,
which again is either 0 (if 𝑗 ≠ 𝑛) or 1 ( if 𝑗 = 𝑛). Thus, the last row is
−e𝑛 . This means that we can write 𝐵 𝑇 𝑊𝐵 in block form as
𝐵0 0ℝ𝑛−1
(6.8) 𝐵 𝑇 𝑊𝐵 = [ ],
0𝑇ℝ𝑛−1 −1
where 𝐵0 is an (𝑛 − 1) × (𝑛 − 1) matrix. Moreover, since 𝐵 and 𝑊 are
orthogonal matrices, so too is 𝐵 𝑇 𝑊𝑇, which implies that the columns
of 𝐵 𝑇 𝑊𝐵 form an orthonormal basis of ℝ𝑛 . Since the first 𝑛 − 1 entries
in the last row of 𝐵 𝑇 𝑊𝑇 are all zeros, the columns of 𝐵0 are orthonormal

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
194 6. Applications Revisited

in ℝ𝑛−1 , which means that 𝐵0 is an (𝑛 − 1) × (𝑛 − 1) orthogonal matrix.

Similarly, we can write the product 𝐵𝑇 Σ𝐵 in block form:
𝐴0 𝑎
(6.9) 𝐵 𝑇 Σ𝐵 = [ ].
𝑐𝑇 𝛾
Here, 𝐴0 is an (𝑛 − 1) × (𝑛 − 1) matrix, 𝑎, 𝑐 ∈ ℝ𝑛−1 and 𝛾 ∈ ℝ. Consider
now the matrix
𝐵 0ℝ𝑛−1
𝑈 ≔ [ 𝑇0 ].
0ℝ𝑛−1 1
Note that 𝑈 is an orthogonal matrix. Using block multiplication, we
have
𝐴 𝑎 𝐵 0ℝ𝑛−1
𝐵 𝑇 Σ𝐵𝑈 = [ 𝑇0 ][ 𝑇0 ]
𝑐 𝛾 0ℝ𝑛−1 1
𝐴0 𝐵0 𝑎
(6.10) =[ ].
𝑐𝑇 𝐵0 𝛾
Now, since 𝑈 is an orthogonal matrix, (2) of this lemma tells us that
tr 𝐵 𝑇 Σ𝐵𝑈 ≤ tr 𝐵 𝑇 Σ𝐵.
From (6.9) and (6.10) we then have
tr 𝐴0 𝐵0 + 𝛾 = tr 𝐵 𝑇 Σ𝐵𝑈 ≤ tr 𝐵 𝑇 Σ𝐵 = tr 𝐴0 + 𝛾,
and so we must have
(6.11) tr 𝐴0 𝐵0 ≤ tr 𝐴0 .
Next, using (6.8) and (6.9), we have
tr Σ𝑊 = tr Σ𝐵𝐵 𝑇 𝑊
= tr Σ𝐵𝐵 𝑇 𝑊𝐵𝐵 𝑇
= tr 𝐵 𝑇 Σ𝐵𝐵 𝑇 𝑊𝐵
𝐴0 𝑎 𝐵 0ℝ𝑛−1
= tr ([ ][ 𝑇0 ])
𝑐𝑇 𝛾 0ℝ𝑛−1 −1
𝐴0 𝐵0 −𝑎
= tr [ ] = tr 𝐴0 𝐵0 − 𝛾.
𝑐𝑇 𝐵0 −𝛾

Combining with (6.11) yields

(6.12) tr Σ𝑊 = tr 𝐴0 𝐵0 − 𝛾 ≤ tr 𝐴0 − 𝛾.
Next, we look at the entries that make up 𝐴0 . Looking at the left side of
(6.9), we have

𝐵 𝑇 Σ𝐵 = 𝐵 𝑇 Σ [𝑏1 𝑏2 ⋯ 𝑏𝑛−1 𝑏𝑛 ]

= 𝐵 𝑇 [Σ𝑏1 Σ𝑏2 ⋯ Σ𝑏𝑛−1 Σ𝑏𝑛 ]

𝑏𝑇1
⎡ ⎤
⎢ 𝑏𝑇2 ⎥
=⎢ ⋮ ⎥ [Σ𝑏1 Σ𝑏2 ⋯ Σ𝑏𝑛−1 Σ𝑏𝑛 ] .
⎢ 𝑇 ⎥
⎢ 𝑏𝑛−1 ⎥
⎣ 𝑏𝑛𝑇 ⎦
Thus, the 𝑖𝑗th entry of 𝐵 𝑇 Σ𝐵 is given by 𝑏𝑇𝑖 Σ𝑏𝑗 , which is the dot product
of 𝑏𝑖 and Σ𝑏𝑗 . Therefore, (6.9) tells us 𝛾 = 𝑏𝑛𝑇 Σ𝑏𝑛 and
𝑛−1 𝑛
tr 𝐴0 = ∑ (𝑏𝑇ℓ Σ𝑏ℓ ) = ∑ (𝑏𝑇ℓ Σ𝑏ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛 .
ℓ=1 ℓ=1
𝑇
Now, if 𝑏ℓ = [𝑏1ℓ 𝑏2ℓ ⋯ 𝑏𝑛−1ℓ 𝑏𝑛ℓ ] , then since Σ is a diagonal
matrix with diagonal entries 𝜎ℓ , we will have
𝜎1 𝑏1ℓ
⎡ ⎤
⎢ 𝜎2 𝑏2ℓ ⎥
Σ𝑏ℓ = ⎢ ⋮ ⎥.
⎢ ⎥
⎢𝜎𝑛−1 𝑏𝑛−1ℓ ⎥
⎣ 𝜎𝑛 𝑏𝑛ℓ ⎦

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
196 6. Applications Revisited
𝑛
In particular, we will have 𝑏𝑇ℓ Σ𝑏ℓ = ∑𝑗=1 𝜎𝑗 𝑏𝑗ℓ
2
, and so
𝑛 𝑛
2
tr 𝐴0 = ∑ ( ∑ 𝜎𝑗 𝑏𝑗ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛
ℓ=1 𝑗=1
𝑛 𝑛
2
= ∑ 𝜎𝑗 ( ∑ 𝑏𝑗ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛 .
𝑗=1 ℓ=1
𝑛 2
Now, the sum ∑ℓ=1 𝑏𝑗ℓ
is the sum of the squares of the entries in row 𝑗
of the matrix 𝐵. Since 𝐵 is orthogonal, so too is 𝐵𝑇 . In particular, that
means that the (transposes of the) rows of 𝐵 form an orthonormal basis
of ℝ𝑛 . Thus each row of 𝐵 must have Euclidean norm equal to 1, i.e.
𝑛 2
∑ℓ=1 𝑏𝑗ℓ = 1 for each 𝑗. Therefore, we have
𝑛
(6.13) tr 𝐴0 = ∑ 𝜎𝑗 − 𝑏𝑛𝑇 Σ𝑏𝑛 = tr Σ − 𝑏𝑛𝑇 Σ𝑏𝑛 = tr Σ − 𝛾.
𝑗=1

Next, we estimate 𝛾 = 𝑏𝑛𝑇 Σ𝑏𝑛 . Because 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0, we have

𝜎1 𝑏1𝑛
⎡ ⎤
⎢ 𝜎2 𝑏2𝑛 ⎥
𝛾 = 𝑏𝑛𝑇 Σ𝑏𝑛 = [𝑏1𝑛 𝑏2𝑛 ⋯ 𝑏𝑛−1𝑛 𝑏𝑛 ] ⎢ ⋮ ⎥
⎢ ⎥
⎢𝜎𝑛−1 𝑏𝑛−1𝑛 ⎥
⎣ 𝜎𝑛 𝑏𝑛𝑛 ⎦
𝑛
(6.14) = ∑ 𝜎ℓ 𝑏2ℓ𝑛
ℓ=1
𝑛
≥ ∑ 𝜎𝑛 𝑏2ℓ𝑛 = 𝜎𝑛 ‖𝑏𝑛 ‖2 = 𝜎𝑛 ,
ℓ=1

since the (Euclidean) norm of 𝑏𝑛 is 1. Thus, combining (6.12), (6.13),

and (6.14) we finally have
tr Σ𝑊 ≤ tr 𝐴0 − 𝛾 = tr Σ − 𝛾 − 𝛾
≤ tr Σ − 2𝜎𝑛
𝑛−1
= ( ∑ 𝜎𝑗 ) − 𝜎𝑛 . □
𝑗=1

We can now solve the problem of finding the orientation preserving

orthogonal matrix that minimizes ‖𝐴𝑉 − 𝐵‖2𝐹 . This problem is relevant
in computational chemistry. There, 𝐵 may represent a configuration of
the atoms in a molecule in its lowest energy configuration and 𝐴 rep-
resents another configuration of the molecule. In this situation, it is
important to require that 𝑈 preserve orientation, since not all chemical
properties are preserved by non-orientation preserving transformations.
The original references are [18], [19], and [41].
Theorem 6.20 (Kabsch-Umeyama Algorithm). Let 𝐴, 𝐵 ∈ ℝ𝑚×𝑛 be given.
Suppose 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decomposition of 𝐴𝑇 𝐵. Let
𝐼 ̂ be the 𝑛 × 𝑛 diagonal matrix whose 𝑖𝑖th entries are 1 for 1 ≤ 𝑖 < 𝑛 and
whose 𝑛𝑛th entry is det 𝑌 𝑋 𝑇 . Then 𝑉ˆ ≔ 𝑌 𝐼𝑋̂ 𝑇 minimizes ‖𝐴𝑉 − 𝐵‖2𝐹 over
all orthogonal 𝑉 with det 𝑉 = 1. (Notice that if det 𝑌 𝑋 𝑇 = 1, then 𝐼 ̂ is
simply the 𝑛 × 𝑛 identity matrix, while if det 𝑌 𝑋 𝑇 = −1, then 𝐼 ̂ is the 𝑛 × 𝑛
identity matrix whose lower right entry has been replaced with -1.)

Proof. As in the proof for the solution of the Orthogonal Procrustes

Problem, we have
‖𝐴𝑉 − 𝐵‖2𝐹 = ‖𝐴‖2𝐹 + ‖𝐵‖2𝐹 − 2tr 𝑉 𝑇 𝐴𝑇 𝐵,
and so to minimize ‖𝐴𝑉 − 𝐵‖2𝐹 over all orthogonal matrices 𝑉 with
det 𝑉 = 1, it suffices to maximize tr 𝑉 𝑇 𝐴𝑇 𝐵 over all orthogonal matrices
𝑉 with det 𝑉 = 1. Suppose 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decom-
position of 𝐴𝑇 𝐵. In particular, both 𝑌 and 𝑋 are orthogonal matrices.
Thus, we have
sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = sup tr 𝑉 𝑇 𝑌 Σ𝑋 𝑇
𝑉 𝑇 𝑉 =𝐼 𝑉 𝑇 𝑉 =𝐼
det 𝑉 =1 det 𝑉 =1

= sup tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 .
𝑉 𝑇 𝑉 =𝐼
det 𝑉 =1

Notice that 𝑋 𝑇 𝑉 𝑇 𝑌 is an orthogonal matrix, and so we would like to

replace 𝑋 𝑇 𝑉 𝑇 𝑌 with 𝑍, and maximize tr Σ𝑍 over all orthogonal 𝑍, just
as we did in the general Procrustes Problem. However, in this situation,
we require that det 𝑉 = 1. Notice that if 𝑍 = 𝑋 𝑇 𝑉 𝑇 𝑌 , then we have
det 𝑍 = det 𝑋 𝑇 det 𝑉 𝑇 det 𝑌 = det 𝑋 𝑇 det 𝑌 = det 𝑌 𝑋 𝑇 .

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
198 6. Applications Revisited

Therefore, we can look at maximizing tr Σ𝑍 over all orthogonal 𝑍 with

det 𝑍 = det 𝑌 𝑋 𝑇 . That is:
(6.15) sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = sup tr Σ𝑍.
𝑉 𝑇 𝑉 =𝐼 𝑍𝑇 𝑍=𝐼
det 𝑉 =1 det 𝑍=det 𝑌 𝑋 𝑇

We consider now two cases: (i) det 𝑌 𝑋 𝑇 = 1, and (ii) det 𝑌 𝑋 𝑇 = −1.
Case (i): det 𝑌 𝑋 𝑇 = 1. We claim that in this case 𝑍ˆ = 𝐼 ̂ maximizes
tr Σ𝑍 over all orthogonal 𝑍 with det 𝑍 = 1. By (1) of Proposition 6.18, we
𝑛
know that tr Σ𝑍 ≤ ∑𝑖=1 𝜎 𝑖 for any orthogonal 𝑍. Moreover, equality is
attained for 𝑍ˆ ≔ 𝐼. By the definition of 𝐼 ̂ in the statement of the theorem,
𝐼 ̂ = 𝐼 in this case. Thus, 𝑍ˆ = 𝐼 ̂ is the maximizer. Since we replaced
𝑋 𝑇 𝑉 𝑇 𝑌 with 𝑍, a maximizer of tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 subject to the constraints
𝑉 𝑇 𝑉 = 𝐼, det 𝑉 = 1 is provided by 𝑉ˆ such that 𝑋 𝑇 𝑉ˆ𝑇 𝑌 = 𝐼.̂ Solving for
ˆ we see 𝑉ˆ = 𝑌 𝐼𝑋
𝑉, ̂ 𝑇 . Thus, the theorem is true in this case.
Case (ii): det 𝑌 𝑋 𝑇 = −1. In this case, we want to find an orthogonal
𝑍ˆ that maximizes tr Σ𝑍 over all orthogonal 𝑍 with det 𝑍 = −1. By (3) of
𝑛−1
Proposition 6.18, we know that tr Σ𝑍 ≤ (∑𝑖=1 𝜎 𝑖 ) − 𝜎𝑛 for any orthog-
onal 𝑍 with det 𝑍 = −1, and equality will occur if we take 𝑍ˆ to be the
diagonal matrix whose entries are all 1, except the lower-right most en-
try, which will be −1 = det 𝑌 𝑋 𝑇 . By the definition of 𝐼 ̂ in the statement
of the theorem, this means 𝑍ˆ = 𝐼.̂ As in the previous case, a maximizer
of tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 subject to 𝑉 𝑇 𝑉 = 𝐼, det 𝑉 = 1 is provided by 𝑉ˆ such that
𝑋 𝑇 𝑉ˆ𝑇 𝑌 = 𝐼,̂ which means 𝑉ˆ = 𝑌 𝐼𝑋 ̂ 𝑇. □

6.6. Summary
We hope that by this stage, the reader has been convinced of the util-
ity of analytic ideas in linear algebra, as well as the importance of the
Singular Value Decomposition. There are many different directions that
an interested reader can go from here. As we saw in this chapter, in-
teresting applications are often optimization problems, where we seek
a particular type of matrix to make some quantity as small as possible.
Thus, one direction is the book [1] that investigates general matrix op-
timization. (The reader should be forewarned: there is a jump from the
level here to the level in that text.) Another direction (related to the
“best” 𝑘-dimensional subspace problem) is: given a collection of points

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.6. Summary 199

in some ℝ𝑚 , what is the minimal volume ellipsoid that contains these

points? The book [39] investigates this problem and provides algorithms
for solving it. We mention again [26], which has a wealth of examples
of applications. Another direction is to look at the infinite-dimensional
setting, and the following chapter gives a glimpse in that direction.

Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms

2.signal and Linear System Analysis
No ratings yet
2.signal and Linear System Analysis
42 pages
4 6012558209825376223 PDF
No ratings yet
4 6012558209825376223 PDF
322 pages
B Tech Electrical and Electronics Engineering
No ratings yet
B Tech Electrical and Electronics Engineering
221 pages
Kippap Handout Sec 39 RCD Columns W Bending
No ratings yet
Kippap Handout Sec 39 RCD Columns W Bending
14 pages
ABB机器人编程手册
No ratings yet
ABB机器人编程手册
1,280 pages
Unit-2 Linear Algebra - 2 Notes
No ratings yet
Unit-2 Linear Algebra - 2 Notes
14 pages
Mult 2023 Final 1
No ratings yet
Mult 2023 Final 1
96 pages
English p5 Mid Term Test
No ratings yet
English p5 Mid Term Test
7 pages
Design of UV Joint
No ratings yet
Design of UV Joint
11 pages
Lec 10
No ratings yet
Lec 10
31 pages
MA412 Final
No ratings yet
MA412 Final
82 pages
PHY10 Lesson 2 Kinematics (Full)
No ratings yet
PHY10 Lesson 2 Kinematics (Full)
35 pages
Handout B: Linear Algebra Cheat Sheet: 1.1 Vectors and Matrices
100% (1)
Handout B: Linear Algebra Cheat Sheet: 1.1 Vectors and Matrices
9 pages
Linear 3
No ratings yet
Linear 3
33 pages
2linear Regression
No ratings yet
2linear Regression
102 pages
Lin Alg ML Mimuw
No ratings yet
Lin Alg ML Mimuw
55 pages
MA 106: Linear Algebra: Prof. B.V. Limaye IIT Dharwad
No ratings yet
MA 106: Linear Algebra: Prof. B.V. Limaye IIT Dharwad
19 pages
Section 6.3 (Extra Details)
No ratings yet
Section 6.3 (Extra Details)
18 pages
Lecture19 D1
No ratings yet
Lecture19 D1
14 pages
Chapter1 - II 2024-2025
No ratings yet
Chapter1 - II 2024-2025
35 pages
Sketching As A Tool For Numerical Linear Algebra
No ratings yet
Sketching As A Tool For Numerical Linear Algebra
139 pages
S VD Chapter
No ratings yet
S VD Chapter
12 pages
Saad Krylov Subspace Methods For Solving Large Unsymmetric Linear Systems
No ratings yet
Saad Krylov Subspace Methods For Solving Large Unsymmetric Linear Systems
23 pages
PCA Dr. Pawan Kumar Tiwari
No ratings yet
PCA Dr. Pawan Kumar Tiwari
19 pages
MML Book (091 120)
No ratings yet
MML Book (091 120)
30 pages
Slides
No ratings yet
Slides
428 pages
Linear Algebra Toronto LectureNotes223
No ratings yet
Linear Algebra Toronto LectureNotes223
96 pages
8.1. Inner Product Spaces
No ratings yet
8.1. Inner Product Spaces
27 pages
ELE 535 Notes
No ratings yet
ELE 535 Notes
30 pages
Mathematics of Modern Engineering I Lecture 5
No ratings yet
Mathematics of Modern Engineering I Lecture 5
7 pages
Transpose & Dot Product: M N A N M A A A A A A A A
No ratings yet
Transpose & Dot Product: M N A N M A A A A A A A A
13 pages
Least Square by Nicholson-Linear Algebra-2018
No ratings yet
Least Square by Nicholson-Linear Algebra-2018
12 pages
UNIT5 Comparison Tree
No ratings yet
UNIT5 Comparison Tree
52 pages
Math Data
No ratings yet
Math Data
117 pages
Section 6.4 (Extra Details)
No ratings yet
Section 6.4 (Extra Details)
10 pages
Bo
No ratings yet
Bo
36 pages
Radar Systems
No ratings yet
Radar Systems
12 pages
The Best Approximation Theorem INCOMPLETE
No ratings yet
The Best Approximation Theorem INCOMPLETE
4 pages
1 Applications of SVD: Least Squares Approximation: Lecture 8: October 21, 2021
No ratings yet
1 Applications of SVD: Least Squares Approximation: Lecture 8: October 21, 2021
5 pages
Section 6.2: Orthogonal Sets: 1 2 P I J N 1 2 N
No ratings yet
Section 6.2: Orthogonal Sets: 1 2 P I J N 1 2 N
7 pages
SVD Chapter
No ratings yet
SVD Chapter
12 pages
a bình phương tối tiểu
No ratings yet
a bình phương tối tiểu
11 pages
Notes On Orthogonality 10 15 12
No ratings yet
Notes On Orthogonality 10 15 12
4 pages
Sol 4
No ratings yet
Sol 4
12 pages
Lecture 41
No ratings yet
Lecture 41
25 pages
Chapter 7
No ratings yet
Chapter 7
7 pages
Orthogonal Diagonalization of Symmetric Matrices: MATH10212 - Linear Algebra - Brief Lecture Notes
No ratings yet
Orthogonal Diagonalization of Symmetric Matrices: MATH10212 - Linear Algebra - Brief Lecture Notes
7 pages
Lecture6 Orthogonality Dot Product
No ratings yet
Lecture6 Orthogonality Dot Product
5 pages
Midtermsols Sp2010
No ratings yet
Midtermsols Sp2010
6 pages
10 SVD
No ratings yet
10 SVD
4 pages
Further Mathematical Methods (Linear Algebra) Solutions For Problem Sheet 8
No ratings yet
Further Mathematical Methods (Linear Algebra) Solutions For Problem Sheet 8
15 pages
Section 6.1 - Inner Products and Norms: M×N I, J I, J
No ratings yet
Section 6.1 - Inner Products and Norms: M×N I, J I, J
16 pages
Orthogonal Projections Onto Higher-Dimensional Sub-Spaces
No ratings yet
Orthogonal Projections Onto Higher-Dimensional Sub-Spaces
3 pages
Norma E691 Ingles
No ratings yet
Norma E691 Ingles
21 pages
Krylov Subspace Methods For Solving Large Unsymmetric Linear Systems (Saad)
No ratings yet
Krylov Subspace Methods For Solving Large Unsymmetric Linear Systems (Saad)
22 pages
Linear-Algebra-Review Xid-8243921 1
No ratings yet
Linear-Algebra-Review Xid-8243921 1
6 pages
Lecture 7 (Adaptive Filters)
No ratings yet
Lecture 7 (Adaptive Filters)
18 pages
Algebra Test 2011 - 2012
No ratings yet
Algebra Test 2011 - 2012
7 pages
Advanced Numerical Analysis: Data Interpolation and Smoothing
No ratings yet
Advanced Numerical Analysis: Data Interpolation and Smoothing
26 pages
Fa 2
No ratings yet
Fa 2
7 pages
72073931-8e00-4107-bdde-c19d4ec282cb
No ratings yet
72073931-8e00-4107-bdde-c19d4ec282cb
5 pages
Homework 9 Solutions: U + C v. If y Is Orthogonal To U U + C V)
No ratings yet
Homework 9 Solutions: U + C v. If y Is Orthogonal To U U + C V)
6 pages
6 Gram-Schmidt Procedure, QR-factorization, Orthog-Onal Projections, Least Square
No ratings yet
6 Gram-Schmidt Procedure, QR-factorization, Orthog-Onal Projections, Least Square
13 pages
Math III Q2 Wk1 Wk2
0% (1)
Math III Q2 Wk1 Wk2
4 pages
Vertical Projectile Motion 2025 DOBS
No ratings yet
Vertical Projectile Motion 2025 DOBS
55 pages
Equations PDF
No ratings yet
Equations PDF
20 pages
Lec 33
No ratings yet
Lec 33
3 pages
Math 114: Linear Algebra Orthogonal Projections
No ratings yet
Math 114: Linear Algebra Orthogonal Projections
3 pages
Basic Matrix Theory
No ratings yet
Basic Matrix Theory
10 pages
Roller Coaster SImulation
No ratings yet
Roller Coaster SImulation
75 pages
Chapter1 - Numerical Analysis II 2023-2024
No ratings yet
Chapter1 - Numerical Analysis II 2023-2024
30 pages
Optimal Subspaces With PCA
No ratings yet
Optimal Subspaces With PCA
11 pages
English 1
No ratings yet
English 1
20 pages
Bece Practice Questions
No ratings yet
Bece Practice Questions
11 pages
Maths Class Ix Sample Paper 01 Blue Prints For Annual Exam 2023 1
No ratings yet
Maths Class Ix Sample Paper 01 Blue Prints For Annual Exam 2023 1
1 page
Las Q3 Mathematics8 Sirtwo
No ratings yet
Las Q3 Mathematics8 Sirtwo
6 pages
NCERT Solutions For Class 7 Maths 8may Chapter 12 Algebraic Expressions Exercise 12.3
No ratings yet
NCERT Solutions For Class 7 Maths 8may Chapter 12 Algebraic Expressions Exercise 12.3
12 pages
Wca Regulations and Guidelines
No ratings yet
Wca Regulations and Guidelines
25 pages
Prefix To Postfix
100% (1)
Prefix To Postfix
2 pages
MA 106: Spring 2014: Tutorial Sheet 3
No ratings yet
MA 106: Spring 2014: Tutorial Sheet 3
4 pages
The Projection Theorem in Hilbert Spaces
No ratings yet
The Projection Theorem in Hilbert Spaces
12 pages
Nonlinear Optimization (18799 B, PP) : Ist-Cmu PHD Course, Spring 2011
No ratings yet
Nonlinear Optimization (18799 B, PP) : Ist-Cmu PHD Course, Spring 2011
11 pages
Determinents
No ratings yet
Determinents
2 pages
Kinematic Analysis For Sliding Failure of Multi-Faced Rock Slopes
No ratings yet
Kinematic Analysis For Sliding Failure of Multi-Faced Rock Slopes
11 pages
Chapter 2
No ratings yet
Chapter 2
16 pages
Python 123455
No ratings yet
Python 123455
11 pages
Pythagorean Triples: Determine Whether Each Set of Numbers Form A Pythagorean Triple. 12, 20, 16 8, 15, 17 1, 7, 5
No ratings yet
Pythagorean Triples: Determine Whether Each Set of Numbers Form A Pythagorean Triple. 12, 20, 16 8, 15, 17 1, 7, 5
2 pages
Computers & Fluids: Tapan K. Sengupta, Himanshu Singh, Swagata Bhaumik, Rajarshi R. Chowdhury
No ratings yet
Computers & Fluids: Tapan K. Sengupta, Himanshu Singh, Swagata Bhaumik, Rajarshi R. Chowdhury
12 pages
Mathematical Analysis 1: theory and solved exercises
From Everand
Mathematical Analysis 1: theory and solved exercises
Alessio Mangoni
5/5 (1)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Applications Revisited: 6.1. The "Best" Subspace For Given Data

Uploaded by

Applications Revisited: 6.1. The "Best" Subspace For Given Data

Uploaded by

10.

We now revisit the big applications mentioned in the introduction and

6.1. The “Best” Subspace for Given Data

Figure 6.1. 𝒱 = ℝ3 , 𝒰 is a two-dimensional subspace, and

Exercise 6.1. (1) Suppose 𝒰 is a finite-dimensional subspace of a vector

(3) Suppose 𝒰 is a subspace of ℝ𝑛 and {𝑣 1 , 𝑣 2 , . . . , 𝑣 𝑘 } is an orthonormal

In other words, the sum of the squared magnitudes of the projections

Suppose now that we have 𝑚 points 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 in ℝ𝑛 and an in-

Thus, finding the “best” approximating 𝑘-dimensional subspace to the

Notice that just as for the Pythagorean Theorem, we have

Moreover, if 𝒰̂ is a subspace of dimension 𝑘 that maximizes (6.2), then

Notice that to specify a subspace 𝒰, we need only specify a basis of 𝒰.

Theorem 6.2. Suppose 𝑎1 , 𝑎2 , . . . , 𝑎𝑚 are points in ℝ𝑛 , let 𝐴 be the 𝑚 × 𝑛

Moreover, we will have

Proof. We use induction on 𝑘. When 𝑘 = 1, specifying a one dimen-

showed in the discussion between (6.1) and (6.2),

= 𝜎22 (𝐴) + 𝜎32 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).

This finishes the proof of the base case (k= 1).

Let 𝒰 be an arbitrary (𝑘 + 1)-dimensional subspace. Notice that 𝑊𝑘⊥ has

𝑛 ≥ dim(𝒰 + 𝑊𝑘⊥ ) = dim 𝒰 + dim 𝑊𝑘⊥ − dim(𝒰 ∩ 𝑊𝑘⊥ )

Therefore, a little rearrangement shows that dim(𝒰 ∩ 𝑊𝑘⊥ ) ≥ 1. Thus,

Moreover, since 𝑈̃ = span{𝑦1 , 𝑦2 , . . . , 𝑦 𝑘 } is a 𝑘-dimensional subspace,

Since 𝒰 is an arbitrary subspace of dimension 𝑘 + 1,

Therefore, the eigenvalues of 𝐴𝑇 𝐴 are 𝜆↓1 = 3 and 𝜆↓2 = 2, with eigen-

In the general situation, the idea is to translate the given collection

6.2. Least Squares and Moore-Penrose Pseudo-Inverse

ℛ (𝐿). If we have an orthonormal basis of ℛ (𝐿), calculating this projec-

Thus, given 𝑦 ∈ 𝒲, Proposition 3.26 implies that

What does this mean when 𝒱 = ℝ𝑛 and 𝒲 = ℝ𝑚 (with the dot

6.3. Eckart-Young-Mirsky for the Operator Norm

We now want to consider 𝑀𝑘 ∈ ℒ (𝒱, 𝒲) defined by

Exercise 6.7. Show that ℛ (𝑀𝑘 ) = span{𝑦1 , 𝑦2 , . . . , 𝑦 𝑘 }, and conclude

and we now calculate ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 . Let 𝑥 = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎𝑛 𝑥𝑛 . By

and therefore for any 𝑥 ≠ 0𝒱 ,

Therefore, inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} ≤ 𝜎 𝑘+1 . (Why?) A harder

Then 𝐵𝑘 has rank 𝑘 and

6.4. Eckart-Young-Mirsky for the Frobenius Norm and

Exercise 6.11 implies that

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} ≤ ( ∑ 𝜎𝑗2 (𝐴)) .

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} = ( ∑ 𝜎𝑗2 (𝐴)) = ‖𝐴 − 𝐵𝑘 ‖𝐹 .

Proof. From Exercise 6.11, we know that

inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} ≤ ( ∑ 𝜎𝑗2 (𝐴))

We again use Weyl’s inequality: 𝜎 𝑖+𝑗−1 (𝐿1 + 𝐿2 ) ≤ 𝜎 𝑖 (𝐿1 ) + 𝜎𝑗 (𝐿2 ) for

As it turns out, 𝐵𝑘 will be the closest rank 𝑘 approximation to 𝐴 in

‖𝐴𝑊‖ = ‖𝐴‖ for any appropriately sized orthogonal matrices 𝑊, then

6.5. The Orthogonal Procrustes Problem

Proof. Suppose that (1) is true. Let 𝑥, 𝑦 ∈ ℝ𝑛 be arbitrary. We have

Since ‖𝑈𝑥 − 𝑈𝑦‖2 = ‖𝑈(𝑥 − 𝑦)‖2 = ‖𝑥 − 𝑦‖2 by assumption, comparing

Suppose now 𝐴 and 𝐵 are two fixed 𝑚 × 𝑛 matrices. We view 𝐴 and

𝑈𝐴𝑇 = 𝑈 [ 𝑎1 𝑎2 ... 𝑎𝑚 ] = [ 𝑈𝑎1 𝑈𝑎2 ... 𝑈𝑎𝑚 ] ,

Since taking the transpose of an orthogonal matrix again yields an or-

As should be no surprise, we can determine a minimizing 𝑉ˆ in terms of

where we replaced 𝑉 with 𝑍 = 𝑋 𝑇 𝑉 𝑇 𝑌 . Now, for any orthogonal ma-

subject to 𝑉 𝑇 𝑉 = 𝐼 is provided by 𝑉ˆ = 𝑌 𝑋 𝑇 (where 𝑌 Σ𝑋 𝑇 is the Singular

Note that the Orthogonal Procrustes problem above is slightly dif-

(3) For every orthogonal 𝑊 with det 𝑊 < 0, we have

Part (2) of Proposition 6.18 has an interpretation in terms of similar

With this definition in mind, (2) says that if 𝐴 is orthogonally similar

Proof. For (1), if 𝑊 is an orthogonal matrix, then the columns of 𝑊

For (2), let 𝐵, 𝑊 be arbitrary orthogonal matrices. Notice that the

Therefore, 2 det(𝑊 + 𝐼) = 0, and so we must have det(𝑊 + 𝐼) = 0. This

= 𝐵 𝑇 [𝑊𝑏1 𝑊𝑏2 ⋯ 𝑊𝑏𝑛−1 𝑊𝑏𝑛 ]

in ℝ𝑛−1 , which means that 𝐵0 is an (𝑛 − 1) × (𝑛 − 1) orthogonal matrix.

Combining with (6.11) yields

= 𝐵 𝑇 [Σ𝑏1 Σ𝑏2 ⋯ Σ𝑏𝑛−1 Σ𝑏𝑛 ]

Next, we estimate 𝛾 = 𝑏𝑛𝑇 Σ𝑏𝑛 . Because 𝜎1 ≥ 𝜎2 ≥ ⋯ ≥ 𝜎𝑛 ≥ 0, we have

since the (Euclidean) norm of 𝑏𝑛 is 1. Thus, combining (6.12), (6.13),

We can now solve the problem of finding the orientation preserving

Proof. As in the proof for the solution of the Orthogonal Procrustes

Notice that 𝑋 𝑇 𝑉 𝑇 𝑌 is an orthogonal matrix, and so we would like to

Therefore, we can look at maximizing tr Σ𝑍 over all orthogonal 𝑍 with

in some ℝ𝑚 , what is the minimal volume ellipsoid that contains these

You might also like