Applications Revisited: 6.1. The "Best" Subspace For Given Data
Applications Revisited: 6.1. The "Best" Subspace For Given Data
1090/stml/094/06
Chapter 6
Applications Revisited
171
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
172 6. Applications Revisited
𝑥 − 𝑃𝒰 𝑥
𝑃𝒰 𝑥
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.1. The “Best” Subspace for Given Data 173
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
174 6. Applications Revisited
since ⟨𝑎𝑗 , 𝑢⟩ = ⟨𝑃𝒰 𝑎𝑗 , 𝑢⟩ for any 𝑢 ∈ 𝒰. (In fact, that is a defining char-
acteristic of the projection onto 𝒰 - see Proposition 3.23.) Thus,
𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ∑ (‖𝑎𝑗 ‖2 − ‖𝑃𝒰 𝑎𝑗 ‖2 )
𝑗=1 𝑗=1
𝑚 𝑚
= ( ∑ ‖𝑎𝑗 ‖2 ) − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 )
𝑗=1 𝑗=1
𝑚
= ‖𝐴‖2𝐹 − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ) ,
𝑗=1
where 𝐴 is the matrix whose rows are 𝑎𝑇1 , 𝑎𝑇2 , . . . , 𝑎𝑇𝑚 . Therefore, to min-
imize
𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 = ‖𝐴‖𝐹2 − ( ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ) ,
𝑗=1 𝑗=1
we need to make the second term as large as possible! (That is, we want
to subtract as much as we possibly can.) Thus, it turns out that (6.1) is
equivalent to finding a subspace 𝒰̂ with dim 𝒰̂ = 𝑘 such that
𝑚 𝑚
(6.2) ∑ ‖𝑃𝒰̂ 𝑎𝑗 ‖2 = sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 .
𝑗=1 dim 𝒰=𝑘 𝑗=1
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.1. The “Best” Subspace for Given Data 175
𝑚 𝑚
∑ ‖𝑃𝒲 𝑘 𝑎𝑗 ‖2 = sup ∑ ‖𝑃𝒰 𝑎𝑗 ‖2 ,
𝑗=1 dim 𝒰=𝑘 𝑗=1
or equivalently
𝑚 𝑚
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘 𝑎𝑗 ‖2 = inf ∑ ‖𝑎𝑗 − 𝑃𝒰 𝑎𝑗 ‖2 .
dim 𝒰=𝑘
𝑗=1 𝑗=1
‖𝐴𝑢‖2
= sup
ᵆ≠0ℝ𝑛 ‖𝑢‖2
2
‖𝐴𝑢‖
= ( sup )
ᵆ≠0ℝ𝑛 ‖𝑢‖
= 𝜎1 (𝐴)2 .
(To go from the first line to the second line, we have used the fact that the
𝑗th entry in 𝐴𝑢 is the dot product of 𝑎𝑗 and 𝑢.) Moreover, a maximizer
above is provided by 𝑥1 , as is shown by Theorem 5.19. Therefore, we
see that a maximizing subspace is given by 𝒲1 = span{𝑥1 }. Next, as we
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
176 6. Applications Revisited
and that
𝑚
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘 𝑎𝑗 ‖2 = 𝜎𝑘+1 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).
𝑗=1
𝑚
We now show that 𝒲 𝑘+1 will maximize ∑𝑗=1 ‖𝑃𝒰 𝑎𝑗 ‖2 over all possible
(𝑘 + 1)-dimensional subspaces 𝒰 of ℝ𝑛 , and that
𝑚
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 = 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴).
𝑗=1
‖𝐴𝑢‖
(6.3) ‖𝐴𝑦 𝑘+1 ‖ ≤ sup = ‖𝐴𝑥𝑘+1 ‖ = 𝜎 𝑘+1 (𝐴).
ᵆ∈𝑊𝑘⊥ \{0ℝ 𝑛}
‖𝑢‖
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.1. The “Best” Subspace for Given Data 177
Combining (6.3) and (6.4) and using the result of exercise (3) above, we
see
𝑚 𝑘+1 𝑘+1 𝑚
∑ ‖𝑃𝒰 𝑎𝑗 ‖2 = ∑ ‖𝐴𝑦 𝑖 ‖2 ≤ ∑ ‖𝐴𝑥𝑖 ‖2 = ∑ ‖𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 .
𝑗=1 𝑖=1 𝑖=1 𝑗=1
and so we have
𝑚 𝑘+1
2
∑ ‖𝑎𝑗 − 𝑃𝒲 𝑘+1 𝑎𝑗 ‖2 = ‖𝐴‖2𝐹 − ∑ 𝜎𝑖2 (𝐴) = 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑝2 (𝐴). □
𝑗=1 𝑖=1
It’s important to make sure that the points are normalized so that
their “center of mass” is at the origin.
Example 6.3. Consider the points [−1 1]𝑇 , [0 1]𝑇 , and [1 1]𝑇 . These are
clearly all on the line 𝑦 = 1 in the 𝑥𝑦 plane. Let
−1 1
𝐴 = [ 0 1] .
1 1
We have
−1 1
−1 0 1
𝐴𝑇 𝐴 = [ ] [ 0 1]
1 1 1
1 1
2 0
=[ ].
0 3
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
178 6. Applications Revisited
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.2. Least Squares and Moore-Penrose Pseudo-Inverse 179
1𝑛 −1 0
𝑎̄ ≔ ∑𝑖=1 𝑎𝑖 . (In our example, we have 𝑎1 = [ ], 𝑎2 = [ ], and
𝑛 1 1
1 1 0 0
𝑎3 = [ ]. The center of mass is then 𝑎̄ = 3 [ ] = [ ].) We then consider
1 3 1
the translated collection {𝑎1 − 𝑎,̄ 𝑎2 − 𝑎,̄ . . . , 𝑎𝑛 − 𝑎},
̄ and determine the
closest subspace to this translated collection. To undo the translation,
we simply add 𝑎̄ to the best subspace, creating an “affine subspace.”
By calculating and graphing the singular values, we can get an idea
as to whether or not the data set is low-dimensional. One way of doing
this is looking at the relative sizes of subsequent singular values. Some
caution: the drop that shows when we can neglect higher dimensions
depends on the problem! For some problems, a drop of a factor of 10
may be a good enough sign. For others, perhaps a factor of 100 may be
necessary.
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
180 6. Applications Revisited
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.2. Least Squares and Moore-Penrose Pseudo-Inverse 181
Exercise 6.5. Explain why formula (6.5) above implies 𝐿† and 𝐿−1 are
the same when 𝐿 is invertible. (Think about what the singular triples of
𝐿−1 are when 𝐿−1 exists.)
1 𝑥𝑗
= [ 𝑥1 𝑥2 ... 𝑥𝑟 ] e = ,
𝜎𝑗 𝑗,𝑟 𝜎𝑗
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
182 6. Applications Revisited
−1
while for 𝑗 = 𝑟+1, . . . , 𝑚, we will have 𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 𝑦𝑗 = 0. Therefore, we
−1 𝑇
see that 𝑋𝑟 (Σ)̃ 𝑌𝑟 𝑣𝑗 agrees with formula (6.5) applied to 𝑦𝑗 for each
−1
𝑗 = 1, 2, . . . , 𝑚, and so we have 𝐴† = 𝑋𝑟 (Σ)̃ 𝑌𝑟𝑇 :
1
⎡ 𝜍1 ⎤⎡ 𝑦𝑇1
⎢ 1 ⎥⎢ ⎤
𝑦𝑇2 ⎥.
𝐴† = [ 𝑥1 𝑥2 ... 𝑥 𝑟 ]⎢ 𝜍2 ⎥⎢ ⎥
⎢ ⋱ ⎥⎢ ⋮ ⎥
⎢ 1 ⎥ 𝑦𝑇𝑟
⎣ 𝜍𝑟 ⎦⎣ ⎦
Exercise 6.6. Use the formula above to give (yet) another explanation
that 𝐴† and 𝐴−1 are the same when 𝐴−1 exists.
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.3. Eckart-Young-Mirsky for the Operator Norm 183
We will have
𝑟
(𝐿 − 𝑀𝑘 )𝑥 = ∑ 𝜎 𝑖 ⟨𝑥𝑖 , 𝑥⟩𝒱 𝑦 𝑖 ,
𝑖=𝑘+1
By the Pythagorean Theorem and the fact that the singular values are
decreasing,
𝑟
2
‖(𝐿 − 𝑀𝑘 )𝑥‖𝒲 = ∑ 𝜎𝑖2 𝑎2𝑖
𝑖=𝑘+1
𝑟
2
≤ 𝜎𝑘+1 ∑ 𝑎2𝑖
𝑖=𝑘+1
𝑛
2
≤ 𝜎𝑘+1 2
∑ 𝑎2𝑖 = 𝜎𝑘+1 ‖𝑥‖𝒱2 ,
𝑖=1
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
184 6. Applications Revisited
Proof. Since the proof requires examining the singular values of differ-
ent operators, we use 𝜎 𝑘 (𝐴) to mean the 𝑘th singular value of 𝐴. In the
preceding paragraphs, we have shown that
inf{‖𝐿 − 𝑀‖𝑜𝑝 ∶ rank 𝑀 = 𝑘} ≤ 𝜎 𝑘+1 (𝐿) = ‖𝐿 − 𝑀𝑘 ‖𝑜𝑝 .
To finish the proof, we must show that ‖𝐿 − 𝑀‖𝑜𝑝 ≥ 𝜎 𝑘+1 (𝐿) for any
𝑀 ∈ ℒ (𝒱, 𝒲) with rank 𝑀 = 𝑘.
Suppose that 𝑀 ∈ ℒ (𝒱, 𝒲) has rank 𝑘, and recall that Theorem
5.2(b) tells us ‖𝐿 − 𝑀‖𝑜𝑝 = 𝜎1 (𝐿 − 𝑀). We now use Weyl’s inequality:
𝜎 𝑘+𝑗−1 (𝐿1 + 𝐿2 ) ≤ 𝜎 𝑘 (𝐿1 ) + 𝜎𝑗 (𝐿2 ) for any 𝐿1 , 𝐿2 ∈ ℒ (𝒱, 𝒲) and for
any indices 𝑘, 𝑗, 𝑘 + 𝑗 − 1 between 1 and min{𝑛, 𝑚}. Replacing 𝐿1 with
𝐿 − 𝑀 and 𝐿2 with 𝑀, and taking 𝑘 = 1 and 𝑗 = 𝑘 + 1, we will have
𝜎 𝑘+1 (𝐿) = 𝜎1+𝑘+1−1 (𝐿) ≤ 𝜎1 (𝐿 − 𝑀) + 𝜎 𝑘+1 (𝑀)
= 𝜎1 (𝐿 − 𝑀) = ‖𝐿 − 𝑀‖𝑜𝑝 ,
since rank 𝑀 = 𝑘 implies that 𝜎 𝑘+1 (𝑀) = 0. □
Corollary 6.9. Let 𝐴 be an 𝑚 × 𝑛 matrix, and let 𝑝 = min{𝑚, 𝑛}. Let
{𝑥1 , 𝑥2 , . . . , 𝑥𝑛 } and {𝑦1 , 𝑦2 , . . . , 𝑦𝑚 } be the orthonormal bases of ℝ𝑛 and
ℝ𝑚 and suppose 𝜎1 ≥ 𝜎2 ≥ . . . 𝜎𝑝 ≥ 0 are the singular values provided by
the Singular Value Decomposition of 𝐴. Assume 𝑘 ≤ rank 𝐴, and let
𝑘
𝐵𝑘 = ∑ 𝜎 𝑖 (𝐴)𝑦 𝑖 𝑥𝑖𝑇 .
𝑖=1
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.4. Eckart-Young-Mirsky for Frobenius Norm 185
‖𝐴‖𝐹 ≔ (∑ 𝐴2𝑖𝑗 ) ,
𝑖,𝑗
so ‖𝐴‖2𝐹
is the sum of the squares of all of the entries of 𝐴. What is the
closest rank 𝑘 matrix to 𝐴, as measured in the Frobenius norm? Our
problem is: find 𝐵̃ with rank 𝑘 such that
‖𝐴 − 𝐵‖̃ 𝐹 = inf{‖𝐴 − 𝐵‖𝐹 ∶ rank 𝐵 = 𝑘} .
As in the case of the operator norm, we can get an upper bound on
the infimum by considering a particular 𝐵. We use the same particu-
lar matrix as in the operator norm: suppose 𝐴 = 𝑌 Σ𝑋 𝑇 is the Singular
Value Decomposition of 𝐴. Σ is an 𝑚 × 𝑛 matrix, whose diagonal entries
are Σ𝑖𝑖 = 𝜎 𝑖 (𝐴). Let Σ𝑘 be the 𝑚 × 𝑛 matrix that has the same first 𝑘
diagonal entries as Σ, and the remaining entries are all zeros. Now, let
𝐵 𝑘 = 𝑌 Σ𝑘 𝑋 𝑇 .
Exercise 6.11. Show that
‖𝐴 − 𝐵𝑘 ‖2𝐹 = 𝜎𝑘+1
2 2
(𝐴) + 𝜎𝑘+2 (𝐴) + ⋯ + 𝜎𝑟2 (𝐴).
(Recall the extraordinarily useful fact that the Frobenius norm is invari-
ant under orthogonal transformations, which means that if 𝑊 is any
appropriately sized square matrix such that 𝑊 𝑇 𝑊 = 𝐼, then we have
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
186 6. Applications Revisited
‖𝐴𝑊 𝑇 ‖𝐹 = ‖𝐴‖𝐹 and ‖𝑊𝐴‖𝐹 = ‖𝐴‖𝐹 ; and the consequence that Frobe-
nius norm is the square root of the sum of the squares of the singular
values.)
(Why?) A harder question: can we do any better? The answer turns out
to be no!
Theorem 6.12 (Eckart-Young-Mirsky Theorem, Frobenius Norm). Sup-
pose 𝐴 is an 𝑚 × 𝑛 matrix with rank 𝐴 = 𝑟. For any 𝑘 = 1, 2, . . . , 𝑟, if
𝐵𝑘 = 𝑌 Σ𝑘 𝑋 𝑇 (where 𝐴 = 𝑌 Σ𝑋 𝑇 is the Singular Value Decomposition of
𝐴), then we have
1
𝑟 2
In other words, 𝐵𝑘 is the closest (as measured by the Frobenius norm) rank
𝑘 matrix to 𝐴.
= ‖𝐴 − 𝐵𝑘 ‖𝐹 .
𝑟
To finish the proof, we show that ‖𝐴 − 𝐵‖2𝐹 ≥ ∑𝑗=𝑘+1 𝜎𝑗2 (𝐴) for any 𝐵
with rank 𝑘.
Let 𝐵 be an arbitrary matrix with rank 𝑘, and let 𝑝 = min{𝑛, 𝑚}, we
have
𝑝
‖𝐴 − 𝐵‖2𝐹 = ∑ 𝜎𝑗2 (𝐴 − 𝐵).
𝑗=1
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.4. Eckart-Young-Mirsky for Frobenius Norm 187
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
188 6. Applications Revisited
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 189
which means that the columns of 𝑈𝐴𝑇 represent a configuration that has
the same lengths and angles as the original configuration represented by
𝐴𝑇 . Next, for the distance to the reference configuration 𝐵, we calculate
the Frobenius norm squared of 𝐴𝑈 𝑇 − 𝐵, i.e. the sum of the squared
distances between the corresponding rows of 𝐴𝑈 𝑇 and 𝐵. Therefore,
finding the closest configuration to the given reference 𝐵 means solving
the following:
minimize ‖𝐴𝑈 𝑇 − 𝐵‖2𝐹 over all 𝑈 such that 𝑈 𝑇 𝑈 = 𝐼.
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
190 6. Applications Revisited
Proof. Notice that the Frobenius norm is the norm associated with the
Frobenius inner product ⟨𝐴, 𝐵⟩𝐹 = tr 𝐴𝑇 𝐵. Therefore, we will have
‖𝐴𝑉 − 𝐵‖2𝐹 = ⟨𝐴𝑉 − 𝐵, 𝐴𝑉 − 𝐵⟩𝐹
= ⟨𝐴𝑉, 𝐴𝑉⟩𝐹 − 2 ⟨𝐴𝑉, 𝐵⟩𝐹 + ⟨𝐵, 𝐵⟩𝐹
= tr (𝐴𝑉)𝑇 𝐴𝑉 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝑉 𝑇 𝐴𝑇 𝐴𝑉 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝐴𝑇 𝐴𝑉𝑉 𝑇 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= tr 𝐴𝑇 𝐴 − 2tr (𝐴𝑉)𝑇 𝐵 + ‖𝐵‖2𝐹
= ‖𝐴‖2𝐹 + ‖𝐵‖2𝐹 − 2tr 𝑉 𝑇 𝐴𝑇 𝐵.
Since ‖𝐴‖2𝐹 and ‖𝐵‖2𝐹 are fixed, minimizing ‖𝐴𝑉 − 𝐵‖2𝐹 over all 𝑉 with
𝑉 𝑇 𝑉 = 𝐼 is equivalent to maximizing tr 𝑉 𝑇 𝐴𝑇 𝐵 over all 𝑉 with 𝑉 𝑇 𝑉 = 𝐼.
Notice that since 𝐴 and 𝐵 are 𝑚 × 𝑛 matrices, 𝐴𝑇 𝐵 is an 𝑛 × 𝑛 matrix.
Thus, 𝐴𝑇 𝐵 = 𝑌 Σ𝑋 𝑇 means that the matrices 𝑌 , Σ, and 𝑋 are all 𝑛 × 𝑛
matrices. Moreover, since the columns of 𝑌 and 𝑋 are orthonormal, the
matrices 𝑌 and 𝑋 are orthogonal: 𝑌 𝑇 𝑌 = 𝐼 and 𝑋 𝑇 𝑋 = 𝐼. Therefore,
sup tr 𝑉 𝑇 𝐴𝑇 𝐵 = sup tr 𝑉 𝑇 𝑌 Σ𝑋 𝑇
𝑉 𝑇 𝑉 =𝐼 𝑉 𝑇 𝑉 =𝐼
= sup tr Σ𝑋 𝑇 𝑉 𝑇 𝑌
𝑉 𝑇 𝑉 =𝐼
𝑛
= sup tr Σ𝑍 = sup ∑ 𝜎 𝑖 𝑧𝑖𝑖 ,
𝑍 𝑇 𝑍=𝐼 𝑍 𝑇 𝑍=𝐼 𝑖=1
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 191
where 𝜎 𝑖 are the singular value of 𝐴 𝐵. Thus, a minimizer for ‖𝐴𝑉 −𝐵‖2𝐹
𝑇
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
192 6. Applications Revisited
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 193
𝐵 𝑇 𝑊𝐵 = 𝐵 𝑇 𝑊 [𝑏1 𝑏2 ⋯ 𝑏𝑛−1 𝑏𝑛 ]
𝑏𝑇1
⎡ ⎤
⎢ 𝑏𝑇2 ⎥
=⎢ ⋮ ⎥ [𝑊𝑏1 𝑊𝑏2 ⋯ 𝑊𝑏𝑛−1 −𝑏𝑛 ]
⎢ 𝑇 ⎥
⎢ 𝑏𝑛−1 ⎥
⎣ 𝑏𝑛𝑇 ⎦
Therefore, the entries of 𝐵 𝑇 𝑊𝐵 are given by the products 𝑏𝑇𝑖 𝑊𝑏𝑗 . We
now claim that the last column of 𝐵 𝑇 𝑊𝐵 is −e𝑛 , and the last row of
𝐵 𝑇 𝑊𝐵 is −e𝑇𝑛 . The last column will have entries 𝑏𝑇𝑖 (−𝑏𝑛 )𝑇 , which is 0
for 𝑖 ≠ 𝑛 and -1 when 𝑖 = 𝑛. In other words, the last column is −e𝑛 .
Similarly, the last row will have entries
𝑏𝑛𝑇 𝑊𝑏𝑗 = 𝑏𝑛 ⋅ 𝑊𝑏𝑗 = 𝑊 𝑇 𝑏𝑛 ⋅ 𝑏𝑗 = −𝑏𝑛 ⋅ 𝑏𝑗 ,
which again is either 0 (if 𝑗 ≠ 𝑛) or 1 ( if 𝑗 = 𝑛). Thus, the last row is
−e𝑛 . This means that we can write 𝐵 𝑇 𝑊𝐵 in block form as
𝐵0 0ℝ𝑛−1
(6.8) 𝐵 𝑇 𝑊𝐵 = [ ],
0𝑇ℝ𝑛−1 −1
where 𝐵0 is an (𝑛 − 1) × (𝑛 − 1) matrix. Moreover, since 𝐵 and 𝑊 are
orthogonal matrices, so too is 𝐵 𝑇 𝑊𝑇, which implies that the columns
of 𝐵 𝑇 𝑊𝐵 form an orthonormal basis of ℝ𝑛 . Since the first 𝑛 − 1 entries
in the last row of 𝐵 𝑇 𝑊𝑇 are all zeros, the columns of 𝐵0 are orthonormal
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
194 6. Applications Revisited
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 195
𝐵 𝑇 Σ𝐵 = 𝐵 𝑇 Σ [𝑏1 𝑏2 ⋯ 𝑏𝑛−1 𝑏𝑛 ]
𝑏𝑇1
⎡ ⎤
⎢ 𝑏𝑇2 ⎥
=⎢ ⋮ ⎥ [Σ𝑏1 Σ𝑏2 ⋯ Σ𝑏𝑛−1 Σ𝑏𝑛 ] .
⎢ 𝑇 ⎥
⎢ 𝑏𝑛−1 ⎥
⎣ 𝑏𝑛𝑇 ⎦
Thus, the 𝑖𝑗th entry of 𝐵 𝑇 Σ𝐵 is given by 𝑏𝑇𝑖 Σ𝑏𝑗 , which is the dot product
of 𝑏𝑖 and Σ𝑏𝑗 . Therefore, (6.9) tells us 𝛾 = 𝑏𝑛𝑇 Σ𝑏𝑛 and
𝑛−1 𝑛
tr 𝐴0 = ∑ (𝑏𝑇ℓ Σ𝑏ℓ ) = ∑ (𝑏𝑇ℓ Σ𝑏ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛 .
ℓ=1 ℓ=1
𝑇
Now, if 𝑏ℓ = [𝑏1ℓ 𝑏2ℓ ⋯ 𝑏𝑛−1ℓ 𝑏𝑛ℓ ] , then since Σ is a diagonal
matrix with diagonal entries 𝜎ℓ , we will have
𝜎1 𝑏1ℓ
⎡ ⎤
⎢ 𝜎2 𝑏2ℓ ⎥
Σ𝑏ℓ = ⎢ ⋮ ⎥.
⎢ ⎥
⎢𝜎𝑛−1 𝑏𝑛−1ℓ ⎥
⎣ 𝜎𝑛 𝑏𝑛ℓ ⎦
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
196 6. Applications Revisited
𝑛
In particular, we will have 𝑏𝑇ℓ Σ𝑏ℓ = ∑𝑗=1 𝜎𝑗 𝑏𝑗ℓ
2
, and so
𝑛 𝑛
2
tr 𝐴0 = ∑ ( ∑ 𝜎𝑗 𝑏𝑗ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛
ℓ=1 𝑗=1
𝑛 𝑛
2
= ∑ 𝜎𝑗 ( ∑ 𝑏𝑗ℓ ) − 𝑏𝑛𝑇 Σ𝑏𝑛 .
𝑗=1 ℓ=1
𝑛 2
Now, the sum ∑ℓ=1 𝑏𝑗ℓ
is the sum of the squares of the entries in row 𝑗
of the matrix 𝐵. Since 𝐵 is orthogonal, so too is 𝐵𝑇 . In particular, that
means that the (transposes of the) rows of 𝐵 form an orthonormal basis
of ℝ𝑛 . Thus each row of 𝐵 must have Euclidean norm equal to 1, i.e.
𝑛 2
∑ℓ=1 𝑏𝑗ℓ = 1 for each 𝑗. Therefore, we have
𝑛
(6.13) tr 𝐴0 = ∑ 𝜎𝑗 − 𝑏𝑛𝑇 Σ𝑏𝑛 = tr Σ − 𝑏𝑛𝑇 Σ𝑏𝑛 = tr Σ − 𝛾.
𝑗=1
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.5. The Orthogonal Procrustes Problem 197
= sup tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 .
𝑉 𝑇 𝑉 =𝐼
det 𝑉 =1
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
198 6. Applications Revisited
We consider now two cases: (i) det 𝑌 𝑋 𝑇 = 1, and (ii) det 𝑌 𝑋 𝑇 = −1.
Case (i): det 𝑌 𝑋 𝑇 = 1. We claim that in this case 𝑍ˆ = 𝐼 ̂ maximizes
tr Σ𝑍 over all orthogonal 𝑍 with det 𝑍 = 1. By (1) of Proposition 6.18, we
𝑛
know that tr Σ𝑍 ≤ ∑𝑖=1 𝜎 𝑖 for any orthogonal 𝑍. Moreover, equality is
attained for 𝑍ˆ ≔ 𝐼. By the definition of 𝐼 ̂ in the statement of the theorem,
𝐼 ̂ = 𝐼 in this case. Thus, 𝑍ˆ = 𝐼 ̂ is the maximizer. Since we replaced
𝑋 𝑇 𝑉 𝑇 𝑌 with 𝑍, a maximizer of tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 subject to the constraints
𝑉 𝑇 𝑉 = 𝐼, det 𝑉 = 1 is provided by 𝑉ˆ such that 𝑋 𝑇 𝑉ˆ𝑇 𝑌 = 𝐼.̂ Solving for
ˆ we see 𝑉ˆ = 𝑌 𝐼𝑋
𝑉, ̂ 𝑇 . Thus, the theorem is true in this case.
Case (ii): det 𝑌 𝑋 𝑇 = −1. In this case, we want to find an orthogonal
𝑍ˆ that maximizes tr Σ𝑍 over all orthogonal 𝑍 with det 𝑍 = −1. By (3) of
𝑛−1
Proposition 6.18, we know that tr Σ𝑍 ≤ (∑𝑖=1 𝜎 𝑖 ) − 𝜎𝑛 for any orthog-
onal 𝑍 with det 𝑍 = −1, and equality will occur if we take 𝑍ˆ to be the
diagonal matrix whose entries are all 1, except the lower-right most en-
try, which will be −1 = det 𝑌 𝑋 𝑇 . By the definition of 𝐼 ̂ in the statement
of the theorem, this means 𝑍ˆ = 𝐼.̂ As in the previous case, a maximizer
of tr Σ𝑋 𝑇 𝑉 𝑇 𝑌 subject to 𝑉 𝑇 𝑉 = 𝐼, det 𝑉 = 1 is provided by 𝑉ˆ such that
𝑋 𝑇 𝑉ˆ𝑇 𝑌 = 𝐼,̂ which means 𝑉ˆ = 𝑌 𝐼𝑋 ̂ 𝑇. □
6.6. Summary
We hope that by this stage, the reader has been convinced of the util-
ity of analytic ideas in linear algebra, as well as the importance of the
Singular Value Decomposition. There are many different directions that
an interested reader can go from here. As we saw in this chapter, in-
teresting applications are often optimization problems, where we seek
a particular type of matrix to make some quantity as small as possible.
Thus, one direction is the book [1] that investigates general matrix op-
timization. (The reader should be forewarned: there is a jump from the
level here to the level in that text.) Another direction (related to the
“best” 𝑘-dimensional subspace problem) is: given a collection of points
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms
6.6. Summary 199
Licensed to AMS.
License or copyright restrictions may apply to redistribution; see https://www.ams.org/publications/ebooks/terms