Lecture 10: Matrices Review
Isabella Zhu
11 March 2025
§1 Last Lecture Wrapup
We will wrap up the proof from lecture 9.
Theorem 1.1
Assume INC(k) with k equal to the sparsity of θ∗ (i.e. k = |θ∗ |0 ). Fix
p p
2τ = 8σ log(2d)/n + 8σ log(1/δ)/n.
Then, the MSE of the lasso estimator is at most
L σ 2 |θ∗ |0
2
MSE(Xθ̂ ) ≤ 32kτ ≲ log(d/δ)
n
Moreover,
|θ̂ − θ∗ |22 ≤ 2MSE(Xθ̂L )
all happening with probability at least 1 − δ.
Proof. For the five hundred millionth time, we start with the good ole basic inequality
|Xθ̂ − Xθ∗ |22 ≤ 2⟨ϵ, Xθ̂ − Xθ∗ ⟩ + 2nτ |θ∗ |1 − 2nτ |θ̂|1
We bound
2⟨ϵ, Xθ̂ − Xθ∗ ⟩ ≤ 2|XT ϵ|∞ · |θ̂ − θ∗ |1
We bound the highest column norm of X. We have
n
|Xj |22 = (XT X)jj ≤ n + ≤ 2n
32k
by the incoherence property. Therefore, we get
τ
2⟨ϵ, Xθ̂ − Xθ∗ ⟩ ≤ 2|XT ϵ|∞ · |θ̂ − θ∗ |1 ≤ 2 · 2n · · |θ̂ − θ∗ |1 = nτ |θ̂ − θ∗ |1
4
To summarize, we’ve proved so far that
|Xθ̂ − Xθ∗ |22 ≤ nτ |θ̂ − θ∗ |1 + 2nτ |θ∗ |1 − 2nτ |θ̂|1
1
Isabella Zhu — 11 March 2025 Lecture 10: Matrices Review
We add nτ |θ̂ − θ∗ |1 on both sides.
|Xθ̂ − Xθ∗ |22 + nτ |θ̂ − θ∗ |1 ≤ 2nτ |θ̂ − θ∗ |1 + 2nτ |θ∗ |1 − 2nτ |θ̂|1
Now we take the support S into account. We have
|θ̂|1 = |θ̂S |1 + |θ̂S c |1 =⇒ |θ̂ − θ∗ |1 − |θ̂|1 = |θ̂S − θ∗ |1 − |θ̂S |1 .
Putting it together,
h i
|Xθ̂ − Xθ∗ |22 + |Xθ̂ − Xθ∗ |22 ≤ 2nτ |θ̂S − θ |1 + |θ |1 − |θ̂|S ≤ 4nτ |θ̂S − θ∗ |1
∗ ∗
We have that
|θ̂ − θ∗ |1 ≤ 4|θ̂S − θ∗ |1 ⇔ |θ̂S c − θS∗ c | ≤ 3|θ̂S − θS∗ |
which is exactly the cone condition! Everything below this is kinda suspicious because
I was playing squardle instead of paying attention. So for our lower bound, we get
2|X(θ̂ − θ∗ )|22
≥ |θ̂ − θ∗ |22
n
By Cauchy,
√ √
r
∗ 2k
|θ̂S − θ |1 ≤ k|θ̂s − θ |2 ≤ k||θ̂ − θ∗ |2 ≤
∗
|Xθ̂ − Xθ∗ |2
n
Therefore, we get r
2k
|Xθ̂ − Xθ∗ |22 ≤ 4nτ
|Xθ̂ − Xθ∗ |2
n
from which we divide and square to get the desired result.
§2 Matrix Estimation
We will go over some linear algebra ”basics” which need to be known for later lectures.
Apparently this lecture will be ”boring to death” (not my words).
§2.1 SubGaussian Sequence Model
Our subGaussian sequence model is of the form Y = θ∗ + ϵ ∈ Rd . We can make this a
matrix problem by just reshaping each vector into a matrix.
If θ∗ is sparse, then we can just use θ̂HARD , so we aren’t utilizing matrix properties.
§2.2 An Aside: Netflix Prize 2006
Aka how Netflix got half the academic community to work for them for free. The problem
is the following: consider matrix M , with n users and m movies, such that Mi,j is how
the ith person rated the jth movie.
2
Isabella Zhu — 11 March 2025 Lecture 10: Matrices Review
Clearly, the matrix is very sparse. In fact, only 1% was filled. The goal was the
fill the rest of the matrix.
§2.2.1 A Simple Model
Consider where Mij only has two effects: user and movie. So,
Mij = ui · vj + noise.
For the simple model, we reduce the number of parameters from nm to n + m.
M = uv T + noise
The rank of uv T is 1. More generally, if the rank of M is r, we can write as
r
X
M= u(j) v (j)T
j=1
§3 Matrix Redux
§3.1 Eigenvalues and Eigenvectors
Square matrix A ∈ Rn×n . Defines eigenvalue and eigenvector Au = λu.
Fact 3.1. If A is symmetric, then all eigenvalues are real.
In this class, we will assume that all eigenvectors have norm 1.
Fact 3.2. If u1 , . . . un eigenvectors of symmetric A, they can form an orthogonal basis
for column span of A. We will call this the eigenbasis.
§3.2 Singular Value Decomposition
Let A ∈ Rm×n . The SVD of A is A written as
A = U DV T , U ∈ Rm×r , V ∈ Rr×n , D ∈ Rr×r
where r is the rank of A, U T U = Ir , V T V = Ir , D is diagonal with positive entries.
This implies that u1 , u2 , . . . ∈ colspan(A) and v1T , v2T , . . . vnT ∈ rowspan(A).
The vector form of this is r
X
A= λj uj vjT
j=1
Remark 3.3. We have AAT uj = λ2j uj and AT Avj = λ2j vj .
Consider the special case when A is positive semidefinite. The eigenvalues are positive
and are equal to the singular values. U and V become the same matrix. In this case,
||A||op = maxm |Ax|2 = λmax (A)
x∈B2
3
Isabella Zhu — 11 March 2025 Lecture 10: Matrices Review
§3.3 Vector Norms and Inner Products
Let A and B be matrices. The q-norm is defined as
!1/q
X
|A|q = |Aij |q
ij
Remark 3.4. Notep that |A|∞ =pmax |Aij | and |A|0 is the number of nonzero entries. We
also have |A|2 = T r(AT A) = T r(AAT ) = ||A||F .
Then we can define the inner product
⟨A, B⟩ = T r(AT B) = T r(AB T )
§3.4 Spectral Norms
Let A have singular values λ1 , . . . , λr . Consider vector λ = (λ1 , . . . , λr ). The Schatten
q-norm is defined as
||A||q = |λ|q
When q = 2, we have
||A||22 = |λ|22 = ||A||2F = |A|22
which can be derived trivially by plugging in SVD into T r(AT A).
When q = 1, we call this the nuclear/trace norm.
X
||A||1 = |λ|1 = λj = ||A||A
§3.5 Matrix Inequalities
Let A and B be positive semidefinite. Order their eigenvalues in decreasing order.
Theorem 3.5
Weyl. We have
max |λj (A) − λj (B)| ≤ ||A − B||op
j
Theorem 3.6
Hoffman-Wielaudt. We have
X
|λj (A) − λj (B)|2 ≤ ||A − B||2F
j
Theorem 3.7
1 1
Holder. We have for p
+ q
= 1,
⟨A, B⟩ ≤ ||A||p ||B||q
4
Isabella Zhu — 11 March 2025 Lecture 10: Matrices Review
§3.6 Eckert-Young
Also known as best rank-k approximation.
Lemma 3.8
Let matrix A be of rank r. Look at SVD A = rj=1 λj uj vjT and assume singular
P
values are in decreasing order. For any k ≤ r, define the truncated SVD
k
X
A= λj uj vjT
j=1
This matrix has rank k. Then, we have
r
X
||A − Ak ||2F = inf ||A − B||2F = λ2j
rank(B)≤k
j=k+1