The document summarizes Yan Xu's upcoming presentation at the Houston Machine Learning Meetup on dimension reduction techniques. Yan will cover linear methods like PCA and nonlinear methods such as ISOMAP, LLE, and t-SNE. She will explain how these methods work, including preserving variance with PCA, using geodesic distances with ISOMAP, and modeling local neighborhoods with LLE and t-SNE. Yan will also demonstrate these methods on a dataset of handwritten digits. The meetup is part of a broader roadmap of machine learning topics that will be covered in future sessions.
Dimensionality Reduction
•Linear method
•PCA (Principle Component Analysis)
• Preserves the variance
•Non-linear method
• ISOMAP
• LLE
• tSNE
• Laplacian
• Diffusion map
• KNN Diffusion
• A global geometric framework for nonlinear dimensionality reduction, J.B.Tenenbaum,
V.De Silva, J.C.Langford (science 2000)
• Nonlinear Dimensionality Reduction by Locally Linear Embedding,
Sam T. Roweis and Lawrence K. Saul (science 2000)
• Visualizing Data using t-SNE,
Laurens van der Maaten, Geoffrey Hinton (Journal of machine learning research 2008)
PCA
• Projecting ontoe1 captures the majority of the variance and hence it
minimizes the error.
• Choosing subspace dimension M:
• Large M means lower expected
error in the subspace data
approximation
0x
1x
11e
22e
0x
1x
11e
22e
Reduction
Manifold Recovery Guaranteeof ISOMAP
• Isomap is guaranteed asymptotically to recover the true dimensionality
and geometric structure of nonlinear manifolds.
• As the sample data points increases, the graph distances provide
increasingly better approximations to the intrinsic geodesic distances.
LLE: Solution
Local covariancematrix:
Bottom d+1 eigen vectors of ,discard bottom 1, which is unit vector
https://www.cs.nyu.edu/~roweis/lle/papers/lleintro.pdf
Reconstruct linear weights w:
Map to embedded coordinates Y:
Designing your owndimension reduction!
• High dimensional representation of data
• Geodesic distance
• Local linear weighted representation
• Low dimensional representation of data
• Euclidean distance
• Local linear weighted representation
• Cost function between High and Low dimensional
representation
22.
Story-telling through datavisualization
https://www.youtube.com/watch?v=usdJgEwMinM
Stochastic Neighbor Embedding(SNE)
• It is more important to get local distances right than non-local ones.
• Stochastic neighbor embedding has a probabilistic way of deciding if a
pairwise distance is “local”.
• Convert each high-dimensional similarity into the probability that one
data point will pick the other data point as its neighbor.
2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j i x x i
k
e
p
e
probability of
picking j given i in
high D
2
2
|| ||
|| |||
i j
i k
y y
y yj i
k
e
q
e
probability of
picking j given
i in low D
25.
Picking the radiusof the Gaussian that is used
to compute the p’s
• We need to use different radii in different parts of the space so that we
keep the effective number of neighbors about constant.
• A big radius leads to a high entropy for the distribution over neighbors of i.
A small radius leads to a low entropy.
• So decide what entropy you want and then find the radius that produces
that entropy.
• Its easier to specify perplexity:
2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j i x x i
k
e
p
e
26.
The cost functionfor a low-dimensional
representation
ijq
ijp
i j
ijpQ
i
iPKLCost i
|
|
log|)||(
)()(2 |||| jijiijiji
j
j
i
qpqp
C
yy
y
Gradient descent:
Gradient update with a momentum term:
Learning
rate
Momentum
27.
Turning conditional probabilitiesinto pairwise
probabilities
2||
2||
2|| 2
2|| 2
ij
xi j
xk l
x
x
k l
e
p
e
4 ( )( )ij ij i j
ji
C
p q y y
y
( || ) log
ij
ij
ij
p
Cost KL P Q p
q
2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j i x x i
k
e
p
e
From SNE tot-SNE: solve crowding problem
2 1
2 1
(1 || || )
(1 || || )
i j
k l
k l
ij
y y
q
y y
High dimension: Convert distances into probabilities using a
Gaussian distribution
Low dimension: Convert distances into probabilities using a
probability distribution that has much heavier tails than a Gaussian.
Student’s t-distribution
V : the number of degrees of freedom
Standard
Normal Dis.
T-Dis. With
V = 1
30.
Optimization method fortSNE
2 1
2 1
(1 || || )
(1 || || )
i j
k l
k l
ij
y y
q
y y
2||
2||
2|| 2
2|| 2
ij
xi j
xk l
x
x
k l
e
p
e
Weakness
1. It’s unclearhow t-SNE performs on general dimensionality
reduction task (d >3);
2. The relative local nature of t-SNE makes it sensitive to the curse
of the intrinsic dimensionality of the data;
3. It’s not guaranteed to converge to a global optimum of its cost
function.
4. It tends to form sub-clusters even if the data points are totally
random.
Explore tSNE simple cases: http://distill.pub/2016/misread-tsne/