Houston Machine LearningMeetup
Feb 25, 2017
Yan Xu
Revealing the hidden data structure in high
dimension - Dimension reduction
2SCR©
https://energyconferencenetwork.com/machine-learning-oil-gas-2017/
20% off, PROMO code: HML
3SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Agglomerative clustering – Kunal
– Spectral clustering – Yan
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– Convolutional neural network – Hengyang Lu
– Recurrent neural networks
– Train deep nets with open-source tools
Earlier slides at: Meetup page: More | File
Later on Slides at:
http://www.slideshare.net/xuyangela
Dimensionality Reduction
• Simple example
• 3-D data
X
Y
Z
X
Y
Z
Y
Motivation
Widely used in
• language processing
• Image processing
• Denoising
• Compressing
Linear
(PCA)
Nonlinear
Non-
parametric
Parametric
(LDA)
Dimension
reduction
Global
(ISOMAP)
Local
(LLE, SNE)
Dimension Reduction Overview
Dimensionality Reduction
•Linear method
• PCA (Principle Component Analysis)
• Preserves the variance
•Non-linear method
• ISOMAP
• LLE
• tSNE
• Laplacian
• Diffusion map
• KNN Diffusion
• A global geometric framework for nonlinear dimensionality reduction, J.B.Tenenbaum,
V.De Silva, J.C.Langford (science 2000)
• Nonlinear Dimensionality Reduction by Locally Linear Embedding,
Sam T. Roweis and Lawrence K. Saul (science 2000)
• Visualizing Data using t-SNE,
Laurens van der Maaten, Geoffrey Hinton (Journal of machine learning research 2008)
PCA
0x
1x
0x
1x
11e
22e
is the marginal variance along the principle directionk ke
PCA
• Projecting onto e1 captures the majority of the variance and hence it
minimizes the error.
• Choosing subspace dimension M:
• Large M means lower expected
error in the subspace data
approximation
0x
1x
11e
22e
0x
1x
11e
22e
Reduction
Nonlinear Dimensionality Reduction
• Many data sets contain essential nonlinear structures that
invisible to PCA.
ISOMAP
• ISOMAP (Isometric feature Mapping)
• Preserves the intrinsic geometry of the data.
• Uses the geodesic manifold distances between all pairs.
ISOMAP
ISOMAP
ISOMAP
Manifold Recovery Guarantee of ISOMAP
• Isomap is guaranteed asymptotically to recover the true dimensionality
and geometric structure of nonlinear manifolds.
• As the sample data points increases, the graph distances provide
increasingly better approximations to the intrinsic geodesic distances.
LLE: Local Linear Embedding
LLE: Solution
Local covariance matrix:
Bottom d+1 eigen vectors of ,discard bottom 1, which is unit vector
https://www.cs.nyu.edu/~roweis/lle/papers/lleintro.pdf
Reconstruct linear weights w:
Map to embedded coordinates Y:
PCA vs. ISOMAP vs. LLE
PCA vs. ISOMAP vs. LLE
PCA vs. ISOMAP vs. LLE
Designing your own dimension reduction!
• High dimensional representation of data
• Geodesic distance
• Local linear weighted representation
• Low dimensional representation of data
• Euclidean distance
• Local linear weighted representation
• Cost function between High and Low dimensional
representation
Story-telling through data visualization
https://www.youtube.com/watch?v=usdJgEwMinM
tSNE
MDS SNE sym SNE UNI-SNE tSNE Barnes-Hut-SNE
Local+probability crowding problem
more stable
and faster
solution
tSNE (t-distributed Stochastic Neighbor Embedding)
easier
implementation
2002 2008 2013
O(N2)->O(NlogN)
2007
Explore tSNE simple cases:
http://distill.pub/2016/misread-tsne/
Stochastic Neighbor Embedding (SNE)
• It is more important to get local distances right than non-local ones.
• Stochastic neighbor embedding has a probabilistic way of deciding if a
pairwise distance is “local”.
• Convert each high-dimensional similarity into the probability that one
data point will pick the other data point as its neighbor.
2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j i x x i
k
e
p
e


 



probability of
picking j given i in
high D
2
2
|| ||
|| |||
i j
i k
y y
y yj i
k
e
q
e
 
 


probability of
picking j given
i in low D
Picking the radius of the Gaussian that is used
to compute the p’s
• We need to use different radii in different parts of the space so that we
keep the effective number of neighbors about constant.
• A big radius leads to a high entropy for the distribution over neighbors of i.
A small radius leads to a low entropy.
• So decide what entropy you want and then find the radius that produces
that entropy.
• Its easier to specify perplexity:
2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j i x x i
k
e
p
e


 
 


The cost function for a low-dimensional
representation
ijq
ijp
i j
ijpQ
i
iPKLCost i
|
|
log|)||(   
)()(2 |||| jijiijiji
j
j
i
qpqp
C



 yy
y
Gradient descent:
Gradient update with a momentum term:
Learning
rate
Momentum
Turning conditional probabilities into pairwise
probabilities
2||
2||
2|| 2
2|| 2
ij
xi j
xk l
x
x
k l
e
p
e









4 ( )( )ij ij i j
ji
C
p q y y
y


  
( || ) log
ij
ij
ij
p
Cost KL P Q p
q
  
2
2
2|| || 2
| 2|| || 2
i j
i k
x x i
j i x x i
k
e
p
e


 
 


MNIST
Database
of handwritten
digits
28×28 images
Problem?
Too crowded!
From SNE to t-SNE: solve crowding problem
2 1
2 1
(1 || || )
(1 || || )
i j
k l
k l
ij
y y
q
y y



 

 
High dimension: Convert distances into probabilities using a
Gaussian distribution
Low dimension: Convert distances into probabilities using a
probability distribution that has much heavier tails than a Gaussian.
Student’s t-distribution
V : the number of degrees of freedom
Standard
Normal Dis.
T-Dis. With
V = 1
Optimization method for tSNE
2 1
2 1
(1 || || )
(1 || || )
i j
k l
k l
ij
y y
q
y y



 

 
2||
2||
2|| 2
2|| 2
ij
xi j
xk l
x
x
k l
e
p
e









6000 MNIST digits
Isomap
LLE
t-SNE
Weakness
1. It’s unclear how t-SNE performs on general dimensionality
reduction task (d >3);
2. The relative local nature of t-SNE makes it sensitive to the curse
of the intrinsic dimensionality of the data;
3. It’s not guaranteed to converge to a global optimum of its cost
function.
4. It tends to form sub-clusters even if the data points are totally
random.
Explore tSNE simple cases: http://distill.pub/2016/misread-tsne/
References:
t-SNE homepage:
http://homepage.tudelft.nl/19j49/t-SNE.html
Advanced Machine Learning: Lecture11: Non-linear Dimensionality Reduction
http://www.cs.toronto.edu/~hinton/csc2535/lectures.html
Implementation
Manifold method:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.manifold
Good examples:
http://scikit-
learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-
glr-auto-examples-manifold-plot-compare-methods-py
35SCR©
Roadmap: Method
• Tour of machine learning algorithms (1 session)
• Feature engineering (1 session)
– Feature selection - Yan
• Supervised learning (4 sessions)
– Regression models -Yan
– SVM and kernel SVM - Yan
– Tree-based models - Dario
– Bayesian method - Xiaoyang
– Ensemble models - Yan
• Unsupervised learning (3 sessions)
– K-means clustering
– DBSCAN - Cheng
– Mean shift
– Agglomerative clustering – Kunal
– Spectral clustering – Yan
– Dimension reduction for data visualization - Yan
• Deep learning (4 sessions)
_ Neural network
– Convolutional neural network – Hengyang Lu
– Recurrent neural networks
– Train deep nets with open-source tools
36SCR©
Thank you
Machine learning in Oil and Gas Conference @ Houston, April 19-20:
https://energyconferencenetwork.com/machine-learning-oil-gas-2017/
20% off, PROMO code: HML
Earlier slides at: Meetup page: More | File
Later on Slides at:
http://www.slideshare.net/xuyangela

Nonlinear dimension reduction

  • 1.
    Houston Machine LearningMeetup Feb25, 2017 Yan Xu Revealing the hidden data structure in high dimension - Dimension reduction
  • 2.
  • 3.
    3SCR© Roadmap: Method • Tourof machine learning algorithms (1 session) • Feature engineering (1 session) – Feature selection - Yan • Supervised learning (4 sessions) – Regression models -Yan – SVM and kernel SVM - Yan – Tree-based models - Dario – Bayesian method - Xiaoyang – Ensemble models - Yan • Unsupervised learning (3 sessions) – K-means clustering – DBSCAN - Cheng – Mean shift – Agglomerative clustering – Kunal – Spectral clustering – Yan – Dimension reduction for data visualization - Yan • Deep learning (4 sessions) _ Neural network – Convolutional neural network – Hengyang Lu – Recurrent neural networks – Train deep nets with open-source tools Earlier slides at: Meetup page: More | File Later on Slides at: http://www.slideshare.net/xuyangela
  • 4.
    Dimensionality Reduction • Simpleexample • 3-D data X Y Z X Y Z Y
  • 5.
    Motivation Widely used in •language processing • Image processing • Denoising • Compressing
  • 6.
  • 7.
    Dimensionality Reduction •Linear method •PCA (Principle Component Analysis) • Preserves the variance •Non-linear method • ISOMAP • LLE • tSNE • Laplacian • Diffusion map • KNN Diffusion • A global geometric framework for nonlinear dimensionality reduction, J.B.Tenenbaum, V.De Silva, J.C.Langford (science 2000) • Nonlinear Dimensionality Reduction by Locally Linear Embedding, Sam T. Roweis and Lawrence K. Saul (science 2000) • Visualizing Data using t-SNE, Laurens van der Maaten, Geoffrey Hinton (Journal of machine learning research 2008)
  • 8.
    PCA 0x 1x 0x 1x 11e 22e is the marginalvariance along the principle directionk ke
  • 9.
    PCA • Projecting ontoe1 captures the majority of the variance and hence it minimizes the error. • Choosing subspace dimension M: • Large M means lower expected error in the subspace data approximation 0x 1x 11e 22e 0x 1x 11e 22e Reduction
  • 10.
    Nonlinear Dimensionality Reduction •Many data sets contain essential nonlinear structures that invisible to PCA.
  • 11.
    ISOMAP • ISOMAP (Isometricfeature Mapping) • Preserves the intrinsic geometry of the data. • Uses the geodesic manifold distances between all pairs.
  • 12.
  • 13.
  • 14.
  • 15.
    Manifold Recovery Guaranteeof ISOMAP • Isomap is guaranteed asymptotically to recover the true dimensionality and geometric structure of nonlinear manifolds. • As the sample data points increases, the graph distances provide increasingly better approximations to the intrinsic geodesic distances.
  • 16.
  • 17.
    LLE: Solution Local covariancematrix: Bottom d+1 eigen vectors of ,discard bottom 1, which is unit vector https://www.cs.nyu.edu/~roweis/lle/papers/lleintro.pdf Reconstruct linear weights w: Map to embedded coordinates Y:
  • 18.
  • 19.
  • 20.
  • 21.
    Designing your owndimension reduction! • High dimensional representation of data • Geodesic distance • Local linear weighted representation • Low dimensional representation of data • Euclidean distance • Local linear weighted representation • Cost function between High and Low dimensional representation
  • 22.
    Story-telling through datavisualization https://www.youtube.com/watch?v=usdJgEwMinM
  • 23.
    tSNE MDS SNE symSNE UNI-SNE tSNE Barnes-Hut-SNE Local+probability crowding problem more stable and faster solution tSNE (t-distributed Stochastic Neighbor Embedding) easier implementation 2002 2008 2013 O(N2)->O(NlogN) 2007 Explore tSNE simple cases: http://distill.pub/2016/misread-tsne/
  • 24.
    Stochastic Neighbor Embedding(SNE) • It is more important to get local distances right than non-local ones. • Stochastic neighbor embedding has a probabilistic way of deciding if a pairwise distance is “local”. • Convert each high-dimensional similarity into the probability that one data point will pick the other data point as its neighbor. 2 2 2|| || 2 | 2|| || 2 i j i k x x i j i x x i k e p e        probability of picking j given i in high D 2 2 || || || ||| i j i k y y y yj i k e q e       probability of picking j given i in low D
  • 25.
    Picking the radiusof the Gaussian that is used to compute the p’s • We need to use different radii in different parts of the space so that we keep the effective number of neighbors about constant. • A big radius leads to a high entropy for the distribution over neighbors of i. A small radius leads to a low entropy. • So decide what entropy you want and then find the radius that produces that entropy. • Its easier to specify perplexity: 2 2 2|| || 2 | 2|| || 2 i j i k x x i j i x x i k e p e        
  • 26.
    The cost functionfor a low-dimensional representation ijq ijp i j ijpQ i iPKLCost i | | log|)||(    )()(2 |||| jijiijiji j j i qpqp C     yy y Gradient descent: Gradient update with a momentum term: Learning rate Momentum
  • 27.
    Turning conditional probabilitiesinto pairwise probabilities 2|| 2|| 2|| 2 2|| 2 ij xi j xk l x x k l e p e          4 ( )( )ij ij i j ji C p q y y y      ( || ) log ij ij ij p Cost KL P Q p q    2 2 2|| || 2 | 2|| || 2 i j i k x x i j i x x i k e p e        
  • 28.
  • 29.
    From SNE tot-SNE: solve crowding problem 2 1 2 1 (1 || || ) (1 || || ) i j k l k l ij y y q y y         High dimension: Convert distances into probabilities using a Gaussian distribution Low dimension: Convert distances into probabilities using a probability distribution that has much heavier tails than a Gaussian. Student’s t-distribution V : the number of degrees of freedom Standard Normal Dis. T-Dis. With V = 1
  • 30.
    Optimization method fortSNE 2 1 2 1 (1 || || ) (1 || || ) i j k l k l ij y y q y y         2|| 2|| 2|| 2 2|| 2 ij xi j xk l x x k l e p e         
  • 31.
  • 32.
    Weakness 1. It’s unclearhow t-SNE performs on general dimensionality reduction task (d >3); 2. The relative local nature of t-SNE makes it sensitive to the curse of the intrinsic dimensionality of the data; 3. It’s not guaranteed to converge to a global optimum of its cost function. 4. It tends to form sub-clusters even if the data points are totally random. Explore tSNE simple cases: http://distill.pub/2016/misread-tsne/
  • 33.
    References: t-SNE homepage: http://homepage.tudelft.nl/19j49/t-SNE.html Advanced MachineLearning: Lecture11: Non-linear Dimensionality Reduction http://www.cs.toronto.edu/~hinton/csc2535/lectures.html
  • 34.
  • 35.
    35SCR© Roadmap: Method • Tourof machine learning algorithms (1 session) • Feature engineering (1 session) – Feature selection - Yan • Supervised learning (4 sessions) – Regression models -Yan – SVM and kernel SVM - Yan – Tree-based models - Dario – Bayesian method - Xiaoyang – Ensemble models - Yan • Unsupervised learning (3 sessions) – K-means clustering – DBSCAN - Cheng – Mean shift – Agglomerative clustering – Kunal – Spectral clustering – Yan – Dimension reduction for data visualization - Yan • Deep learning (4 sessions) _ Neural network – Convolutional neural network – Hengyang Lu – Recurrent neural networks – Train deep nets with open-source tools
  • 36.
    36SCR© Thank you Machine learningin Oil and Gas Conference @ Houston, April 19-20: https://energyconferencenetwork.com/machine-learning-oil-gas-2017/ 20% off, PROMO code: HML Earlier slides at: Meetup page: More | File Later on Slides at: http://www.slideshare.net/xuyangela