0% found this document useful (0 votes)
46 views49 pages

16 dm2 Dimred 2022 23

This document provides an overview of dimensionality reduction techniques, including feature selection and feature projection. It describes popular feature selection methods like variance threshold, univariate feature selection, and recursive feature elimination. It also explains feature projection techniques like principal component analysis (PCA), linear discriminant analysis (LDA), and multidimensional scaling. PCA uses eigendecomposition to project data onto orthogonal principal components. LDA attempts to preserve class separability in the reduced dimensions.

Uploaded by

nimra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views49 pages

16 dm2 Dimred 2022 23

This document provides an overview of dimensionality reduction techniques, including feature selection and feature projection. It describes popular feature selection methods like variance threshold, univariate feature selection, and recursive feature elimination. It also explains feature projection techniques like principal component analysis (PCA), linear discriminant analysis (LDA), and multidimensional scaling. PCA uses eigendecomposition to project data onto orthogonal principal components. LDA attempts to preserve class separability in the reduced dimensions.

Uploaded by

nimra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

DATA MINING 2

Dimensionality Reduction
Riccardo Guidotti

a.a. 2022/2023
Dimensionality Reduction
• Dimensionality reduction is X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
the process of reducing the 1.1 10 0.3 0.5 A 1 C 15 1.3 a
number of random variables 1.2 12 0.3 0.7 A 0 D 19 1.8 P
under consideration by … … … … … … … … … …
obtaining a set of principal
variables.
• Approaches can be divided
into feature selection and XA XB
feature projection. 1.8 5.4
1.9 6.3
… …
Feature Selection
• Select a subset of the features according to different strategies:
• the filter strategy (e.g. information gain),
• the wrapper strategy (e.g. search guided by accuracy),
• the embedded strategy (selected features add or are removed
while building the model based on prediction errors).

• Classification and/or regression or can be done in the reduced space


more accurately than in the original space.
Feature Selection
• Variance Threshold. It removes all features whose variance does not meet some
threshold. By default, it removes all zero-variance features, i.e. features that have the
same value in all samples.
• Univariate Feature Selection. It selects the best features based on univariate statistical
tests. For instance, it removes all but the k highest scoring features. An example of
statistical test is the ANOVA F-value between label/feature.

• F-value = /

• where denotes the sample mean in the ith group, ni is the number of observations in
the ith group, denotes the overall mean of the data, is the jth observation in the ith out
of K groups, K denotes the number of groups, N the overall sample size.
• F-value is large if the numerator is large, which is unlikely to happen if the population
means of the groups all have the same value.
Recursive Feature Elimination (RFE)
• Given an external estimator that assigns weights to features (e.g., the
coefficients of a linear model, or feature importance of decision tree),
RFE selects features by recursively considering smaller and smaller
sets of features.
• First, the estimator is trained on the initial set of features and the
importance of each feature is obtained.
• Then, the least important features are pruned from current set of
features.
• That procedure is recursively repeated on the pruned set until the
desired number of features to select is eventually reached.
Feature Projection (a.k.a Feature Extraction)
• It transforms the data in the high-dimensional space to a space of
fewer dimensions.
• The data transformation may be linear, or nonlinear.
• Approaches:
• Principal Component Analysis (PCA)
• Non-negative matrix factorization (NMF)
• Linear Discriminant Analysis (LDA)
• Multidimensional Scaling
• Sammon
• IsoMap
• t-SNE
• Autoencoder
Feature Projection (a.k.a Feature Extraction)
• It transforms the data in the high-dimensional space to a space of
fewer dimensions.
• The data transformation may be linear, or nonlinear.
• Approaches:
• Principal Component Analysis (PCA)
• Non-negative matrix factorization (NMF)
• Linear Discriminant Analysis (LDA)
• Multidimensional Scaling
• Sammon
• IsoMap
• t-SNE
• Autoencoder
Principal Component Analysis (PCA)
• The goal of PCA is to find a new set of
dimensions (attributes or features) that
better captures the variability of the data.
• The first dimension is chosen to capture as
much of the variability as possible.
• The second dimension is orthogonal to the
first and, subject to that constraint, captures
as much of the remaining variability as
possible, and so on.
PCA – Conceptual Algorithm
• Find a line such that, when the data is projected onto that line, it has
the maximum variance; minimize the sum-of-squares of the
projection errors.
PCA – Conceptual Algorithm
• Find a line such that, when the data is projected onto that line, it has
the maximum variance; minimize the sum-of-squares of the
projection errors.
PCA – Conceptual Algorithm
• Find a second line, orthogonal to the first, that has maximum
projected variance.
PCA – Conceptual Algorithm
• Repeat until have k orthogonal lines.
• The projected position of a point on these lines gives the coordinates
in the k-dimensional reduced space.
Background: Covariance, Eigenvalue and Eigenvectors

• The covariance of two attributes is a measure of how strongly the


attributes vary together.

• Eigenvector of matrix X: a vector v such that X𝑣=𝜆𝑣


• 𝜆: eigenvalue of eigenvector 𝑣
• A square matrix X of rank r, has r orthonormal eigenvectors v1, v2,…,
v𝑟 with eigenvalues 𝜆1, 𝜆2, …, 𝜆𝑟.
• Eigenvectors define an orthonormal basis for the column space of X
Steps in PCA
• Calculate the mean value of the data of every dimension

• Calculate the covariance matrix of all pairs of attributes


• Given matrix of data X, remove the mean of each column from the column
vectors to get the centered matrix C
• The matrix Σ = 𝐶𝑇𝐶 is the covariance matrix of the row vectors of X.

• Calculate eigenvalues and eigenvectors of Σ


• Methods: power iteration method, Singular Value Decomposition
• Eigenvector with largest eigenvalue λ1 is the 1st PC
• Eigenvector with kth largest eigenvalue λk is the kth PC
• λk/ Σi λi is the proportion of variance captured by the kth PC
Singular Value Decomposition - SVD
Singular Value Decomposition - SVD
Applying the PCA
• The full set of PCs comprise a new orthogonal basis for feature space,
whose axes are aligned with the maximum variances of original data.
• Projection of original data into first k PCs gives a reduced
dimensionality representation of the data.
• Transforming reduced dimensionality projection back into original
space gives a reduced dimensionality reconstruction of the data.
• Reconstruction will have some error.
Select the dimension k
• Rank eigenvalues in decreasing order.
• Select eigenvectors that retain a fixed percentage of variance (e.g., at
least a minimum threshold.
Example
• Iris Dataset
PCA via SVD
• Create mean-centered data matrix X
• Solve SVD: X = USVT
• Columns of V are the eigenvectors of Σ sorted from largest to smallest
eigenvalues.

• Limits of PCA:
• Limited to linear projections
Partial Least Squares (PLS)
• Supervised alternative to PCA
• Attempts to find the set of orthogonal directions that explain both
outcome and features.
• First direction:
• Calculate simple linear regression between each feature and outcome
• Use coefficients to define first direction giving greatest weight to predictors
which are highly correlated with outcome (large coefficients)
• Repeat procedure on residuals of predictors
Random Subspace Projection
• High-dimensional data is projected into low-dimensional space using
a random matrix whose columns have unit length.
• No attempt to optimize criterion.
• Preserve structure of data (e.g. distances)
• Computationally cheap.
Multi-Dimensional Scaling (MDS)
• Given a pairwise dissimilarity matrix (no need to be a metric), the goal
of MDS is to learn a mapping of data into a lower dimensionality such
that the relative distances are preserved.
• If two points are close in the feature space, it should be close in the
latent factor space.
MDS methods
• MDS is a family of different algorithms designed to map data into a
very low configuration, e.g., k=2 or k=3.
• MDS methods include
• Classical MDS
• Metric MDS
• Non-metric MDS

• MDS cannot be inverted


Distance, dissimilarity and similarity
• Distance, dissimilarity and similarity (or proximity) are defined for any
pair of objects in any space. In mathematics, a distance function (that
gives a distance between two objects) is also called metric, satisfying:
• d(x, y) ≥ 0,
• d(x, y) = iff x = y,
• d(x, y) = d(y, x)
• d(x, z) ≤ d(x, y) + d(y, z)
• If the last condition does not hold, than d is a distance function but it
is not a metric.
MDS – Conceptual Algorithm
• Given a pairwise dissimilarity matrix D and the dimensionality k, find
a mapping such that dij = ||xi - xj|| for all points in D.
• Usually, a gradient descent approach is adopted to solve an
optimization problem that aims at minimizing the function
• J(x) = ∑in ∑jn d1(dij, d2(xi, xj))
• Depending on the distances adopted to calculate D and the distance
function used for d1 and d2 the approach returns a different result.
• The Classic MDS adopts the Euclidean distance for every calculus.
• Metric-MDM adopts metrics as distances
• Non metric-MDS deals with ranks of distances instead of their values
Sammon Mapping
• Sammon mapping is a generalization of the usual metric MDS.
• It introduces a weighting system that normalizes the squared-errors in
pairwise distances by using the distance in the original space.
• J(x) = ∑in ∑jn d1(dij, d2(xi, xj))/dij
• As a result, Sammon mapping preserves the small dij, giving them a
greater degree of importance in the fitting procedure than for larger
values of dij
Classic-MDS vs Sammon Mapping
• Sammon mapping better preserves inter-distances for smaller
dissimilarities, while proportionally squeezes the inter-distances for
larger dissimilarities.
Isometric Feature Mapping (IsoMap)
• Preserves the intrinsic geometry of the data
• Uses the geodesic manifold distances between all pairs.
• It is a MDS method.
• IsoMap Handles non-linear manifold
IsoMap Algorithm
• Step 1
• Determine neighboring points within a fixed radius based on the input space
distance (Euclidean)
• These neighborhood relations are represented as weighted graph G over the
data points.
• Step 2
• Estimate the geodesic distance between all pairs of points on the manifold by
computing their shortest path distances on the graph G
• Step 3
• Construct an embedding of the data in a k dimensional Euclidean space that
best preserves the manifold geometry
t-Distributed Stochastic Neighbor Embedding (t-SNE)

• PCA tries to find a global structure


• Low dimensional subspace
• Can lead to local inconsistencies
• Far away points can become neighbors
• t-SNE tries to preserve local structure
• Local dimensional neighborhood should be the same as original
neighborhood
• Distance Preservation
• Neighbor Preservation
• Unlike PCA almost only used for visualization
PCA vs t-SNE
SNE Intuition
• Measure pairwise similarities between high-dimensional and low-
dimensional objects.
Stochastic Neighbor Embedding (SNE)
• Encode high dimensional neighborhood information as a distribution
• Intuition: Random walk between data points.
• High probability to jump to a close point
• Find low dimensional points such that their neighborhood
distribution is similar.
• How do you measure distance between distributions?
• Most common measure: KL divergence
Neighborhood Distributions
• Consider the neighborhood around an input data point xi
• Imagine that we have a Gaussian distribution centered around xi
• Then the probability that xi chooses some other datapoint xj as its
neighbor is in proportion with the density under this Gaussian
• A point closer to xi will be more likely than one further away
Probabilities
• This pj|i probability is the probability that point xi chooses xj as its
neighbor

• The parameter sigma sets the size of the neighborhood


• Very low sigma -> all the probability is in the newest neighbor
• Very high sima -> uniform weights
• We set sigma differently for each data point
• Results depend heavily on sigma as it defines the neighborhood we
are trying to preserve
• The final distribution over pairs is symmetrized pij = 1/2N(pi|j + pj|i)
Perplexity
• For each distribution pj|i depends on sigma we define the perplexity
• perp(pj|i) = 2H(pj|i) where H(p) = - Σ p log(p) is the entropy
• If p is uniform over k elements perplexity is k
• Smooth version of k in kNN
• Low perplexity equals to small sigma
• High perplexity equals to large sigma
• Typically values of sigma between 5-50 work well
• Important parameter that can capture different scales in the data
SNE objective
• Given x1, …, xn ∈ Rm define the distribution pij
• Goal: find good embedding y1, .., yn ∈ Rk for k < n
• How do we measure embedding quality?
• For points y1, .., yn we can define distribution q similarly to the same (but
not sigma and not symmetric)

• The idea is to optimize q to be close to p by minimizing the KL-divergence


• The embeddings y1, .., yn are the parameters we are optimizing
KL-divergence
• Measures distance between two distributions, P and Q

• It is not a metric function as is not symmetric


• Based on the information theory intuition: if we are transmitting
information distributed according to p then the optimal lossless
compression will need to send on average H(p) bits
• Thus, K(P||Q) is the penalty for using a wrong distribution
Distances to Conditional Probabilities
• Converting the high-dimensional Euclidean distances into conditional
probabilities that represent similarities
• Similarities of datapoints in High Dimension

• Similarity of datapoints in Low Dimension

• Cost function

• Minimize C using gradient descent


SNE problems
• Not a convex problem! No guarantees, can use multiple restarts.
• Crowding problem
• In high dim we have a lot of different neighbors
• In 2 dimensions we have few neighbors at the same distance and far from
each other
• Thus, we do not have space to accommodate all neighbors

• t-SNE solution: change the Gaussian in Q to a heavy tailed distribution

Student-t Probability Density


The basic algorithm …

Complete slides available here: https://kawahara.ca/visualizing-data-using-t-sne-slides/


Compute probabilities P that xi
and xj are neighbors
(based on Euclidian distance in high-d
space)
Key assumption is that the high-
d P and the low-d Q probability
distributions should be the same
Find a low-d map that minimizes the difference between the
P (high-d) and Q (low-d) distributions

(if xi,xj has high probability of being neighbors in high-d,


then yi,yj should have high probability in low-d)
We will minimize the difference between the
high-d and low-d maps using gradient descent
References
• Dimensionality Reduction. Appendix B.
Introduction to Data Mining.
• Cox, T.F.; Cox, M.A.A. (2001).
Multidimensional Scaling. Chapman and Hall.
• Maaten, L. van der, & Hinton, G. (2008).
Visualizing Data using t-SNE. Journal of
Machine Learning Research (JMLR), 9, 2579–
2605. [pdf]
• J. B. Tenenbaum, V. de Silva, J. C. Langford, A
Global Geometric Framework for Nonlinear
Dimensionality Reduction, Science 290,
(2000), 2319–2323. [pdf]

You might also like