16 dm2 Dimred 2022 23

This document provides an overview of dimensionality reduction techniques, including feature selection and feature projection. It describes popular feature selection methods like variance threshold, univariate feature selection, and recursive feature elimination. It also explains feature projection techniques like principal component analysis (PCA), linear discriminant analysis (LDA), and multidimensional scaling. PCA uses eigendecomposition to project data onto orthogonal principal components. LDA attempts to preserve class separability in the reduced dimensions.

Uploaded by

nimra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views49 pages

16 dm2 Dimred 2022 23

Uploaded by

nimra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

DATA MINING 2

Dimensionality Reduction
Riccardo Guidotti

a.a. 2022/2023
Dimensionality Reduction
• Dimensionality reduction is X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
the process of reducing the 1.1 10 0.3 0.5 A 1 C 15 1.3 a
number of random variables 1.2 12 0.3 0.7 A 0 D 19 1.8 P
under consideration by … … … … … … … … … …
obtaining a set of principal
variables.
• Approaches can be divided
into feature selection and XA XB
feature projection. 1.8 5.4
1.9 6.3
… …
Feature Selection
• Select a subset of the features according to different strategies:
• the filter strategy (e.g. information gain),
• the wrapper strategy (e.g. search guided by accuracy),
• the embedded strategy (selected features add or are removed
while building the model based on prediction errors).

• Classification and/or regression or can be done in the reduced space

more accurately than in the original space.
Feature Selection
• Variance Threshold. It removes all features whose variance does not meet some
threshold. By default, it removes all zero-variance features, i.e. features that have the
same value in all samples.
• Univariate Feature Selection. It selects the best features based on univariate statistical
tests. For instance, it removes all but the k highest scoring features. An example of
statistical test is the ANOVA F-value between label/feature.

• F-value = /

• where denotes the sample mean in the ith group, ni is the number of observations in
the ith group, denotes the overall mean of the data, is the jth observation in the ith out
of K groups, K denotes the number of groups, N the overall sample size.
• F-value is large if the numerator is large, which is unlikely to happen if the population
means of the groups all have the same value.
Recursive Feature Elimination (RFE)
• Given an external estimator that assigns weights to features (e.g., the
coefficients of a linear model, or feature importance of decision tree),
RFE selects features by recursively considering smaller and smaller
sets of features.
• First, the estimator is trained on the initial set of features and the
importance of each feature is obtained.
• Then, the least important features are pruned from current set of
features.
• That procedure is recursively repeated on the pruned set until the
desired number of features to select is eventually reached.
Feature Projection (a.k.a Feature Extraction)
• It transforms the data in the high-dimensional space to a space of
fewer dimensions.
• The data transformation may be linear, or nonlinear.
• Approaches:
• Principal Component Analysis (PCA)
• Non-negative matrix factorization (NMF)
• Linear Discriminant Analysis (LDA)
• Multidimensional Scaling
• Sammon
• IsoMap
• t-SNE
• Autoencoder
Feature Projection (a.k.a Feature Extraction)
• It transforms the data in the high-dimensional space to a space of
fewer dimensions.
• The data transformation may be linear, or nonlinear.
• Approaches:
• Principal Component Analysis (PCA)
• Non-negative matrix factorization (NMF)
• Linear Discriminant Analysis (LDA)
• Multidimensional Scaling
• Sammon
• IsoMap
• t-SNE
• Autoencoder
Principal Component Analysis (PCA)
• The goal of PCA is to find a new set of
dimensions (attributes or features) that
better captures the variability of the data.
• The first dimension is chosen to capture as
much of the variability as possible.
• The second dimension is orthogonal to the
first and, subject to that constraint, captures
as much of the remaining variability as
possible, and so on.
PCA – Conceptual Algorithm
• Find a line such that, when the data is projected onto that line, it has
the maximum variance; minimize the sum-of-squares of the
projection errors.
PCA – Conceptual Algorithm
• Find a line such that, when the data is projected onto that line, it has
the maximum variance; minimize the sum-of-squares of the
projection errors.
PCA – Conceptual Algorithm
• Find a second line, orthogonal to the first, that has maximum
projected variance.
PCA – Conceptual Algorithm
• Repeat until have k orthogonal lines.
• The projected position of a point on these lines gives the coordinates
in the k-dimensional reduced space.
Background: Covariance, Eigenvalue and Eigenvectors

• The covariance of two attributes is a measure of how strongly the

attributes vary together.

• Eigenvector of matrix X: a vector v such that X𝑣=𝜆𝑣

• 𝜆: eigenvalue of eigenvector 𝑣
• A square matrix X of rank r, has r orthonormal eigenvectors v1, v2,…,
v𝑟 with eigenvalues 𝜆1, 𝜆2, …, 𝜆𝑟.
• Eigenvectors define an orthonormal basis for the column space of X
Steps in PCA
• Calculate the mean value of the data of every dimension

• Calculate the covariance matrix of all pairs of attributes

• Given matrix of data X, remove the mean of each column from the column
vectors to get the centered matrix C
• The matrix Σ = 𝐶𝑇𝐶 is the covariance matrix of the row vectors of X.

• Calculate eigenvalues and eigenvectors of Σ

• Methods: power iteration method, Singular Value Decomposition
• Eigenvector with largest eigenvalue λ1 is the 1st PC
• Eigenvector with kth largest eigenvalue λk is the kth PC
• λk/ Σi λi is the proportion of variance captured by the kth PC
Singular Value Decomposition - SVD
Singular Value Decomposition - SVD
Applying the PCA
• The full set of PCs comprise a new orthogonal basis for feature space,
whose axes are aligned with the maximum variances of original data.
• Projection of original data into first k PCs gives a reduced
dimensionality representation of the data.
• Transforming reduced dimensionality projection back into original
space gives a reduced dimensionality reconstruction of the data.
• Reconstruction will have some error.
Select the dimension k
• Rank eigenvalues in decreasing order.
• Select eigenvectors that retain a fixed percentage of variance (e.g., at
least a minimum threshold.
Example
• Iris Dataset
PCA via SVD
• Create mean-centered data matrix X
• Solve SVD: X = USVT
• Columns of V are the eigenvectors of Σ sorted from largest to smallest
eigenvalues.

• Limits of PCA:
• Limited to linear projections
Partial Least Squares (PLS)
• Supervised alternative to PCA
• Attempts to find the set of orthogonal directions that explain both
outcome and features.
• First direction:
• Calculate simple linear regression between each feature and outcome
• Use coefficients to define first direction giving greatest weight to predictors
which are highly correlated with outcome (large coefficients)
• Repeat procedure on residuals of predictors
Random Subspace Projection
• High-dimensional data is projected into low-dimensional space using
a random matrix whose columns have unit length.
• No attempt to optimize criterion.
• Preserve structure of data (e.g. distances)
• Computationally cheap.
Multi-Dimensional Scaling (MDS)
• Given a pairwise dissimilarity matrix (no need to be a metric), the goal
of MDS is to learn a mapping of data into a lower dimensionality such
that the relative distances are preserved.
• If two points are close in the feature space, it should be close in the
latent factor space.
MDS methods
• MDS is a family of different algorithms designed to map data into a
very low configuration, e.g., k=2 or k=3.
• MDS methods include
• Classical MDS
• Metric MDS
• Non-metric MDS

• MDS cannot be inverted

Distance, dissimilarity and similarity
• Distance, dissimilarity and similarity (or proximity) are defined for any
pair of objects in any space. In mathematics, a distance function (that
gives a distance between two objects) is also called metric, satisfying:
• d(x, y) ≥ 0,
• d(x, y) = iff x = y,
• d(x, y) = d(y, x)
• d(x, z) ≤ d(x, y) + d(y, z)
• If the last condition does not hold, than d is a distance function but it
is not a metric.
MDS – Conceptual Algorithm
• Given a pairwise dissimilarity matrix D and the dimensionality k, find
a mapping such that dij = ||xi - xj|| for all points in D.
• Usually, a gradient descent approach is adopted to solve an
optimization problem that aims at minimizing the function
• J(x) = ∑in ∑jn d1(dij, d2(xi, xj))
• Depending on the distances adopted to calculate D and the distance
function used for d1 and d2 the approach returns a different result.
• The Classic MDS adopts the Euclidean distance for every calculus.
• Metric-MDM adopts metrics as distances
• Non metric-MDS deals with ranks of distances instead of their values
Sammon Mapping
• Sammon mapping is a generalization of the usual metric MDS.
• It introduces a weighting system that normalizes the squared-errors in
pairwise distances by using the distance in the original space.
• J(x) = ∑in ∑jn d1(dij, d2(xi, xj))/dij
• As a result, Sammon mapping preserves the small dij, giving them a
greater degree of importance in the fitting procedure than for larger
values of dij
Classic-MDS vs Sammon Mapping
• Sammon mapping better preserves inter-distances for smaller
dissimilarities, while proportionally squeezes the inter-distances for
larger dissimilarities.
Isometric Feature Mapping (IsoMap)
• Preserves the intrinsic geometry of the data
• Uses the geodesic manifold distances between all pairs.
• It is a MDS method.
• IsoMap Handles non-linear manifold
IsoMap Algorithm
• Step 1
• Determine neighboring points within a fixed radius based on the input space
distance (Euclidean)
• These neighborhood relations are represented as weighted graph G over the
data points.
• Step 2
• Estimate the geodesic distance between all pairs of points on the manifold by
computing their shortest path distances on the graph G
• Step 3
• Construct an embedding of the data in a k dimensional Euclidean space that
best preserves the manifold geometry
t-Distributed Stochastic Neighbor Embedding (t-SNE)

• PCA tries to find a global structure

• Low dimensional subspace
• Can lead to local inconsistencies
• Far away points can become neighbors
• t-SNE tries to preserve local structure
• Local dimensional neighborhood should be the same as original
neighborhood
• Distance Preservation
• Neighbor Preservation
• Unlike PCA almost only used for visualization
PCA vs t-SNE
SNE Intuition
• Measure pairwise similarities between high-dimensional and low-
dimensional objects.
Stochastic Neighbor Embedding (SNE)
• Encode high dimensional neighborhood information as a distribution
• Intuition: Random walk between data points.
• High probability to jump to a close point
• Find low dimensional points such that their neighborhood
distribution is similar.
• How do you measure distance between distributions?
• Most common measure: KL divergence
Neighborhood Distributions
• Consider the neighborhood around an input data point xi
• Imagine that we have a Gaussian distribution centered around xi
• Then the probability that xi chooses some other datapoint xj as its
neighbor is in proportion with the density under this Gaussian
• A point closer to xi will be more likely than one further away
Probabilities
• This pj|i probability is the probability that point xi chooses xj as its
neighbor

• The parameter sigma sets the size of the neighborhood

• Very low sigma -> all the probability is in the newest neighbor
• Very high sima -> uniform weights
• We set sigma differently for each data point
• Results depend heavily on sigma as it defines the neighborhood we
are trying to preserve
• The final distribution over pairs is symmetrized pij = 1/2N(pi|j + pj|i)
Perplexity
• For each distribution pj|i depends on sigma we define the perplexity
• perp(pj|i) = 2H(pj|i) where H(p) = - Σ p log(p) is the entropy
• If p is uniform over k elements perplexity is k
• Smooth version of k in kNN
• Low perplexity equals to small sigma
• High perplexity equals to large sigma
• Typically values of sigma between 5-50 work well
• Important parameter that can capture different scales in the data
SNE objective
• Given x1, …, xn ∈ Rm define the distribution pij
• Goal: find good embedding y1, .., yn ∈ Rk for k < n
• How do we measure embedding quality?
• For points y1, .., yn we can define distribution q similarly to the same (but
not sigma and not symmetric)

• The idea is to optimize q to be close to p by minimizing the KL-divergence

• The embeddings y1, .., yn are the parameters we are optimizing
KL-divergence
• Measures distance between two distributions, P and Q

• It is not a metric function as is not symmetric

• Based on the information theory intuition: if we are transmitting
information distributed according to p then the optimal lossless
compression will need to send on average H(p) bits
• Thus, K(P||Q) is the penalty for using a wrong distribution
Distances to Conditional Probabilities
• Converting the high-dimensional Euclidean distances into conditional
probabilities that represent similarities
• Similarities of datapoints in High Dimension

• Similarity of datapoints in Low Dimension

• Cost function

• Minimize C using gradient descent

SNE problems
• Not a convex problem! No guarantees, can use multiple restarts.
• Crowding problem
• In high dim we have a lot of different neighbors
• In 2 dimensions we have few neighbors at the same distance and far from
each other
• Thus, we do not have space to accommodate all neighbors

• t-SNE solution: change the Gaussian in Q to a heavy tailed distribution

Student-t Probability Density

The basic algorithm …

Complete slides available here: https://kawahara.ca/visualizing-data-using-t-sne-slides/

Compute probabilities P that xi
and xj are neighbors
(based on Euclidian distance in high-d
space)
Key assumption is that the high-
d P and the low-d Q probability
distributions should be the same
Find a low-d map that minimizes the difference between the
P (high-d) and Q (low-d) distributions

(if xi,xj has high probability of being neighbors in high-d,

then yi,yj should have high probability in low-d)
We will minimize the difference between the
high-d and low-d maps using gradient descent
References
• Dimensionality Reduction. Appendix B.
Introduction to Data Mining.
• Cox, T.F.; Cox, M.A.A. (2001).
Multidimensional Scaling. Chapman and Hall.
• Maaten, L. van der, & Hinton, G. (2008).
Visualizing Data using t-SNE. Journal of
Machine Learning Research (JMLR), 9, 2579–
2605. [pdf]
• J. B. Tenenbaum, V. de Silva, J. C. Langford, A
Global Geometric Framework for Nonlinear
Dimensionality Reduction, Science 290,
(2000), 2319–2323. [pdf]

(Lecture Notes in Electrical Engineering 292) Jacob Scharcanski, Hugo ProenÃ A, Eliza Du (Eds.) - Signal and Image Processing For Biometrics
No ratings yet
(Lecture Notes in Electrical Engineering 292) Jacob Scharcanski, Hugo ProenÃ A, Eliza Du (Eds.) - Signal and Image Processing For Biometrics
336 pages
ML 4
No ratings yet
ML 4
14 pages
Outline: Reducing Data Dimension
No ratings yet
Outline: Reducing Data Dimension
7 pages
ISOMAP in ML
No ratings yet
ISOMAP in ML
12 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
17 pages
ML Unit 4
No ratings yet
ML Unit 4
10 pages
Dimensionality Reduction Techniques You Should Know in 2021
No ratings yet
Dimensionality Reduction Techniques You Should Know in 2021
12 pages
ML Unit 4
No ratings yet
ML Unit 4
34 pages
ML Unit 4
No ratings yet
ML Unit 4
50 pages
Feature Selection and Feature Extraction in Pattern Analysis: A Literature Review
No ratings yet
Feature Selection and Feature Extraction in Pattern Analysis: A Literature Review
14 pages
Shakiba Rahimiaghdam - 61130 - Assignsubmission - File - DatasetAnalysis - MINERS
No ratings yet
Shakiba Rahimiaghdam - 61130 - Assignsubmission - File - DatasetAnalysis - MINERS
56 pages
Unit 5 (Dimensionality Reduction)
No ratings yet
Unit 5 (Dimensionality Reduction)
96 pages
I Tif Isometric Fea T M I Ature Mapping: (Isom (Isom Map) Map)
No ratings yet
I Tif Isometric Fea T M I Ature Mapping: (Isom (Isom Map) Map)
20 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Multivariate
100% (1)
Multivariate
78 pages
Nonlinear Dimensionality Reduction
No ratings yet
Nonlinear Dimensionality Reduction
18 pages
1 s2.0 S1574013721000186 Main
No ratings yet
1 s2.0 S1574013721000186 Main
13 pages
3191 Random Projections For Manifold Learning
No ratings yet
3191 Random Projections For Manifold Learning
8 pages
WIREs Computational Stats - 2012 - Izenman - Introduction To Manifold Learning
No ratings yet
WIREs Computational Stats - 2012 - Izenman - Introduction To Manifold Learning
8 pages
Updated Feature Enginering Notes
No ratings yet
Updated Feature Enginering Notes
47 pages
FALLSEM2023-24 - ITE2011 - ETH - VL2023240102356 - 2023-09-01 - Reference-Material-I (3 Files Merged)
No ratings yet
FALLSEM2023-24 - ITE2011 - ETH - VL2023240102356 - 2023-09-01 - Reference-Material-I (3 Files Merged)
191 pages
BIA 658 Social Network Analysis - Midterm Exam Fall 2020
No ratings yet
BIA 658 Social Network Analysis - Midterm Exam Fall 2020
3 pages
Curse of Dimensionality
No ratings yet
Curse of Dimensionality
51 pages
D3S2 - Unsupervised - Dimensionality Reduction
No ratings yet
D3S2 - Unsupervised - Dimensionality Reduction
81 pages
Unit 3
No ratings yet
Unit 3
50 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
DimensionalityReduction 13022024
No ratings yet
DimensionalityReduction 13022024
32 pages
Sta 5
No ratings yet
Sta 5
16 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
Pattern Classification 06. Feature Selection & Extraction: Abdelmoniem Bayoumi, PHD
No ratings yet
Pattern Classification 06. Feature Selection & Extraction: Abdelmoniem Bayoumi, PHD
29 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
82 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
Unit 4 Dimenstionality Reduction
No ratings yet
Unit 4 Dimenstionality Reduction
104 pages
Lecture W12ab
No ratings yet
Lecture W12ab
60 pages
Lec 3
No ratings yet
Lec 3
60 pages
Visualization 9 Dim Reduction
No ratings yet
Visualization 9 Dim Reduction
73 pages
Data Projections & Visualization: Student Eng.: Maria-Alexandra MATEI
No ratings yet
Data Projections & Visualization: Student Eng.: Maria-Alexandra MATEI
18 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
MLSP-6 Dimensionality Reduction
No ratings yet
MLSP-6 Dimensionality Reduction
39 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
I2ml3e Chap6
No ratings yet
I2ml3e Chap6
37 pages
Unit 4
No ratings yet
Unit 4
17 pages
CHP 4
No ratings yet
CHP 4
72 pages
Day School 03
No ratings yet
Day School 03
32 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
A Global Geometric Framework For Nonlinear Dimensionality Reduction
No ratings yet
A Global Geometric Framework For Nonlinear Dimensionality Reduction
8 pages
ML Mod 4 Part 2
No ratings yet
ML Mod 4 Part 2
32 pages
It ML Unit 4 Notes Final
No ratings yet
It ML Unit 4 Notes Final
21 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
19 pages
Polo Chaur Dimension Reduction
No ratings yet
Polo Chaur Dimension Reduction
59 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
5 pages
ML Lec-20
No ratings yet
ML Lec-20
17 pages
4.5 Principal Component Analysis
No ratings yet
4.5 Principal Component Analysis
15 pages
Facial Recognition and Mathematics - Vectors and Geometry in Action
No ratings yet
Facial Recognition and Mathematics - Vectors and Geometry in Action
6 pages
6 Dimension Reduction Theory
No ratings yet
6 Dimension Reduction Theory
18 pages
PCA - Feb 8
No ratings yet
PCA - Feb 8
28 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
22AIP3101A Session 7
No ratings yet
22AIP3101A Session 7
28 pages
7.3 Pca
No ratings yet
7.3 Pca
17 pages
# Loop Over Classes: 6.2 Principal Components Analysis (Pca)
No ratings yet
# Loop Over Classes: 6.2 Principal Components Analysis (Pca)
10 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
ML - Unit 3
No ratings yet
ML - Unit 3
4 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
PCA Complete
No ratings yet
PCA Complete
8 pages
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
No ratings yet
Lecture 7: Unsupervised Learning: C19 Machine Learning Hilary 2013 A. Zisserman
20 pages
Feature Extraction: - Saheni Patra
No ratings yet
Feature Extraction: - Saheni Patra
17 pages
Pca
No ratings yet
Pca
6 pages
Principal Component Analysis: Atent Ariables
No ratings yet
Principal Component Analysis: Atent Ariables
13 pages