16 dm2 Dimred 2022 23
16 dm2 Dimred 2022 23
Dimensionality Reduction
Riccardo Guidotti
a.a. 2022/2023
Dimensionality Reduction
• Dimensionality reduction is X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
the process of reducing the 1.1 10 0.3 0.5 A 1 C 15 1.3 a
number of random variables 1.2 12 0.3 0.7 A 0 D 19 1.8 P
under consideration by … … … … … … … … … …
obtaining a set of principal
variables.
• Approaches can be divided
into feature selection and XA XB
feature projection. 1.8 5.4
1.9 6.3
… …
Feature Selection
• Select a subset of the features according to different strategies:
• the filter strategy (e.g. information gain),
• the wrapper strategy (e.g. search guided by accuracy),
• the embedded strategy (selected features add or are removed
while building the model based on prediction errors).
• F-value = /
• where denotes the sample mean in the ith group, ni is the number of observations in
the ith group, denotes the overall mean of the data, is the jth observation in the ith out
of K groups, K denotes the number of groups, N the overall sample size.
• F-value is large if the numerator is large, which is unlikely to happen if the population
means of the groups all have the same value.
Recursive Feature Elimination (RFE)
• Given an external estimator that assigns weights to features (e.g., the
coefficients of a linear model, or feature importance of decision tree),
RFE selects features by recursively considering smaller and smaller
sets of features.
• First, the estimator is trained on the initial set of features and the
importance of each feature is obtained.
• Then, the least important features are pruned from current set of
features.
• That procedure is recursively repeated on the pruned set until the
desired number of features to select is eventually reached.
Feature Projection (a.k.a Feature Extraction)
• It transforms the data in the high-dimensional space to a space of
fewer dimensions.
• The data transformation may be linear, or nonlinear.
• Approaches:
• Principal Component Analysis (PCA)
• Non-negative matrix factorization (NMF)
• Linear Discriminant Analysis (LDA)
• Multidimensional Scaling
• Sammon
• IsoMap
• t-SNE
• Autoencoder
Feature Projection (a.k.a Feature Extraction)
• It transforms the data in the high-dimensional space to a space of
fewer dimensions.
• The data transformation may be linear, or nonlinear.
• Approaches:
• Principal Component Analysis (PCA)
• Non-negative matrix factorization (NMF)
• Linear Discriminant Analysis (LDA)
• Multidimensional Scaling
• Sammon
• IsoMap
• t-SNE
• Autoencoder
Principal Component Analysis (PCA)
• The goal of PCA is to find a new set of
dimensions (attributes or features) that
better captures the variability of the data.
• The first dimension is chosen to capture as
much of the variability as possible.
• The second dimension is orthogonal to the
first and, subject to that constraint, captures
as much of the remaining variability as
possible, and so on.
PCA – Conceptual Algorithm
• Find a line such that, when the data is projected onto that line, it has
the maximum variance; minimize the sum-of-squares of the
projection errors.
PCA – Conceptual Algorithm
• Find a line such that, when the data is projected onto that line, it has
the maximum variance; minimize the sum-of-squares of the
projection errors.
PCA – Conceptual Algorithm
• Find a second line, orthogonal to the first, that has maximum
projected variance.
PCA – Conceptual Algorithm
• Repeat until have k orthogonal lines.
• The projected position of a point on these lines gives the coordinates
in the k-dimensional reduced space.
Background: Covariance, Eigenvalue and Eigenvectors
• Limits of PCA:
• Limited to linear projections
Partial Least Squares (PLS)
• Supervised alternative to PCA
• Attempts to find the set of orthogonal directions that explain both
outcome and features.
• First direction:
• Calculate simple linear regression between each feature and outcome
• Use coefficients to define first direction giving greatest weight to predictors
which are highly correlated with outcome (large coefficients)
• Repeat procedure on residuals of predictors
Random Subspace Projection
• High-dimensional data is projected into low-dimensional space using
a random matrix whose columns have unit length.
• No attempt to optimize criterion.
• Preserve structure of data (e.g. distances)
• Computationally cheap.
Multi-Dimensional Scaling (MDS)
• Given a pairwise dissimilarity matrix (no need to be a metric), the goal
of MDS is to learn a mapping of data into a lower dimensionality such
that the relative distances are preserved.
• If two points are close in the feature space, it should be close in the
latent factor space.
MDS methods
• MDS is a family of different algorithms designed to map data into a
very low configuration, e.g., k=2 or k=3.
• MDS methods include
• Classical MDS
• Metric MDS
• Non-metric MDS
• Cost function