Dimensionality Reduction Unit-5 Dr. H C Vijayalakshmi: Reference 1. Ample
Dimensionality Reduction Unit-5 Dr. H C Vijayalakshmi: Reference 1. Ample
Dr. H C Vijayalakshmi
Reference
1.
https://www.gatevidyalay.com/tag/principal-component-analysis-numerical-ex
ample/
Eigen Values and Eigen Vectors
2
3
2 X 2 Example
A= 1 -2 so A - λI = 1 - λ -2
3 -4 3 -4 - λ
Set λ2 + 3 λ + 2 to 0
Solution
Let us first derive the characteristic polynomial of A.
We get
giving x1 = –x2. The solutions to this system of equations are x1 = –r, x2 = r, where r is a scalar.
Thus the eigenvectors of A corresponding to λ = 2 are nonzero vectors of the form
For λ = –1
We solve the equation (A + 1I2)x = 0 for x.
The matrix (A + 1I2) is obtained by adding 1 to the diagonal elements of A. We get
Thus x1 = –2x2. The solutions to this system of equations are x1 = –2s and x2 = s, where s is a
scalar. Thus the eigenvectors of A corresponding to λ = –1 are nonzero vectors of the form
•
Example2 : Eigenvalues and Eigenvectors
Solution:
Example 2: Eigenvalues and Eigenvectors
Story so far:
Solution
The matrix A – λI3 is obtained by subtracting λ from the diagonal elements of A.Thus
The characteristic polynomial of A is |A – λI3|. Using row and column operations to simplify
determinants, we get
We now solving the characteristic equation of A:
The solution to this system of equations are x1 = 2r, x2 = 2r, and x3 = r, where r is a scalar.
Thus the eigenspace of λ1 = 10 is the one-dimensional space of vectors of the form.
• λ2 = 1
Let λ = 1 in (A – λI3)x = 0. We get
The solution to this system of equations can be shown to be x1 = – s – t, x2 = s, and x3 = 2t, where s and
t are scalars. Thus the eigenspace of λ2 = 1 is the space of vectors of the form.
Separating the parameters s and t, we can write
If an eigenvalue occurs as a k times repeated root of the characteristic equation, we say that it is of
multiplicity k. Thus λ=10 has multiplicity 1, while λ=1 has multiplicity 2 in this example.
Eigen values associated with ATA: λ = 0, 1 & 3
What is singular value decomposition explain with example?
• The singular value decomposition of a matrix A is the factorization of A into
the product of three matrices A = UDVT where the columns of U and VT are
orthonormal and the matrix D is diagonal with positive real entries. The SVD
is useful in many tasks.
• Calculating the SVD consists of finding the eigenvalues and eigenvectors
of AAT and ATA.
• The eigenvectors of ATA make up the columns of V , the eigenvectors
of AAT make up the columns of U.
• Also, the singular values in D are square roots of eigenvalues from AAT or ATA.
• The singular values are the diagonal entries of the D matrix and are arranged in
descending order. The singular values are always real numbers.
• If the matrix A is a real matrix, then U and V are also real.
where:
• U: mxr matrix of the orthonormal eigenvectors of AAT.
• VT: transpose of a rxn matrix containing the orthonormal eigenvectors of ATA.
• W: a rxr diagonal matrix of the singular values which are the square roots of the
eigenvalues of AAT and ATA .
Find the singular value of the matrix
To calculate the SVD, First, we need to compute the singular values by finding eigenvalues of AAT.
Now we find the right singular vectors i.e orthonormal set of eigenvectors of ATA. The eigenvalues of ATA are 25, 9, and
0, and since ATA is symmetric we know that the eigenvectors will be orthogonal.
For some scalar λ. Then the scalar l is called an eigenvalue of A, and x is said to be an eigenvector of A
corresponding to λ . So to find the eigenvalues of the above entity we compute matrices AAT and ATA. As
previously stated , the eigenvectors of AAT make up the columns of U so we can do the following analysis
to find U.
Now that we have a n x n matrix we can determine the eigenvalues of the matrix W.
Since W x = λ x then (W- λ I) x = 0
For a unique set of eigenvalues to determinant of the matrix (W-lI) must be equal to zero. Thus from the
solution of the characteristic equation, |W- λI|=0 we obtain:
λ =0, λ =0; λ = 15+√221.5 ~ 29.883; λ = 15- √ 221.5 ~ 0.117 (four eigenvalues since it is
a fourth degree polynomial).
This value can be used to determine the eigenvector that can be placed in the columns of U. Thus we
obtain the following equations:
• 19.883 x1 + 14 x2 = 0
• 14 x1 + 9.883 x2 = 0
• x3 = 0
• x4 = 0
Upon simplifying the first two equations we obtain a ratio which relates the value of x1 to x2.
The values of x1 and x2 are chosen such that the elements of the S are the square roots of the
eigenvalues.
Thus a solution that satisfies the above equation x1 = -0.58 and x2 = 0.82 and x3 = x4 = 0 (this is the
second column of the U matrix).
Substituting the other eigenvalue we obtain: -9.883 x1 + 14 x2 = 0;
14 x1 - 19.883 x2 = 0
x3 = 0;
x4 = 0
Thus a solution that satisfies this set of equations is x1 = 0.82 and x2 = -0.58 and x3 = x4 = 0 (this is the
first column of the U matrix). Combining these we obtain:
Similarly ATA makes up the columns of V so we can do a similar analysis to find the value of V.
Finally as mentioned previously the S is the square root of the eigenvalues from AAT or ATA. and can be
obtained directly giving us:
Feature extraction is about extracting/deriving information from the original features set to create a new
features subspace. The primary idea behind feature extraction is to compress the data with the goal of
maintaining most of the relevant information. As with feature selection techniques, these techniques are also
used for reducing the number of features from the original features set to reduce model complexity, model
overfitting, enhance model computation efficiency and reduce generalization error.
The following are different types of feature extraction techniques:
PCA- Principal Component Analysis
LDA - Linear Discriminant Analysis
The key difference between feature selection and feature extraction techniques used for dimensionality
reduction is that while the original features are maintained in the case of feature selection algorithms, the
feature extraction algorithms transform the data onto a new feature space.
Feature selection techniques can be used if the requirement is to maintain the original features, unlike the
feature extraction techniques which derive useful information from data to construct a new feature subspace.
Feature selection techniques are used when model explainability is a key requirement.
● Feature extraction and feature engineering: transformation of raw data into features
suitable for modeling;
● Feature transformation: transformation of data to improve the accuracy of the
algorithm;
● Feature selection: removing unnecessary features. Feature selection is applied either to
prevent redundancy and/or irrelevancy existing in the features or just to get a limited
number of features to prevent from overfitting.
Dimensionality Reduction
The number of input variables or features for a dataset is referred to as its dimensionality.
Dimensionality reduction refers to techniques that reduce the number of input variables in a training
dataset.
More input features often make a predictive modeling task more challenging to model, more generally
referred to as the curse of dimensionality.
• Having a large number of dimensions in the feature space can mean that the volume of that space is
very large, and in turn, the points that we have in that space (rows of data) often represent a small
and non-representative sample.
• This can dramatically impact the performance of machine learning algorithms fit on data with many
input features, generally referred to as the curse of dimensionality
• Therefore, it is often desirable to reduce the number of input features.
• This reduces the number of dimensions of the feature space, hence the name “dimensionality
reduction.”
Dimensionality Reduction
• The fundamental reason for the curse of dimensionality is that high-dimensional
functions have the potential to be much more complicated than low-dimensional
ones, and that those complications are harder to discern.
• The only way to beat the curse is to incorporate knowledge about the data that is
correct.
• Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce
the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
• Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the
trick in dimensionality reduction is to trade a little accuracy for simplicity.
• Because smaller data sets are easier to explore and visualize and make
analyzing data much easier and faster for machine learning algorithms without
extraneous variables to process.
• So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving
as much information as possible.
• The steps followed in principal components analysis are the following:
• Subtract mean.
• Calculate the covariance matrix.
• Calculate eigenvectors and eigenvalues.
• Select principal components.
• Reduce the data dimension.
What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?
• if negative then : One increases when the other decreases (Inversely correlated)
• Now, that we know that the covariance matrix is not more than a table that
summaries the correlations between all the possible pairs of variables, let’s move to
the next step.
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute
from the covariance matrix in order to determine the principal components of the data.
Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables.
These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components.
So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put
maximum possible information in the first component.
Then maximum remaining information in the second and so on, until having something
like shown in the scree plot below.
• There are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts
for the largest possible variance in the data set.
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the eigenvector
that corresponds to the first principal component (PC1) is v1 and the one that corresponds to the
second component (PC2) is v2.
After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum
of eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry
respectively 96% and 4% of the variance of the data.
• As we saw in the previous step, computing the eigenvectors and ordering them by
their eigenvalues in descending order, allow us to find the principal components in
order of significance. In this step, what we do is, to choose whether to keep all these
components or discard those of lesser significance (of low eigenvalues), and form
with the remaining ones a matrix of vectors that we call Feature vector.
• So, the feature vector is simply a matrix that has as columns the eigenvectors of the
components that we decide to keep. This makes it the first step towards
dimensionality reduction, because if we choose to keep only p eigenvectors
(components) out of n, the final data set will have only p dimensions.
Continuing with the example from the previous step, we can either form a
feature vector with both of the eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser significance, and form a
feature vector with v1 only:
The goal of LDA is to project the features in higher dimensional space onto a lower
dimensional space in order to avoid the curse of dimensionality and also reduce resources
and dimensional costs.
The original technique was developed in the year 1936 by Ronald A. Fisher and was
named Linear Discriminant or Fisher's Discriminant Analysis. The original Linear
Discriminant was described as a two-class technique. The multi-class version was later
generalized by C.R Rao as Multiple Discriminant Analysis. They are all simply referred to
as the Linear Discriminant Analysis.
A Brief Introduction to Linear Discriminant Analysis
• Scatter matrix: Used to make estimates of the covariance matrix. It is a m X m positive
semi-definite matrix. Given by: sample variance * no. of samples.
• Note: Scatter and variance measure the same thing but on different scales. So, we might use
both words interchangeably. So, do not get confused.
Here we will be dealing with two types of scatter matrices
• Between class scatter = Sb = measures the distance between class means
• Within class scatter = Sw = measures the spread around means of each class
•
• Step 1 - Computing the within-class and between-class scatter matrices.
• Step 2 - Computing the eigenvectors and their corresponding eigenvalues
for the scatter matrices.
• Step 3 - Sorting the eigenvalues and selecting the top k.
• Step 4 - Creating a new matrix that will contain the eigenvectors mapped
to the k eigenvalues.
• Step 5 - Obtaining new features by taking the dot product of the data and
the matrix from Step 4.
Within-class scatter matrix
To calculate the within-class scatter matrix, you can use the following mathematical
expression:
Where
and
Then solve the generalized eigenvalue problem to obtain the linear discriminants for:
We will sort the eigenvalues from the highest to the lowest since the eigenvalues with the highest
values carry the most information about the distribution of data is done. Next, we will first k
eigenvectors.
Finally, we will place the eigenvalues in a temporary array to make sure the eigenvalues map to the
same eigenvectors after the sorting is done:
LDA vs. PCA : Linear discriminant analysis is very similar to PCA both look for linear
combinations of the features which best explain the data.
The main difference is that the Linear discriminant analysis is a supervised
dimensionality reduction technique that also achieves classification of the data
simultaneously.
LDA focuses on finding a feature subspace that maximizes the separability between the
groups.
PCA focuses on capturing the direction of maximum variation in the data set.