0% found this document useful (0 votes)
111 views

Dimensionality Reduction Unit-5 Dr. H C Vijayalakshmi: Reference 1. Ample

Dimensionality reduction techniques like principal component analysis (PCA) and linear discriminant analysis (LDA) are used to extract relevant information from a dataset and represent the data in a lower dimensional space. PCA transforms the data to a new coordinate system such that the greatest variance by projecting it to a lower dimensional space. LDA performs an optimal projection of the data by maximizing the separation between classes. The key difference is that PCA aims to retain maximum variance while LDA aims to maximize class separability. Examples are provided to demonstrate calculating eigenvalues and eigenvectors to derive the principal components and perform dimensionality reduction.

Uploaded by

Bateman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

Dimensionality Reduction Unit-5 Dr. H C Vijayalakshmi: Reference 1. Ample

Dimensionality reduction techniques like principal component analysis (PCA) and linear discriminant analysis (LDA) are used to extract relevant information from a dataset and represent the data in a lower dimensional space. PCA transforms the data to a new coordinate system such that the greatest variance by projecting it to a lower dimensional space. LDA performs an optimal projection of the data by maximizing the separation between classes. The key difference is that PCA aims to retain maximum variance while LDA aims to maximize class separability. Examples are provided to demonstrate calculating eigenvalues and eigenvectors to derive the principal components and perform dimensionality reduction.

Uploaded by

Bateman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Dimensionality Reduction Unit-5

Dr. H C Vijayalakshmi

Reference
1.
https://www.gatevidyalay.com/tag/principal-component-analysis-numerical-ex
ample/
Eigen Values and Eigen Vectors

The eigenvectors x and eigenvalues λ of a matrix A satisfy


Ax = λx
If A is an n x n matrix, then x is an n x 1 vector, and λ is a constant.

The equation can be rewritten as (A - λI) x = 0, where I is the


n x n identity matrix.

2
3
2 X 2 Example

A= 1 -2 so A - λI = 1 - λ -2
3 -4 3 -4 - λ

det(A - λI) = (1 - λ)(-4 - λ) – (3)(-2)


= λ2 + 3 λ + 2

Set λ2 + 3 λ + 2 to 0

Then = λ = (-3 +/- sqrt(9-8))/2

So the two values of λ are -1 and -2.


4
Find the eigenvalues and eigenvectors of the matrix

Solution
Let us first derive the characteristic polynomial of A.
We get

We now solve the characteristic equation of A.

The eigenvalues of A are 2 and –1.


The corresponding eigenvectors are found by using these values of λ in the equation(A – λI2)x = 0.
There are many eigenvectors corresponding to each eigenvalue.
For λ = 2
We solve the equation (A – 2I2)x = 0 for x.
The matrix (A – 2I2) is obtained by subtracting 2 from the diagonal elements of A.
We get

This leads to the system of equations

giving x1 = –x2. The solutions to this system of equations are x1 = –r, x2 = r, where r is a scalar.
Thus the eigenvectors of A corresponding to λ = 2 are nonzero vectors of the form
For λ = –1
We solve the equation (A + 1I2)x = 0 for x.
The matrix (A + 1I2) is obtained by adding 1 to the diagonal elements of A. We get

This leads to the system of equations

Thus x1 = –2x2. The solutions to this system of equations are x1 = –2s and x2 = s, where s is a
scalar. Thus the eigenvectors of A corresponding to λ = –1 are nonzero vectors of the form

Example2 : Eigenvalues and Eigenvectors

Find the eigenvalues of

Solution:
Example 2: Eigenvalues and Eigenvectors

What is the eigenvector of at λ=1?


Example 2: Eigenvalues and Eigenvectors

Multiply 3rd eqn by -5 and add it to 1st eqn to eliminate


Example 2: Eigenvalues and Eigenvectors

Divide 2nd eqn by and simplify using the known result:


Example 2: Eigenvalues and Eigenvectors

Story so far:

We can obtain a normalized eigenvector using:


Example 3: Eigenvalues and Eigenvectors
Find the eigenvalues and eigenvectors of the matrix

Solution
The matrix A – λI3 is obtained by subtracting λ from the diagonal elements of A.Thus

The characteristic polynomial of A is |A – λI3|. Using row and column operations to simplify
determinants, we get
We now solving the characteristic equation of A:

The eigenvalues of A are 10 and 1.


The corresponding eigenvectors are found by using three values of λ in the equation (A – λI3)x = 0.
• λ1 = 10
We get

The solution to this system of equations are x1 = 2r, x2 = 2r, and x3 = r, where r is a scalar.
Thus the eigenspace of λ1 = 10 is the one-dimensional space of vectors of the form.
• λ2 = 1
Let λ = 1 in (A – λI3)x = 0. We get

The solution to this system of equations can be shown to be x1 = – s – t, x2 = s, and x3 = 2t, where s and
t are scalars. Thus the eigenspace of λ2 = 1 is the space of vectors of the form.
Separating the parameters s and t, we can write

Thus the eigenspace of λ = 1 is a two-dimensional subspace of R3 with basis

If an eigenvalue occurs as a k times repeated root of the characteristic equation, we say that it is of
multiplicity k. Thus λ=10 has multiplicity 1, while λ=1 has multiplicity 2 in this example.
Eigen values associated with ATA: λ = 0, 1 & 3
What is singular value decomposition explain with example?
• The singular value decomposition of a matrix A is the factorization of A into
the product of three matrices A = UDVT where the columns of U and VT are
orthonormal and the matrix D is diagonal with positive real entries. The SVD
is useful in many tasks.
• Calculating the SVD consists of finding the eigenvalues and eigenvectors
of AAT and ATA.
• The eigenvectors of ATA make up the columns of V , the eigenvectors
of AAT make up the columns of U.
• Also, the singular values in D are square roots of eigenvalues from AAT or ATA.
• The singular values are the diagonal entries of the D matrix and are arranged in
descending order. The singular values are always real numbers.
• If the matrix A is a real matrix, then U and V are also real.
where:
• U: mxr matrix of the orthonormal eigenvectors of AAT.
• VT: transpose of a rxn matrix containing the orthonormal eigenvectors of ATA.
• W: a rxr diagonal matrix of the singular values which are the square roots of the
eigenvalues of AAT and ATA .
Find the singular value of the matrix

To calculate the SVD, First, we need to compute the singular values by finding eigenvalues of AAT.

The characteristic equation for the above matrix is:

Now we find the right singular vectors i.e orthonormal set of eigenvectors of ATA. The eigenvalues of ATA are 25, 9, and
0, and since ATA is symmetric we know that the eigenvectors will be orthogonal.

So the singular values are


Now we find the right singular vectors i.e orthonormal set of eigenvectors of ATA.
The eigenvalues of ATA are 25, 9, and 0, and since ATA is symmetric we know that the eigenvectors will be
orthogonal.
Now for
• Consider We know that for an n x n matrix W, then a nonzero vector x is
the eigenvector of W if:
W x = λx

For some scalar λ. Then the scalar l is called an eigenvalue of A, and x is said to be an eigenvector of A
corresponding to λ . So to find the eigenvalues of the above entity we compute matrices AAT and ATA. As
previously stated , the eigenvectors of AAT make up the columns of U so we can do the following analysis
to find U.
Now that we have a n x n matrix we can determine the eigenvalues of the matrix W.
Since W x = λ x then (W- λ I) x = 0

For a unique set of eigenvalues to determinant of the matrix (W-lI) must be equal to zero. Thus from the
solution of the characteristic equation, |W- λI|=0 we obtain:
λ =0, λ =0; λ = 15+√221.5 ~ 29.883; λ = 15- √ 221.5 ~ 0.117 (four eigenvalues since it is
a fourth degree polynomial).
This value can be used to determine the eigenvector that can be placed in the columns of U. Thus we
obtain the following equations:
• 19.883 x1 + 14 x2 = 0
• 14 x1 + 9.883 x2 = 0
• x3 = 0
• x4 = 0
Upon simplifying the first two equations we obtain a ratio which relates the value of x1 to x2.
The values of x1 and x2 are chosen such that the elements of the S are the square roots of the
eigenvalues.
Thus a solution that satisfies the above equation x1 = -0.58 and x2 = 0.82 and x3 = x4 = 0 (this is the
second column of the U matrix).
Substituting the other eigenvalue we obtain: -9.883 x1 + 14 x2 = 0;
14 x1 - 19.883 x2 = 0
x3 = 0;
x4 = 0
Thus a solution that satisfies this set of equations is x1 = 0.82 and x2 = -0.58 and x3 = x4 = 0 (this is the
first column of the U matrix). Combining these we obtain:
Similarly ATA makes up the columns of V so we can do a similar analysis to find the value of V.

and similarly we obtain the expression:

Finally as mentioned previously the S is the square root of the eigenvalues from AAT or ATA. and can be
obtained directly giving us:
Feature extraction is about extracting/deriving information from the original features set to create a new
features subspace. The primary idea behind feature extraction is to compress the data with the goal of
maintaining most of the relevant information. As with feature selection techniques, these techniques are also
used for reducing the number of features from the original features set to reduce model complexity, model
overfitting, enhance model computation efficiency and reduce generalization error.
The following are different types of feature extraction techniques:
PCA- Principal Component Analysis
LDA - Linear Discriminant Analysis
The key difference between feature selection and feature extraction techniques used for dimensionality
reduction is that while the original features are maintained in the case of feature selection algorithms, the
feature extraction algorithms transform the data onto a new feature space.

Feature selection techniques can be used if the requirement is to maintain the original features, unlike the
feature extraction techniques which derive useful information from data to construct a new feature subspace.
Feature selection techniques are used when model explainability is a key requirement.
● Feature extraction and feature engineering: transformation of raw data into features
suitable for modeling;
● Feature transformation: transformation of data to improve the accuracy of the
algorithm;
● Feature selection: removing unnecessary features. Feature selection is applied either to
prevent redundancy and/or irrelevancy existing in the features or just to get a limited
number of features to prevent from overfitting.
Dimensionality Reduction
The number of input variables or features for a dataset is referred to as its dimensionality.
Dimensionality reduction refers to techniques that reduce the number of input variables in a training
dataset.
More input features often make a predictive modeling task more challenging to model, more generally
referred to as the curse of dimensionality.
• Having a large number of dimensions in the feature space can mean that the volume of that space is
very large, and in turn, the points that we have in that space (rows of data) often represent a small
and non-representative sample.
• This can dramatically impact the performance of machine learning algorithms fit on data with many
input features, generally referred to as the curse of dimensionality
• Therefore, it is often desirable to reduce the number of input features.
• This reduces the number of dimensions of the feature space, hence the name “dimensionality
reduction.”
Dimensionality Reduction
• The fundamental reason for the curse of dimensionality is that high-dimensional
functions have the potential to be much more complicated than low-dimensional
ones, and that those complications are harder to discern.

• The only way to beat the curse is to incorporate knowledge about the data that is
correct.

• Dimensionality reduction is a data preparation technique performed on


data prior to modeling. It might be performed after data cleaning and data
scaling and before training a predictive model.
Principal Component Analysis
• Principal components is a form of multivariate statistical analysis and is one method of studying the
correlation or covariance structure in a set of measurements on m variables for n observations.

• Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce
the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.

• Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the
trick in dimensionality reduction is to trade a little accuracy for simplicity.

• Because smaller data sets are easier to explore and visualize and make
analyzing data much easier and faster for machine learning algorithms without
extraneous variables to process.
• So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving
as much information as possible.
• The steps followed in principal components analysis are the following:
• Subtract mean.
• Calculate the covariance matrix.
• Calculate eigenvectors and eigenvalues.
• Select principal components.
• Reduce the data dimension.
What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?

• It’s actually the sign of the covariance that matters

• if positive then : the two variables increase or decrease together (correlated)

• if negative then : One increases when the other decreases (Inversely correlated)

• Now, that we know that the covariance matrix is not more than a table that
summaries the correlations between all the possible pairs of variables, let’s move to
the next step.
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute
from the covariance matrix in order to determine the principal components of the data.
Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables.
These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components.
So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put
maximum possible information in the first component.
Then maximum remaining information in the second and so on, until having something
like shown in the scree plot below.
• There are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts
for the largest possible variance in the data set.

• Organizing information in principal components this way, will allow us to reduce


dimensionality without losing much information, and this by discarding the components
with low information and considering the remaining components as your new variables.
• An important thing to realize here is that, the principal components are less interpretable
and don’t have any real meaning since they are constructed as linear combinations of the
initial variables.
Instance x1 x2 Instance x1 x2
1 0.3 0.5 11 0.6 0.8
2 0.4 0.3 12 0.4 0.6
3 0.7 0.4 13 0.3 0.4
4 0.5 0.7 14 0.6 0.5
5 0.3 0.2 15 0.8 0.5
6 0.9 0.8 16 0.8 0.9
7 0.1 0.2 17 0.2 0.3
8 0.2 0.5 18 0.7 0.7
9 0.6 0.9 19 0.5 0.5
10 0.2 0.2 20 0.6 0.4
After subtracting the mean
The covariance of two random variables measures the degree of variation
from their means for each other.
The sign of the covariance provides us with information about the relation
between them:
If the covariance is positive, then the two variables increase and decrease
together.
If the covariance is negative, then when one variable increases, the other
decreases, and vice versa.
These values determine the linear dependencies between the variables,
which will be used to reduce the data set's dimension.
x1 x2
x1 0.33 0.25
x2 0.25 0.41
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors
and eigenvalues of the covariance matrix are as follows:

If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the eigenvector
that corresponds to the first principal component (PC1) is v1 and the one that corresponds to the
second component (PC2) is v2.
After having the principal components, to compute the percentage of variance (information)
accounted for by each component, we divide the eigenvalue of each component by the sum
of eigenvalues. If we apply this on the example above, we find that PC1 and PC2 carry
respectively 96% and 4% of the variance of the data.
• As we saw in the previous step, computing the eigenvectors and ordering them by
their eigenvalues in descending order, allow us to find the principal components in
order of significance. In this step, what we do is, to choose whether to keep all these
components or discard those of lesser significance (of low eigenvalues), and form
with the remaining ones a matrix of vectors that we call Feature vector.

• So, the feature vector is simply a matrix that has as columns the eigenvectors of the
components that we decide to keep. This makes it the first step towards
dimensionality reduction, because if we choose to keep only p eigenvectors
(components) out of n, the final data set will have only p dimensions.
Continuing with the example from the previous step, we can either form a
feature vector with both of the eigenvectors v1 and v2:

Or discard the eigenvector v2, which is the one of lesser significance, and form a
feature vector with v1 only:

Discarding the eigenvector v2 will reduce dimensionality by 1, and will


consequently cause a loss of information in the final data set. But given that v2
was carrying only 4% of the information, the loss will be therefore not important
and we will still have 96% of the information that is carried by v1.
• Principal Components in PCA
• As described above, the transformed new features or the output of PCA
are the Principal Components. The number of these PCs are either equal
to or less than the original features present in the dataset. Some
properties of these principal components are given below:
• The principal component must be the linear combination of the original
features.
• These components are orthogonal, i.e., the correlation between a pair of
variables is zero.
• The importance of each component decreases when going to 1 to n, it
means the 1 PC has the most importance, and n PC will have the least
importance.
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality
reduction techniques in machine learning to solve more than two-class classification
problems. LDA is also a dimensionality reduction technique. It is used as a
pre-processing step in Machine Learning and applications of pattern classification.

The goal of LDA is to project the features in higher dimensional space onto a lower
dimensional space in order to avoid the curse of dimensionality and also reduce resources
and dimensional costs.

The original technique was developed in the year 1936 by Ronald A. Fisher and was
named Linear Discriminant or Fisher's Discriminant Analysis. The original Linear
Discriminant was described as a two-class technique. The multi-class version was later
generalized by C.R Rao as Multiple Discriminant Analysis. They are all simply referred to
as the Linear Discriminant Analysis.
A Brief Introduction to Linear Discriminant Analysis
• Scatter matrix: Used to make estimates of the covariance matrix. It is a m X m positive
semi-definite matrix. Given by: sample variance * no. of samples.
• Note: Scatter and variance measure the same thing but on different scales. So, we might use
both words interchangeably. So, do not get confused.
Here we will be dealing with two types of scatter matrices
• Between class scatter = Sb = measures the distance between class means
• Within class scatter = Sw = measures the spread around means of each class

• Step 1 - Computing the within-class and between-class scatter matrices.
• Step 2 - Computing the eigenvectors and their corresponding eigenvalues
for the scatter matrices.
• Step 3 - Sorting the eigenvalues and selecting the top k.
• Step 4 - Creating a new matrix that will contain the eigenvectors mapped
to the k eigenvalues.
• Step 5 - Obtaining new features by taking the dot product of the data and
the matrix from Step 4.
Within-class scatter matrix
To calculate the within-class scatter matrix, you can use the following mathematical
expression:

• where, c = total number of distinct classes and

• where, x = a sample (i.e. a row).


n = total number of samples within a given class.
• Now we create a vector with the mean values of each feature:
Between-class scatter matrix
• We can calculate the between-class scatter matrix using the following mathematical expression:

Where

and
Then solve the generalized eigenvalue problem to obtain the linear discriminants for:

We will sort the eigenvalues from the highest to the lowest since the eigenvalues with the highest
values carry the most information about the distribution of data is done. Next, we will first k
eigenvectors.

Finally, we will place the eigenvalues in a temporary array to make sure the eigenvalues map to the
same eigenvectors after the sorting is done:
LDA vs. PCA : Linear discriminant analysis is very similar to PCA both look for linear
combinations of the features which best explain the data.
The main difference is that the Linear discriminant analysis is a supervised
dimensionality reduction technique that also achieves classification of the data
simultaneously.

LDA focuses on finding a feature subspace that maximizes the separability between the
groups.

While Principal component analysis is an unsupervised Dimensionality reduction


technique, it ignores the class label.

PCA focuses on capturing the direction of maximum variation in the data set.

LDA and PCA both form a new set of components.


Drawbacks of Linear Discriminant Analysis (LDA)
Although, LDA is specifically used to solve supervised classification problems for two or more
classes which are not possible using logistic regression in machine learning. But LDA also fails
in some cases where the Mean of the distributions is shared. In this case, LDA fails to create a
new axis that makes both the classes linearly separable.
Example problem on PCA
• Consider the given Dataset

• Compute Mean vector

• Subtract the mean vector from the feature vectors

• Calculate the covariance matrix


• Covariance Matrix is
Now,
Covariance matrix = (m1 + m2 + m3 + m4 + m5 + m6) / 6




Calculate the eigen values and eigen vectors of the covariance matrix.
λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI|
= 0. So, we have-


From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0

Solving this quadratic equation, we get λ = 8.22, 0.38

Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38.


Clearly, the second eigen value is very small compared to the first eigen value.
So, the second eigen vector can be left out.
Eigen vector corresponding to the greatest eigen value is the principal component for the
given data set. So. we find the eigen vector corresponding to eigen value λ1.

We use the following equation to find the eigen vector-


MX = λX
where-
M = Covariance Matrix
X = Eigen vector
λ = Eigen value
Substituting the Eigen value

The eigen vector is


LDA example:
• Consider a 2-D dataset of the previous slide
• Cl =X1 =(x1,x2) ={(4,1),(2,4),(2,3),(3,6), (4,4)}
• C2=X2=(x1,x2) = {(9,10),(6,8),(9,5),(8,7),(10,8)}
Computed values s1,s2 and Sw
Step 2: Compute between class scatter
Matrix(Sb)
• Mean 1 (M1) =(3,3.6)
• Mean 2 (M2)=(8,4,7.6)

• (M1-M2) = (3-8.4, 3.6-7.6) = (-5.4, 4.0)


Step 3: Find the best LDA projection vector
• To do this ..compute the Eigen values and eigen vector
for the largest eigen value, on the matrix which is the
product of : =

• In this example, highest eigen value is : 15.65 ( )


Eigen vector computed for Eigen value: 15.65
Compute inverse of Sw
• =

You might also like