100% found this document useful (1 vote)
51 views

Principal Component Analysis

PCA is a technique used to reduce the dimensionality of large datasets by transforming the data to a new coordinate system. It works by finding the principal components - linear combinations of variables that maximize the variance in the data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. PCA transforms the data such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. This has the effect of identifying variables that are highly correlated and replacing them with a smaller number of artificial variables.

Uploaded by

Geethakshaya
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
51 views

Principal Component Analysis

PCA is a technique used to reduce the dimensionality of large datasets by transforming the data to a new coordinate system. It works by finding the principal components - linear combinations of variables that maximize the variance in the data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. PCA transforms the data such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. This has the effect of identifying variables that are highly correlated and replacing them with a smaller number of artificial variables.

Uploaded by

Geethakshaya
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Introduction

Sometimes data are collected on a large number of variables from a single population. With a large
number of variables, there would be too many pairwise correlations between the variables to consider.
Graphical display of data may also not be of particular help in case the data set is very large. To interpret
the data in a more meaningful form, it is therefore necessary to reduce the number of variables to a
few, interpretable linear combinations of the data. Each linear combination will correspond to a
principal component. Principal Component Analysis is a linear transformation method. PCA yields the
directions (principal components) that maximize the variance of the data. This chapter gives a general
view of PCA.

Principal Component Analysis:

PCA is a linear transformation method. PCA yields the directions (principal components) that maximize
the variance of the data. In other words, PCA projects the entire dataset onto a different feature (sub)
space. Often, the desired goal is to reduce the dimensions of a d-dimensional dataset by projecting it
onto a (k)- dimensional subspace (where k<d) in order to increase the computational efficiency while
retaining most of the information.

PCA Approach:

I. Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to
represent data.
II. The original data set is reduced to one consisting of N data vectors on c principal components
(reduced dimensions).
III. Each data vector is a linear combination of the c principal component vectors.

IV. Works for numeric data only

IV. Used when the number of dimensions is large

yk = ak1x1 + ak2x2 + ... + akkxk


such that:
yk's are uncorrelated (orthogonal) Principal Components
y1 explains as much as possible of original variance in data set
y2 explains as much as possible of remaining variance etc. and

{a11,a12,...,a1k} is 1st Eigenvector of correlation/covariance matrix, and


coefficients of first principal component
{a21,a22,...,a2k} is 2nd Eigenvector of correlation/covariance matrix, and
coefficients of 2nd principal component

{ak1,ak2,...,akk} is kth Eigenvector of correlation/covariance matrix, and coefficients of kth
principal component.

In summary, PCA rotates multivariate dataset into a new configuration which is easier to
interpret for two major reasons:

- simplify data

– look at relationships between variables

– look at patterns of units

Fundamentals

To get to PCA, we’re going to quickly define some basic statistical ideas – mean, standard deviation,
variance and covariance – so we can weave them together later. Their equations are closely related.

Mean is simply the average value of all x’s in the set X, which is found by dividing the sum of all data
points by the number of data points, n.

Standard deviation, is simply the square root of the average square distance of data points to the mean.
In the equation below, the numerator contains the sum of the differences between each data point and
the mean, and the denominator is simply the number of data points (minus one), producing the average
distance.

Variance is the spread, or the amount of difference that data


expresses.

Covariance (cov(X,Y)) is the joint variability between two random variables X and Y, and covariance is
always measured between 2 or more dimensions. If you calculate the covariance between one
dimension and itself, you get the variance.

For both variance and standard deviation, squaring the differences between data points and the mean
makes them positive, so that values above and below the mean don’t cancel each other out.

Input to various regression techniques can be in the form of correlation or covariance matrix. Covariance
Matrix is a matrix whose element in the i, j position is the covariance between the i th and jth elements
of a random vector. A random vector is a random variable with multiple dimensions.

Correlation Matrix is a table showing correlation coefficients between sets of variables.

Covariance Matrix vs Correlation Matrix:

Covariance Matrix Correlation Matrix


Variables must be in same units • Variables are standardized (mean 0.0, SD
• Emphasizes variables with most 1.0)
variance • Variables can be in different units
• Mean eigenvalue ≠1.0 • All variables have same impact on analysis
• Mean eigenvalue = 1.0

PCA steps

1. Consider a Data

2. Subtract the mean - from each of the data dimensions

3. Calculate the covariance matrix

4. Calculate the eigenvalues and eigenvectors of the covariance matrix

5. Reduce dimensionality and form feature vector


1. order the eigenvectors by eigenvalues, highest to lowest. This gives you the components in order of
significance and ignore the components of lesser significance

2. Feature Vector = (eig1 eig2 eig3 … eign)

6. Deriving the new data:

FinalData = RowFeatureVector x RowZeroMeanData

1. RowFeatureVector is the matrix with the eigenvectors in the rows, with the most significant
eigenvector at the top

2. RowZeroMeanData is the mean-adjusted data transposed, ie. the data items are in each column, with
each row holding a separate dimension.

PCA steps explained with an examples

Step 1: Get some data

Consider a data with just 2 dimensions, and its 2D plots of the data to show what the PCA analysis is
doing at each step. The data used is found in Figure 2 (a), along with a plot of that data in Figure 2 (b).

Step 2: Subtract the mean

For PCA to work properly, you have to subtract the mean from each of the data dimensions. The mean
subtracted is the average across each dimension. So, all the x values have (the mean of the < values of
all the data points) subtracted, and all the y values have subtracted from them. This produces a data set
whose mean is zero as shown in figure 2 (c).

Step 3: Calculate the covariance matrix This is done in exactly the same way as was discussed in section
2.1.4. Since the data is 2 dimensional, the covariance matrix will be 2 x 2.as given below:
So, since the non-diagonal elements in this covariance matrix are positive, we should expect that both
the x and y variable increase together.

Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix Since the covariance matrix
is square, we can calculate the eigenvectors and eigenvalues for this matrix. These are rather important,
as they tell us useful information about our data. The eigenvectors and eigenvalues are given below:

It is important to notice that these eigenvectors are both unit eigenvectors ie. their lengths are both 1.
So, by this process of taking the eigenvectors of the covariance matrix, we have been able to extract
lines that characterise the data. The rest of the steps involve transforming the data so that it is
expressed in terms of them lines.

Step 5: Choosing components and forming a feature vector:

Here is where the notion of data compression and reduced dimensionality comes into it. If you look at
the eigenvectors and eigenvalues from the previous step, you will notice that the eigenvalues are quite
different values. In fact, it turns out that the eigenvector with the highest eigenvalue is the principle
component of the data set.

In our example, the eigenvector with the larges eigenvalue was the one that pointed down the middle
of the data. It is the most significant relationship between the data dimensions. Once eigenvectors are
found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This
gives you the components in order of significance. To be precise, if you originally have n dimensions in
your data, and so you calculate n eigenvectors and eigenvalues, and then you choose only the first p
eigenvectors, then the final data set has only p dimensions. What needs to be done now is you need to
form a feature vector:

FeatureVector = (eig1, eig2, eig3…eign)

Given our example set of data, and the fact that we have 2 eigenvectors, we have two choices. We can
either form a feature vector with both of the eigenvectors or we can choose to leave out the smaller,
less significant component and only have asingle column:
Step 6: Deriving the new data set This the final step in PCA, and is also the easiest. Once we have chosen
the components (eigenvectors) that we wish to keep in our data and formed a feature vector, we simply
take the transpose of the vector and multiply it on the left of the original data set, transposed.

where RowFeatureVector is the matrix with the eigenvectors in the columns transposed so that the
eigenvectors are now in the rows, with the most significant eigenvector at the top, and RowDataAdjust
is the mean-adjusted data transposed, ie. the data items are in each column, with each row holding a
separate dimension. FinalData is the final data set, with data items in columns, and dimensions along
rows.

Our original data set had two axes, x and y, so our data was in terms of them. It is possible to express
data in terms of any two axes that you like. If these axes are perpendicular, then the expression is the
most efficient. This was why it was important that eigenvectors are always perpendicular to each other.
We have changed our data from being in terms of the axes x and y, and now they are in terms of our 2
eigenvectors.

In the case of keeping both eigenvectors for the transformation, we get the data and the plot found in
Figure 3. This plot is basically the original data, rotated so that the eigenvectors are the axes. This is
understandable since we have lost no information in this decomposition.

Basically we have transformed our data so that is expressed in terms of the patterns between them,
where the patterns are the lines that most closely describe the relationships between the data. This is
helpful because we have now classified our data point as a combination of the contributions from each
of those lines.

Figure 3: Transformed Data and New data

Interpretation of PCs
Consider Places Rated Almanac data (Boyer and Savageau) which rates 329 communities according to
nine criteria:

 Climate and Terrain (C1)


 Housing (C2)
 Health Care & Environment (C3)
 Crime (C4)
 Transportation (C5)
 Education (C6)
 The Arts (C7)
 Recreation (C8)
 Economics (C9)

Step 1: Examine the eigenvalues to determine how many principal components should be considered.

If you take all of these eigenvalues and add them up and you get the total variance of 0.5223.

The proportion of variation explained by each eigenvalue is given in the third column. For example,
0.3775 divided by the 0.5223 equals 0.7227, or, about 72% of the variation is explained by this first
eigenvalue.

The cumulative percentage explained is obtained by adding the successive proportions of variation
explained to obtain the running total. For instance, 0.7227 plus 0.0977 equals 0.8204, and so forth.
Therefore, about 82% of the variation is explained by the first two eigenvalues together.

Next we need to look at successive differences between the eigenvalues. Subtracting the second
eigenvalue 0.051 from the first eigenvalue, 0.377 we get a difference of 0.326. The difference between
the second and third eigenvalues is 0.0232; the next difference is 0.0049.

A sharp drop from one eigenvalue to the next may serve as another indicator of how many eigenvalues
to consider.

The first three principal components explain 87% of the variation. This is an acceptably large
percentage.

Table 1. Eigenvalues, and the proportion of variation explained by the principal components
Step 2: Next, compute the principal component scores:

• For example, the first principal component can be computed using the elements of the first
eigenvector:

Step 3: To interpret each component, compute the correlations between the original data for each
variable and each principal component.

First Principal Component Analysis - PCA1

The first principal component is a measure of the quality of Health and the Arts, and to some extent
Housing, Transportation and Recreation. Health increases with increasing values in the Arts. If any of
these variables goes up, so do the remaining ones. They are all positively related as they all have positive
signs.

Second Principal Component Analysis - PCA2 The second principal component is a measure of the
severity of crime, the quality of the economy, and the lack of quality in education. Crime and Economy
increase with decreasing Education. Here we can see that cities with high levels of crime and good
economies also tend to have poor educational systems.
Third Principal Component Analysis - PCA3

The third principal component is a measure of the quality of the climate and poorness of the economy.
Climate increases with decreasing Economy. The inclusion of economy within this component will add a
bit of redundancy within our results. This component is primarily a measure of climate, and to a lesser
extent the economy.

You might also like