0% found this document useful (0 votes)
7 views

Kumar 2017

Uploaded by

Hamza Zabet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Kumar 2017

Uploaded by

Hamza Zabet
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

GENERAL ARTICLE

Principal Component Analysis:


Most Favourite Tool in Chemometrics
Keshav Kumar

Principal component analysis (PCA) is the most commonly


used chemometric technique. It is an unsupervised pattern
recognition technique. PCA has found applications in chem-
istry, biology, medicine and economics. The present work at-
tempts to understand how PCA work and how can we inter-
pret its results.

Keshav Kumar did his PhD


1. Introduction from Department of
Chemistry, Indian Institute of
Chemometrics is a discipline that combines mathematics, statis- Technology-Madras, India,
under the guidance of
tics, and logic to design or select optimal measurement proce-
Professor A K Mishra.
dures and experiments. It allows the extraction of maximum rel- Currently he is working as a
evant chemical information by analysing chemical data and helps Postdoc at the Institute for
in understanding chemical systems [1]. In recent years, chemo- Wine Analysis and Beverage
Research, Hochschule
metrics has emerged as an important part of analytical chemistry.
Geisenheim University,
Chemometric techniques have enabled the analysis of large vol- Germany. His research
umes of data obtained from various instruments (single or hy- mainly focus on
phenated) efficiently. The analyses of such large data sets are chemometrics and its
application in various elds.
otherwise a time consuming process and might end up with no
meaningful interpretation or conclusions.
Among various chemometric techniques, principal component anal-
ysis (PCA) [1, 2] is considered the ‘most favourite’. PCA has
found applications in various elds. For example, Singh et al.,
have successfully used PCA for stellar spectral classication [3].
Kumar et al., have applied PCA for (i) classifying aqueous herbal
drugs [4] and (ii) diagnosis and therapeutic prognosis of oral sub-
mucous brosis [5]. Kowalski et al., have used PCA for the clas- Keywords
sication of archaeological artefacts [6]. Kowalkowaski has ap- Chemometrics, principal compo-
nent analysis, classication, pat-
plied PCA for river water classication [7], while Ragot and co-
tern recognition, chromatography.
workers have used PCA for air quality monitoring [8].

RESONANCE | August 2017 747


GENERAL ARTICLE

Data compression by
It can be realised that PCA is capable of providing a fast and
PCA involves nding a
new space spanned by effective way of analysing data sets from various disciplines viz
fewer number of physics, biology, chemistry, archaeology, etc. PCA essentially
dimensions over which reduces the dimensions of the data set while retaining most of
original data set is the variation [1, 2]. Data compression by PCA involves nding
projected. The
dimensions of the new a new space spanned by fewer number of dimensions over which
space are orthogonal to original data set is projected. The dimensions of the new space
each other simplifying are orthogonal to each other simplifying the data sets for further
the data sets for further analysis. Theoretical and various technical aspects of PCA are
analysis.
discussed below.

2. Theory

2.1 Geometrical Representation of PCA

In order to understand PCA geometrically, let us consider a two


dimensional data set I × J, where I is the number of samples and
J is the number of variables. In the present case, for convenience,
we have set the number of variables (i.e., J) to two – J 1 and J 2 .
As shown in Figure 1, these samples can be presented in a two
dimensional space spanned by J 1 and J 2 . The two axes J 1 and J 1
are orthogonal to each other. The data set acquired for the sam-
ples have considerable amount of variation along J 1 and J 2 axes.
In other words, both the dimensions are signicantly important to
have the complete information about the sample set.
An anti-clock wise rotation of the J 1 and J 2 axes by an angle θ (=
45o in the present case) generates another pair of orthogonal axes
T 1 and T 2 . Mathematically, it could be shown using (1):
⎛ ⎞ ⎛ ⎞⎛ ⎞
⎜⎜⎜T 1 ⎟⎟⎟ ⎜⎜⎜ cos θ sin θ ⎟⎟⎟ ⎜⎜⎜ J1 ⎟⎟⎟
⎝⎜ ⎠⎟ = ⎝⎜ ⎠⎟ ⎝⎜ ⎠⎟
T2 − sin θ cos θ J2

The new variables (or dimensions) T 1 and T 2 are the linear com-
binations of J 1 and J 2 variables with sine and cosine as coeffi-
cients
T 1 = J1 cos θ + J2 sin θ (1)

748 RESONANCE | August 2017


GENERAL ARTICLE

T 2 = −J1 sin θ + J2 cos θ (2)

Projection of the data set in space, spanned by the new variables


T 1 and T 2 is shown in Figure 2. The data set has most of the
variations along T 1 axis and is literally invariant along T 2 axis.
Figure 1. Representa-
tion of a data set in the
space spanned by J1 and J2 .
Data has signicant varia-
tion along both the axes.

Figure 2. (a) Rotation of


axes J1 and J2 by 45o to gen-
erate another pair of orthog-
onal axes T 1 and T 2 . (b)
Representation of data set
in the new space spanned
by T 1 and T 2 axes. The
data set has variation along
T 1 and no variation along
T 2 . (c) Reduction of di-
mensions. T 2 is unimpor-
tant and hence could be re-
moved, and T 1 can be taken
as the approximation of data
spanned in the two dimen-
sional space spanned by J1
and J2 .

RESONANCE | August 2017 749


GENERAL ARTICLE

While the score value In principle, variation along T 1 axis can be taken as a good ap-
explains how the proximation of the original two-dimensional data set, and one can
samples are related to easily ignore the T 2 axis. Thus, by projecting the data set in a suit-
each other, the loading
value explains how the able space, it is possible to reduce the dimensions of the data sets
variables are related to while retaining all the information.
each other.
2.2 Commonly Used Terminologies in PCA

Before proceeding further, it is necessary that we briey describe


some commonly used terminologies.
(i) Principal Components: The set of new variables (i.e., T 1 and
T 2 ) obtained from the linear combinations of old variables (J 1
and J 2 ) are called principal components. The variable that ex-
plains the maximum variation is called the rst principal compo-
nent. Second principal component explains the second highest
variation from the unexplained variance of the data set and so on.
In the above given example, T 1 is the rst principal component
and explains all the variations of the data set, and T 2 is the second
principal component that explains the remaining variance of the
data set.
(ii) Loading Vectors: They essentially form the basis for project-
ing the original data set to obtain the principal components. In
the above example, [cos θ sin θ]T and [− sin θ cos θ]T , transpose of
rst and second row of the matrix, respectively, given in (1) rep-
resents the rst and second loading vectors corresponding to rst
(T 1 ) and second (T 2 ) principal components. The loading vectors
in more generic sense are known as Eigen vectors of the data set.
(iii) Score and Loading Value: The numerical values associated
with principal components of each sample are called the score
values. The numerical values associated with the elements of
loading vectors are called the loading values. The score values
explain how the samples are related to each other, and the loading
values explain how the variables are related to each other.

750 RESONANCE | August 2017


GENERAL ARTICLE

2.3 Fitting PCA Model

PCA model can be tted using Eigen value decomposition method


as summarized below. Autoscaled data is very
useful when the
(1) In the rst step, the data set is mean centered X = X−mean(X). variables are in different
(2) In the second step, the covariance matrix for mean centred X scales or in different
magnitudes. The data set
is calculated, is autoscaled by
XT X subtracting the mean
Cov(X) = from each column,
I−1 followed by division
(3) In the next step, the covariance matrix is diagonalized to ob- with the standard
tain the ∧ (Eigen values) and P (Eigen matrix containing the deviation.
Eigen vectors): Cov(X)P = ∧P.
(4) In this step, the diagonal elements in the matrix ∧ are arranged
in the decreasing order (i.e., ∧1 > ∧2 > ∧3 ... > ∧k ), and the
corresponding arrangement is made in the loading matrix P. For
example if the positions of ∧1 and ∧3 are interchanged in the
matrix ∧, then the rst and third rows of P are interchanged.
(5) Score matrix T can be calculated by projecting the data set
X in the space spanned by the Eigen vectors of matrix P : T =
XP. The score matrix T and loading matrix P are orthogonal and
orthonormal, respectively. T T T = diagonal matrix, and PT P =
identity matrix.
(6) The approximation of data set X by PCA model can be repre-
sented as: X = T PT + E.
E is the residual matrix of dimension I × J, T is the score matrix
of dimension I × K, and P is loading matrix of dimension J × K,
where K is the number of signicant factors of PCA model that
explain majority of the variation in the data set and it is always ≤
min( I and J).
(7) Score value of any new sample can be calculated by projecting
X new new data on P: T new = Xnew P.
We can also use the autoscaled data in step 1 to perform PCA
analysis. It is very useful when the variables are in different
scales or in different magnitudes. The data set is autoscaled

RESONANCE | August 2017 751


GENERAL ARTICLE

by subtracting the mean from each column, followed by division


The choice of number of with the standard deviation:
factors can only affect
the extent to which one X = [X − mean (X)] /standard deviation (X).
can retrieve different Steps 2–7 can be performed on autoscaled X to create the PCA
pieces of orthogonal
information. Thus, a model. It is to be noted that square matrix obtained in step 2 is
PCA model can be dened as the correlation matrix.
created with any number
of factors, and each 2.4 Finding Optimum Number of Factors for PCA Model
model would provide a
true piece of information
available in the data set.
One of the signicant advantages of PCA is that it is essentially
sequential in nature. In other words, K factor PCA model is al-
ways a subset of K + 1 factor PCA model. The choice of number
of factors can only affect the extent to which one can retrieve dif-
ferent pieces of orthogonal information. Thus, a PCA model can
be created with any number of factors, and each model would
provide a true piece of information available in the data set. In
order to ensure that we capture the complete information of the
data set without overtting the model, a general thumb rule is that
one has to rst select k factors from available K factors that could
capture at least 80% of the data set.

k
∧i
Amount of variance = i=1 , > 80%; k ≤ K.
i=1 ∧i
K

One has to keep adding the number of factors if the amount of


variance captured by the model increases by more than 3–4%.

2.5 Some Statistical Parameters Involved in PCA

Lack of Fit Parameter (Q): It is a measure of the difference be-


tween the actual data set and the approximation made by the PCA
model. The lack of t parameter (Q) of PCA model can be calcu-
lated by taking the outer product of the residual matrix E.
Q = EE T = X(I − PPT )X T .
In an ideal case, the diagonal elements of Q (a matrix of dimen-
sion I × I) should be zero.

752 RESONANCE | August 2017


GENERAL ARTICLE

Hotelling’s T 2 Statistic: This parameter measures the variation


of each sample within the PCA model. It indicates the spread of
samples from the origin in the model. It could be calculated as,
T 2 = T λ−1 T T .
Leverage: It measures the inuence of a sample in PCA model. A
sample with high leverage reinforces PCA model and may cause
the rotation of principal components. Leverage of a sample can
be calculated using the formula: Leverage = T (T T T )−1 T .

2.6 Detection of Outliers

Principal component analysis can be used to nd the outlier in a


data set, provided we study the leverage and residual of the sam-
ples. A sample that is not well described by the model will have
unusually high residual, and the samples that have high inuence
on the model will have unusually high leverage [9]. In an ideal
case, all the samples of a data set should have low leverage and
low residuals. The samples having high residual are classied as
outliers and need to be analyzed carefully. The samples having
high leverage may have high residual or low residual. The former
is called as bad leverage samples, and the latter is called as good
leverage samples. The bad leverage samples need to be analysed
very carefully because they tend to bias the model signicantly.
The good leverage samples also have to be analysed carefully as
they indicate unusual variation of the variable, though the data set
of those samples are well tted by PCA model.

3. Performing PCA: An Example

3.1 Data Used

The chromatographic data set reported in literature [10, 11] has


been used to carry out the present work. The chromatographic
data set consist of 120 oil samples. Of these samples, 68 be-
longs to the class of olive oils and 52 belongs to the class of non-
olive vegetable oils, and oil blends (i.e., the vegetable and olive
oil blends).

RESONANCE | August 2017 753


GENERAL ARTICLE

Figure 3. The chro-


matograms of 120 oil sam-
ples. Of this, 68 belongs
to the class of olive oils and
the remaining belongs to the
class of non-olive oils and
blended oils.

3.2 Software Used

All the analysis and data plotting was carried out on MATLAB-
2014 platform. However, there are other platforms such as R,
python, etc., that can be used for the analysis.

3.3 Results and Discussion

The chromatographic data sets acquired for the olive and non-
olive oils are shown in Figure 3. It can be seen that based on the
visual analysis of the chromatographic proles, it is difficult to
differentiate olive oils from non-olive vegetable oils. Moreover,
manual analysis of such a large volume of data is laborious and
time consuming, and may not provide any meaningful interpre-

Figure 4. Amount of vari-


ance captured by different
principal components (PCs).
The plot indicates that rst
two PCs are sufficient to ex-
plain most of the variance
(more than 85%) of the data
set without overtting the
model.

754 RESONANCE | August 2017


GENERAL ARTICLE

tations. However, PCA essentially simplies and reduces the The optimum number of
dimensions of the data set, and provides a fast and efficient way principal components
of analysing the complex chromatographic data of the selected oil required for tting PCA
model is obtained from
samples. The chromatographic data sets are arranged in a matrix the variance captured by
of dimensions 126 × 4001, where 126 is the number of samples different principal
and 4001 is the number of variables (i.e., retention time points). components against the
The chromatographic data sets are normalized to unit area and principal component
numbers.
mean-centered prior to PCA. The optimum number of principal
components required for tting PCA model is obtained from the
variance captured by different principal components against the
principal component numbers, as shown in Figure 4. It can be
seen that with the addition of principal components, there is a
substantial improvement in the cumulative percentage of variance
captured by the PCA model. However, beyond 2 principal com-
ponents, the addition of extra factors do not bring any substan-
tial improvement in the cumulative variance captured by the PCA
model. Thus, one can conclude that PCA model of two principal
components that explains more than 80% variance is optimum
to capture all the important information buried in the data set.
PC1 and PC2 individually explains 78% and 5% variances of the
data sets. The PC1 versus PC2 score plot is shown in Figure 5;
PCA model clearly separates the samples in two groups. It is
found that all the samples belonging to the class of olive oils have
negative PC1 score values and the samples belonging to the non-
olive oil class have positive PC1 score values. The blended oils

Figure 5. PC1 versus


PC2 score plot classifying
the olive oil samples from
non-olive oils and blended
oil samples.

RESONANCE | August 2017 755


GENERAL ARTICLE

Figure 6. (a) Load-


ing vectors corresponding to
PC1 that mainly contains
negatively correlated major
peaks can be used to ratio-
nalize the classication of
olive oil, non-olive oil, and
blended oil samples in the
PC1 versus PC2 score plot.
(b) Loading vectors corre-
sponding to PC2 mainly ex-
plains the minor peaks that
can be used for the classi-
cation of the samples within
the groups.

in the score plot appear near the edges of the ellipse. The blended
oil containing more of olive oils have negative PC1 score values,
whereas the blended oils containing less of olive oils have posi-
tive PC1 score values. The loading vectors that explain how the
variables are related to each other are shown in Figure 6.
The analysis of loading vector plot can be really helpful in nd-
ing the set of variables that are really helpful in characterizing the
samples. The loading vector corresponding to PC1 mainly ex-
plains the variations of the major peaks that can be used to char-
acterize the classes of olive oils and non-olive oils. The loading
vector corresponding to PC2 mainly explains the variation of the
minor peaks and can be used to differentiate the samples within
the olive oil and non-olive oil groups. Based on the loading vec-
tor proles, the appearance of blended oil samples on the extreme
edges of PC2 axes can be attributed to the fact that blended oils

756 RESONANCE | August 2017


GENERAL ARTICLE

Figure 7. The outlier di-


agnostic plot explaining the
leverage and residual val-
ues in two component PCA
model. Nine samples (16,
20, 40, 48, 51, 52, 72, 95 and
120) are found to have high
leverage values. Of this,
5 samples (16, 20, 40, 51
and 95) are the blended oils.
Five samples (41, 84, 85, 86
and 119) are found to have
contain different types of olive and non-olive oils. The outlier high residual values indicat-
diagnostic plot created by plotting the leverage versus residual ing that their composition is
values can be used to nd the really unusual samples called the very different from others or
outliers. The outlier diagnostic plot is shown in Figure 7. All the something went wrong at the
5 blended samples (16, 20, 40, 51, and 95) are found to have un- sample preparation or data
acquisition stages and need
usually high leverage values that correlate well with the fact that
further attention.
they contain constituents of both olive and non-olive oils. There
are some samples with high residual values indicating that these
samples need a careful analysis. These samples might have un-
usual compositions or something might have gone wrong at the
sample preparation or data acquisition stages. In summary, the
obtained PCA model is found to be highly specic and sensitive
in classifying the oil samples.
In most cases, PCA is well capable of classifying the samples.
Though, in some cases due to the complexity of the data sets,
the output of PCA such as score matrix needs to be further pro-
cessed with some other chemometric techniques such as linear
discriminant analysis (LDA) [12], soft independent modelling of
class analogy (SIMCA) [13,14], neural network analysis (NNA)
[3,14], etc., for achieving meaningful interpretation of the data
sets.

RESONANCE | August 2017 757


GENERAL ARTICLE

3.4 Conclusions

PCA is the most favourite tool in chemometrics. It reduces the


dimensions of the data sets and simplies the data for easy and
meaningful interpretations. Using the chromatographic data set
of olive and non-olive oil samples, it has been clearly shown that
PCA can be used as an unsupervised pattern recognition tech-
nique. PCA successfully differentiated olive oil from non-olive
oil samples. It is also shown that PCA can be used for detecting
the outlier samples in the data set.

Suggested Reading
[1] D L Massart, B G M Vandeginste, L M C Buydens, S de Jong, P J Lewi and V
J S Verbeke, Handbook of Chemometrics and Qualimetrics, Elsevier, New York,
1997.
[2] R Kramer, Chemometric Techniques for Quantitative Analysis, Marcel Dekker,
New York, 1998.
[3] H P Singh, R K Gulati and Ranjan Gupta, Stellar Spectral Classication Us-
ing Principal Component Analysis and Articial Neural Networks, Monthly
Notices of the Royal Astronomical Society, Vol.295, pp.312–318, 1998.
[4] K Kumar, P Bairi, K Ghosh, K K Mishra and A K Mishra, Classication
of Aqueous-based Ayurvedic Preparations Using Synchronous Fluorescence
Spectroscopy and Chemometric Techniques, Current Science, Vol.107, No.3,
p.107, 470–477, 2014.
[5] K Kumar, S Sivabalan, S Ganesan, and A K Mishra, Discrimination of Oral
Submucous Fibrosis (OSF) Affected Oral Tissues From Healthy Oral Tis-
sues Using Multivariate Analysis of In-vivo Fluorescence Spectroscopic Data:
A Simple and Fast Procedure for OSF Diagnosis, Analytical Methods, Vol.5,
pp.3482–3489, 2013.
[6] B R Kowalski, T F Schatzki and F H Stross, Classication of Archaeological
Artifacts by Applying Pattern Recognition to Trace Element Data, Analytical
Chemistry, Vol.44, pp.2176–2180, 1972.
[7] T Kowalkowski, R Zbytniewski, J Szpejna and B Buszewski, Application of
Chemometrics in River Water Classication, Water Research, Vol.40, pp.744–
752, 2006.
[8] M F Harkat, G Mourot and J Ragot, An Improved PCA Scheme for Sensor
FDI: Application to An Air Quality Monitoring Network, Journal of Process
Control, Vol.16, pp.625–634, 2006.
[9] S Wold, K Esbensen and P Geladi, Principal Component Analysis, Chemomet-
rics and Intelligent Laboratory Systems, Vol.2, pp.37–52, 1987.
[10] de la P Mata-Espinosa, J M Bosque-Sendra, R Bro and L Cuadros-Rodriguez,
Olive Oil Quantication of Edible Vegetable Oil Blends Using Triacylglyc-

758 RESONANCE | August 2017


GENERAL ARTICLE

erols Chromatographic Fingerprints and Chemometric Tools, Talanta, Vol.85,


pp.177–182, 2011.
[11] http://www.models.life.ku.dk/oliveoil Address for Correspondence
[12] S Balakrishnama and A Ganapathiraju, Linear Discriminant Analysis A Brief Keshav Kumar
Tutorial, Institute for Signal and Information Processing, March 2, 1998. Institute for Wine Analysis and
https://www.isip.piconepress.com/publications/reports/1998/isip/lda/lda Beverage Research,
theory.pdf Hochschule, Geisenheim
[13] S Wold, Pattern Recognition by Means of Disjoint principal component mod- University, Geisenheim 653 66
els, Patt. Recog., Vol.8, pp.127–139, 1976. Germany
[14] R Brereton, Chemometrics for Pattern Recognition, John Wiley & Sons, Ltd, Email:
U.K., 2009. [email protected]

RESONANCE | August 2017 759

You might also like