Kumar 2017
Kumar 2017
Data compression by
It can be realised that PCA is capable of providing a fast and
PCA involves nding a
new space spanned by effective way of analysing data sets from various disciplines viz
fewer number of physics, biology, chemistry, archaeology, etc. PCA essentially
dimensions over which reduces the dimensions of the data set while retaining most of
original data set is the variation [1, 2]. Data compression by PCA involves nding
projected. The
dimensions of the new a new space spanned by fewer number of dimensions over which
space are orthogonal to original data set is projected. The dimensions of the new space
each other simplifying are orthogonal to each other simplifying the data sets for further
the data sets for further analysis. Theoretical and various technical aspects of PCA are
analysis.
discussed below.
2. Theory
The new variables (or dimensions) T 1 and T 2 are the linear com-
binations of J 1 and J 2 variables with sine and cosine as coeffi-
cients
T 1 = J1 cos θ + J2 sin θ (1)
While the score value In principle, variation along T 1 axis can be taken as a good ap-
explains how the proximation of the original two-dimensional data set, and one can
samples are related to easily ignore the T 2 axis. Thus, by projecting the data set in a suit-
each other, the loading
value explains how the able space, it is possible to reduce the dimensions of the data sets
variables are related to while retaining all the information.
each other.
2.2 Commonly Used Terminologies in PCA
k
∧i
Amount of variance = i=1 , > 80%; k ≤ K.
i=1 ∧i
K
All the analysis and data plotting was carried out on MATLAB-
2014 platform. However, there are other platforms such as R,
python, etc., that can be used for the analysis.
The chromatographic data sets acquired for the olive and non-
olive oils are shown in Figure 3. It can be seen that based on the
visual analysis of the chromatographic proles, it is difficult to
differentiate olive oils from non-olive vegetable oils. Moreover,
manual analysis of such a large volume of data is laborious and
time consuming, and may not provide any meaningful interpre-
tations. However, PCA essentially simplies and reduces the The optimum number of
dimensions of the data set, and provides a fast and efficient way principal components
of analysing the complex chromatographic data of the selected oil required for tting PCA
model is obtained from
samples. The chromatographic data sets are arranged in a matrix the variance captured by
of dimensions 126 × 4001, where 126 is the number of samples different principal
and 4001 is the number of variables (i.e., retention time points). components against the
The chromatographic data sets are normalized to unit area and principal component
numbers.
mean-centered prior to PCA. The optimum number of principal
components required for tting PCA model is obtained from the
variance captured by different principal components against the
principal component numbers, as shown in Figure 4. It can be
seen that with the addition of principal components, there is a
substantial improvement in the cumulative percentage of variance
captured by the PCA model. However, beyond 2 principal com-
ponents, the addition of extra factors do not bring any substan-
tial improvement in the cumulative variance captured by the PCA
model. Thus, one can conclude that PCA model of two principal
components that explains more than 80% variance is optimum
to capture all the important information buried in the data set.
PC1 and PC2 individually explains 78% and 5% variances of the
data sets. The PC1 versus PC2 score plot is shown in Figure 5;
PCA model clearly separates the samples in two groups. It is
found that all the samples belonging to the class of olive oils have
negative PC1 score values and the samples belonging to the non-
olive oil class have positive PC1 score values. The blended oils
in the score plot appear near the edges of the ellipse. The blended
oil containing more of olive oils have negative PC1 score values,
whereas the blended oils containing less of olive oils have posi-
tive PC1 score values. The loading vectors that explain how the
variables are related to each other are shown in Figure 6.
The analysis of loading vector plot can be really helpful in nd-
ing the set of variables that are really helpful in characterizing the
samples. The loading vector corresponding to PC1 mainly ex-
plains the variations of the major peaks that can be used to char-
acterize the classes of olive oils and non-olive oils. The loading
vector corresponding to PC2 mainly explains the variation of the
minor peaks and can be used to differentiate the samples within
the olive oil and non-olive oil groups. Based on the loading vec-
tor proles, the appearance of blended oil samples on the extreme
edges of PC2 axes can be attributed to the fact that blended oils
3.4 Conclusions
Suggested Reading
[1] D L Massart, B G M Vandeginste, L M C Buydens, S de Jong, P J Lewi and V
J S Verbeke, Handbook of Chemometrics and Qualimetrics, Elsevier, New York,
1997.
[2] R Kramer, Chemometric Techniques for Quantitative Analysis, Marcel Dekker,
New York, 1998.
[3] H P Singh, R K Gulati and Ranjan Gupta, Stellar Spectral Classication Us-
ing Principal Component Analysis and Articial Neural Networks, Monthly
Notices of the Royal Astronomical Society, Vol.295, pp.312–318, 1998.
[4] K Kumar, P Bairi, K Ghosh, K K Mishra and A K Mishra, Classication
of Aqueous-based Ayurvedic Preparations Using Synchronous Fluorescence
Spectroscopy and Chemometric Techniques, Current Science, Vol.107, No.3,
p.107, 470–477, 2014.
[5] K Kumar, S Sivabalan, S Ganesan, and A K Mishra, Discrimination of Oral
Submucous Fibrosis (OSF) Affected Oral Tissues From Healthy Oral Tis-
sues Using Multivariate Analysis of In-vivo Fluorescence Spectroscopic Data:
A Simple and Fast Procedure for OSF Diagnosis, Analytical Methods, Vol.5,
pp.3482–3489, 2013.
[6] B R Kowalski, T F Schatzki and F H Stross, Classication of Archaeological
Artifacts by Applying Pattern Recognition to Trace Element Data, Analytical
Chemistry, Vol.44, pp.2176–2180, 1972.
[7] T Kowalkowski, R Zbytniewski, J Szpejna and B Buszewski, Application of
Chemometrics in River Water Classication, Water Research, Vol.40, pp.744–
752, 2006.
[8] M F Harkat, G Mourot and J Ragot, An Improved PCA Scheme for Sensor
FDI: Application to An Air Quality Monitoring Network, Journal of Process
Control, Vol.16, pp.625–634, 2006.
[9] S Wold, K Esbensen and P Geladi, Principal Component Analysis, Chemomet-
rics and Intelligent Laboratory Systems, Vol.2, pp.37–52, 1987.
[10] de la P Mata-Espinosa, J M Bosque-Sendra, R Bro and L Cuadros-Rodriguez,
Olive Oil Quantication of Edible Vegetable Oil Blends Using Triacylglyc-