Advanced Statistics Project
REPORT-13 June 2021
Shefali Kaushik
PGP-DSBA (ONLINE) MARCH’21
Contents
PROBLEM-1A............................................................................................................................................2
1.1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.................................................................................................3
1.2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.........................................................4
1.3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.........................................................4
1.4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result...........................................................................................4
PROBLEM-1B............................................................................................................................................5
1.5. What is the interaction between two treatments? Analyse the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot. [hint: use the ‘pointplot’
function from the ‘seaborn’ function]..................................................................................................5
1.6. Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses and
state your results. How will you interpret this result?..........................................................................6
1.7. Explain the business implications of performing ANOVA for this particular case study.................7
PROBLEM-2...............................................................................................................................................8
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?............................................................................................8
2.2. Is scaling necessary for PCA in this case? Give justification and perform scaling.........................14
2.3 Comment on the comparison between the covariance and the correlation matrices from this
data. [on scaled data].........................................................................................................................15
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?.........16
2.5 Extract the eigenvalues and eigenvectors. [print both]................................................................17
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features...................................................................................................................20
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and
corresponding features].....................................................................................................................22
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?............................23
2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components
Obtained]............................................................................................................................................24
PROBLEM-1A
Salary is hypothesized to depend on educational qualification and occupation. To understand
the dependency, the salaries of 40 individuals are collected and each person’s educational
qualification and occupation are noted. Educational qualification is at three levels, High school
graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical,
Sales, Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may
not always hold if the sample size is small.]
Exploratory Data Analysis
Sample of dataset:
The dataset has 3 variables – Education, Occupation and Salary
Check for missing value in the dataset:
From the above results, we can say that there no missing values present in the data.
Education and Occupation are two categorical (Independent) variables and Salary is the
response (Dependent) variable.
Summary of dataset:
For all numerical variables, we have min/max value, mean values, Standard deviation
values, different percentile values and for the categorical variable ‘Names’, we get the
summary of total counts, unique values, top value and the frequency.
1.1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
Postulate null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually -
For Educational qualification -
Null hypothesis: (H0)
H0 = For all 3 levels of treatment educational levels, mean salary is same.
Alternate hypothesis: (Ha)
Ha = For at least one level of treatment educational levels, mean salary is different.
For Occupation –
Null hypothesis: (H0)
H0 = For all 4 levels of treatment Occupation, mean salary is same.
Alternate hypothesis: (Ha)
Ha = For at least one level of treatment Occupation, mean salary is different.
1.2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
Conclusion:
Based on the results above, the corresponding p-value is very less t than the level of
significance 'α' (0.05). Thus, we reject the Null Hypothesis (H0).
Therefore, it can be concluded that the mean salary is not same across all three levels of
education.
1.3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
Conclusion:
Based on the results above, the corresponding p-value (0.45) is greater than the level of
significance 'α' (0.05). Thus, we fail to reject the Null Hypothesis (H0).
Therefore, it can be concluded that the mean salary is same across all four levels of
occupation.
1.4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result.
(Optional)
PROBLEM-1B
1.5. What is the interaction between two treatments? Analyse the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot. [hint: use the ‘pointplot’
function from the ‘seaborn’ function]
The interaction effect tells us the combined effect of independent variables (Education and
Occupation) on the dependent variable (Salary).
Interaction plot
The interaction plot above shows
an interaction between two
treatments with respect to the
response variable 'Salary'. The
insights drawn are:
Professional or specialty –
The ‘Doctorate’ degree holders
have the highest salary, wherein the salary of ‘Bachelors’ and 'high school graduates' is
comparatively low.
Sales –
Individuals who are 'high school graduates' have the lowest salary, wherein the salary of
‘Bachelors’ and ‘Doctorate’ is above average.
Executive or managerial –
The salary of ‘Bachelors’ and ‘Doctorate’ is significantly high. Wherein, 'High school graduates'
are not eligible for the occupation.
Administrative and clerical-
'High school graduates' have a significantly low salary, whereas both 'Doctorate' and
'Bachelors' have a moderately high salary.
1.6. Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses
and state your results. How will you interpret this result?
Let us first have a look at the ANOVA results without interaction between variables.
Based on the results above, the corresponding p-value is much lesser than the significance
level (0.05) for both variables. Therefore, we reject the Null Hypothesis (H0) in both the
cases.
Now, let us check the ANOVA results with interaction between the variables.
For Interaction effect, postulate Null and alternative hypotheses (respectively):
H0 = There is no interaction between Education and Occupation based on response variable
Salary, at 5% level of significance
Ha = There is an interaction between Education and Occupation based on response variable
Salary, at 5% level of significance
Conclusion:
Based on the ANOVA results above, p-value for the interaction is less than significance level
(0.05). Thus, we reject our null hypothesis (H0).
Hence, it can be concluded that there is an interaction between Education and Occupation
based on response variable Salary, at 5% level of significance
1.7. Explain the business implications of performing ANOVA for this particular case study.
Performing One-way ANOVA:
o Since we reject the null hypotheses - Salary of an individual doesn’t necessarily
depend on 3 levels of (High school graduate, Bachelor, and Doctorate).
o As we fail to reject the null hypotheses - Salary of an individual depend on all 4
levels of occupations (Administrative and clerical, Sales, Professional or specialty,
and Executive or managerial)
o It can be concluded that, Individual’s occupation plays a primary role when it comes
to salary and not the educational qualifications.
Performing Two-way ANOVA:
o The interaction plot shows an interaction of treatments w.r.t salary.
o Individuals with educational qualification ‘Doctorate’ and working as ‘Professional
or specialty’ are being paid the HIGHEST salary.
o ‘High school graduates’ working in ‘sales’ have the LOWEST salary.
PROBLEM-2
The dataset contains information on various colleges. You are expected to do a Principal Component
Analysis for this case study according to the instructions given. The data dictionary of the 'Education -
Post 12th Standard.csv' can be found in the following file: Data Dictionary.xlsx.
Sample of dataset:
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?
We start with performing some operations like, checking data type of variables,
summary of the data which tells how data has been spread for the numerical values,
data dimension.
Checking information of dataset
Dataset consists of 777 observations and 18 variables. Out of which, 16 are of type ‘int’
and remaining 2 are of datatype ‘object’ and ‘float’. No null values are present in the
data.
Checking summary of the data:
This gives us the minimum value, mean values, Standard deviation values, different
percentile values and maximum values for each of the numeric variable.
Checking data dimension:
(777,18)
Dataset has 777 columns and 18 rows
Univariate Analysis:
1) Using distplot to check the normality of each variable.
The distplot visually represents the univariate distribution of data or it plots the data
distribution of a variable against the density distribution. The plots above are the density
distribution of 17 variables (numeric) which helps us to check whether it is normally
distributed or not. The above results shows that they are normally distributed with some
skewness (left and right).
2. Using Boxplot to check the outliers present in each variable.
The boxplot gives us the five-number summary along with the presence of outliers
in the data. From the above plot, we can see for majority of variables - outliers are
present in the data.
Bivariate Analysis:
1. Using Heatmap to check the correlation between the variables.
The heatmap gives us visual insight of the variables correlated to each other. The
variables highly correlated to each other will have a scale value closer to 1.
Insights from EDA
The data set has 777 observations and 18 variables in the data set.
Dataset has no missing value, special characters and duplicate rows present.
All variables have dtype ‘int’ except Names is of 'object' dtype, S.F. Ratio is of ‘float'
dtype.
Variable 'Names' can be dropped for PCA analysis as is to identify rows in the
dataset.
Majority of variables have outliers. ‘Expend’ shows higher presence of outliers.
As per the univariate analysis, each variable is almost normally distributed.
'Apps', 'Accept', 'Enroll', 'F.Undergrad' - shows high significance with each other,
wherein, 'S.F.Ratio' shows negligible significance with all other variables.
2.2. Is scaling necessary for PCA in this case? Give justification and perform scaling.
In scaling, we convert variables with different scales of measurements into a single
scale which, sometimes also helps in speeding up the calculations. Scaling/
Standardization is applied to independent variables to normalize the data within a
particular range.
Yes, scaling is necessary in this case.
Reason –
o The variables present in the dataset are of different measurement and also
varies from one another. Since, the range of values of data vary widely, it
becomes a mandatory step in data pre-processing.
o For example, in our data set Apps, Outstate, Expend are having values in
thousands, Top10perc, Top25perc, Grad Rate in just two digits and S.F. Ratio in
decimals. Since the data in these variables are of different scales, it is tough to
compare these variables.
o So, in order to normalise the data and change the values of numeric columns
into a common scale, Scaling is necessary.
Let us now perform scaling on this dataset,
If we look at the results above, all numeric variables are normalized and scaled in one
scale. Therefore, data is now suitable for performing PCA.
2.3 Comment on the comparison between the covariance and the correlation matrices from this
data. [on scaled data]
Covariance and Correlation are the two mathematical concepts which determines the
relationship and measures the dependency between random variables. Despite, some
similarities between these two mathematical terms, they do have few differences between
them too.
Covariance Matrix – It is basically a matrix of data that captures the ‘correlations’
It gets affected by the change in scale.
Correlation Matrix – It refers to the scaled form of covariance.
It is not influenced by the change in scale.
Now, as we have the standardized data with us, let us proceed further with generating
covariance/correlation matrix.
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?
Box plot (Before scaling)
Box plot (After scaling)
Scaling transforms the variables into a same range but it is not necessary that outliers
get removed from the data after scaling.
However, if we compare the boxplots before and after scaling, below are the few
insights that can be drawn –
The outliers are much more visible and can be observed clearly after scaling.
Information becomes much more derivable after scaling.
We can clearly infer that ‘Top25perc’ is the only feature that doesn’t have any
outliers after scaling.
Before scaling, boxplot shows only the right skewed data wherein after scaling,
we can observe the data both left and right skewed.
2.5 Extract the eigenvalues and eigenvectors. [print both]
o Eigen vectors determine the direction of the dataset
o Eigen values determine the magnitude of the dataset
o Each eigen vector has a correspondent eigen value
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features
1.Using scikit learn PCA – It performs all the steps and maps data into PCA dimensions
in one shot.
Below are the principal component scores -
2.Now, loading each feature on the component, we get our eigen vectors –
3. Exporting the Principal Component (eigenvectors) into a data frame –
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with
two places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and
corresponding features]
These are all the variable names present in the dataset -
The first principal component in terms of eigen vectors is -
Now, from both the above results, we can conclude the explicit form/ linear equation
of first principal component (using values with 2 decimal place) –
The Linear equation of 1st component:
0.25 * Apps + 0.21 * Accept + 0.18 * Enroll + 0.35 * Top10perc + 0.34 *
Top25perc + 0.15 * F.Undergrad + 0.03 * P.Undergrad + 0.29 * Outstate +
0.25 * Room.Board + 0.06 * Books + -0.04 * Personal + 0.32 * PhD + 0.32
* Terminal + -0.18 * S.F. Ratio + 0.21 * perc.alumni + 0.32 * Expend +
0.25 * Grad.Rate +
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on
the optimum number of principal components? What do the eigenvectors indicate?
1) Sum total of ‘eigen values’ is the cumulative variance, given below -
2) Also, we can view a scree plot to identify number of PC’s to be built
Figure 2.8.1
Figure 2.8.2
As compared to the scree plot, cumulative variance is a better and accurate measure
of deciding the optimum number of PCs.
Insights drawn:
Deciding the optimum no. of PCs –
The cumulative values are basically the sum total of eigen values, sorted in descending
order. The Cumulative Variance help us decide the optimum number of principal
components that captures the maximum information (considering 80% as threshold)
from the original data. Thus, if we look at Cumulative Variance explained, first 6
Principal components are capturing 80% of the significant information in the data.
Therefore, we can proceed ahead with 6 components and remaining can be dropped.
What does Eigen vectors indicate -
The eigenvectors are the principal components which determines the direction of all
the features in a dataset. Sorting the eigen vectors in descending order with respect to
their eigenvalues, we get the first eigen vector indicating the first largest spread
among data, the second eigen vector indicating the second largest spread among data,
so forth and so on.
2.9 Explain the business implication of using the Principal Component Analysis for this case
study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]
PCA help us reduce the dimension of the data and it also help us identify the correlations
between data points.
Thus, by following the mechanics of PCA we have concluded that first 6 PCs shows
significant information about the data.
In figure 2.8.1, there is a distinct break at 3. However, 3 cannot be taken as the optimum
number of PCs since the first three PCs explains only 65% of total variance. The PCs must
be taken so as to explain between 70% -90% of the total variance. (We are using 80% as a
threshold value in this case).
Therefore, we decide on first 6 principal components so as the explained variance is above
80%.
First 6 PCs