100% found this document useful (1 vote)

1K views26 pages

Advance Statistics - Buisness Report

This document reports on an advanced statistics project analyzing salary data. It contains the following key points: 1) One-way ANOVA is performed on salary with respect to education and occupation individually. For education, the null hypothesis of equal means is rejected, but for occupation it is not rejected. 2) An interaction plot shows the interaction effect between education and occupation on salary. Certain occupations have higher salaries for some education levels compared to others. 3) A two-way ANOVA is performed on salary with respect to both education and occupation, including their interaction. The null hypotheses of no effect for education and occupation are rejected, but the null of no interaction effect is not rejected.

Uploaded by

Shefali Kaushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views26 pages

Advance Statistics - Buisness Report

Uploaded by

Shefali Kaushik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Advanced Statistics Project

REPORT-13 June 2021

Shefali Kaushik
PGP-DSBA (ONLINE) MARCH’21
Contents
PROBLEM-1A............................................................................................................................................2
1.1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.................................................................................................3
1.2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.........................................................4
1.3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.........................................................4
1.4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result...........................................................................................4
PROBLEM-1B............................................................................................................................................5
1.5. What is the interaction between two treatments? Analyse the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot. [hint: use the ‘pointplot’
function from the ‘seaborn’ function]..................................................................................................5
1.6. Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses and
state your results. How will you interpret this result?..........................................................................6
1.7. Explain the business implications of performing ANOVA for this particular case study.................7
PROBLEM-2...............................................................................................................................................8
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?............................................................................................8
2.2. Is scaling necessary for PCA in this case? Give justification and perform scaling.........................14
2.3 Comment on the comparison between the covariance and the correlation matrices from this
data. [on scaled data].........................................................................................................................15
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?.........16
2.5 Extract the eigenvalues and eigenvectors. [print both]................................................................17
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features...................................................................................................................20
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and
corresponding features].....................................................................................................................22
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?............................23
2.9 Explain the business implication of using the Principal Component Analysis for this case study.
How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components
Obtained]............................................................................................................................................24
PROBLEM-1A

Salary is hypothesized to depend on educational qualification and occupation. To understand

the dependency, the salaries of 40 individuals are collected and each person’s educational
qualification and occupation are noted. Educational qualification is at three levels, High school
graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical,
Sales, Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may
not always hold if the sample size is small.]

Exploratory Data Analysis

 Sample of dataset:

The dataset has 3 variables – Education, Occupation and Salary

 Check for missing value in the dataset:

From the above results, we can say that there no missing values present in the data.
Education and Occupation are two categorical (Independent) variables and Salary is the
response (Dependent) variable.
 Summary of dataset:

For all numerical variables, we have min/max value, mean values, Standard deviation
values, different percentile values and for the categorical variable ‘Names’, we get the
summary of total counts, unique values, top value and the frequency.

1.1. State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.

Postulate null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually -

For Educational qualification -

Null hypothesis: (H0)
H0 = For all 3 levels of treatment educational levels, mean salary is same.
Alternate hypothesis: (Ha)
Ha = For at least one level of treatment educational levels, mean salary is different.

For Occupation –
Null hypothesis: (H0)
H0 = For all 4 levels of treatment Occupation, mean salary is same.
Alternate hypothesis: (Ha)
Ha = For at least one level of treatment Occupation, mean salary is different.

1.2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.

Conclusion:
Based on the results above, the corresponding p-value is very less t than the level of
significance 'α' (0.05). Thus, we reject the Null Hypothesis (H0).
Therefore, it can be concluded that the mean salary is not same across all three levels of
education.

1.3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.

Conclusion:
Based on the results above, the corresponding p-value (0.45) is greater than the level of
significance 'α' (0.05). Thus, we fail to reject the Null Hypothesis (H0).
Therefore, it can be concluded that the mean salary is same across all four levels of
occupation.
1.4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result.
(Optional)

PROBLEM-1B

1.5. What is the interaction between two treatments? Analyse the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot. [hint: use the ‘pointplot’
function from the ‘seaborn’ function]
The interaction effect tells us the combined effect of independent variables (Education and
Occupation) on the dependent variable (Salary).
Interaction plot

The interaction plot above shows

an interaction between two
treatments with respect to the
response variable 'Salary'. The
insights drawn are:
 Professional or specialty –
The ‘Doctorate’ degree holders
have the highest salary, wherein the salary of ‘Bachelors’ and 'high school graduates' is
comparatively low.

 Sales –
Individuals who are 'high school graduates' have the lowest salary, wherein the salary of
‘Bachelors’ and ‘Doctorate’ is above average.

 Executive or managerial –
The salary of ‘Bachelors’ and ‘Doctorate’ is significantly high. Wherein, 'High school graduates'
are not eligible for the occupation.

 Administrative and clerical-

'High school graduates' have a significantly low salary, whereas both 'Doctorate' and
'Bachelors' have a moderately high salary.

1.6. Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses
and state your results. How will you interpret this result?
 Let us first have a look at the ANOVA results without interaction between variables.

Based on the results above, the corresponding p-value is much lesser than the significance
level (0.05) for both variables. Therefore, we reject the Null Hypothesis (H0) in both the
cases.
 Now, let us check the ANOVA results with interaction between the variables.

 For Interaction effect, postulate Null and alternative hypotheses (respectively):

H0 = There is no interaction between Education and Occupation based on response variable

Salary, at 5% level of significance

Ha = There is an interaction between Education and Occupation based on response variable

Salary, at 5% level of significance

Conclusion:
Based on the ANOVA results above, p-value for the interaction is less than significance level
(0.05). Thus, we reject our null hypothesis (H0).
Hence, it can be concluded that there is an interaction between Education and Occupation
based on response variable Salary, at 5% level of significance

1.7. Explain the business implications of performing ANOVA for this particular case study.

Performing One-way ANOVA:

o Since we reject the null hypotheses - Salary of an individual doesn’t necessarily
depend on 3 levels of (High school graduate, Bachelor, and Doctorate).
o As we fail to reject the null hypotheses - Salary of an individual depend on all 4
levels of occupations (Administrative and clerical, Sales, Professional or specialty,
and Executive or managerial)
o It can be concluded that, Individual’s occupation plays a primary role when it comes
to salary and not the educational qualifications.

Performing Two-way ANOVA:

o The interaction plot shows an interaction of treatments w.r.t salary.
o Individuals with educational qualification ‘Doctorate’ and working as ‘Professional
or specialty’ are being paid the HIGHEST salary.
o ‘High school graduates’ working in ‘sales’ have the LOWEST salary.

PROBLEM-2

The dataset contains information on various colleges. You are expected to do a Principal Component
Analysis for this case study according to the instructions given. The data dictionary of the 'Education -
Post 12th Standard.csv' can be found in the following file: Data Dictionary.xlsx.

Sample of dataset:

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?

We start with performing some operations like, checking data type of variables,
summary of the data which tells how data has been spread for the numerical values,
data dimension.
 Checking information of dataset

Dataset consists of 777 observations and 18 variables. Out of which, 16 are of type ‘int’
and remaining 2 are of datatype ‘object’ and ‘float’. No null values are present in the
data.
 Checking summary of the data:
This gives us the minimum value, mean values, Standard deviation values, different
percentile values and maximum values for each of the numeric variable.

 Checking data dimension:

(777,18)
Dataset has 777 columns and 18 rows

 Univariate Analysis:
1) Using distplot to check the normality of each variable.
The distplot visually represents the univariate distribution of data or it plots the data
distribution of a variable against the density distribution. The plots above are the density
distribution of 17 variables (numeric) which helps us to check whether it is normally
distributed or not. The above results shows that they are normally distributed with some
skewness (left and right).

2. Using Boxplot to check the outliers present in each variable.

The boxplot gives us the five-number summary along with the presence of outliers
in the data. From the above plot, we can see for majority of variables - outliers are
present in the data.
Bivariate Analysis:

1. Using Heatmap to check the correlation between the variables.

The heatmap gives us visual insight of the variables correlated to each other. The
variables highly correlated to each other will have a scale value closer to 1.

Insights from EDA

 The data set has 777 observations and 18 variables in the data set.
 Dataset has no missing value, special characters and duplicate rows present.
 All variables have dtype ‘int’ except Names is of 'object' dtype, S.F. Ratio is of ‘float'
dtype.
 Variable 'Names' can be dropped for PCA analysis as is to identify rows in the
dataset.
 Majority of variables have outliers. ‘Expend’ shows higher presence of outliers.
 As per the univariate analysis, each variable is almost normally distributed.
 'Apps', 'Accept', 'Enroll', 'F.Undergrad' - shows high significance with each other,
wherein, 'S.F.Ratio' shows negligible significance with all other variables.
2.2. Is scaling necessary for PCA in this case? Give justification and perform scaling.

In scaling, we convert variables with different scales of measurements into a single

scale which, sometimes also helps in speeding up the calculations. Scaling/
Standardization is applied to independent variables to normalize the data within a
particular range.

Yes, scaling is necessary in this case.

Reason –
o The variables present in the dataset are of different measurement and also
varies from one another. Since, the range of values of data vary widely, it
becomes a mandatory step in data pre-processing.

o For example, in our data set Apps, Outstate, Expend are having values in
thousands, Top10perc, Top25perc, Grad Rate in just two digits and S.F. Ratio in
decimals. Since the data in these variables are of different scales, it is tough to
compare these variables.

o So, in order to normalise the data and change the values of numeric columns
into a common scale, Scaling is necessary.

Let us now perform scaling on this dataset,

If we look at the results above, all numeric variables are normalized and scaled in one
scale. Therefore, data is now suitable for performing PCA.
2.3 Comment on the comparison between the covariance and the correlation matrices from this
data. [on scaled data]
Covariance and Correlation are the two mathematical concepts which determines the
relationship and measures the dependency between random variables. Despite, some
similarities between these two mathematical terms, they do have few differences between
them too.
Covariance Matrix – It is basically a matrix of data that captures the ‘correlations’
It gets affected by the change in scale.
Correlation Matrix – It refers to the scaled form of covariance.
It is not influenced by the change in scale.
Now, as we have the standardized data with us, let us proceed further with generating
covariance/correlation matrix.
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here?

Box plot (Before scaling)

Box plot (After scaling)
Scaling transforms the variables into a same range but it is not necessary that outliers
get removed from the data after scaling.
However, if we compare the boxplots before and after scaling, below are the few
insights that can be drawn –
 The outliers are much more visible and can be observed clearly after scaling.
 Information becomes much more derivable after scaling.
 We can clearly infer that ‘Top25perc’ is the only feature that doesn’t have any
outliers after scaling.
 Before scaling, boxplot shows only the right skewed data wherein after scaling,
we can observe the data both left and right skewed.

2.5 Extract the eigenvalues and eigenvectors. [print both]

o Eigen vectors determine the direction of the dataset

o Eigen values determine the magnitude of the dataset
o Each eigen vector has a correspondent eigen value
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features

1.Using scikit learn PCA – It performs all the steps and maps data into PCA dimensions
in one shot.
Below are the principal component scores -

2.Now, loading each feature on the component, we get our eigen vectors –
3. Exporting the Principal Component (eigenvectors) into a data frame –

2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with
two places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and
corresponding features]

These are all the variable names present in the dataset -

The first principal component in terms of eigen vectors is -

Now, from both the above results, we can conclude the explicit form/ linear equation
of first principal component (using values with 2 decimal place) –
The Linear equation of 1st component:

0.25 * Apps + 0.21 * Accept + 0.18 * Enroll + 0.35 * Top10perc + 0.34 *

Top25perc + 0.15 * F.Undergrad + 0.03 * P.Undergrad + 0.29 * Outstate +
0.25 * Room.Board + 0.06 * Books + -0.04 * Personal + 0.32 * PhD + 0.32
* Terminal + -0.18 * S.F. Ratio + 0.21 * perc.alumni + 0.32 * Expend +
0.25 * Grad.Rate +
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on
the optimum number of principal components? What do the eigenvectors indicate?

1) Sum total of ‘eigen values’ is the cumulative variance, given below -

2) Also, we can view a scree plot to identify number of PC’s to be built

Figure 2.8.1

Figure 2.8.2
As compared to the scree plot, cumulative variance is a better and accurate measure
of deciding the optimum number of PCs.
Insights drawn:
 Deciding the optimum no. of PCs –
The cumulative values are basically the sum total of eigen values, sorted in descending
order. The Cumulative Variance help us decide the optimum number of principal
components that captures the maximum information (considering 80% as threshold)
from the original data. Thus, if we look at Cumulative Variance explained, first 6
Principal components are capturing 80% of the significant information in the data.
Therefore, we can proceed ahead with 6 components and remaining can be dropped.

 What does Eigen vectors indicate -

The eigenvectors are the principal components which determines the direction of all
the features in a dataset. Sorting the eigen vectors in descending order with respect to
their eigenvalues, we get the first eigen vector indicating the first largest spread
among data, the second eigen vector indicating the second largest spread among data,
so forth and so on.

2.9 Explain the business implication of using the Principal Component Analysis for this case
study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]

 PCA help us reduce the dimension of the data and it also help us identify the correlations
between data points.
 Thus, by following the mechanics of PCA we have concluded that first 6 PCs shows
significant information about the data.

In figure 2.8.1, there is a distinct break at 3. However, 3 cannot be taken as the optimum
number of PCs since the first three PCs explains only 65% of total variance. The PCs must
be taken so as to explain between 70% -90% of the total variance. (We are using 80% as a
threshold value in this case).
Therefore, we decide on first 6 principal components so as the explained variance is above
80%.

First 6 PCs

Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Project - Advanced Statistics - Final-1
100% (3)
Project - Advanced Statistics - Final-1
15 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
Adv Stats Proj
95% (38)
Adv Stats Proj
25 pages
Business Report SMDM Bhushan
No ratings yet
Business Report SMDM Bhushan
18 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Data Analysis for Python Users
100% (1)
Data Analysis for Python Users
14 pages
Lifi
100% (1)
Lifi
16 pages
VARUNSAINI - 11 Dec 2022
No ratings yet
VARUNSAINI - 11 Dec 2022
16 pages
Problem Statement
0% (2)
Problem Statement
2 pages
Heart Disease Prediction Using Decision Tree Analysis
No ratings yet
Heart Disease Prediction Using Decision Tree Analysis
10 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Time Series Rose Shehroz Arfeen
100% (1)
Time Series Rose Shehroz Arfeen
42 pages
Data Analysis for Marketing Experts
100% (2)
Data Analysis for Marketing Experts
24 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Rajendra Ladda DVT Car Insurance Tableau Project
No ratings yet
Rajendra Ladda DVT Car Insurance Tableau Project
8 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Problem Statement1
No ratings yet
Problem Statement1
1 page
SMDM Extended Project Report
No ratings yet
SMDM Extended Project Report
9 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
FRA Report
100% (1)
FRA Report
30 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Election Prediction & Speech Analysis
No ratings yet
Election Prediction & Speech Analysis
3 pages
SMDM Project
No ratings yet
SMDM Project
16 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages
SMDM - Project Report - Lakshmi
No ratings yet
SMDM - Project Report - Lakshmi
26 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
TSF - Graded Quiz 4 - Great Lakes Institute
No ratings yet
TSF - Graded Quiz 4 - Great Lakes Institute
5 pages
MRA Project Milestone2 PDF
100% (1)
MRA Project Milestone2 PDF
1 page
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Business Report
No ratings yet
Business Report
12 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
SMDM Assignment: Problem 1
0% (1)
SMDM Assignment: Problem 1
16 pages
Project Time Series Forecasting
100% (1)
Project Time Series Forecasting
53 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Naive Bayes Model Accuracy Analysis
100% (1)
Naive Bayes Model Accuracy Analysis
2 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Problem Statement2
0% (1)
Problem Statement2
2 pages
Financial Risk Analysis Guide
No ratings yet
Financial Risk Analysis Guide
49 pages
AS Project Report
No ratings yet
AS Project Report
22 pages
Advance Statistics-Project Report
50% (2)
Advance Statistics-Project Report
17 pages
SMDM Project Instructions & Analysis
50% (2)
SMDM Project Instructions & Analysis
5 pages
Business Report Pradeep Chauhan 11june'23
100% (1)
Business Report Pradeep Chauhan 11june'23
25 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
100% (1)
Executive Sumary - Rajarshi Das (Data Visualization Using Tableau Project)
11 pages
Time Series Project
50% (4)
Time Series Project
2 pages
Advanced Statistics Project
17% (6)
Advanced Statistics Project
2 pages
Dbms db03 2020 Assessment (Solved) : Find Study Resources
50% (2)
Dbms db03 2020 Assessment (Solved) : Find Study Resources
12 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Project: Advanced Statistics: Anova, Eda and Pca
No ratings yet
Project: Advanced Statistics: Anova, Eda and Pca
35 pages
Advanced Statistics Project
No ratings yet
Advanced Statistics Project
23 pages
MCA Syllabus
No ratings yet
MCA Syllabus
57 pages
4 Months 99 Percentile Preparation Plan
No ratings yet
4 Months 99 Percentile Preparation Plan
6 pages
ISO-9426-2003 - Wood-Based Panels - Determination of Dimension of Panels
No ratings yet
ISO-9426-2003 - Wood-Based Panels - Determination of Dimension of Panels
9 pages
White Paper Dagcoin
No ratings yet
White Paper Dagcoin
71 pages
GCSE Maths: Cubic & Reciprocal Graphs
No ratings yet
GCSE Maths: Cubic & Reciprocal Graphs
8 pages
Continuous Improvement
100% (7)
Continuous Improvement
45 pages
Lecture Notes in Statistics 153: Edited by P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg, Olkin, N. Wermuth, S. Zeger
No ratings yet
Lecture Notes in Statistics 153: Edited by P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg, Olkin, N. Wermuth, S. Zeger
10 pages
Introduction To B Physics - Matthias Neubert
No ratings yet
Introduction To B Physics - Matthias Neubert
53 pages
c04RightAngledTriangles Trigonometry
No ratings yet
c04RightAngledTriangles Trigonometry
58 pages
Specific Heat of Barium Titanate
No ratings yet
Specific Heat of Barium Titanate
7 pages
Properties and Fractions
No ratings yet
Properties and Fractions
1 page
Iit Jam Ms 2005-25 Yearwise Pyq Papers-333-603
No ratings yet
Iit Jam Ms 2005-25 Yearwise Pyq Papers-333-603
271 pages
Examenes Clad Ingles
No ratings yet
Examenes Clad Ingles
90 pages
System Simulation and Modeling-Course Material
No ratings yet
System Simulation and Modeling-Course Material
82 pages
SSC Mts 2020 Maths 02 All 42 Sets 1050 Pyq
No ratings yet
SSC Mts 2020 Maths 02 All 42 Sets 1050 Pyq
68 pages
OM-Lecture 2 (Heritage & Productivity)
No ratings yet
OM-Lecture 2 (Heritage & Productivity)
22 pages
Abstract Reasoning Test Questions Answers
No ratings yet
Abstract Reasoning Test Questions Answers
325 pages
Final Exam - Attempt Review 1
No ratings yet
Final Exam - Attempt Review 1
16 pages
10 Most Asked LLM Interview Questions
No ratings yet
10 Most Asked LLM Interview Questions
12 pages
Elementary Matrix Algebra - Jayalal Wettasinghe
No ratings yet
Elementary Matrix Algebra - Jayalal Wettasinghe
13 pages
Lathe Machine Program Manual (A Type)
No ratings yet
Lathe Machine Program Manual (A Type)
262 pages
Evaluation of Duplicate Samples: The Hyperbolic Method: Dr. Armando Simón, R.P.Geo
100% (1)
Evaluation of Duplicate Samples: The Hyperbolic Method: Dr. Armando Simón, R.P.Geo
2 pages
Sheet 1 Solution of Systems of Linear Equations
No ratings yet
Sheet 1 Solution of Systems of Linear Equations
4 pages
Compiler Lecture 4
No ratings yet
Compiler Lecture 4
17 pages
Module 1 Notes - Introduction To Differential Equation
No ratings yet
Module 1 Notes - Introduction To Differential Equation
19 pages
Exact Differential Equation
50% (2)
Exact Differential Equation
27 pages
SPE 135339 Modeling CO Injection Including Diffusion in A Fractured-Chalk Experiment
No ratings yet
SPE 135339 Modeling CO Injection Including Diffusion in A Fractured-Chalk Experiment
11 pages
Excel Basics for Managers Guide
No ratings yet
Excel Basics for Managers Guide
5 pages
The Six Trigonometric Ratios
No ratings yet
The Six Trigonometric Ratios
30 pages
0 Boiler Design Softwear
No ratings yet
0 Boiler Design Softwear
54 pages

Advance Statistics - Buisness Report

Uploaded by

Advance Statistics - Buisness Report

Uploaded by

Advanced Statistics Project

REPORT-13 June 2021

Salary is hypothesized to depend on educational qualification and occupation. To understand

Exploratory Data Analysis

The dataset has 3 variables – Education, Occupation and Salary

 Check for missing value in the dataset:

For Educational qualification -

The interaction plot above shows

 Administrative and clerical-

 For Interaction effect, postulate Null and alternative hypotheses (respectively):

H0 = There is no interaction between Education and Occupation based on response variable

Ha = There is an interaction between Education and Occupation based on response variable

Performing One-way ANOVA:

Performing Two-way ANOVA:

 Checking data dimension:

2. Using Boxplot to check the outliers present in each variable.

1. Using Heatmap to check the correlation between the variables.

Insights from EDA

In scaling, we convert variables with different scales of measurements into a single

Yes, scaling is necessary in this case.

Let us now perform scaling on this dataset,

Box plot (Before scaling)

2.5 Extract the eigenvalues and eigenvectors. [print both]

o Eigen vectors determine the direction of the dataset

These are all the variable names present in the dataset -

The first principal component in terms of eigen vectors is -

0.25 * Apps + 0.21 * Accept + 0.18 * Enroll + 0.35 * Top10perc + 0.34 *

1) Sum total of ‘eigen values’ is the cumulative variance, given below -

2) Also, we can view a scree plot to identify number of PC’s to be built

 What does Eigen vectors indicate -

You might also like