Lecture 1. Part 1-Regression Analysis. Correlation and SLRM
Lecture 1. Part 1-Regression Analysis. Correlation and SLRM
Y Y
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
◼
Linear Correlation
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
◼
Linear Correlation
No relationship
X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
◼
Moral of the story: make a Scatter Plot, and look at it!
You may see a relationship that the calculation does not.
"Correlation Is Not Causation"
A common saying is "Correlation Is Not
Causation".
• What it really means is that a correlation does not prove one thing
causes the other:
• One thing might cause the other
• The other might cause the first to happen
• They may be linked by a different thing
• Or it could be random chance!
• There can be many reasons the data has a good correlation.
Example: Poor suburbs are more likely to have high pollution
Why?
• Do poor people make pollution?
• Are polluted suburbs the only place poor people can afford?
• Is it a common link, such as factories with low paying jobs and lots of
pollution?
Pearson Product-Moment
Correlation
Pearson Product-Moment Correlation
Ice
Temperatu
Cream x2 y2 xy
re °C (x)
Sales(y)
22 22 Coefficient (r ) : 0.761713
74 27
34 29
50 29
42 27
64 28
53 29
43 24
21 19
12 17
Correlation
• Quantification of the relationship between two QUANTITATIVE
variables
• -1 < r < 1
• (+) direct linear relationship
• (-) inverse linear relationship
Conclusion
There is a strong inverse linear relationship between water
temperature and decrease in pulse rate of children
How to Perform Pearson
Correlation( r ) in Statistica
Open data in Statistica:
https://www.youtube.com/watch?v=uc_67xVZK8s
Reference : https://www.kaggle.com/datasets/grosvenpaul/family-income-and-expenditure/discussion
Note: This data is trimmed to use as a tool for class discussion. Complete data is available upon request from
PSA with recommendation of thesis adviser.
Observation on Correlation of Total Income
and other variables
What observations can you get from the data?
Which has the highest correlation with Total Income?
Which are not significantly related with Total Income?
Which variables have directly linear relationship with Total Income?
Which variables have indirect linear relationship with Total Income?
Simple Linear Regression Model
(SLRM)
Introduction to Simple Linear Model
http://www.sthda.com/english/articles/40-regression-analysis/167-simple-linear-
regression-in-r/
https://www.youtube.com/watch?v=owI7zxCqNY0
The simple linear regression is used to predict a quantitative outcome y on the basis of one single predictor
variable x. The goal is to build a mathematical model (or formula) that defines y as a function of the x variable.
Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis
of new x values.
From the scatter plot, it can be seen that not all the data points
fall exactly on the fitted regression line. Some of the points are
above the blue curve and some are below it;
Mathematically, the beta coefficients (b0 and b1) are determined so that the RSS is as minimal as possible. This
method of determining the beta coefficients is technically called least squares regression or ordinary least squares
(OLS) regression.
Least Square Method using Excel
Least Square Method using excel
https://www.youtube.com/watch?v=P8hT5nDai6A
Least Square Method using
Statistica
https://www.youtube.com/watch?v=VInW7mmxzOU&list=PLsY7hM6ZLBNOBT9RPY
o0oeIuFezrDNYXu&index=7
Least Square Method using Statistica
Example 1.
Encode the table below, where the dependent variable is y and the
independent variable (predictor) is x.
x y
1 1.5
2 3.8
3 6.7
4 9.0
5 11.2
6 13.6
7 16
Based on the Least Square Method, the
line of best fit to the data is
𝑦 = −0.828571 + 2.414286x
Prediction: Example
𝑦 = −0.828571 + 2.414286x
-0.828571 :
The EXPECTED value of y for when the predictor, x, is 0 is -0.828571.
𝟐. 𝟒𝟏𝟒𝟐𝟖𝟔:
The EXPECTED increase in value of y for every unit increase in x.
Model Validation using Coefficient of
Determination (R-squared)
https://www.youtube.com/watch?v=TCtDXmvXDUc
https://www.youtube.com/watch?v=igIT6xzAH8s
A measure of goodness-of-fit.
𝑅2 closer to 1 is a good model
𝑅2 closer to zero is not a good model.
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 0.99876183
Interpretation: 99.87% of the variation of y can be explained by x.
Model is significant if the p-value is less than 0.05 (p<0.05)
Example 2. Least Square Method using Statistica
Use FIES data, where the dependent variable is total income and the
independent variable (predictor) is Communication Expenditure.
Based on the Least Square Method, the line of best fit to the data is
𝑦 = 133241.9 + 27.9x
Model Validation using Coefficient of
Determination (R-squared)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 = 0.50428708
Interpretation: 50.42% of the variation of the total household income can be
explained by communication expenditure.
𝑦ො = 133241.9 + 27.9x
Total Household Income = 133241.9 + 27.9 (Communication Expenditure)
The symbol, 𝑦,
ො means it is an estimate only. The symbol, y, means it is the
actual population measurement.
Prediction:
1. The EXPECTED total household income when the communication
expenditure is 0 is 𝐏𝐡𝐩 𝟏𝟑𝟑𝟐𝟒𝟏. 𝟗𝟎.
2. The Expected increase in total household income for every one unit
increase in communication expenses is 𝐏𝐡𝐩𝟐𝟕. 𝟗.
3. A household with annual communication expenditure of 10000 has an
Expected total household income of
Total Household Income = 133241.9 + 27.9 10000 = 𝐏𝐡𝐩 𝟒𝟏𝟐, 𝟐𝟒𝟏. 𝟗