0% found this document useful (0 votes)
37 views10 pages

Lecture - Hoi Qui Don - DT - New - 8.5

This document discusses linear regression analysis using Python. It covers topics such as types of regression models, determining the simple linear regression equation, measures of variation, assumptions of regression, and performing linear regression on sample data to find the equation of the straight line that fits the data best.

Uploaded by

211124022306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

Lecture - Hoi Qui Don - DT - New - 8.5

This document discusses linear regression analysis using Python. It covers topics such as types of regression models, determining the simple linear regression equation, measures of variation, assumptions of regression, and performing linear regression on sample data to find the equation of the straight line that fits the data best.

Uploaded by

211124022306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

PHÂN TÍCH DỮ LIỆU

BẰNG PYTHON
Lecture 8: Linear Regression

Chapter Topics Purpose of Regression Analysis

 Types of Regression Models  Regression Analysis is Used Primarily to Model


 Determining the Simple Linear Regression Causality and Provide Prediction
Equation  Predict the values of a dependent (response)
variable based on values of at least one
 Measures of Variation independent (explanatory) variable
 Assumptions of Regression and Correlation  Explain the effect of the independent variables on
 Residual Analysis the dependent variable
 Measuring Autocorrelation
 Inferences about the Slope
Types of Regression Models Simple Linear Regression Model
Positive Linear Relationship Relationship NOT Linear
 Relationship between Variables is Described
by a Linear Function
 The Change of One Variable Causes the Other
Variable to Change
Negative Linear Relationship No Relationship  A Dependency of One Variable on the Other

Simple Linear Regression Model Simple Linear Regression Model


(continued) (continued)

Population regression line is a straight line that


Y (Observed Value of Y) = Yi = β 0 + β1 X i + ε i
describes the dependence of the average value
(conditional mean) of one variable on the other
Population Random
Population
Y Intercept
Slope Error ε i = Random Error β1
Coefficient

Yi = β 0 + β1 X i + ε i μY | X = β 0 + β1 X i
Dependent β0 (Conditional Mean)
(Response) Independent
Variable (Explanatory) X
Variable Observed Value of Y
Linear Regression Equation Linear Regression Equation
(continued)

Sample regression line provides an estimate of  b0 and b1 are obtained by finding the values
the population regression line as well as a of b0 and b that minimize the sum of the
predicted value of Y 1
squared residuals
Sample
 (Y − Yˆ ) =  e
n 2 n
Sample 2
Slope
Y Intercept i i i
Coefficient
Yi = b0 + b1 X i + ei Residual
i =1 i =1

 b0 provides an estimate of β 0
Yˆ = b 0 + b1 X =(Fitted
Simple Regression Equation  b1 provides an estimate of β1
Regression Line, Predicted Value)

Simple Linear Regression:


Linear Regression Equation Example
(continued)

You wish to examine Annual


Store Square Sales
the linear dependency Feet ($1000)
of the annual sales of 1 1,726 3,681
produce stores on their 2 1,542 3,395
sizes in square footage. 3 2,816 6,653
Sample data for 7 4 5,555 9,543
stores were obtained. 5 1,292 3,318
Find the equation of 6 2,208 5,563
the straight line that 7 1,313 3,760
fits the data best.
© 2003 Prentice-Hall, Inc. Chap 13-12
Simple Linear Regression
Scatter Diagram: Example Equation: Example
12000
Yˆi = b0 + b1 X i
Annua l S a le s ($000)

10000

8000 = 1636.415 +1.487 X i


6000

4000
From Excel Printout:
2000
C o efficien ts
0
In te rce p t 1636.414726
0 1000 2000 3000 4000 5000 6000
X V a ria b le 1.486633657
Squa re Fe e t

Graph of the Simple Linear Interpretation of Results:


Regression Equation: Example Example

Yˆi = 1636.415 +1.487 X i


12000
Annua l S a le s ($000)

10000

8000

6000 The slope of 1.487 means that for each increase of


4000 one unit in X, we predict the average of Y to
increase by an estimated 1.487 units.
2000

0
The equation estimates that for each increase of 1
0 1000 2000 3000 4000 5000 6000
square foot in the size of the store, the expected
S qua re Fe e t annual sales are predicted to increase by $1487.
Measures of Variation: Measures of Variation:
The Sum of Squares The Sum of Squares
(continued)

 SST = Total Sum of Squares


 Measures the variation of the Yi values around
SST = SSR + SSE their mean, Y
 SSR = Regression Sum of Squares
Total
= Explained + Unexplained  Explained variation attributable to the relationship
Sample between X and Y
Variability Variability
Variability
 SSE = Error Sum of Squares
 Variation attributable to factors other than the
relationship between X and Y

Measures of Variation:
The Sum of Squares The ANOVA Table
(continued)

Y ∧ ANOVA
SSE =(Yi - Yi )2
_ Significance
SST = (Yi - Y)2 df SS MS F
F
MSR P-value of
∧ _ Regression k SSR MSR/MSE
SSR = (Yi - Y)2 =SSR/k the F Test
_ MSE
Y Residuals n-k-1 SSE
=SSE/(n-k-1)

Total n-1 SST

X
Xi
Coefficients of Determination (r 2)
The Coefficient of Determination and Correlation (r)

SSR Regression Sum of Squares Y r2 = 1, r = +1



r2 = = Y r2 = 1, r = -1
SST Total Sum of Squares ^=b +b X
Y i 0 1 i
^=b
Y + b1Xi
i 0
 Measures the proportion of variation in Y that X X
is explained by the independent variable X in
the regression model Y r2 = .81,r = +0.9 Y r2 = 0, r = 0

^=b +b X
Y ^=b +b X
Y
i 0 1 i i 0 1 i
X X

Measures of Variation:
Standard Error of Estimate Produce Store Example
Excel Output for Produce Stores

 (Y − Yˆ )
n 2
i R e g re ssi o n S ta ti sti c s
 SSE
SYX = = i =1 M u lt ip le R 0.9705572
n−2 n−2 R S q u a re 0.94198129
A d ju s t e d R S q u a re 0 . 9 3 0 3 7 7 5 4
S t a n d a rd E rro r 611.751517
O b s e rva t io n s 7
 Measures the standard deviation (variation) of r2 = .94 n Syx
the Y values around the regression equation
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage.
Inference about the Slope:
Linear Regression Assumptions t Test
 t Test for a Population Slope
 Normality  Is there a linear dependency of Y on X ?
 Y values are normally distributed for each X  Null and Alternative Hypotheses
 Probability distribution of error is normal  H0: β1 = 0 (no linear dependency)
 Homoscedasticity (Constant Variance)  H1: β1 ≠ 0 (linear dependency)
 Independence of Errors  Test Statistic/P-value

Inferences about the Slope:


Example: Produce Store t Test Example
Data for 7 Stores: Test Statistic:
Estimated Regression H0: β1 = 0
Annual
Store Square Sales Equation: H1: β1 ≠ 0 b1 Sb1 t
Feet ($000) α = .05 Coefficients Standard Error t Stat P-value
1 1,726 3,681 Yˆi = 1636.415 + 1.487X i df = 7 - 2 = 5 Intercept 1636.4147 451.4953 3.6244 0.01515
Footage 1.4866 0.1650 9.0099 0.00028
2 1,542 3,395 Critical Value(s):
3 2,816 6,653 The slope of this Decision:
4 5,555 9,543 model is 1.487. Reject Reject Reject H0. p-value
5 1,292 3,318 .025 .025 Conclusion:
Does square footage There is evidence that
6 2,208 5,563
affect annual sales? square footage affects
7 1,313 3,760 -2.5706 0 2.5706 t
annual sales.
Inferences about the Slope: Inferences about the Slope:
Confidence Interval Example F Test
Confidence Interval Estimate of the Slope:  F Test for a Population Slope
Is there a linear dependency of Y on X ?
b1 ± tn − 2 Sb1

 Null and Alternative Hypotheses


Excel Printout for Produce Stores  H0: β1 = 0 (no linear dependency)
Lower 95% Upper 95%  H1: β1 ≠ 0 (linear dependency)
Intercept 475.810926 2797.01853  Test Statistic / P-value
Footage 1.06249037 1.91077694
At 95% level of confidence, the confidence interval
for the slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear dependency
of annual sales on the size of the store.

Relationship between a t Test Inferences about the Slope:


and an F Test F Test Example
Test Statistic:
 Null and Alternative Hypotheses H0: β1 = 0
 H0: β1 = 0 (no linear dependency) H1: β1 ≠ 0 ANOVA

 H1: β1 ≠ 0 (linear dependency) α = .05 df SS MS F Significance F


numerator Regression 1 30380456.12 30380456.12 81.179 0.000281
Residual 5 1871199.595 374239.919
df = 1
(t )
2
p-value

n−2 = F1,n − 2 denominator
Total 6 32251655.71

df = 7 - 2 = 5 Decision: Reject H0.


 The p –value of a t Test and the p –value of Reject
an F Test are Exactly the Same Conclusion:
 The Rejection Region of an F Test is Always α = .05 There is evidence that
in the Upper Tail square footage affects
0 6.61 F1,n − 2 annual sales.
Implementation by python Implementation by python

Implementation by python Implementation by python


Implementation by python
Chapter Summary

 Introduced Types of Regression Models


 Discussed Determining the Simple Linear
Regression Equation
 Described Measures of Variation
 Next: Multiple regression model

You might also like