PHÂN TÍCH DỮ LIỆU
BẰNG PYTHON
Lecture 8: Linear Regression
Chapter Topics Purpose of Regression Analysis
Types of Regression Models Regression Analysis is Used Primarily to Model
Determining the Simple Linear Regression Causality and Provide Prediction
Equation Predict the values of a dependent (response)
variable based on values of at least one
Measures of Variation independent (explanatory) variable
Assumptions of Regression and Correlation Explain the effect of the independent variables on
Residual Analysis the dependent variable
Measuring Autocorrelation
Inferences about the Slope
Types of Regression Models Simple Linear Regression Model
Positive Linear Relationship Relationship NOT Linear
Relationship between Variables is Described
by a Linear Function
The Change of One Variable Causes the Other
Variable to Change
Negative Linear Relationship No Relationship A Dependency of One Variable on the Other
Simple Linear Regression Model Simple Linear Regression Model
(continued) (continued)
Population regression line is a straight line that
Y (Observed Value of Y) = Yi = β 0 + β1 X i + ε i
describes the dependence of the average value
(conditional mean) of one variable on the other
Population Random
Population
Y Intercept
Slope Error ε i = Random Error β1
Coefficient
Yi = β 0 + β1 X i + ε i μY | X = β 0 + β1 X i
Dependent β0 (Conditional Mean)
(Response) Independent
Variable (Explanatory) X
Variable Observed Value of Y
Linear Regression Equation Linear Regression Equation
(continued)
Sample regression line provides an estimate of b0 and b1 are obtained by finding the values
the population regression line as well as a of b0 and b that minimize the sum of the
predicted value of Y 1
squared residuals
Sample
(Y − Yˆ ) = e
n 2 n
Sample 2
Slope
Y Intercept i i i
Coefficient
Yi = b0 + b1 X i + ei Residual
i =1 i =1
b0 provides an estimate of β 0
Yˆ = b 0 + b1 X =(Fitted
Simple Regression Equation b1 provides an estimate of β1
Regression Line, Predicted Value)
Simple Linear Regression:
Linear Regression Equation Example
(continued)
You wish to examine Annual
Store Square Sales
the linear dependency Feet ($1000)
of the annual sales of 1 1,726 3,681
produce stores on their 2 1,542 3,395
sizes in square footage. 3 2,816 6,653
Sample data for 7 4 5,555 9,543
stores were obtained. 5 1,292 3,318
Find the equation of 6 2,208 5,563
the straight line that 7 1,313 3,760
fits the data best.
© 2003 Prentice-Hall, Inc. Chap 13-12
Simple Linear Regression
Scatter Diagram: Example Equation: Example
12000
Yˆi = b0 + b1 X i
Annua l S a le s ($000)
10000
8000 = 1636.415 +1.487 X i
6000
4000
From Excel Printout:
2000
C o efficien ts
0
In te rce p t 1636.414726
0 1000 2000 3000 4000 5000 6000
X V a ria b le 1.486633657
Squa re Fe e t
Graph of the Simple Linear Interpretation of Results:
Regression Equation: Example Example
Yˆi = 1636.415 +1.487 X i
12000
Annua l S a le s ($000)
10000
8000
6000 The slope of 1.487 means that for each increase of
4000 one unit in X, we predict the average of Y to
increase by an estimated 1.487 units.
2000
0
The equation estimates that for each increase of 1
0 1000 2000 3000 4000 5000 6000
square foot in the size of the store, the expected
S qua re Fe e t annual sales are predicted to increase by $1487.
Measures of Variation: Measures of Variation:
The Sum of Squares The Sum of Squares
(continued)
SST = Total Sum of Squares
Measures the variation of the Yi values around
SST = SSR + SSE their mean, Y
SSR = Regression Sum of Squares
Total
= Explained + Unexplained Explained variation attributable to the relationship
Sample between X and Y
Variability Variability
Variability
SSE = Error Sum of Squares
Variation attributable to factors other than the
relationship between X and Y
Measures of Variation:
The Sum of Squares The ANOVA Table
(continued)
Y ∧ ANOVA
SSE =(Yi - Yi )2
_ Significance
SST = (Yi - Y)2 df SS MS F
F
MSR P-value of
∧ _ Regression k SSR MSR/MSE
SSR = (Yi - Y)2 =SSR/k the F Test
_ MSE
Y Residuals n-k-1 SSE
=SSE/(n-k-1)
Total n-1 SST
X
Xi
Coefficients of Determination (r 2)
The Coefficient of Determination and Correlation (r)
SSR Regression Sum of Squares Y r2 = 1, r = +1
r2 = = Y r2 = 1, r = -1
SST Total Sum of Squares ^=b +b X
Y i 0 1 i
^=b
Y + b1Xi
i 0
Measures the proportion of variation in Y that X X
is explained by the independent variable X in
the regression model Y r2 = .81,r = +0.9 Y r2 = 0, r = 0
^=b +b X
Y ^=b +b X
Y
i 0 1 i i 0 1 i
X X
Measures of Variation:
Standard Error of Estimate Produce Store Example
Excel Output for Produce Stores
(Y − Yˆ )
n 2
i R e g re ssi o n S ta ti sti c s
SSE
SYX = = i =1 M u lt ip le R 0.9705572
n−2 n−2 R S q u a re 0.94198129
A d ju s t e d R S q u a re 0 . 9 3 0 3 7 7 5 4
S t a n d a rd E rro r 611.751517
O b s e rva t io n s 7
Measures the standard deviation (variation) of r2 = .94 n Syx
the Y values around the regression equation
94% of the variation in annual sales can be
explained by the variability in the size of the
store as measured by square footage.
Inference about the Slope:
Linear Regression Assumptions t Test
t Test for a Population Slope
Normality Is there a linear dependency of Y on X ?
Y values are normally distributed for each X Null and Alternative Hypotheses
Probability distribution of error is normal H0: β1 = 0 (no linear dependency)
Homoscedasticity (Constant Variance) H1: β1 ≠ 0 (linear dependency)
Independence of Errors Test Statistic/P-value
Inferences about the Slope:
Example: Produce Store t Test Example
Data for 7 Stores: Test Statistic:
Estimated Regression H0: β1 = 0
Annual
Store Square Sales Equation: H1: β1 ≠ 0 b1 Sb1 t
Feet ($000) α = .05 Coefficients Standard Error t Stat P-value
1 1,726 3,681 Yˆi = 1636.415 + 1.487X i df = 7 - 2 = 5 Intercept 1636.4147 451.4953 3.6244 0.01515
Footage 1.4866 0.1650 9.0099 0.00028
2 1,542 3,395 Critical Value(s):
3 2,816 6,653 The slope of this Decision:
4 5,555 9,543 model is 1.487. Reject Reject Reject H0. p-value
5 1,292 3,318 .025 .025 Conclusion:
Does square footage There is evidence that
6 2,208 5,563
affect annual sales? square footage affects
7 1,313 3,760 -2.5706 0 2.5706 t
annual sales.
Inferences about the Slope: Inferences about the Slope:
Confidence Interval Example F Test
Confidence Interval Estimate of the Slope: F Test for a Population Slope
Is there a linear dependency of Y on X ?
b1 ± tn − 2 Sb1
Null and Alternative Hypotheses
Excel Printout for Produce Stores H0: β1 = 0 (no linear dependency)
Lower 95% Upper 95% H1: β1 ≠ 0 (linear dependency)
Intercept 475.810926 2797.01853 Test Statistic / P-value
Footage 1.06249037 1.91077694
At 95% level of confidence, the confidence interval
for the slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear dependency
of annual sales on the size of the store.
Relationship between a t Test Inferences about the Slope:
and an F Test F Test Example
Test Statistic:
Null and Alternative Hypotheses H0: β1 = 0
H0: β1 = 0 (no linear dependency) H1: β1 ≠ 0 ANOVA
H1: β1 ≠ 0 (linear dependency) α = .05 df SS MS F Significance F
numerator Regression 1 30380456.12 30380456.12 81.179 0.000281
Residual 5 1871199.595 374239.919
df = 1
(t )
2
p-value
n−2 = F1,n − 2 denominator
Total 6 32251655.71
df = 7 - 2 = 5 Decision: Reject H0.
The p –value of a t Test and the p –value of Reject
an F Test are Exactly the Same Conclusion:
The Rejection Region of an F Test is Always α = .05 There is evidence that
in the Upper Tail square footage affects
0 6.61 F1,n − 2 annual sales.
Implementation by python Implementation by python
Implementation by python Implementation by python
Implementation by python
Chapter Summary
Introduced Types of Regression Models
Discussed Determining the Simple Linear
Regression Equation
Described Measures of Variation
Next: Multiple regression model