Distance between the best fitted line to the data point = residual
When correlation is not zero, we can estimate y value with x
Anova table → find out the total variability of the regression model
Mean between data and mean of data (yi-y mean)
SST (total sum of squares) → Σ (y-ȳ)^2
SSR regression sum of squares (explained) → Σ (ŷ-ȳ)^2
SSE Sum of squared errors (unexplained) → Σ(y-ŷ)^2
SST=RSS+SSE (total sum of squares = regression sum of squares + sum of squared errors)
The least square method helps us to find a line to minimise SSE
ANOVA table → how fitting is the regression model fitting to our observed data
Measure to fit of our model:
Approach 1: standard error of estimate (SEE, Standard Error of Estimate, SD of errors,s )
s=SEE= sqrt (MSE)= Sqrt (SSE/n-2 )
Approach 2: coefficient of determination (R^2)
R^2=1-SSE/Syy=(Syy-SSE)/Syy
= reduction in the sum of square error due to x / sum of square error using ŷ=ȳ
F stat = MSR/MSE
Indications
Observed data is close to the regression line, SSE low, SEE is also low
R^2 is high → if R^2 is high and SEE is low → a good indication where the model is good fit (high confidence of the
estimate)
Observed data is far to the regression line, SEE is high while R^2 is low, regression is a poor fit
When R^2 = 1, SSE must be equal to 0, i.e all the points fall on a straight line
Simple linear regression model and its properties
𝑆𝑦
→estimate of slope= 𝐵1= r 𝑆𝑥 =Sxy/Sxx r = Cov(x,y)/𝑠𝑥𝑠𝑦 Cov(x,y)=∑(xi-𝑥)(yi-𝑦)/ (n-1)
F stat = MSR/MSE
Test for population coefficient of correlation
Residuals
Stat model assumes that for SLR,
for each value of x, the value of y
is normally distributed with some
mean (that depends on x linearly),
And a SD that does not depend on x.
The sd
is constant and same for all values
Inference on the slope coefficient (hypothesis testing) s / 𝑆𝑥𝑥= 𝑆𝐸(𝑏i), (Estimated Standard error for βi, (given in
Summary output, SE Coef)
𝑠𝑒=s=standard error of estimate = 𝑀𝑆𝐸= 𝑆𝑆𝐸/(𝑛 − 2) (standard error of regression)
ii) Hypothesis test for the slope 𝑏1 and the intercept 𝑏0
- 2-sided test: 𝐻0: 𝑏𝑖= b* VS 𝐻𝑎: 𝑏𝑖 ≠ b*, for any hypothesized value b*
𝑏𝑖− 𝑏*
→ Observed test statistic (t-stat) : t= (~𝑇𝑛−2 under 𝐻0)
𝑆𝐸(𝑏𝑖)
→ Reject 𝐻0 if |t| > 𝑡α/2,𝑛−2 or p-value= 2P(𝑇𝑛−2≥|t|) < α
- 1-sided test:
→ Reject 𝐻0if t> 𝑡α,𝑛−2 or P (𝑇𝑛−2≥t ) < α for 𝐻α: 𝑏𝑖 > b* ; t< 𝑡α,𝑛−2 or P (𝑇𝑛−2≤ t ) < α for 𝐻α: 𝑏𝑖 < b*
Inferences in SLR: Reject the claim (i.e., 𝐻0 below) that the parameter (𝑏0 or 𝑏1) in SLR equals any value b* with 5% chance of committing
Type 1 error ;
𝐻0: 𝑏𝑖=b* vs 𝐻α:𝑏𝑖≠𝑏 *
If any of the following is correct: (reject null hypothesis which is the researcher claim)
1) The absolute value of the t-statistic is larger than 𝑡0.025,𝑛−2≈2;
2) The p-value computed from the t-statistic is less than 0.05; or
3) b* lies outside the 95% ci for the parameter 𝑏𝑖
Even if the errors are not normally distributed, we can apply hypothesis testing when n>30 (central limit theorem)
Interference with confidence interval
sx= sample SD of xo , s=se= 𝑀𝑆𝐸= 𝑆𝑆𝐸/(𝑛 − 2) (standard error of regression)
Coefficient of determination (R^2)
2 2 𝑆𝑦𝑦−𝑆𝑆𝐸
𝑅 =[𝑐𝑜𝑟𝑟(𝑋, 𝑌)] = SSR/ (SSR+SSE) ; = 𝑆𝑦𝑦
= 1-RSS/TSS
2
(𝑆𝑦𝑦=∑(𝑦𝑖 −𝑦) ) ** (Syy)=sy^2 (n-1)
95% C.I. for the parameter b in SLR: 𝑏1± 𝑡0.025,𝑛−2𝑆𝐸(𝑏𝑖) ; 𝑡0.025,𝑛−2
1
How to explain if prediction is reliable
1. R^2 is 30% , meaning that 70% of the variation in y remains
unexplained.
2. standard error of estimate (SEE, Standard Error of Estimate, SD
of errors,s ) , s=SEE= sqrt (MSE)= Sqrt (SSE/n-2
Prediction interval -> CI for new observation
CH 5 Confidence interval T distribution (mount shaped, symmetric about 0, fatter tail than standard
Normal, larger df → closer to standard
Normal )
A point estimator→ a single value that
Estimate an unknown population
parameter
Empirical rule 要符合percentages as well, not only bell shaped