100% found this document useful (1 vote)
84 views

Econometrics1 Cha2

The document summarizes key concepts from Chapter 2 of a textbook on simple linear regression analysis. It introduces the concept of regression analysis and the simple linear regression model. It explains that the simple linear regression model assumes a linear relationship between a dependent variable (Y) and a single independent variable (X). It also describes the method of least squares, which finds the best-fitting regression line by minimizing the sum of the squared residuals or vertical distances between the observed data points and the regression line.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
84 views

Econometrics1 Cha2

The document summarizes key concepts from Chapter 2 of a textbook on simple linear regression analysis. It introduces the concept of regression analysis and the simple linear regression model. It explains that the simple linear regression model assumes a linear relationship between a dependent variable (Y) and a single independent variable (X). It also describes the method of least squares, which finds the best-fitting regression line by minimizing the sum of the squared residuals or vertical distances between the observed data points and the regression line.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

CHAPTER TWO

SIMPLE LINEAR REGRESSION


2.1 The Concept of Regression Analysis
2.2 The Simple Linear Regression Model
2.3 The Method of Least Squares
2.4 Properties of Least-Squares Estimators and the
Gauss-Markov Theorem
2.5 Residuals and Goodness of Fit
2.6 Confidence Intervals and Hypothesis Testing in
Regression Analysis
2.7 Prediction with the Simple Linear Regression1
2.1 The Concept of Regression Analysis
 Compare regression & correlation! (dependence vs.
association).
 Objective of regression analysis:
how the average value of the dependent variable
(regressand) varies with values of explanatory
variables (regressors).
E[Y | X ] = f ( X )
conditional expectation function (CEF), or
population regression function (PRF)
2
2.1 The Concept of Regression Analysis
 Stochastic PRF (for empirical purposes)

Yi = E[Y | X i ] +  i
 The stochastic disturbance term ɛi plays a critical
role in estimating the PRF.
 The PRF is an idealized concept. Hence, we use the
stochastic sample regression function (SRF)
to estimate the PRF, i.e.,
we use: Yi = Yˆ + e to estimate Yi = E[Y | X i ] +  i ;
i i
where: Yˆi = f(X i ) 3
2.2 The Simple Linear Regression Model
Linear: we assume that the PRF is linear in
parameters (α and β); it may or may not be linear in
variables (Y or X).
E[Y | X i ] =  + X i  Yi =  + X i +  i
Simple: because we have only one regressor (X).
Accordingly, we use:
Yˆ = ˆ + ˆX to estimate E[Y | X ] =  + X .
i i i i

4
2.2 The Simple Linear Regression Model
 Using theoretical r/p b/n X & Y, Yi is decomposed
into a non-stochastic/systematic component α+βXi
and a random component ɛi.
 This is a theoretical decomposition because we do
not know the values of α and β, or the values of ɛ.
 Operational decomposition of Yi is with reference to
the fitted line: actual value Yi is equal to the fitted
value Yˆ = ˆ + ˆX plus the residual ei.
i i
 The residuals ei serve a similar purpose as the
stochastic term ɛi, but the two are not identical. 5
2.2 The Simple Linear Regression Model

 From the PRF:


Yi = E[Y | X i ] +  i  i = Yi − E[Y | X i ]
but , E[Y | X i ] =  + X i ε i = Yi − α − βX i
 From the SRF:
Y = Yˆ + e ei = Yi − Yˆi
i i i

but Yˆi = ˆ + ˆX e i = Yi − αˆ − βˆ X


i i
6
2.2 The Simple Linear Regression Model

E[Y|X2] = α + βX2
O4
Y ɛ4
E[Y|Xi] = α + βXi
O1 P3
P4
ɛ1 P2 ɛ3
ɛ2 O3
α P1
O2

X
X1 X2 X3 X4
7
2.2 The Simple Linear Regression Model

O4
e4 SRF : Yˆ = ˆ + ˆX
Y ɛ4
R4 E[Y|Xi] = α + βX i
O1 R3
P4
P3 Ɛi & ei are not
ɛ1 RP
22 e3
e1 ɛ3 identical
e2 ɛ2
P1 O3  Ɛ1 < e1
α O2
R1  Ɛ2 = e2
̂
 Ɛ3 < e3
X1 X2 X3 X4 X
 Ɛ4 > e4
8
2.3 The Method of Least Squares
 Our sample is only one of the large number of
possibilities.
 Implication: the SRF line is just one of the possible
SRFs. Each SRF line has unique ˆ and ˆ values.
 Then, which of these lines should we choose?
 Generally we will look for the SRF which is very close
to the (unknown) PRF.
 We need a rule that makes the SRF as close as
possible to the observed data points.
 But, how can we devise such a rule? Equivalently,
how can we choose the best technique to estimate
the parameters of interest (α and β)?
9
2.3 The Method of Least Squares
Generally, there are 3 methods of estimation:
 method of least squares,
 method of moments, and
 maximum likelihood estimation.
The most common method for fitting a regression
line is the method of least-squares. We will use LSE,
specifically, Ordinary Least Squares (OLS).
What does OLS do?
A line fits a dataset well if observations are close to
it, i.e., if predicted values obtained using the line are
close to the values actually observed. 10
2.3 The Method of Least Squares
Meaning, the residuals should be small.
Therefore, when assessing the fit of a line, the
vertical distances of the points from the line are the
only distances that matter.
The OLS method calculates the best-fitting line for
a dataset by minimizing the sum of the squares of
the vertical deviations from each data point to the
line (the Residual Sum
n
of Squares, RSS).
Minimize RSS =  i
2
e
i =1
We could think of minimizing RSS by successively
choosing pairs of values for ˆ and ˆ until RSS is
made as small as possible. 11
2.3 The Method of Least Squares
 But, we will use differential calculus (which turns
out to be a lot easier).
 Why the sum of the squared residuals? Why not just
minimize the sum of the residuals?
 To prevent negative residuals from cancelling
positive ones.
 If we use  ei , all the error terms ei would receive
equal importance no matter how closely/widely
scattered the individual observations are from SRF.
 If so, the algebraic sum of ei’s may be small (even
zero) though the eis are widely scattered about SRF.
 Besides, the OLS estimators possess desirable
properties of estimators under some assumptions.12
2.3 The Method of Least Squares
n
=
n
− ˆ =
n
− ˆ − ˆ
minimize i  i i  i
 α β
2 2 2
OLS: e (Y Y ) (Y X i )
i =1 i =1 i =1
αˆ, βˆ
n n

F.O.C.: (1)  ( e ) 2
i [ (Yi − ˆ − ˆX i ) 2 ]
i =1
=0 i =1
=0
ˆ ˆ
n
 2.[ (Yi − ˆ − ˆX i )][−1] = 0   (Yi − ˆ − ˆX i ) = 0
n

i =1 i =1

n n n n n
  Yi −  ˆ −  ˆX i = 0   Yi − nˆ − ˆ  X i = 0.
i =1 i =1 i =1 i =1 i =1

 Y − ˆ − ˆX = 0 αˆ =Y − βˆX 13
2.3 The Method of Least Squares
n n
F.O.C.: (2)  ( e ) 2
i [ (Yi − ˆ − ˆX i ) 2 ]
i =1
=0 i =1
=0
ˆ ˆ
n
 2.[ (Yi − ˆ − ˆX i )][− X i ] = 0
i =1
n
  [(Yi − ˆ − ˆX i )( X i )] = 0
i =1
n n n
  Yi X i −  ˆX i −  ˆX i2 = 0
i =1 i =1 i =1
n n n
  Yi X i = ˆ  X i + ˆ  X i2
i =1 i =1 i =1 14
2.3 The Method of Least Squares
Solve αˆ =Y − βˆX and Y i X i = αˆ  X i + βˆ  X i2
(called normal equations) simultaneously!
i i  i  i 
Y X = ˆ X + ˆ X 2  Y X = (Y − βˆX)( X ) + βˆ X 2
 i i i  i

ˆ ˆ
  Yi X i = Y  X i − βX  X i + β  X i2

  Yi X i − Y  X i = βˆ  X i2 − βˆX  X i
ˆ
  Yi X i − Y  X i = β(  X i − X  X i )
2

ˆ
  Yi X i − nXY = β(  X i − nX )
2 2

 Xi
b/c X =   X i = nX. 15
n
2.3 The Method of Least Squares
 Y X − nXY
Thus, 1. βˆ = i i
2 2
 X − nX
i
Alternative expressions for βˆ :

 ( X − X )(Y − Y ) ˆβ =  xy
ˆ
2. β = i i
 x 2
2
 ( Xi − X ) where : x = X − X & y = Y − Y .
i i

Cov ( X , Y ) n Y X − ( X )( Y )
ˆ
3. β = 4. βˆ = i i i i
Var ( X ) 2 2
n X − ( X ) 16
i i
2.3 The Method of Least Squares

For αˆ just use: αˆ =Y − βˆX


 Y X − nXY
Or, αˆ = Y − {X.[ i i ]}
2 − nX 2
 X
i
2
( Y )( X ) − ( X )( Y X )
 αˆ = i i i i i
n( X 2 − nX 2 )
i

17
2.3 The Method of Least Squares
Previously, we came across two normal equations:
1.  (Yi − ˆ − ˆX i ) = 0 this is equivalent to:  ei = 0
n

i =1
n
2.  i
[(Y − 
ˆ − ˆX )( X )] = 0 equivalently,
 i i e Xi i =0
i =1

Note also the following property: Y = Yˆ


Y = Yˆ + e   Y =  Yˆ +  e
i i i i i i
 Yi  Yˆi  ei  Y = Yˆ as  e = 0  e = 0.
 = + i
n n n 18
2.3 The Method of Least Squares

Y = Yˆ
Y = ˆ + ˆX
These facts imply that the sample regression line
passes through the sample mean values of X and Y.

Yˆ = ˆ + ˆX
Y

X
X 19
2.3 The Method of Least Squares
Firm (i) Sales (Yi) Advertising Expense (Xi)
Numerical
1 11 10
Example:
2 10 7
Explaining sales
3 12 10
= f(advertising)
4 6 5
Sales are in
5 10 8
thousands
6 7 8
of Birr &
7 9 6
advertising
8 10 7
expenses are in
hundreds of Birr. 9 11 9
10 10 10 20
2.3 The Method of Least Squares
i Yi Xi y i = Yi − Y xi = X i − X xi y i 10
1 11 10 1.4 2 2.8 Y i
96
2 10 7 0.4 -1 -0.4 Y = i =1
=
. n 10
3 12 10 2.4 2 4.8
= 9.6
4 6 5 -3.6 -3 10.8
10
5
6
10
7
8
8
0.4
-2.6
0
0
0
0
X i
80
X = i =1
=
7 9 6 -0.6 -2 1.2 n 10
8 10 7 -0.4 -1 -0.4 =8
9 11 9 -1.4 1 1.4
10 10 10 0.4 2 0.8
Ʃ 96 80 0 0 21 21
2.3 The Method of Least Squares
i yi xi y i2 x i2
1 1.4 2 1.96 4
ˆ =
 xiy i
=
21
= 0.75
x
2 0.4 -1 0.16 1 2
i 28
3 2.4 2 5.76 4
4 -3.6 -3 12.96 9

ˆ = Y − ˆX
5 0.4 0 1.96 0
6 -2.6 0 6.76 0
7 -0.6 -2 0.36 4 = 9.6 − 0.75(8) = 3.6
8 -0.4 -1 0.16 1
9 -1.4 1 1.96 1
10 0.4 2 0.16 4
Ʃ 0 0 30.4 28 22
2.3 The Method of Least Squares
ˆY = 3.6 + 0.75 X ei = Yi − Yˆi
 = 14.65
i 2
e 2
i i i
e i
1 11.1 -0.10 0.01
2
3
8.85
11.10
1.15
0.90
1.3225
0.81  yˆ = 15.752
i
4 7.35 -1.35 1.8225
5
6
9.60
9.60
0.40
-2.60
0.16
6.76
 y = 30.4 2
i

7 8.10 0.90 0.81  y =  x =  yˆ


i i i

= e = 0
8 8.85 1.15 1.3225
i
9 10.35 0.65 0.4225
10 11.10 -1.10 1.21
Ʃ 96 0 14.65 23
2.3 The Method of Least Squares
Assumptions Underlying the Method of Least Squares
 To obtain ˆ & ˆ in the model Yi = ˆ + ˆX i + ei , the only
assumption we need is that:
 X must take at least 2 distinct values (number of
observations  number of parameters).
 But the objective in regression analysis is not only
to obtain ˆ and ˆ but also to draw inferences about
the true parameters  and  .
 For example, we’d like to know how close ˆ and ˆ
are to  and  or how close Yˆi is to E[Y | X i ] .
 To that end, we must also make certain assumptions
about the manner in which Yi ’s are generated. 24
2.3 The Method of Least Squares
 The PRF Yi =  + X i +  i shows that Yi depends on
both X i and  i .
 Therefore, unless we are specific about how X i &  i
are created/generated, there is no way we can make
any statistical inference about Yi and about  and  .
 Assumptions made about the X variable and the
error term are extremely critical to the valid
interpretation of the regression estimates.

25
2.3 The Method of Least Squares
THE ASSUMPTIONS:
1. Zero mean value of the error term: E(ɛ|Xi) = 0.
Or equivalently, E[Y|Xi] = α + βXi.
2. Homoskedasticity or equal variance of ɛi: the
variance of ɛ is the same (finite positive constant
σ2) for all observations, i.e.,
var(ɛ|Xi) = E[ɛ|Xi – E(ɛ|Xi)]2 = E(ɛ|Xi)2 = σ2.
By implication: var(Y|Xi) = σ2.
var(Y|Xi) = E{α+βXi+ɛi – (α+βXi)}2
= E(ɛ|Xi)2
= σ2 for all i. 26
2.3 The Method of Least Squares
3. No autocorrelation between the disturbance terms.
Each error term ɛi is uncorrelated with every other
error term ɛs (for s ≠ i).
cov(ɛi,ɛs|Xi,Xs) = E({[ɛi−E(ɛi)]|X i}{[ɛs−E(ɛs)]|Xs})
= E[(ɛi|Xi)(ɛs|Xs)] = E(ɛiɛs|Xi,s) = 0.
Equivalently, cov(Yi,Ys|Xi,s) = 0. (for all s ≠ i).
4. The disturbance term ɛ and the explanatory
variable X are uncorrelated: cov(ɛi,Xi) = 0.
cov(ɛi,Xi) = E{[ɛi−E(ɛi)][Xi−E(Xi)]}
= E[ɛi(Xi−E(Xi))]
= E(ɛiXi)−E(Xi)E(ɛi) = E(ɛiXi) = 0 27
2.3 The Method of Least Squares
5. The error terms are normally and independently
distributed, i.e.,  i ~ NID(0,  ).
2

Assumptions 1 to 3 together imply that  i ~ IID (0,  2 ).


The normality assumption enables us derive the
sampling distributions of OLS estimators (ˆ and ˆ ).
This simplifies the task of establishing confidence
intervals and testing hypotheses.
6. X is usually assumed to be fixed or non-stochastic,
but this is needed for simplicity only.

28
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
☞Given the assumptions of the classical linear
regression model, the least-squares estimators
possess some ideal or optimum properties.
These statistical properties are extremely important
because they provide criteria for choosing among
alternative estimators.
These properties are contained in the well-known
Gauss–Markov Theorem.

29
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
Gauss-Markov Theorem:
Under the above assumptions of the linear
regression model, the estimators ˆ and ˆ have the
smallest variance of all linear & unbiased
estimators of & , i.e., OLS estimators are the Best
Linear Unbiased Estimators (BLUE) of  and .

The Gauss-Markov Theorem does not depend on the


assumption of normality (of the error terms).

Let us prove that ̂ is the BLUE of  !


30
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
Linearity of ̂ : (in a stochastic variable, Yi or  i ).

ˆ =
 x y
i
=
 x (
iY − Y )i i
 ˆ =
 x Y

Y
i i  x i

x 2
i x 2
i x x 2
i
2
i

=
 xY xY

i i i
=
 xY
i i
(sin ce x = 0)
x x x
2 2 2 i
i i i

xi
  = (
ˆ x
i   =  k i Yi where k i =
)Y ˆ i

 i
x 2
 ix 2

ˆ
  = k1Y1 + k 2Y2 + ... + k nYn 31
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem

Note that:
(1)  xi is a constant
2

(2) since xi is non-stochastic, ki is also non-stochastic


(3) . k i =  ( xi
) =
 xi
= 0  ki = 0
 i  i
x 2
x 2

xi  xi2  k i xi = 1
(4) . k i xi =  ( x 2 )( xi ) = x 2 = 1 i 2
= 1
 i  i
k 2
x

i
2
x
k X
xi 1
(5) .  k =  [(
2
)] =
2
=
i
. =1
 xi ( xi )  xi
i 2 2 2 2 i i

xi xi
(6) .  k i X i =  ( 2 )( X i ) =  ( 2 )( xi + X ) =
i
x 2

+
X  xi
=1
 xi  xi x 2
i x 2
32
i
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
Unbiasedness: ˆ =  k i Yi
ˆ =  k i (  + X i +  i )
ˆ =   ki +   ki X i +  ki i
ˆ =  +  ki i [because ki = 0 and  ki X i = 1]
E ( ˆ ) = E (  ) + E (k1 1 + k 2  2 + ... + k n  n )
E ( ˆ ) = E (  ) + ( k i ).E ( i )

E ( ˆ ) =  + ( ki ).(0) E ( ˆ ) =  33
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
Efficiency:
~
Suppose is another unbiased linear estimator of  .
ˆ ~
Then, var(  )  var(  ) .
Proof: var(ˆ ) = var( k Y )  i i

var(ˆ ) = var(k1Y1 + k 2Y2 + ... + k nYn )


var(ˆ ) = var(k1Y1 ) + var(k 2Y2 ) + ... + var(k nYn )
{since the covariance between Yi and Ys (for i  s) = 0}
ˆ
var( ) = k1 var(Y1 ) + k 2 var(Y2 ) + ... + k n var(Yn )
2 2 2

ˆ
var( ) = k ( ) + k ( ) + ... + k ( )
2 2 2 2 2 2
1 2 n 34
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
2
var( ˆ ) =  2  k i2 or , var(ˆ ) = x 2
~
 i

Suppose :  =  wi Yi where wi s are coefficients.


~
 =  wiYi
~
 =  wi ( + X i +  i )
~
 =   wi +   wi X i +  wi i
~
E(β ) = (  wi ).E(α ) + (  wi X i ).E( β) + (  wi ).E( εi )
~
E(β ) = (  wi ).α + (  wi X i ).β
~
for  to be an unbiased estimator of  ,  wi = 0 and w X i i = 1.
35
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
~
var( ) = var( wi Yi )
~
var( ) = var(w1Y1 + w2Y2 + ... + wnYn )
~
var(  ) = var( w1Y1 ) + var( w2Y2 ) + ... + var( wnYn )
since the covariance between Y and Y (for i  s) = 0
i s
~
var( ) = w1 var(Y1 ) + w2 var(Y2 ) + ... + wn var(Yn )
2 2 2

~
var( ) = w1 ( ) + w2 ( ) + ... + wn ( )
2 2 2 2 2 2

~
var( ) =  2
w 2
i 36
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
ˆ ~
 Let us now compare var(β) and var(β )!
 Suppose wi  k i , and the r/p b/n them
be given by : d i = wi − k i .
 Because both  wi and  k i equal zero :
  d i =  wi −  k i = 0
 Because both  wi xi and  k i xi equal one :
  d i xi =  wi xi −  k i xi = 1 − 1 = 0
 (wi ) 2 = (k i + d i ) 2  wi2 = k i2 + d i2 + 2k i d i
xi
  w =  k +  d + 2 (d i )(
2 2 2
)
 xi
i i i 2
37
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
1
  wi =  k i +  d i + 2(
2 2 2
2  i i
)( d x )
 xi
1
  w =  k +  d + 2(
2 2 2
)(0)
 xi
i i i2

  wi2 =  k i2 +  d i2

(given wi  ki , not all d i s


 w  k
2 2
are zero and thus, d  0 ). i
2 i i

 2
w 2
i  2
k ~
i
2

 var (β )  var (βˆ). 38


2.4 Properties of OLS Estimators and the Gauss-Markov Theorem

Linearity of ̂: ˆ = Y − ˆX


 ˆ = Y − X { k i Y i }
 ˆ = Y − X {k1Y1 + k 2Y2 + ... + k nYn }
1
 αˆ = (Y1 + Y2 + ... + Yn ) − {Xk1Y1 + Xk 2Y2 + ... + Xk nYn }
n
1 1 1
 αˆ = ( − Xk1 )Y1 + ( − Xk 2 )Y2 + ... + ( − Xk n )Yn
n n n
1
 αˆ = f1Y1 + f 2Y2 + ... + f nYn where f i = − Xk i
n 39
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
Unbiasedness: ˆ = Y − ˆX
 ˆ = ( + X) − X {( k i )(  + X i +  i )}

 ˆ = ( + X) − X {  k i +   k i X i +  k i  i }
 ˆ = ( + X) − X { +  k i  i }
 ˆ = ( +  X −  X − X  k i  i )
 E (ˆ ) = E ( ) − E ( X  k i  i )
 E (ˆ ) = E ( ) − X ( k i ).E ( i )
 E (ˆ ) =  40
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
Efficiency:
Suppose ~ is another unbiased linear estimator of  .
Then, var(ˆ )  var(~) .

Proof: var(ˆ ) = var( f iYi )
var(ˆ ) = var( f1Y1 + f 2Y2 + ... + f nYn )
var(ˆ ) = var( f1Y1 ) + var( f 2Y2 ) + ... + var( f nYn )
{since cov(Yi , Ys ) = 0 for i  s}
var( ) = f1 var(Y1 ) + f 2 var(Y2 ) + ... + f n var(Yn )
ˆ 2 2 2

var( ) = f1 ( ) + f 2 ( ) + ... + f n ( ) =   f i
ˆ 2 2 2 2 2 2 2 2
41
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
1
var(ˆ ) =   f i =  ( ( − Xki ) 2 )
2 2 2

n
1 2
var(ˆ ) =  { ( 2 + X ki − Xki )}
2 2 2

n n
2
2 1 2
var(ˆ ) =  { + X  ki − X  ki }
1 X
var(ˆ ) =  2 ( +
2 2
)
n n n  xi 2

2  i
1 1 X
var(ˆ ) =  2 { + X 2  ki2 } =  2 { +
2
} X
n n  xi or, var (αˆ) = σ
2

n xi2
note that :
2
1 X
1
 fi = ( n − Xki ) = 1 − X  ki = 1  fi 2 = +
n  42xi2
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
Suppose : ~ =  ziYi where zis are coefficien ts.
~ =  ziYi
~ =  zi ( +  X i +  i )
~ =   zi +   zi X i +  zi i
 i 
E(~) = ( z ).E(α ) + ( z X ).E( β) + (
i i  z ).E( ε )
i i

E(~) = (  z i ).α + (  z i X i ).β


for ~ to be an unbiased estimator of  ,  z i = 1 &  z i X i = 0.

var(~ ) = var( ziYi )


var(~ ) = var( z1Y1 + z 2Y2 + ... + z nYn ) 43
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
var(~ ) = var( z1Y1 ) + var( z 2Y2 ) + ... + var( z n Yn )
since cov (Y , Y ) = 0 for i  s.
i s
~
var( ) = z1 var(Y1 ) + z 2 var(Y2 ) + ... + z n var(Yn )
2 2 2

~
var( ) = z ( ) + z ( ) + ... + z ( )
2 2 2 2 2 2
1 2 n
~
var( ) =  2
z 2
i
 Let us now compare var(ˆ ) and var(~ )!
 Suppose zi  f i , and the relatioshi p
b/n them be given by : d i = zi − f i . 44
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
 Because  z i X i = 0, and  z i = 1,
  z i xi =  z i ( X i − X ) =  z i X i −  z i X
=  zi X i − X  zi
= 0 − X (1) = − X

 i  i  i − 2{ zi f i }
d 2
= z 2
+ f 2 1
where f i = − X (
n
xi
 xi2
)
1 xi
d 2
= z +
2
f i − 2{ [ z i ( − X
2
)]}
 xi
i i 2
n
1 X
  d =  z +  f i − 2{  z i − (
2 2 2
2  i i
)( z x )}
 xi
i i
n
1 X
  d i2 =  z i2 +  f i 2 − 2{ − ( )(− X )}
n  xi 2
45
2.4 Properties of OLS Estimators and the Gauss-Markov Theorem
2
1 X
  d i2 =  z i2 +  f i 2 − 2{ + }
n  xi 2

  d i2 =  z i2 +  f i 2 − 2 f i 2

  d i2 =  z i2 −  f i 2
  z i2 =  d i2 +  f i 2
  zi2   f i 2
  2  zi2   2  f i 2

~
 var( )  var(ˆ ).
46
2.5 Residuals and Goodness of Fit
Decomposing the variation in Y:

47
2.5 Residuals and Goodness of Fit
One measure of the variation in Y is the sum of its
squared deviations around its sample mean, often
described as the Total Sum of Squares, TSS.
TSS, the total sum of squares of Y can be
decomposed into two:
ESS, the ‘explained’ sum of squares, and
RSS, residual (‘unexplained’) sum of squares.

TSS = ESS + RSS

 (Yi − Y) =  (Yi − Y) +  ei
2 ˆ 2 2
48
2.5 Residuals and Goodness of Fit
Yi = Yˆi + ei  Yi − Y = Yˆi − Y + ei
(Yi − Y ) 2 = (Yˆi − Y + ei ) 2
 (Y − Y ) =  (Y − Y + e )
i
2 ˆ
i
2
i

y 2
i =  ( yˆ i + ei ) 2

 i  i  i + 2 yˆ i ei
y 2
= ˆ
y 2
+ e 2

The last term equals zero:


 ii  i
yˆ e = (Yˆ − Y )e i =  i ei −  Y ei

  yˆ i ei =  (ˆ + ˆX i )ei − Y  ei


49
2.5 Residuals and Goodness of Fit
.   ii i  ii
ˆ
y e = 
ˆ e + ˆ Xe

  yˆ i ei = 0
Hence:   y i =  yˆ i2 +  ei2
2

TSS = ESS + RSS


30.4 = 15.75 + 14.65
Coefficient of Determination (R2):
the proportion of the variation in the dependent
variable that is explained by the model. 50
2.5 Residuals and Goodness of Fit

1. R =2ESS
=
 ˆ
y 2

TSS  y 2
ESS 2 
x 2

2. R =
2 ˆ
=
R =
2 ESS
=
 ( ˆx) 2
 TSS  y 2

TSS  y 2

 The OLS regression coefficients are chosen in such


a way as to minimize the sum of squared residuals.
It automatically follows that they maximize R2.
TSS = ESS + RSS
1=
ESS
+
RSS
TSS ESS RSS TSS TSS
 3. R 2
=1−
 ei2
 = + ESS
TSS TSS TSS  TSS = 1 −
RSS
TSS
y 2

51
2.5 Residuals and Goodness of Fit
Coefficient of Determination (R2):
R =
2 ESS
= ˆ (
 )(  )
xy x 2
ESS  i
ˆ
y 2

R =
2
=
TSS  x 2
 y 2
TSS  y 2

4. R =
ESS
2
= ˆ
 xy =
15.75
= 0.5181
TSS y 2 30.4

R =2  xy  xy
x y 2 2

2
( xy ) 2
 6. R =2 [cov( X , Y )]
 5. R = 2
var( X )  var(Y )
x y 2 2
52
2.5 Residuals and Goodness of Fit
 A natural criterion of goodness of fit is the
correlation between the actual and fitted values of
Y. The least squares principle also maximizes this.
 In fact,  R = (ryˆ , y ) = (rx , y )
2 2 2

where ryˆ , y and rx,y are the coefficients of correlation


between Yˆ & Y, and X & Y, defined as:
cov(Yˆ , Y ) cov( X , Y ) , respectively.
ryˆ , y = & rx , y =
 Yˆ  Y  XY
Note:
RSS = (1 − R ) y 2 2
53
To sum up:
Use Yˆi = ˆ + ˆX i to estimate E[Y | X i ] =  + X i .

OLS: min i=1 e =


n
(Y −
 i i Yˆ 2
)2
=
n
n
 i
(Y − ˆ
α − ˆ
β X )2
i i
i =1 i =1
α̂, β̂
 xy
ˆ
β= αˆ = Y − βˆX
x
2

Given the assumptions of the linear regression


model, the estimators ˆ and ˆ have the smallest
variance of all linear and unbiased estimators of
 and 
ˆ
var( ) =
 2
1
var(ˆ ) =  ( +
2 X 2
) = 2  X i
2

 i
x 2
n  xi
2
n x 54i
2
To sum up …
2 2
 i  i i
2
y = ˆ
y 2
+ e 2
var(ˆ ) = =
x 2
i 28
TSS = ESS + RSS  0.0357 2

R =
2 ESS
=
y
ˆ 2
1 X 2

y 2 var(ˆ ) =  ( +
2
)
n  xi
TSS 2

 yˆ 2
= ̂  xy
var(ˆ ) =  2 (
1
+
64
)
10 28
 yˆ 2 ˆ
= 2
x 2
 2.3857 2

RSS = (1 − R ) y 2 2
But ,  = ? 2
55
An unbiased estimator for σ2

E ( RSS ) = E ( e ) = (n − 2)
2
i
2

Thus, if we define ˆ 2
=
 e2
i
, then :
n−2
1 1
E ( ) = (
ˆ
2
) E ( ei ) = (
2
)(n − 2) = 
2 2

n−2 n−2

 ˆ 2 =
 i
e 2

is an unbiased estimator of  2 .
n−2
56
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis
Why is the Error Normality Assumption Important?
The normality assumption permits us to derive the
functional form of the sampling distributions of
ˆ , ˆ & ˆ 2 .
Knowing the form of the sampling distributions
enables us to derive feasible test statistics for the
OLS coefficient estimators.
These feasible test statistics enable us to conduct
statistical inference, i.e.,
1)to construct confidence intervals for  ,  &  .
2

2)to test hypothesis about the values of  ,  &  2


. 57
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis


. i ~ N (0, )
2
  Y ~ N ( + X ,  )
i i
2

 2
ˆ ~ N (  , ) ˆ ~ N ( , 
X 2

 xi
2 i
2 )
x 2
i

ˆ − 
( )  xi ~ N (0,1)
2

 ˆ − 
~ t n−2

2
βˆ − β seˆ(ˆ ) σˆ =
e i
~t n − 2 n−2
seˆ(βˆ)
 i
2
ˆ X
seˆ( ˆ ) = seˆ(αˆ ) = σˆ.
n  xi
2
 i
2
x 58
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

Confidence Interval for α and β :


αˆ − α n −2
P{−t n−2
  t / 2 } = 1 − 
seˆ(αˆ )
 /2

100( 1 − α)% Two-Sided


CI for α : αˆ  (t n−2
α/ 2 )seˆ(αˆ)
Similarly,
100( 1 − α)% Two-Sided
CI for β:
  (t / 2 ) seˆ( ˆ )
ˆ n− 2

59
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis
Let us continue with our earlier example.
We have: n = 10, ˆ = 3.6 , ˆ
 = 0.75, R = 0.5181,
2

var(ˆ )  2.3857 2 , var(ˆ )  0.0357 2 ,&  ei2 = 14.65

 is estimated by: σˆ 2 =  ei = 14.65 = 1.83125


2
2

n−2 8
 σˆ = 1.83125  1.3532
Thus, vâr(ˆ )  2.3857(1.83125)  4.3688
 seˆ(ˆ )  4.3688  2.09
vâr( ˆ )  0.0357(1.83125)  0.0654
 seˆ( ˆ )  0.0654  0.256 60
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

 95% CI for α and β :


95% CI for α : 1 −  = 0.95   = 0.05   / 2 = 0.025
 95% CI for α :

= 3.6  4.8195 [−1.2195, 8.4195]


95% CI for  :
 95% CI for  :
[0.1597, 1.3403]
= 0.75  0.5903 61
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

The confidence intervals we have constructed for


 &  are two-sided intervals.
Sometimes we want either the upper or lower limit
only, in which case we construct one-sided intervals.
For instance, let us construct a one-sided (upper
limit) 95% confidence interval for  .
Form the t-table, t08.05 = 1.86 .
Hence, ˆ + t 08.05 .seˆ( ˆ ) = 0.75 + 1.86(0.256)
= 0.75 + 0.48 = 1.23
The confidence interval is (- ∞, 1.23]. 62
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

➢ Similarly, the 95% lower limit is:


ˆ
 − t 0.05 seˆ( ˆ ) = 0.75 − 1.86(0.256)
8

= 0.75 − 0.48 = 0.27


➢ Hence, the 95% CI is: [0.27, ∞).
Hypothesis Testing:
 Use our example to test the following hypotheses.
 Result: Yˆi = 3.6 + 0.75 X i
(2.09) (0.256)
1. Test the claim that sales doesn’t depend on
advertising expense (at 5% level of significance).
63
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

 H0:  = 0 against Ha:   0 .


 Test statistic: ˆ −  0.75 − 0
tc =  tc = = 2.93
seˆ( ˆ ) 0.256
 Critical value: (tt = t-tabulated)
n−2
 = 0.05   / 2 = 0.025 t t = t =t = 2.306
8
 /2 0.025

 Since t c  t t , we reject the null (the alternative is


supported). That is, the slope coefficient is
statistically significantly different from zero:
advertising has a significant influence on sales.
2. Test whether the intercept is greater than 3.5. 64
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

 H0:  = 3.5 against Ha:   3.5 .


 Test statistic: t = ˆ −   t = 3.6 − 3.5 = 0.1 = 0.05
seˆ(ˆ )
c c
2.09 2.09

 Critical value: (tt = t-tabulated)


At 5% level of significance ( = 0.05),
n−2
t t = t =t 8
0.05 = 1.86
 Since t c  t t , we do not reject the null (the null is
supported). That is, the intercept (coefficient) is not
statistically significantly greater than 3.5.
65
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis
3. Can you reject the claim that a unit increase in
advertising expense raises sales by one unit? If so,
at what level of significance?
 H0:  = 1 against Ha:   1 .
 Test statistic: ˆ −  0.75 − 1 − 0.25
tc =  tc = = = −0.98
seˆ( ˆ ) 0.256 0.256
 At  = 0 . 05, t 0.025 = 2.306 and thus H0 can’t be rejected.
8

 Similarly, at  = 0.10, 0.05 = 1.86 H0 can’t be rejected.


8
t
 At  0.10 = 1.397 and thus H 0 can’t be rejected.
8
= 0 . 20, t
 At  = 0.50, t 08.05 = 0.706 and thus H0 is rejected. 66
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

 For what level of significance (probability) is the


value of the t-tabulated for 8 df at least as extreme
as t c = 0.98?
i.e., find P for which P{ t  0.98}P{t . 0.98 or - t  −0.98} = ?
P{t  0.706} = 0.25 & P{t  1.397} = 0.10
 0.98 is between the two numbers (0.706 and 1.397).
 So, P{t  0.98} is somewhere between 0.25 & 0.10.
 Using software, P{t  0.98}  0.19 .
P{ t  0.98} = 2P{t  0.98}  0.38
67
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis

 For our H0 to be rejected, the minimum level of


significance (the probability of Type I error) should
be as high as 38%. To conclude, H0 is retained!
 The p-value associated with the calculated sample
value of the test statistic is defined as the lowest
significance level at which H0 can be rejected.
 Small p-values constitute strong evidence against H0.

68
2.6 Confidence Intervals and Hypothesis Testing in Regression Analysis
There is a correspondence between the confidence
intervals derived earlier and tests of hypotheses.
For instance, the 95% CI we derived earlier for  is:
(0.16 <  < 1.34).
Any hypothesis that says  = c , where c is in this
interval, will not be rejected at the 5% level for a
two-sided test.
For instance, the hypothesis  = 1 was not rejected,
but the hypothesis  = 0 was.
For one-sided tests we consider one-sided
confidence intervals.
69
2.7 Prediction with the Simple Linear Regression
The estimated regression equation Yˆi = ˆ + ˆX i is used
for predicting the value (or the average value) of Y
for given values of X.
Let X0 be the given value of X. Then we predict the
corresponding value YP of Y by: YˆP = ˆ + ˆX 0
The true value YP is given by: YP =  + X 0 +  P
Hence the prediction error is:
YˆP − YP = (ˆ −  ) + ( ˆ −  ) X 0 −  P
E (YˆP − YP ) = E (ˆ −  ) + E ( ˆ −  ) X 0 − E ( P )  E (YˆP − YP ) = 0
 Yˆ = ˆ + ˆX is an unbiased predictor of Y. (BLUP!)
P 0 70
2.7 Prediction with the Simple Linear Regression
The variance of the prediction error is:
var(YˆP − YP ) = var(ˆ −  ) + X 02 var(ˆ −  )
+ 2 X 0 cov(ˆ −  , ˆ −  ) + var( P )

ˆ
var(YP − YP ) =  2X i
2

+  2 X 2
0
− 2 X  2 X
+  2

n xi i i
2 2 0 2
x x

ˆ 1 ( X − X ) 2
var(YP − YP ) =  [1 + +
2 0
]
n  xi 2

Thus, the variance increases the farther away the


value of X0 is from X , the mean of the observations
on the basis of which ˆ & ˆ have been computed.71
2.7 Prediction with the Simple Linear Regression

That is, prediction is more precise for values nearer


to the mean (as compared to extreme values).
within-sample prediction (interpolation): if X0 lies
within the range of the sample observations on X.
out-of-sample prediction (extrapolation): if X0 lies
outside the range of the sample observations. Not
recommended!
Sometimes, we would be interested in predicting the
mean of Y, given X0. We use: YˆP = ˆ + ˆX Pto predict
YP =  + X P . (The same predictor as before!)
The prediction error is: YˆP − YP = (ˆ −  ) + ( ˆ −  ) X P72
2.7 Prediction with the Simple Linear Regression
The variance of the prediction error is:
var(YˆP − YP ) = var(ˆ −  ) + X 02 var(ˆ −  )
+ 2 X cov(ˆ −  , ˆ −  )
0

ˆ 2 1 (X 0 − X ) 2
 var(YP − YP ) =  [ + ]
n  xi2

Again, the variance increases the farther away the


value of X0 is from X .
The variance (standard error) of the prediction error
is smaller in this case (of predicting E(Y|X)) than
that of predicting a value of Y|X. 73
2.7 Prediction with the Simple Linear Regression
Predict (a) the value of sales for a firm , and (b) the
average value of sales for firms, with an advertising
expense of six hundred Birr.
a. From Yˆi = 3.6 + 0.75 X i , at Xi = 6,
Point prediction: Yˆi = 3.6 + 0.75(6) = 8.1
[Sales value | advertising of 600 Birr] = 8,100 Birr.
Interval prediction: 95% CI: 0.025 = 2.306
8
t
1 ( X − X ) 2
1 ( 6 − 8) 2
seˆ(YˆP* ) = ˆ 2 [1 + + 0 2 ]  seˆ(YˆP* ) = 1.35 1 + +
n  xi 10 28
= 1.35(1.115) = 1.508 74
2.7 Prediction with the Simple Linear Regression
Hence, 95% CI : 8.1  (2.306)(1.508)
[ 4.62,11.58]
b. From Yˆi = 3.6 + 0.75 X i , at Xi = 6, Yˆi = 3.6 + 0.75(6) = 8.1
Point prediction:
[Average sales | advertising of 600 Birr] = 8,100 Birr.
Interval prediction: 95% CI: 1 (X − X )2
seˆ(YˆP* ) = ˆ 2 [ + 0
]
n x 2
i
1 ( 6 − 8) 2
 se(YˆP* ) = 1.35 + ˆ ˆ
 se(YP ) = 1.35(0.493) = 0.667
*
10 28

95% CI : 8.1  (2.306)(0.667) [6.56,9.64] 75


Notes on interpreting the coefficient of X in simple linear regression

1. Y =  + X +   dY =  .dX   =
dY
= slope
dX
β is the (AVERAGE) change in Y resulting from
a unit change in X.
 + X +
2. Y =e  ln Y =  + X + 
1 (dY ) Relative Δ in Y
 d (ln Y ) =  .dX  .dY =  .dX  = Y =
Y dX Absolute  in X

(dY )  100 %age Δ in Y


  (100) = Y = β ( 100) is the (AVERAGE)
dX dX
percentage change in Y resul -
 %age Δ in Y =  (100).dX ting from a unit change76in X.
Notes on interpreting the coefficient of X in simple linear regression

3. e
Y
= AX E  Y =  +  ln X +  ;
 = ln( A) & = ln( E )
dY dY Absolute  in Y
 = = =  dX
d (ln X ) ( 1 )dX Relative  in X  dY = .(  100)
X 100 X
 dY = (0.01 ).(% age  in X )
β ( 0.01) is the (AVERAGE) change in Y resulting from
a percentage change in X.
4. Y = AX e  ln Y =  +  (ln X ) +  ; = ln A
 

d (ln Y ) dY / Y %age  in Y
 = = =
d (ln X ) dX / X %age  in X
β is the (AVERAGE) percentage change in Y
= Elasticity
resulting from a percentage change in X. 77

You might also like