0% found this document useful (0 votes)
24 views40 pages

L5 2025 Spring

The document discusses the properties of Ordinary Least Squares (OLS) estimators, focusing on unbiasedness, variance, and the Gauss-Markov theorem, which establishes OLS as the best linear unbiased estimator (BLUE). It also covers the estimation of variance, goodness of fit, and the concept of consistency in relation to OLS estimators. The document emphasizes that while high R-squared values are desirable, the primary goal of regression is to estimate the population parameter β rather than merely fitting the data.

Uploaded by

shavityoav42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views40 pages

L5 2025 Spring

The document discusses the properties of Ordinary Least Squares (OLS) estimators, focusing on unbiasedness, variance, and the Gauss-Markov theorem, which establishes OLS as the best linear unbiased estimator (BLUE). It also covers the estimation of variance, goodness of fit, and the concept of consistency in relation to OLS estimators. The document emphasizes that while high R-squared values are desirable, the primary goal of regression is to estimate the population parameter β rather than merely fitting the data.

Uploaded by

shavityoav42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Properties of the OLS Estimators

Ity Shurtz

Ben-Gurion University of the Negev


Introduction
I b is a k × 1 vector of random variables
I It is useful to study its properties in order to be able to discuss
whether it is a “good” estimator of β or not
I we start by asking if the OLS estimator is unbiased: if its
expected value is equal to the its true value
Unbiasedness
I Let A = (X 0 X )−1 X 0 (a kXn matrix), then b = Ay. Note also
that AX = Ik
I To establish unbiasedness, it would be useful to first derive
the conditional expectation and then use the law of iterated
expectation in order to derive the unconditional expectation -

E [b|X ] = E [Ay|X ] = AE [y|X ] = AX β = Ik β = β

Where the 2nd equality is because A is a function of x alone


and the 3rd equality follows from the E [y|X ] = X β
assumption
I Thus, E [b] = EX [E [b|X ]] = EX [β] = β
I For any sample of observations X , the OLS estimator has
expectations β. Averaging over the possible values of X, we
find that the unconditional mean of b is β
Unbiasedness
I An alternative derivation of the unbiasedness is -

b = (X 0 X )−1 X 0 y = (X 0 X )−1 X 0 (X β + ) = β + A ⇒
E [b|X ] = β + E [A|X ] = β + AE [|X ] = β

Where the last equality follows from the assumption that


E [|X ] = 0
I Thus, unbiasedness requires assuming either E [y|X ] = X β or
E [|X ] = 0
Variance of b
I Using the assumption V (y|X ) = σ 2 I

V (b|X ) = E [(b − β)(b − β)0 |X ]


= E [A0 A0 |X ] = AE [0 |X ]A0
= (X 0 X )−1 X 0 σ 2 Inxn X (X 0 X )−1
= σ 2 (X 0 X )−1 X 0 X (X 0 X )−1 = σ 2 (X 0 X )−1
Variance of b
I Next, we rely on the Law of Total Variance, which is also
known as the ANOVA (Analysis of Variation) Theorem -
V (Y ) = V (E [Y |X ]) + E [V (Y |X )]
Intuitively this result says that the variance of Y can be
decomposed to two parts: (1) the variance of the CEF (2) the
variance of the residual

V (b) = EX [V (b|X )] + VX (E [b|X ]) = EX [σ 2 (X 0 X )−1 ] + VX [β]


= σ 2 E [(X 0 X )−1 ] + 0 = σ 2 E [(X 0 X )−1 ]

I Note that V (b) is a k × k symmetric matrix with the j th


diagonal elements being V (bj ) and the (j1 , j2 ) off-diagonal
element being Cov(bj1 , bj2 )
Gauss-Markov Theorem
I Next, we discuss the important Gauss-Markov Theorem -
justifies the use of OLS over a variety of competing estimators
I We already justified OLS. under our Assumptions, OLS is
unbiased
I However, there are many unbiased estimators of β under these
assumptions
I We will show that: under assumptions A1 through A4, the
OLS estimator is the best linear unbiased estimator
(BLUE)
Gauss-Markov Theorem
I We already know what unbiased means...
I What is linear? An estimator is linear iff it can be expressed as
a linear
P function of the data on the dependent variable
βj = n1 aij yi
where aij can be a function of the sample values of all the
independent variables (OLS for example are linear)
In matrix notation this is A∗ y where A∗ is kXn function of the
x 0s
I What is best? Here best is having the smallest variance
Gauss-Markov Theorem
Theorem
b is the best conditional-on-X linear unbiased estimator (BLUE)
estimator. That is, b has the lowest variance among all
conditional-on-X linear unbiased estimators of β.
An alternative way to state this theorem is that if there exists
another conditional-on-X unbiased linear estimator, denoted by b ∗ ,
then V (b ∗ |X ) − V (b|X ) is a positive semidefinite matrix
(A n × n matrix M is said to be positive semidefinite if for any
non-zero n × 1 vector z, z 0 Mz ≥ 0)
Gauss-Markov Theorem Proof
Proof: Let b ∗ = A∗ y be a different linear unbiased estimator of β,
where A∗ is a kxn matrix. E [A∗ y|X ] = E [A∗ X β + A∗ |X ] = β.
This holds true iff, A∗ X = I, b ∗ = β + A∗ 
V (b ∗ |X ) = σ 2 A∗ A∗ .
0

Define D = A∗ − A therefore, Dy = b ∗ − b.
A∗ X = AX + DX = I + DX This implies DX = 0 and therefore
DA0 = 0 (verify). Therefore, V (b ∗ |X ) = σ 2 [(D + A)(D + A)0 ] =
σ 2 (X 0 X )−1 + σ 2 DD 0 = v(b|X ) + σ 2 DD 0 . The martix DD 0 is
positive semi definite (matrix times itself transpose is psd) and this
is true for every X , therefore V (b∗) ≥ v(b) 
Corollaries of the Gauss-Markov Theorem
Claim: Let b ∗ 6= b be a conditional-on-X unbiased estimator of β.
Then V (bj∗ ) ≥ V (bj ) , ∀j = 1, . . . , k
Proof: Pick z = (0, . . . , 0, 1, 0, . . . , 0) with its j th entry being the
only non-zero entry and use the last statement of the
Gauss-Markov Theorem 
I Intuitively, b is preffered over b element by element, each

element has a smaller variance.


Fitted Values and Residuals
I The estimated CEF is refered to as the predicted or fitted
value of y :

ŷ = Xb = X (X 0 X )−1 X 0 y = XAy = Hy

where H = X (X 0 X )−1 X 0 (nXn) is refereed to the predictor


matrix or the projection matrix
I The residual is defied as the deviation of y from ŷ

e = y − ŷ = y − Xb = (I − X (X 0 X )−1 X 0 )y = (I − H)y = My

Where M = (I − X (X 0 X )−1 X 0 ) is the residual maker matrix,


because multiplying it by a random vector y gives the OLS
residuals of y when regressed on X , which is the matrix
defining M
Properties of M and H
1. The OLS residuals are orthogonal to all explanatory variables
since X 0 My = X 0 e = 0 (kx1 vector - from the normal eq.)
2. Furthermore, if X includes a constant term, then the sum of
the residuals adds to 0 [just take a look at how the first
element of X 0 e was derived] Intuition
3. M is symetric (M = M 0 )
4. M and H are n × n idempotent matrices and are orthogonal
to each other. I.e., HH = H, MM = M and MH = 0 Intuition
5. M and X are orthogonal (MX = 0) - regression any column of
X on X gives a perfect fit
6. HX = X - the idea is that the fitted value from running any
column of X on all regressors is exactly that column
7. (H + M)y = y - since the fitted value (Hy) and the residual
(My) sums to the observed value (y)
Estimation of V (b|X )
I We already know that V (b|X ) = σ 2 (X 0 X )−1 , but how can
this expression be estimated?
I Recall from prev. lecture σ 2 can be estimated using σ̂ 2 = ene
0

n e0e
(MOM) or by s 2 = n−k σ̂ 2 = n−k (OLS). Is any of them an
unbiased estimator of σ 2 ?
I It turns out that s 2 = n 2 e0e
n−k σ̂ = n−k is an unbiased estimator
I Proof below (not required)
Estimation of V (b|X )
I Note first (Trace of a square matrix: sum of elements on the
main diagonal)

trace(M) = trace(I − X (X 0 X )−1 X 0 )


= trace(In ) − trace(X (X 0 X )−1 X 0 )
= n − trace(X 0 X (X 0 X )−1 )
= n − trace(Ik ) = n − k

where the third equality comes from trace(AB) = trace(BA)


and the fourth the comes from
trace(A − B) = trace(A) − trace(B)
Estimation of V (b|X )
Using M to express the moments of e
I The residual vector, e, is a random vector since it is a
function of y
I The residual expectation -
E [e|X ] = E [My|X ] = ME [y|X ] = M(X β) = 0
I The residuals variance - V (e|X ) = V (My|X ) =
MV (y|X )M 0 = Mσ 2 In M 0 = σ 2 MM 0 = σ 2 M
Estimation of V (b|X )
I Let RSS = e 0 e, a univariante random variable. Then,

E [e 0 e|X ] = E [trace(e 0 e|X )] = E [trace(ee 0 |X )]


= trace(V (e|x)) = trace(σ 2 M)
= σ 2 trace(M)

I Altogether, E [e 0 e|X ] = σ 2 (n − k) ⇒ E [e 0 e] = σ 2 (n − k)
I Thus, the MOM estimator satisfies E [σ̂ 2 ] = n−k 2
n σ and an
2 2
unbiased estimator of σ is s = (e e)/(n − k)
0

I Since s 2 is an unbiased estimator σ 2 , then an unbiased


estimator of the variance matrix of b is V̂ (b|X ) = s 2 (X 0 X )−1
Goodness of fit
I Recall that the purpose of a regression is to estimate the
population parameter β and not to “fit the data” or “to
explain the variation in the dependent variable”. Thus,
regression with a low R 2 should not be dismissed
I What is the proportion of the total variation of y that can be
accounted for by the regression?
I y = ŷ + e, and by construction ŷ 0 e = 0.
recall: ŷ 0 e = (Hy)0 My = y 0 H 0 My = y 0 HMy
and: H(I − H) = H − HH = H − H = 0
Goodness of fit
Pn 2
Pn 2
I Thus, y 0 y = (ŷ + e)0 (ŷ + e) = ŷ 0 ŷ + e 0 e = i=1 ŷi + i=1 ei
I This is known as the sum of squares decomposition - the sum
of squares of the observed values equal to the sum of squares
of the fitted values plus the sum of squares of the residuals
I Further, ni=1 yi = ni=1 ŷi + ni=1 ei and ȳ = ŷ¯ + ē.
P P P

So, ni=1 yi = ni=1 ŷi and since ē = 0, then nȳ = nŷ¯


P P
Goodness of fit
n n n
(yi − ȳ)2 = (ŷi − ȳ)2 + ei2
X X X

|i=1 {z } |i=1 {z } |i=1


{z }
TSS ESS RSS

I This is the analysis of variation decomposition, where the


variation is defined as the sum of squares deviation from the
sample mean
I R 2 captures the proportion of TSS that is attributed to the
ESS RSS
regression - R 2 = TSS = 1 − TSS
Goodness of fit
I R 2 can be thought of as a measure for the explanatory power
of the regressors that is in excess of the sample mean of y.
Thus, it will later become handy in testing hypothesis
about the combined significance of all explanatory
variables
I However, since the point is not fitting the data, but rather
estimating β, nothing in the CRM requires high R 2
Properties of R 2
I In general, 0 ≤ R 2 ≤ 1
I To account for the fact that R 2 cannot decrease with k, the
adjusted R 2 – denoted R̄ 2 – penalizes when k is big relative
to n by allowing it either to fall or to rise with k
Pn 2
i=1 ei
2 n−k n−1
R̄ = 1 − Pn =1− (1 − R 2 )
i=1 (yi −ȳ)
2
n−k
n−1
Consistency
I We would like better criteria to judge estimators beyond
unbiasedness
I Weak criteria - many estimator are biased
I Weak - insensitive to the amount of information available
I The property of consistency offers some help. Intuitively, b is
a consistent estimator of β if the probability that b differs
from β goes to zero as the sample size increases
Intuitively, as the sample size grows the distribution of the
estimator becomes concentrated at the parameter
More intuition - if having more and more data does not
generally get us closer to the parameter value of interest, then
we are using a poor estimation procedure
I We are in the (abstract) world of asymptotic theory
Consistency

Definition
An estimator of β is consistent if its probability limit is β
I In general, the limit of an infinite sequence of scalars is the
number that the sequence is very close to as we add
additional elements to it
I For example, limn→∞ n1 = 0
I More formally,
Definition

The limit of the sequence {Xn }∞


n=1 is c , denoted lim Xn = c if
n→∞
∀ > 0, ∃N such that ∀n > N, |Xn − c| < 

I In words, x gets arbitrarily close to c above some n > N


Probability Limit
I Note however that OLS estimators are random variables
(rather than scalars), so they can, in general, take any value
in the real line and we do not know for sure that they will be
within any certain distance from some number
I Instead, we can talk about the corresponding sequence of
probabilities (all between 0 and 1). That is, we ask what
happens to the probability that the OLS estimator is within a -
δ neighborhood of c as the number of observation increases -
Definition
The probability limit of bn is c, denoted p lim bn = c, if for any
δ > 0, limn→∞ P(|bn − c| < δ) = 1
Equivalently, we say that bn convergence in probability to c,
p
written bn → c
I This means that the probability that bn is arbitrarily close to c
equals 1 with an infinitely large sample size
Consistency
I Consistency is a large sample property. Roughly speaking, as
n increases, it is increasingly likely that the values taken by
the random variable bn are close to β
I Consistency means that the mass of the distribution of bn
concentrates around the parameter β
I Next we will use simulation to examine this property in ols
estimator
Simulation
I One useful way to understanding the statistical properties of
any estimator, and specifically of the OLS estimator, is
through simulations
I That is -
1. Pick a data generating process (DGP) for the data (i.e.
the (multidimensional) distribution that data are coming
from) and values for the model parameters
2. Simulate the data based on the DGP and the params.
value
3. Use the simulated data values to estimate the parameters
4. Repeat steps 2-3 many times and explore the distribution
of the parameter estimators
Simulation
I For example
1. Assume that the true population model is
Yi = 10 + 5 · Xi + i
2. i ∼ N(0, 9)
3. Xi is standard normal: Xi ∼ N(10, 5)

set obs 100


gen e = rnormal(0, 9)
gen x = rnormal(10, 5)
gen y = 10 + 5*x + e
Simulation

tw(hist x) (kdensity x), graphregion(color(white))


legend(off)
.08
.06
.04
.02
0

0 5 10 15 20
x
Simulation

. reg y x

Source SS df MS Number of obs = 100


F(1, 98) = 695.78
Model 60559.5012 1 60559.5012 Prob > F = 0.0000
Residual 8529.76433 98 87.0384115 R-squared = 0.8765
Adj R-squared = 0.8753
Total 69089.2655 99 697.871369 Root MSE = 9.3294

y Coef. Std. Err. t P>|t| [95% Conf. Interval]

x 5.060004 .1918294 26.38 0.000 4.679325 5.440684


_cons 8.842219 2.1514 4.11 0.000 4.572836 13.1116
Simulation
I Let’s create a 1000 such data sets and estimate the regression
(Monte Carlo)

. capture program drop omitted

. program define omitted, rclass


1. drop _all
2. set obs 100
3. gen e = rnormal(0, 9)
4. gen x = rnormal(10, 5)
5. gen y = 10 + 5*x + e
6. reg y x
7.
. end

.
. simulate _b , reps(1000) seed(12345) nodots saving("$lec_5_path/b_100", repla

command: omitted
Simulation
I Let’s create a 1000 such data sets and estimate the regression
(Monte Carlo)

1000 datasets with n=100


2.5
2
1.5
1
.5
0

4 4.5 5 5.5 6
Model estimates of b1
Simulation
I Let’s create a 1000 such data sets and estimate the regression
(Monte Carlo)

1000 datasets with n=1000


8
6
4
2
0

4 4.5 5 5.5 6
Model estimates of b1
Simulation
I Let’s create a 1000 such data sets and estimate the regression
(Monte Carlo)

1000 datasets with n=10000


20
15
10
5
0

4 4.5 5 5.5 6
Model estimates of b1
Proving Consistency
I To show consistency, we need to be able to calculate the
sequence of probabilities, which is a difficult task
I However, the figure from above demonstrate that as n
increases, the distribution of the estimator are becoming
concentrated around its center, and so the probability of being
far from β becomes negligible
I While this is encouraging, this is definitely not a proof...
I The following lemma takes us from a probability limit to the
regular limit world
Lemma
Any unbiased estimator of β whose covariance matrix tends to zero
with the sample size is consistent. That is, let cn be an estimator
of β s.t. ∀n, E [cn ] = β and also limn→∞ V (cn ) = 0. Then
p lim cn = β
Proving Consistency
I Since we know that ∀n E [bn ] = β, then for the lemma to be
used for the purpose of proving consistency, we need to prove
that limn→∞ σ 2 (X 0 X )−1 = 0
I Note that X 0 X is a k × k matrix, with its (j, l)th element
being ni=1 xji xli
P

I So, for example, if the model includes an intercept,


then the (1, 1)th element of X 0 X tends to infinity
with n
I In order to use the lemma, we will assume the following
assumption -
Assumption
A5. p limn→∞ ( XnX ) = Q a positive definite k × k matrix
0

I So Q can be thought of as the matrix of the expected


cross-products between all the regressors
Proving Consistency
I Assumption A5 implies the Q −1 exists and is nonzero
I For the case of non-stochastic X (i.e. when the p lim and the
regular limit coincide) -

lim V (bn ) = lim σ 2 (X 0 X )−1


n→∞ n→∞
σ2 X 0 X −1
= lim lim ( )
n→∞ n n→∞ n
X 0 X −1
= 0 ? ( lim )
n→∞ n
= 0 ? Q −1 = 0

I Note that bn is a vector of random variables. The consistency


of a vector is an element-by-element concept. So, consistency
of bn implies consistency of each element in it. That is,
p lim bjn = βj , j = 1, . . . k
Recap
I OLS estimators are unbiased
I OLS is BLUE
I OLS estimators are consistent
Intuitition
I This is how the data looks

y1 x11 x12 . . . x1j . . . x1k


 
y2 x21 x22 . . . x2j . . . x2k 
 .. .. .. 
 
. . . 
 yi xi1 xi2 xij . . . xik 
 
...
 .. .. .. 
 
. . . 
yn xn1 xn2 . . . xnj . . . xnk

I When we multiply X 0 by e, each row of the kx1 vector will


look like x11 · e1 + x21 · e2 ... + xn1 · en = 0
I If the x 0 s are all one’s, then the sum is:
1 · e1 + 1 · e2 ... + 1 · en = 0
back
Intuitition
Note: (A + B)0 = A0 + B 0 , (AB)0 = B 0 A0
I (I − H)0 = I 0 − (X (X 0 X )−1 X 0 )0 = I − ((X 0 X )−1 X 0 )0 X 0 =
I − X (X 0 X )−1 X 0 = (I − H)
I H idempotent (when multiplied by itself, yields itself)
HH = X (X 0 X )−1 X 0 · X (X 0 X )−1 X 0 = X (X 0 X )−1 X 0 = H
I MH = (I − H)H = 0
I MX = (I − X (X 0 X )−1 X 0 )X = IX − X (X 0 X )−1 X 0 X = 0 back

You might also like