Announcements:
Exam 1 has been returned to you. Mean: 41 Median: 40
Today’s Objectives
Part 2: b is random!
So what are the statistical properties of the LS estimator?
Properties of b rely on our “starting point” assumptions about the
randomness of
MVLUE (Gauss-Markov Theorem)
Estimating the variance of b, the LS estimator
1
Assumptions of the Classical Linear Model
A.1. Linearity
yi = xi11 + xi22 + …+ xikk + I
A.2. Full Rank
No exact linear relationship among any of the explanatory variables.
A.3. Exogeneity of the independent variables.
E [i | xj1, xj2, … xjk] = 0 for all j. Don’t focus on the word “exogeneity”;
instead know that the conditional mean is zero.
A.4. Homoscedasticity & Nonautocorrelation.
The variance of each i is 2; i is not correlated with j.
A.5. Exogenously generated data
X could include random elements, but the uncorrelatedness of X and is
crucial.
A.6. Normal distribution
Each disturbance term is normally distributed: i,~ N(0, 2).
NOTE: Highlighted assumptions deal with the error term, 2
Properties of b, the LS estimator: Unbiasedness
Let’s re-write b to put it in terms of the disturbance, .
Why? Because we want to know about b’s randomness, and the only information
we have is on .
Taking expectations, conditional on X,
Assuming X is nonrandom:
Conditional on (random) X:
By Assumption A.3, the second term is 0.
So Good news, b is unbiased!
The interpretation of this result is that for any particular set of observations, X,
the least squares estimator b has expectation .
(If X were random:) If we were to average this result over other possible values
of X, i.e., iterating over X, then the unconditional mean is also .
3
Properties of b, the LS estimator: linearity
Note:
can be rewritten as:
Therefore, b is a linear function of the disturbance term.
It is therefore a Linear estimator and an Unbiased estimator.
4
Properties of b, the LS estimator: Variance-Covariance of b
Remember from Part 0 the definition of a random variable : Var[x] = E[(x – )2]
For a vector x, Var[x] = E[(x - x) (x - x)′]
So the variance of b is similar,
Remember, ,
and E[
5
Properties of b, the LS estimator: Variance-Covariance
So the variance of b is similar,
With E[
Remarks:
1. is read as the “variance-covariance” matrix for b
2. In general, the greater the variation in X, the smaller becomes, and b will has
less variance (i.e., more precise).
6
Other Linear Unbiased Estimators?
To include the LS estimator and all other linear estimators…
Let
Let C = A + D be a k x n matrix that does not depend on y.
Let b0 = Cy be our new linear estimator
Note: We trivially recover the LS estimator for the special case D=0
Taking expectations (for now, let’s ignore conditional expectations)…
b0 will be biased except in cases when is a null vector.
For b0 to be unbiased, for all .
Implication: Another linear and unbiased estimator is possible, but important
restrictions hold.
With restrictions: 7
Variance of other Linear Unbiased Estimators?
Compare the Cov matrix of b0 with the class of linear, unbiased estimators (A):
i) Requires X = 0 (for unbiasedness), so
ii) b0 – =
But with our unbiasedness restriction, X = 0 (and = 0),
8
Variance of other Linear Unbiased Estimators?
Compare the Var-Cov matrix of b0 a linear, unbiased estimators to the Var-Cov
matrix of b, our unbiased, least squares estimator:
Or,
9
Variance of other Linear Unbiased Estimators?
What do we know about ?
Remarks:
1. The diagonal elements are sums of squares (i.e., 0)
2. The variance of each element of b0 is greater than or equal to the variance of
each element of b.
3. If DD is positive semidefinite (or nonnegative semidefinite),
then Var(b0) exceeds Var(b).
(For some q 0, then
Or for if Z = D q, then Z Z 0)
4. If diagonal elements of Var(b0) are equal to diagonal elements of Var(b), then
must be zero.
=> D is a null matrix
=> b0 = b 10
Gauss-Markov Theorem
For a regressor matrix X,
The LS estimator b is the minimum variance linear unbiased estimator of .
So,
When dealing with class of Linear Unbiased Estimators
the LS estimator is best (best = minimum variance).
MVLUE: Minimum Variance Linear Unbiased Estimator
BLUE: Best Linear Unbiased Estimator (Old terminology)
11
Estimating the Variance of the LS Estimator
Recall,
where = unobserved population parameter
=
We also know that ei is an estimate of i.
So by analogy,
…could be a natural choice! (Let’s investigate.)
In other words, can we find that = ?
LS residuals are…
(since MX = 0) Recall
An estimator of will be based on the sum of squared residuals (which is a scalar:
12
Estimating the Variance of the LS Estimator
Expectations:
Remarks:
1. are scalars.
2. The trace of a matrix is equal to the sum of its diagonal elements
3. The expected value of a scalar matrix is equal to its trace! (This means the
expected value of the trace is the trace of the expected value.)
4. Short cut. These expectations should actually be conditional on X, but it
doesn’t affect the outcome.
5. tr(ABCD) = tr(BCDA) = tr(CDAB) = tr(DABC)
6. We’ll use these trace results to find
7. We are hoping to find = or =
13
Estimating the Variance of the LS Estimator
!!
14
Estimating the Variance of the LS Estimator
So, the natural estimator of , is biased.
Note that the bias becomes smaller as n
Bias =
Consequently,
So, is an unbiased estimator of
With this estimator in hand, we can compute the estimated variance of the LS
estimator:
15
Summary: Estimating the Variance of the LS Estimator
Recall the following logic “chain”,
is an unobserved population parameter, =
However, our LS residual, ei, is an estimate of i.
So by analogy, could be a natural choice (i.e.,)
But is biased: , not .
We can show that is an unbiased estimator of
16
Statistical properties (distribution) of
If were known, then statistical inference on could be based on zk.
But it’s generally not known. So we use s2 instead of .
While we know that s2 is an unbiased estimator for , we don’t know its
statistical properties (i.e., we don’t know its distribution).
That means we have another big step to do. If we can rewrite s2 in terms
of , then we might be able to find its distribution.
This is a “quadratic form” of a standard normal vector ().
Therefore, it is Chi-Square with rank(M), where rank(M) = trace(M) = n –
K.
So,
17
Independence of b and e (and s2)
Independence of b and e:
Based on our original assumptions of the Classical Linear Regression
Model…
The distribution of conditional on X is N(0, I ):
n
Thus, the distribution of conditional on X does not depend on X.
Both b and e are linear functions of : They are jointly normal
conditional on X, but also they are uncorrelated, conditional on X.
In other words b and e are independently distributed, conditional on X.
Therefore all functions of e (including s2) are also independent of b.
We need b to be independent from s2.
18
Recap: Distributions
We showed that
Therefore:
And
We also noted that
From Part 0 – If we have ratio of a Standard Normal variable to a square root of
a Chi-squared variable divided by its d.o.f., what is the distribution of the ratio?
A Student’s t:
tk follows a t distribution with n – K degrees of freedom.
19
Hypothesis testing: a t-Test
Step 1: Construct a null hypothesis: , and then use the t-ratio
Note 1: A “large” deviation of tk from 0 implies is a sign of a failure of the null
hypothesis. The next step specifies how large is too large.
Note 2: If 0, then the test is a probability statement on the estimate being statistically
different from zero.
Step 2: Go to a t-table (or a computer program) and look up the entry for n – K degrees
of freedom. Find the critical value, t(1-)/2(n – K), such that the area in the
t-distribution to the right of t(1-)/2 is (1- )/2.
Prob{ – t(1-)/2(n – K) < tk < t(1-)/2(n – K)} = 1 – (Often, is picked to be 0.95)
(1- )/ 2
20
- t(1-)/2(n – K) 0 tk t(1-)/2(n – K)
Hypothesis testing: a t-Test
Step 3: Fail to reject if – t(1-)/2(n – K)< tk < t(1-)/2(n – K). Reject otherwise.
A convenient feature of the t-test is that the critical value does not depend on X.
Therefore, there is no need to calculate critical values for each sample.
A decision rule based on p-Value:
Step 1: Same as above.
Step 2: Calculate p = Prob(t > |tk|)
Step 3: Accept if p > 1- . Reject otherwise (if p < 1- ).
I.e., reject if p < 0.05
Finally, note that a very common test is whether is significantly different from
zero.
Thus,
And . This is standard output from computer programs.
21
An Example from our Lab
gpercap i 1constant i 2 pg i 3 y i 4 pnci 5 puci
6 ppt i 7 pd i 8 pni 9 psi 10 yeari i
=?
= – 4.61 Does it also mean ?
YES
22
An Example from our Lab
= – 4.61
What’s the associated with this?
❑0
: = 0. (𝑏k − 𝛽𝑘 ) −0.1206771 − 0
𝑡 𝑝𝑔= = =− 4.61
What’s the conclusion? 𝑆𝐸( 𝑏 k ) 0.0261592
|tpg| = 4.61 > t(1-)/2(n – K) => Reject => 0 with 95% confidence.
Or, “t-stat” “Critical value”
p = Prob(t(n-K) > |tpg|) = 0.000 < (1- )= (1-.95) = 0.05 => Reject
23
An Example from our Lab
What’s the meaning of “F( 9, 26)” and “Prob > F” ?
Another hypothesis test that we’ll explore:
: = … = = 0.
How many “joint” hypotheses?
Nine. (That is, nine “equal signs”.)
We’ll use our Restricted Least Squares estimator with nine restrictions.
24
An Example from our Lab
Can you use the output to find s2?
Two ways:
1. s2 = = 0.000451
2. (Root MSE)2 s2 (0.02133)2 = 0.000455
25