1 Omitted Variable Bias: Part I: The Baseline: SLR.1-4 Hold, and Our Estimates Are Unbiased
1 Omitted Variable Bias: Part I: The Baseline: SLR.1-4 Hold, and Our Estimates Are Unbiased
Population Model:
y = β0 + β1 x + u
Sample Regression:
ŷi = β̂0 + β̂1 xi
We can use what we know about the population model, plug y = β0 + β1 x + u into our formula for β̂1 and
simplify:
P
i (xi − x̄)(β0 + β1 xi + ui )
β̂1 = P
i (xi − x̄)xi
P P P
β0 i (xi − x̄) + β1 i (xi − x̄)xi + i (xi − x̄)ui
= P
i (xi − x̄)xi
P
(x i − x̄)ui
= β1 + Pi
i (xi − x̄)xi
Now, remember that β̂1 is a random variable, so that it has an expected value:
P P
h i
i (xi − x̄)ui i (xi − x̄)ui
E β̂1 = E β1 + P = β1 + E P = β1
i (xi − x̄)xi i (xi − x̄)xi
Aha! So under assumptions SLR.1-4, on average our estimates of β̂1 will be equal to the true population
parameter β1 that we were after the whole time.
2
Reality Check: SLR.4 fails, E [u|X] 6= 0, and our estimates are biased
Population Model:
Sample Regression:
What’s the OLS formula for α̂1 ?
P
Cov(xi , yi ) − x̄)(yi − ȳ)
i (x
α̂1 = = Pi 2
V ar(xi ) (x i − x̄)
P i
(x − x̄)yi
= Pi i
i i − x̄)xi
(x
We can use what we know about the population model, plug y into our formula for α̂1 and simplify:
P
i (xi − x̄)(β
P0
+ β1 xi + β2 zi + ui )
α̂1 =
i (xi − x̄)xi
P P P P
β0 i (xi − x̄) + β1 i (xi − x̄)xi + β2 i (xi − x̄)zi + i (xi − x̄)ui
= P
(xi − x̄)xi
P P i
(x i − x̄)zi (x i − x̄)ui
= β1 + β2 P i + Pi
i (xi − x̄)xi i (xi − x̄)xi
P
(x −x̄)z
There’s an extra term! The second term β2 P i(xii−x̄)xii is a result of our omission of the variable z that affects
i
y. When SLR.1-4 hold, on average our regression estimates will be close to the true parameters. But here,
SLR.1-4 do not hold! If we take the expectation of α̂1 :
P P
i (xi − x̄)zi (xi − x̄)ui
E [α̂1 ] = E β1 + β2 P + Pi
(xi − x̄)xi i (xi − x̄)xi
Pi P
(x i − x̄)zi (xi − x̄)ui
= β1 + β2 E P i + E Pi
i (xi − x̄)xi i (xi − x̄)xi
= β1 + β2 ρ 1
If E [α̂1 ] 6= β1 then we say α̂1 is biased. What this means is that on average, our regression estimate is going
to miss the true population parameter by .
3
people is our whole population of interest, so that when we run our regressions, we are actually revealing the
true parameters instead of just estimates. We’re interested in the relationship between wages and gender,
and our “omitted” variable will be tenure (how long the person has been at his/her job). Suppose our
population model is:
log(wage)i = β0 + β1 f emalei + β2 tenurei + ui (1)
First let’s look at the correlations between our variables and see if we can’t predict how omitting tenure will
bias β̂1 :
. corr lwage female tenure
| lwage female tenure
-------------+---------------------------
lwage | 1.0000
female | -0.3737 1.0000
tenure | 0.3255 -0.1979 1.0000
If we ran the regression:
log(wage)i = α0 + α1 f emalei + ei (2)
...then the information above tells us that α1 β1 . Let’s see if we were right. Below is the Stata output
from running regressions (1) and (2):
. reg lwage female tenure
Source | SS df MS Number of obs = 526
-------------+------------------------------ F( 2, 523) = 67.64
Model | 30.4831298 2 15.2415649 Prob > F = 0.0000
Residual | 117.846622 523 .225328148 R-squared = 0.2055
-------------+------------------------------ Adj R-squared = 0.2025
Total | 148.329751 525 .28253286 Root MSE = .47469
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -.3421323 .042267 -8.09 0.000 -.4251663 -.2590984
tenure | .0192648 .0029255 6.59 0.000 .0135176 .0250119
_cons | 1.688842 .0343675 49.14 0.000 1.621326 1.756357
------------------------------------------------------------------------------
. reg lwage female
Source | SS df MS Number of obs = 526
-------------+------------------------------ F( 1, 524) = 85.04
Model | 20.7120004 1 20.7120004 Prob > F = 0.0000
Residual | 127.617751 524 .243545326 R-squared = 0.1396
-------------+------------------------------ Adj R-squared = 0.1380
Total | 148.329751 525 .28253286 Root MSE = .4935
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -.3972175 .0430732 -9.22 0.000 -.4818349 -.3126001
_cons | 1.81357 .0298136 60.83 0.000 1.755001 1.872139
------------------------------------------------------------------------------
Just to clarify, we “know” that β1 = and α1 = .
This means that our BIAS is equal to:
4
There’s one more parameter missing from our OVB formula. What regression do we have to run to find its
value?
tenure = ρ0 + ρ1 f emale + v (3)
The Stata output from this regression is below:
. reg tenure female
Source | SS df MS Number of obs = 526
-------------+------------------------------ F( 1, 524) = 21.36
Model | 1073.26518 1 1073.26518 Prob > F = 0.0000
Residual | 26327.9839 524 50.244244 R-squared = 0.0392
-------------+------------------------------ Adj R-squared = 0.0373
Total | 27401.249 525 52.1928553 Root MSE = 7.0883
------------------------------------------------------------------------------
tenure | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -2.859373 .618672 -4.62 0.000 -4.074755 -1.643991
_cons | 6.474453 .4282209 15.12 0.000 5.633212 7.315693
------------------------------------------------------------------------------
Just to clarify, our ρ =
Now we can plug all of our parameters into the bias formula to check that it in fact gives us the bias from
leaving out tenure from our wage regression:
α1 = E[α̂1 ] =
β1 + β2 δ 1
=
−.3421323 + (.0192648)(−2.859373)
= −0.397217549
4 OVB Intuition
For further intuition on omitted variable bias, I like to think of an archer. When our MLR1-4 hold, the archer
is aiming the arrow directly at the center of the target—if he/she misses, it’s due to random fluctuations
in the air that push the arrow around, or maybe imperfections in the arrow that send it a little off course.
When MLR1-4 do not all hold, like when we have an omitted variable, the archer is no longer aiming at the
center of the target. There are still puffs of air and feather imperfections that send the arrow off course, but
the course wasn’t even the right one to begin with! The arrow (which you should think of as our β̂) misses
the center of the target (which you should think of as our true β) systematically.
To demonstrate this, I did the following:
Take a random sample of 150 people out of the 500 that are in WAGE1.dta
Estimate β̂1 using OLS, controlling for tenure with these 150 people.
Estimate α̂1 using OLS (NOT controlling for tenure) with these 150 people.
Repeat 6000 times.
At the end of all of the above, I end up with 6000 biased and 6000 unbiased estimates of β̂1 . I plotted the
kernel density of the biased estimates alongside that of the unbiased estimates. You can see how the biased
distribution is shifted to the left indicating a downward bias!
5
Figure 1. Kernel densities for biased and unbiased estimates.
6
4
Density
2
0
alphahat_1 betahat_1
Traffic fatalities and primary seatbelt laws. Using data from Anderson (2008) for 49 US states, we can
examine how primary seatbelt laws (an officer can pull you over just for not wearing your seatbelt) impact
annual traffic fatalities. From the paper, I have data on the number of traffic fatalities in 2000, whether or
not the state had a pimary seatbelt law in place, and the total population of the state. In 2000, just 35% of
the 49 states had primary seatbelt laws (the rest had what’s called a secondary seatbelt law). Suppose we
run the following regression:
\ = β̂0 + β̂1 pop + β̂2 primary
f atalities
1. Think of another variable or factor that you think affects traffic fatalities:
6
(Clearly, even this specification with controls for weather has some issues: an additional inch of snow per
year decreases predicted fatalities by 579.91 lives?)
5 Confidence Intervals
The simulation that was shown in section demonstrates something pretty profound: even after designing a
random sample, collecting the data, figuring out the population model, and running regressions, there’s
still a chance your estimates are very far from those of the population. Each random sample
yields a different estimate; if you have 100 random samples, you have 100 different values of β̂1 . What can
you do with them? Confidence intervals use the randomness of our sample estimates to say something useful
about where the true population parameter actually is.
You can think of confidence intervals in two different ways:
1. We can think of a confidence interval as a bound for how wrong our sample estimate is. For example,
if a political poll finds that a proposition will receive 53.2% of the vote, we come to very different
conclusions if the “margin of error” is .5% or 5%.
2. Alternatively, we can think of a confidence interval as a measure of where the true, population value
is likely to be. (The wording here is a little misleading, as you’ll see in a bit.) For example, if the
true average wage for US laborers is $7, then it’s unlikely that we’d find a confidence interval from
our sample like [10,14].
The basics
We can think of a sample mean, x̄ the same way we think about our β̂s: these are both .
We know even more about x̄ from the Central Limit Theorem: For a random sample of a variable {x1 , ..., xN },
the Central Limit Theorem tells us that for very large samples (large N ), the sample average x̄ ∼ N (µ, σx̄2 ).
What this means: if I took 10,000 different random samples of laborers in the US and recorded their wage, I
would end up with 10,000 different sample means {x̄1 , x̄2 , ..., x̄10,000 }. If I plotted a histogram of all of these
sample means, it would look very much like a normal distribution and the center of the distribution would
be very close to the true average wage, µ, in the US. Because it’s easy to get confused when we’re talking
about a random variable X and another random variable x̄, which is the sample mean of X, here’s a table
to keep things straight:
Population Sample
Sample Mean: x̄ = n1 P
P
Mean of X: µX i xi , and E[x̄] = µX
1
Variance of x: V ar(x) or σx2 Sample Variance of x: s2 = n−1 2 2 2
i (xi − x̄) , and E[s ] = σx
2
Sample Variance of x̄: s2x̄ = sn , and E s2x̄ = σx̄2
Variance of x̄: V ar(x̄) or σx̄2
Normal distributions are tricky to work with, and it’s easier to standardize normally distributed variables so
that they have a mean of 0 and a variance of 1. Remember our formula to find the expected value and variance
of a transformed variable... If v is normally distributed with expected value E[v] and V ar(v) = σv2 :
v − E[v]
E =
σv
7
v − E[v]
V ar =
σv
Since we’re interested in the distribution of x̄ (which is normal), we can standardize it just like above so
that: x̄−µ
σx̄ ∼ N (0, 1)
Now we can use what we know about the distribution of standard normal variables to help us say something
meaningful about what the true population mean, µX might be:
We know that for any standard normal variable v, P r(−1.96 < v < 1.96) = 95%
We know that x̄−µX
σx̄ is standard normal
If we draw a number z from a standard normal distribution, then we know P r(1.96 < z < 1.96) =
x̄−µ x̄−µ
Above we argued that σx̄ ∼ N (0, 1) which means that the P r(−1.96 < σx̄ < 1.96) =
Just like in lecture, we can rearrange terms to see that P r(x̄ − 1.96σx̄ < µ < x̄ + 1.96σx̄ ) =
The most important thing to remember about a confidence interval is that the is
what’s random, not the .
To make another metaphor out of an archaic sport, I like to think of confidence intervals in the context of a
game of horseshoes:
As you could guess from the table, in practice we do not know what σx̄ is, and we have to estimate it
using sample data with: s2x̄ , which we call the standard error. Using the standard error (which is itself a
random variable) changes the distribution of the sample mean a little bit, and we have to use a Student’s t
distribution instead of a Normal distribution:
x̄ − µ
s2
∼ tn−1
√
n
8
Step 1. Determine the confidence level.
If we want to be 95% confident that our interval covers the true population parameter, then our confidence
level is 0.95. Pretty straight forward.
Step 2. Compute your estimates of x̄ and s.
Step 3. Find c from the t-table.
The value of c will depend on both the sample size (n) and the confidence level (always use 2-Tailed for
confidence intervals):
Example I took a random sample of 121 UCB students’ heights in inches, and found that x̄ = 65 and
s2 = 4. Following the 4 steps above, I can find the 95% confidence interval for x̄:
Practice
You have a random sample of housing prices in the Bay Area. After loading the data into Stata, you look
at summary statistics for the prices you observed:
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
price | 88 293.546 102.7134 111 725
Find a 99% confidence interval for the true average housing price:
9
6 Appendix
Two facts used in the discussion of omitted variable bias:
X X X
(xi − x̄)(yi − ȳ) = (xi − x̄)yi − (xi − x̄)ȳ
i i i
¯
X X
= (xi − x̄)yi − ȳ (xi − x)
i i
X X
= (xi − x̄)yi − ȳ(0) = (xi − x̄)yi
i i
Now
P replace every
P yi in what’s above to an xi , and every ȳ to an x̄, and you can see that the same steps show
2
i (xi − x̄) = i (xi − x̄)xi .
10