Econ 535, Applied Econometrics SOLUTIONS - Problem set 2 Due
February 26, 7pm
Part 1: Political views
Below is a subset of the data from the in-class experiment. There are two variables
– Politics_Economic (which consists of the answer to the question of how
economically liberal (1) or conservative (10) a person is are), and Politics_Social
(corresponding data from the question on how socially liberal or conservative a
person is). This sample is, obviously, very small. The purpose of this exercise is to
let you investigate the mechanics of OLS regression. First you will calculate the
regression “by hand” using the formulas from class, and then you will use Stata to
confirm your calculations.
The hypothesis is that the dependent variable, Politics_Economic (Y), is a function
of the independent variable, Politics_Social (X).
ID Politics_Social Politics_Economic
1 4 3
2 8 8
3 2 4
4 3 3
5 9 7
6 4 8
7 5 5
1. Using the appropriate formulas (given in the Appendix at the end of this
assignment), show how to calculate each of the following. (Note: “show how to
calculate” means (1) write the appropriate formula; (2) plug in the appropriate
values; and (3) show the computed answer. You do not need to show the
intermediate calculations between steps 2 and 3.)
a) (1 point) ^β 1, the estimated slope coefficient from the regression Y = β0 + β 1 X +u.
∑( X i− X ) ( Y i−Y )
Answer: ^β 1= =0.575
∑ ( X i− X )2
b) (2 points) ^β 0, the estimated intercept term from the same regression. Can this
estimated intercept be meaningfully interpreted? Why or why not?
Answer: ^β 0=Y − ^β 1 X =2.554
This estimated intercept cannot be meaningfully interpreted. If we tried to
interpret it, the intercept would be the estimated value of Politics_Economic when
Politics_Social is 0. However, Politics_Social cannot have a value of zero, so the
intercept cannot be meaningfully interpreted.
^i, the predicted values of Politics_Economic for each of the seven
c) (1 point) Y
individuals.
The predicted values are given by Y ^ = β^ + β^ x , where Politics_Economic is Y and
i 0 1 i
Politics_Social is x. See results in the table below.
ID Predicted value of
Politics_Economic
1 4.85
2 7.15
3 3.70
4 4.28
5 7.73
6 4.85
7 5.43
d) (1 point) u^ i , the OLS residual for each of the seven individuals.
^ i. See results in the table below.
The OLS residuals are given by u^ i=Y i−Y
ID OLS Residual
1 -1.85
2 0.85
3 0.296
4 -1.28
5 -0.73
6 3.15
7 -0.43
2. (1 point) In a concise paragraph, drawing on the numbers you calculated
above, describe the relationship between Politics_Economic and Politics_Social as
precisely as you can. Indicate the direction and magnitude of the relationship
based on this sample of seven individuals.
In this sample, an increase in Politics_Social by 1 is associated with a decrease in
Politics_Economic of 0.575. However, the magnitude of this decrease is relatively
low considering that the average value of Politics_Economic is 5.43. We have no
evidence that changes in Politics_Social cause changes in Politics_Economic.
3. Now you will reproduce your results using Stata. Open Stata and type “edit”,
which bring up something that looks like a spreadsheet. Enter the values for
Politics_Economic and Politics_Social in the first two columns. Double-click the
column headers and enter variable names (“Politics_Economic”, “Politics_Social”).
Close the editor window when you are done. Use the command “list” to make sure
that you have typed in the numbers correctly, and use the command “sum” to
inspect the variable means.
a) (1 point) Run a regression with “Politics_Economic” as the dependent (left-
hand-side) variable and “Politics_Social” as the independent (right-hand-side)
variable. Use robust standard errors. Remember that the first variable you list
after the regress command is the dependent variable “Y”, and the second variable
is the independent variable “X.” Report the output and find and label ^β 1 and ^β 0.
These should match what you calculated above.
. reg politics_economic politics_social, r
Linear regression Number of obs = 7
F(1, 5) = 10.61
Prob > F = 0.0225
R-squared = 0.4451
Root MSE = 1.816
---------------------------------------------------------------------------------
| Robust
politics_econ~c | Coefficient std. err. t P>|t| [95% conf. interval]
----------------+----------------------------------------------------------------
politics_social | .575 .1765099 3.26 0.023 .1212668 1.028733
_cons | 2.553571 1.371069 1.86 0.122 -.9708724 6.078015
---------------------------------------------------------------------------------
^β is highlighted in green, and ^β is highlighted in yellow.
1 0
b) In the regression that you just ran, how come the p-value for ^β 1 is larger than
0.05 even though the absolute value of the t-statistic is larger than 1.96?
Not applicable
c) (1 point) Write a policy conclusion based on the results that you found. Does
this provide evidence that having more/less conservative views on social issues
leads to having more/less conservative views on economic issues?
The most important thing to note here is that our findings here cannot be
interpreted as showing a causal relationship (or a lack of a causal relationship).
The policy conclusions that can be drawn are hence very limited.
Part 2: Happiness and Socioeconomic Status
The General Social Survey (GSS) was launched in 1972 by the National Opinion
Research Center (NORC) and completed its 26th round in 2006. The GSS is one of
the most frequently analyzed source of information in the social sciences. Here you
will use a subset of GSS data from the 2000 survey to explore the relationship
between some simple demographic variables, socioeconomic status and self-
reported happiness. For this question you will use the data set for Problem Set 2.
The main variables you will be using in this assignment are listed below.
Varia Description
ble
divorc Dummy, =1 if respondent is divorced
e
marrie Dummy, =1 if respondent is married
d
female Dummy, =1 if respondent is female
cohort year of birth of respondent
educ Years of education
sei “Socioeconomic indicator” (a value from 0-100. A greater
value indicates higher socioeconomic status.)
workin Dummy, =1 if respondent is currently employed
g
retired Dummy, =1 if respondent is currently retired
unemp Dummy, =1 if respondent is currently unemployed
vhappy Dummy, =1 if respondent reported that he/she was “very
phapp happy” =1 if respondent reported that he/she was “pretty
Dummy,
y
nothap happy” =1 if respondent reported that he/she was “not
Dummy,
py happy"
1. (1 point) Create summary statistics for the above variables, and present a table
summarizing the main findings. The goal of this table is to provide the background
you would like the reader to have before you present any additional analysis.
. sum
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
educ | 374 14.06417 2.875644 3 20
sei | 374 53.2139 19.46692 17.1 97.2
cohort | 374 1958.914 12.8886 1922 1982
female | 374 .5294118 .4998028 0 1
working | 374 .6336898 .482441 0 1
-------------+---------------------------------------------------------
retired | 374 .0534759 .225282 0 1
unemp | 374 .0213904 .1448756 0 1
divorce | 374 .131016 .3378699 0 1
married | 374 .4705882 .4998028 0 1
vhappy | 374 .2834225 .4512634 0 1
-------------+---------------------------------------------------------
phappy | 374 .5534759 .4977981 0 1
nothappy | 374 .1256684 .3319194 0 1
2. (2 points) Create a variable called “age” that is equal to the respondent’s age
in 2000. Report the average, maximum and minimum ages in the sample.
. gen age=2000-cohort
. sum age
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
age | 374 41.08556 12.8886 18 78
The average age is 41.08 years. The maximum age is 78 years, and the minimum
age is 18 years.
3. (1 point) In this sample, who are more likely to report being “very happy”:
working people, or unemployed people? Report averages for both populations.
. sum vhappy if working==1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
vhappy | 237 .2995781 .4590427 0 1
. sum vhappy if unemp==1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
vhappy | 8 .125 .3535534 0 1
On average, 29.95% of working people report being very happy, while 12.5% of
unemployed people report being very happy. Therefore, in this sample, working
people are more likely to report being very happy.
4. (1 point) What is the employment status of the youngest person in the sample?
Report their age and employment status.
To do this, you can type the following commands:
sort age
list age retired working unemp
The youngest person is 18 years old. S/he is not retired, not working and not
employed. (Probably this person is a student).
5. (2 points) Say we are interested in the relationship between happiness and
whether or not a person is working. What sign would you expect the correlation
between the two dummy variables “working” and “vhappy” to have? What about
the correlation between happiness and socioeconomic status – i.e. between
“vhappy” and “sei”? Report these correlation coefficients. State your expectations
with 1-2 sentences of explanation and then report the actual values.
One would expect the correlation between “working” and “vhappy” to be positive
and the correlation between “sei” and “vhappy” to be positive as well. This is also
what the Stata output shows us:
. corr working vhappy
(obs=374)
| working vhappy
-------------+------------------
working | 1.0000
vhappy | 0.0472 1.0000
. corr sei vhappy
(obs=374)
| sei vhappy
-------------+------------------
sei | 1.0000
vhappy | 0.1235 1.0000
6. (2 points) Assume that the variable “sei,” which provides a rough measure of a
respondent’s socioeconomic status (SES), is a good approximation of that
respondent’s percentile SES ranking in US society. For example, a respondent with
sei = 76 is in the top 25th percentile (or top quarter) of American society in terms
of SES. Say we are interested in the relationship between socioeconomic status
and marital status. Regress “sei” on “married” and report the results (use robust
standard errors). What is the interpretation and level of significance of the
coefficient you estimate?
. reg sei married, r
Linear regression Number of obs = 374
F(1, 372) = 5.98
Prob > F = 0.0150
R-squared = 0.0158
Root MSE = 19.339
------------------------------------------------------------------------------
| Robust
sei | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
married | 4.892361 2.001008 2.44 0.015 .957655 8.827067
_cons | 50.91162 1.387547 36.69 0.000 48.1832 53.64004
The coefficient on married is about 4.89 and is significant at the 5% level. This
indicates that in this sample, the average married person has a socioeconomic
status that is 4.89 percentage points higher than the average unmarried person.
7. (3 points) Now regress “sei” on “married” and on “educ” - What happens to the
coefficient on married? What is the sign of the omitted variable bias implied by the
second regression? What does this change tell you about the correlation between
“married” and “educ”?
. reg sei married educ, r
Linear regression Number of obs = 374
F(2, 371) = 71.52
Prob > F = 0.0000
R-squared = 0.3561
Root MSE = 15.662
------------------------------------------------------------------------------
| Robust
sei | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
married | 2.776367 1.695161 1.64 0.102 -.5569614 6.109695
educ | 3.96655 .3854941 10.29 0.000 3.208523 4.724577
_cons | -3.87886 5.397741 -0.72 0.473 -14.49286 6.735144
As discussed in class, the sign of the bias of ^β 1 when omitting X2 in the estimation
of Y = 0 + 1X1 + 2X2 + u can be summarized in the table below.
Corr (X1,X2) Corr (X1,X2)
2>0 >0
+ <0
-
2<0 - +
In this particular example, the coefficient on “married” decreases from 4.89 to
2.78, so the omitted variable bias is positive. The sign of the omitted variable bias
in the first regression is positive because “married” and “educ” are positively
correlated, and “educ” is positively related to “sei”. Looking at the correlation
between married and educ confirms this:
| married educ
-------------+------------------
married | 1.0000
educ | 0.0927 1.0000
Part 3 – Omitted Variable Bias
As discussed in class, the sign of the bias of β 1 when omitting X 2 in the estimation
of
Y = β0 + β 1 X 1 + β 2 X 2 +u
can be summarized in the table below.
Corr (X1,X2) Corr (X1,X2)
2>0 >0
+ <0
-
2<0 - +
This exercise will help you work through the cases above by illustrating them with
examples of your own. For each case, follow the steps below (i.e. you need to
work through four different cases!). Assume the regression equation you would
like to estimate is Y = β0 + β 1 X 1 + β 2 X 2 +u , but for some reason (lack of data or
knowledge, for example), you end up omitting X 2 from the regression. Feel free to
draw on examples from class but do not copy directly.
1. (1 point) Describe the hypothetical variables Y , X 1 , and X 2 .
2. (1 point) Indicate the sign of the correlation between X 1 and X 2 , and explain
why you would expect such a sign.
3. (1 point) Indicate the sign of β 2, and explain why you would expect such a sign.
4. (1 point) Indicate the sign of the bias if you were to omit X 2 from the
regression.
5. (1 point) Indicate how your estimate of β 1 is likely to change when omitting X 2
from the regression, i.e. will it get larger or smaller relative to when you estimate
the full regression (with both X 1 and X 2 as explanatory variables)? Explain your
reasoning both in technical terms (using your answers to the previous sub-
questions) and in terms a policymaker can understand.
Suggested solutions:
Case 1: 2>0, Corr (X1,X2)>0
Suppose we would like to estimate the following equation:
wage = 0 + 1educ + 2faminc + u
(1) So:
Y=wage= wage earned (measured in $/hour)
X1=educ=completed education (measured in years of schooling)
X2=faminc= family income (measured in thousands of dollars)
(2) Corr (educ, faminc) is probably positive since individuals with high family
income are more likely to be able to afford more education than individuals with
low family income.
(3) 2 is probably positive because individuals with high family income may have
certain characteristics (such as good family connections) that may allow them to
get better-paying jobs more easily than individuals with low family income.
(4) Since 2>0 and Corr (educ, faminc)>0, then bias>0 (positive bias). This means
that if we were to estimate:
wage = 0 + 1 educ + v
Then the expected value of
α^ 1 will be more positive than , i.e. E [ α^ 1 ]> β 1
1
(5) Given that 1 is likely to be positive (since individuals with high levels of
schooling are more likely to get high wages than individuals with low levels of
schooling), by omitting faminc from the regression we are likely to overestimate
the effect of education on wages. We would be attributing entirely to education
what is partly due to family income.
Case 2: 2<0, Corr (X1, X2)>0
Suppose we would like to estimate the following equation:
birthweight = 0 + 1 cigs + 2alcohol + u
(1) So:
Y= birthweight =Weight at birth of a newborn baby (often thought of
as a measure of a baby’s health/development)
X1=cigs=number of cigarettes mother smoked during pregnancy
X2=alcohol=number of alcoholic drinks mother drank during
pregnancy
(2) Corr (cigs, alcohol) is probably positive since mothers who smoked during
pregnancy were probably more likely to engage in other behavior that is
potentially dangerous to the child, such as drinking.
(3) 2 is probably negative given that doctors advise women not to drink frequently
during pregnancy because of the possible harm this may cause to the baby’s
development.
(4) Since 2<0 and Corr (cigs, alcohol)>0, then bias<0 (negative bias). This means
that if we were to estimate:
birthweight = 0 + 1 cigs + v
Then the expected value of
α^ 1 will be more negative than , i.e. E [ α^ 1 ]< β 1
1
(5) Given that 1 is likely to be negative (since mothers who smoke during
pregnancy are more likely to have low birthweight babies than mothers who don’t
smoke), by omitting alcohol from the regression we are likely to overestimate the
effect of smoking on birthweight. We would be attributing entirely to smoking what
is partly due to alcohol drinking.
Case 3: 2>0, Corr (X1, X2)<0
Suppose we would like to estimate the following equation:
life expectancy = 0 + 1 cigs + 2 exercise+ u
(1) So:
Y= life expectancy
X1= cigs = annual average cigarette consumption
X2= exercise = annual average hours of exercise
(2) Corr (cigs, exercise) is probably negative since people who exercise are less
likely to smoke.
(3) 2 is probably positive given that frequent exercise may increase life
expectancy.
(4) Since 2>0 and Corr (cigs, exercise)<0, then bias<0 (negative bias). This
means that if we were to estimate:
life expectancy = 0 + 1 cigs + v
Then the expected value of
α^ 1 will be more negative than , i.e. E [ α^ 1 ]< β 1
1
(5) Given that 1 is likely to be negative (since smokers on average live less years
than non-smokers), by omitting exercise from the regression we are likely to
overestimate the effect of smoking on life expectancy. We would be attributing
entirely to smoking what is partly due to lack of exercise.
Case 4: 2<0, Corr (X1, X2) <0
Suppose we would like to estimate the following equation:
testscorei = 0 + 1 schoolspendingi + 2povratei + u
(1) So:
Y=testscorei=average test score of students in school district i
X1=schoolspendingi=spending of school district i
X2=povratei = poverty rate in school district i
(2) Corr (schoolspending, povrate) is probably negative since communities that
have high poverty rates are likely to have school districts that have low levels of
resources.
(3) 2 is probably negative since poor communities are usually associated with low
academic performance.
(4) Since 2<0 and Corr (schoolspending, povrate)<0, then bias>0 (positive bias).
This means that if we were to estimate:
testscorei = 0 + 1 schoolspendingi + v
Then the expected value of
α^ 1 will be more positive than , i.e. E [ α^ 1 ]< β1.
1
(5) Given that 1 is likely to be positive (since we would think that higher spending
is associated with higher- or at least not lower- test scores), by omitting povrate
from the regression we are likely to overestimate the effect of school spending on
test scores. We would be attributing entirely to school spending what is partly due
to poor socio-economic conditions (as measured by povrate).
Appendix: Selected Regression Formulas
Slope coefficient in bivariate OLS:
^β = ∑ (
X i− X ) ( Y i−Y )
∑ ( X i− X )2
1
Intercept coefficient in bivariate OLS:
^β =Y − ^β X
0 1
Predicted values:
Y^ i= β^ 0 + β^ 1 X i
OLS residual:
u^ i=Y i−Y^ i