Introduction to Statistics:
Political Science (Class 9)
Review
Probability of having
cardiovascular disease
• Purpose of statistics:
– Inferences about populations using samples
• We draw a random sample of 1,000 adults
and 405 have some form of CVD
• Based on our sample, if we randomly
select one adult from the population: what
is the probability that they have
cardiovascular disease?
Conditional Probability
No CVD CVD
Exercise less than 3
30.3% 28.9%
days/week (N=602)
Exercise 3 or more
30.2% 10.6%
days/week (N=398)
• Probability of exercising <3 days/week?
• Probability of CVD among those who
exercise <3 days/week?
• Probability of CVD among those who exercise 3
or more days/week?
Association between
exercise and CVD?
No CVD CVD
Exercise less than 3
30.3% 28.9%
days/week (N=602)
Exercise 3 or more
30.2% 10.6%
days/week (N=398)
p1 = 28.9/(30.3+28.9) = 0.488
p2 = 10.6/(30.2+10.6) = 0.260
Difference = 0.488 - 0.260 = .228
Those who exercise less than 3 days/week .228 (22.8%)
more likely to have CVD
Specifying and testing hypotheses
• Difference of proportions = .228
• What’s our null hypothesis?
• Why a “null hypothesis”? Why not test
whether the difference is .228?
• Central limit theorem
– In repeated sampling, the distribution of our
estimates of the mean (or difference of means
or slope) will be normally distributed and
centered over the true population value
Central limit theorem
0
1 standard error Proposed true value
Comparing proportions
• Difference of proportions = .228
p1 = 28.9/(30.3+28.9) = 0.488
(N=602)
p2 = 10.6/(30.2+10.6) = 0.260
(N=398)
• Standard error of this difference:
Comparing proportions
• So, standard error of difference is the
square root of:
(.488*(1-.488)/602)+(.260*(1-.260)/398)
– Which is .0299
• Difference of proportions = .237
Hypotheses
• Null hypothesis:
– There is no difference in the rate of CVD
between those who exercise less than 3
days/week and those who do
• Alternate hypothesis:
– There is a difference in the rate of CVD
between those who exercise less than 3
days/week and those who do
• (i.e., the difference is not 0)
If 0 is was the true difference, it would be very unlikely
that we would find a difference 7.93 (.237/.0299)
standard errors from that value by chance
0
1 standard error Proposed true value
Does exercise cause lower CVD?
• Reverse causation? Might CVD cause
exercise?
• Failure to account for confounds
– Typically leads to over-estimating the strength
of a relationship (not always… but usually)
Democrats Republicans
100
90
80
Obama FT
70
60
50
40
30
20
10
0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Bush FT
Specification and
Interpretation
Multivariate Regression
Does exercise make
CDV less likely?
• Regression (predict CDV)
Coef. SE T P-value
Days Exercise (0-7) -0.06 .001 ? 0.000
Constant 0.56 .002 ? 0.000
• Estimated likelihood of CDV if exercise
4 days/week?
• What might confound our estimate of the
relationship between exercise and CVD?
Controlling for confounds
Coef. SE T P-value
Days Exercise (0-7) -0.03 .001 -3.0 0.002
Days Fast Food (0-7) 0.04 .002 2.0 0.048
Constant 0.42 .002 21.0 0.000
Democrats Republicans
100
High Fast Food
90
80
Obama FT
70
% Chance CVD
60
50
40
30 Low Fast Food
20
10
0
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Days per Week Exercise
Bush FT
Controlling for dichotomous
confounds
Coef. SE T P-value
Days Exercise (0-7) -0.03 .001 -3.0 0.002
Days Fast Food (0-7) 0.04 .002 2.0 0.048
Smoker (1=yes) 0.11 .001 11.0 0.000
Constant 0.38 .002 19.0 0.000
• Predicted probability of CVD for
– 2 days exercise, 2 days Fast food, smoker
Nominal Variables
• Variable that does not have an “order” to it
– Nothing is “higher” or “lower”
• Create set of dichotomous variables
• Always interpret coefficients with respect
to the reference category
Controlling for nominal confounds
Coef. SE T P-value
Days Exercise (0-7) -0.03 .001 -3.0 0.002
Days Fast Food (0-7) 0.03 .002 1.5 0.135
Smoker (1=yes) 0.09 .001 9.0 0.000
South (1=yes) 0.03 .002 1.5 0.137
West (1=yes) -0.01 .002 -0.5 0.642
Northeast (1=yes) 0.02 .002 1.0 0.410
Constant 0.34 .002 17.0 0.000
(Midwest is excluded category)
What if we wanted to test whether including
region indicators improves fit of the model?
Non-linear relationships
1,800,000
1,600,000 Logarithms
1,400,000
Home Value ($s)
1,200,000
1,000,000
800,000
600,000
Why use a logarithmic transformation?
400,000
You think the relationship looks like this…
200,000
0
60,000
660,000
1,260,000
1,860,000
2,460,000
3,060,000
3,660,000
4,260,000
4,860,000
5,460,000
6,060,000
6,660,000
Yearly Income ($s)
1,800,000
1,600,000 Logarithms
1,400,000
1,200,000
Home Value
1,000,000
800,000
600,000
400,000
200,000
0
10 11 12 13 14 15 16
Logged Yearly Income
Squared term – U(or ∩)-shaped
relationship
Age and political ideology (-2=very conservative, 2=very liberal)
Coef. SE T P
Age -0.007 0.004 -1.740 0.082
Constant 0.122 0.209 0.580 0.561
Coef. SE T P
Age -0.065 0.025 -2.630 0.009
Age-squared 0.001 0.000 2.390 0.017
Constant 1.554 0.635 2.450 0.015
Age and Political Ideology
Coef. SE T P
Age -0.065 0.025 -2.630 0.009
Age-squared 0.001 0.000 2.390 0.017
Constant 1.554 0.635 2.450 0.015
Age Age2 -0.065*Age .0005574*Age2 Constant Predicted Value
18 324 -1.178 0.181 1.554 0.557
28 784 -1.832 0.437 1.554 0.159
38 1444 -2.487 0.805 1.554 -0.128
48 2304 -3.141 1.284 1.554 -0.303
58 3364 -3.795 1.875 1.554 -0.366
68 4624 -4.450 2.577 1.554 -0.319
78 6084 -5.104 3.391 1.554 -0.159
Ideology
(-2=very conservative, 2=very liberal)
-0.5
-1
0.5
0
1
18
28
38
48
Age
58
68
78
88
Create indicators from an
ordered variable
Party Identification (-3 to 3)
Seven Variables:
Strong Republican (1=yes)
Weak Republican (1=yes)
Lean Republican (1=yes)
Pure Independent (1=yes)
Lean Democrat (1=yes)
Weak Democrat (1=yes)
Strong Democrat (1=yes)
Predict Obama Favorability (1-4)
Coef. SE T P
Strong Republican -1.632 0.161 -10.160 0.000
Weak Republican -0.707 0.198 -3.580 0.000
Lean Republican -1.235 0.181 -6.810 0.000
Lean Democrat 0.674 0.197 3.430 0.001
Weak Democrat 0.494 0.187 2.640 0.009
Strong Democrat 0.595 0.159 3.750 0.000
Constant 2.940 0.134 21.870 0.000
Excluded category: Pure Independents
1
2
3
4
Strong
Republican
Weak
Republican
Lean
Republican
Pure
Independent
Obama Favorability
Lean Democrat
Weak
Democrat
Strong
Democrat
Predict Obama Favorability (1-4)
Coef. SE T P
Strong Republican -0.397 0.150 -2.650 0.008
Weak Republican 0.528 0.189 2.790 0.006
Pure Independent 1.235 0.181 6.810 0.000
Lean Democrat 1.909 0.188 10.150 0.000
Weak Democrat 1.729 0.179 9.680 0.000
Strong Democrat 1.831 0.148 12.360 0.000
Constant 1.705 0.122 14.010 0.000
New excluded category: Leaning Republicans
Interactions
• One variable moderates the effect of
another – i.e., the relationship between
one variable and an outcome depends on
the value of another variable
Regression estimates an equation…
Coef. SE T P
Party Affiliation (-3=strong R; 3=strong D) 1.286 0.878 1.460 0.143
Voted in 2008 -1.138 1.484 -0.770 0.443
Party Affiliation x Voted in 2008 3.575 0.918 3.900 0.000
Constant 61.100 1.358 44.980 0.000
61.100 + 1.286*Party – 1.138*Voted + 3.575*Party*Voted + u
61.100 + Party*1.286 + Party*Voted*3.575 – 1.138*Voted + u
61.100 + Party(1.286 + Voted*3.575) – 1.138*Voted + u
OR
61.100 + Party*1.286 + Voted*Party*3.575 – Voted*1.138 + u
61.100 + Party*1.286 + Voted(Party*3.575 –1.138) + u
Party Aff. Voted Party Aff. Voted Party x Voted Constant Predicted Value
Coefficients 1.286 -1.138 3.575 61.100
-3 0 -3.858 0 0 61.100 57.242
-2 0 -2.572 0 0 61.100 58.528
-1 0 -1.286 0 0 61.100 59.814
0 0 0.000 0 0 61.100 61.100
1 0 1.286 0 0 61.100 62.386
2 0 2.572 0 0 61.100 63.672
3 0 3.858 0 0 61.100 64.959
Party Aff. Voted Party Aff. Voted Party x Voted Constant Predicted Value
Coefficients 1.286 -1.138 3.575 61.100
-3 1 -3.858 -1.13775 -10.7258 61.100 45.378
-2 1 -2.572 -1.13775 -7.1505 61.100 50.240
-1 1 -1.286 -1.13775 -3.57525 61.100 55.101
0 1 0.000 -1.13775 0 61.100 59.962
1 1 1.286 -1.13775 3.575252 61.100 64.824
2 1 2.572 -1.13775 7.150504 61.100 69.685
3 1 3.858 -1.13775 10.72576 61.100 74.547
Support for Comparative Effectiveness Research Did not Vote Voted
80
70
60
50
40
Strong Republican Weak Republican Lean Republican Independent Lean Democrat Weak Democrat Strong Democrat
Establishing causality
Dealing with confounds
• Theory + multivariate regression
• Experiments
Dealing with reverse causation
• Theory
• Experiments
Experiments
• What is the key characteristic of an
experiment?
• How does this address reverse causality?
• How does it address confounds?
• Weaknesses/limitations of experiments?
Exam Expectations
• Describe probabilities / conditional probabilities
• Write hypotheses
– Demonstrate understanding of how null hypotheses relate to the
central limit theorem
• Test difference of proportions (formula for SE will be provided)
• Interpreting multivariate regression
– Relationships (slopes)
– Predicted values
– Sketch graphs of relationships
• Discuss strengths and limitations of analyses
– Why an estimated slope might be biased
– Benefits and limitations of experiments
Notes
• Homework 3 graded
• Homework 4 due Thursday 12/9
• Office hours next week – email to come
• Exam December 14 at 2pm