Bookdown Demo
Bookdown Demo
Analysis Using R
Thomas M. Holbrook
2023-08-20
2
Contents
Preface 11
Origin Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
How to use this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
What’s in this Book? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Keys to Student Success . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Data Sets and Codebooks . . . . . . . . . . . . . . . . . . . . . . . . . 15
3
4 CONTENTS
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7 Probability 179
7.1 Get Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3 Theoretical Probabilities . . . . . . . . . . . . . . . . . . . . . . . 181
7.4 Empirical Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 183
7.5 The Normal Curve and Probability . . . . . . . . . . . . . . . . . 188
7.6 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
CONTENTS 5
Origin Story
This book started as a collection of lecture notes that I put online for my
undergraduates in the wake of the COVID-19 outbreak in March of 2020. I had
always posted a rough set of online notes for my classes, but the pandemic pivot
meant that the notes had to be much more detailed and thorough than before.
This, coupled with my inability to find a textbook that was right for my class2 ,
led me to somewhat hastily cobble together my topic notes into something that
started to look like a “book” during the summer of 2021. I used this new set
of notes for the fall 2021 hybrid section of my undergraduate course, Political
Data Analysis, and it worked out pretty well. I then spent the better part of
spring 2022 (a sabbatical semester) expanding and revising (and revising, and
revising, and revising) the content, as well as trying to master Bookdown, so I
1 If the link doesn’t work, copy and paste this to your browser: https://www.dropbox.com/
s/tezwantj8n4emjt/bookdown-demo_.pdf?dl=0
2 There are a lot of really good textbooks out there. I suppose I was a bit like Goldilocks:
most books had many things to recommend them, but none were quite right. Some books had
a social science orientation but not much to say about political science; some books were overly
technical, while others were not technical enough; some books were essentially programming
manuals with little to no research design or statistics instruction, while others were very good
on research design and statistics but offered very little on the computing side.
11
12 CONTENTS
could stitch together the multiple R Markdown files to look more like a book
than just a collection of lecture notes. I won’t claim to have mastered either
Bookdown or Markdown, but I did learn a lot as I used them to put this thing
together.
Right now, the book is available free online. At some point, it will be available
as a physical book, probably via a commercial press, and hopefully with low
cost options for students. Feel free to use it in your classes if you think it will
work for you. If you do use it, let me know how it works out for you!
This work is licensed under a Creative Commons Attribution-NonCommercial-
ShareAlike 4.0 International License.
calculate and interpret statistics, apply key concepts to problems and exam-
ples, or interpret R output. Most of these problems lean much more to the
“Concepts” than to “Calculations” side, and the calculations tend to be pretty
simple. Second, there are R Problems that require students to analyze data and
interpret the findings using R commands shown earlier in the chapter. In some
cases, there is a cumulative aspect to these problems, meaning that students
need to use R code they learned in previous chapters. Students who follow
along and run the R code as they read the chapters will have a relatively easy
time with the R problems. Most chapters include both Concept and Calcula-
tions and R Problems, though some chapters only include one or the other type
of problems.
Chapter Topics
1. Introduction to Research and 10. Hypothesis Testing with two Groups
Data
2. Using R to Do Data Analysis 11. Hypothesis Testing with Multiple
Groups
3. Frequencies and Bar Graphs 12. Hypothesis Testing with Non-Numeric
Variables (Crosstabs)
4. Transforming Variables 13. Measures of Association
5. Measures of Central 14. Correlation and Scatterplots
Tendency
6. Measures of Dispersion 15. Simple Regression
7. Probability 16. Multiple Regression
8. Sampling and Inference 17. Advanced Regression Topics
9. Hypothesis Testing 18. Regression Assumptions
14 CONTENTS
get their work done in the first part of the book, where they will learn some data
analysis basics and get their first exposure to using R. Also, for students, ask
for help! Your instructors want you to succeed and one of the keys to success is
letting the experts (your instructors) help you!
These and other sets can be downloaded at this link (you might have to right-
click on this link to open the directory in a new tab or window)3 , and the
codebooks can be found in the appendix of this book. Only a handful of vari-
ables are used from each of these data sets, leaving many others for homework,
research projects, or paper assignments.
3 If the link doesn’t work, copy and paste this to your browser: https://www.dropbox.com/
sh/le8u4ha8veihuio/AAD6p6RQ7uFvMNXNKEcU__I7a?dl=0
16 CONTENTS
Chapter 1
Introduction to Research
and Data
17
18 CHAPTER 1. INTRODUCTION TO RESEARCH AND DATA
you want to know? There are countless interesting topics that could be pursued
in either of these general categories. The key to really kicking things off is to
narrow the focus to a more useful research question. Maybe a student interested
in elections has observed that some presidents are re-elected more easily than
others and settles on a more manageable goal, explaining the determinants of
incumbent success in presidential elections. Maybe the student interested in
LGBTQ rights has observed that some states offer several legal protections for
LGBTQ residents, while other states do not. In this case, the student might
limit their research interest to explaining variation in LGBTQ rights across the
fifty states. The key here is to move from a broad subject area to a narrower
research question that gives some direction to the rest of the process.
Still, even if a researcher has narrowed their research interest to a more man-
ageable topic, they need to do a bit more thinking before they can really get
started; they need a set of expectations to guide their research. In the case of
studying sources of incumbent success, for instance, it is still not clear where to
begin. Students need to think about their expectations. What are some ideas
about the things that might be related to incumbent success? Do these ideas
make sense? Are they reasonable expectations? What theory is guiding your
research?
“Theory” is one of those terms whose meaning we all understand at some
level and perhaps even use in everyday conversations (e.g., “My theory is
that…”), but it has a fairly specific meaning in the research process. In
this context, theory refers to a set of logically connected propositions (or
ideas/statements/assumptions) that we take to be true and that, together,
can be used to explain a given phenomenon (outcome) or set of phenomena
(outcomes). Think of a theory as a rationale for testing ideas about the things
that you think explain the outcome that interests you. Another way to think of
a theory in this context is as a model that identifies how things are connected
to produce certain outcomes.
A good theory has a number of characteristics, the most important of which are
that it must be testable, plausible, and accessible. To be testable, it must be
possible to subject the theory to empirical evidence. Most importantly, it must
be possible to show that the theory does not provide a good account of reality,
if that is the case; in other words, the theory must be falsifiable. Plausibility
comes down to the simple question of, on its face, given what we know about
the subject at hand, does the theory make sense? If you know something about
the subject matter and find yourself thinking “Really? This sounds a bit far-
fetched,” this could be a sign of a not-very-useful theory. Finally, a good theory
needs to be understandable, easy to communicate. This is best accomplished
by being parsimonious (concise, to the point, as few moving parts as possible),
and by using as little specialized, technical jargon as possible.
As an example, a theory of retrospective voting can be used to explain support
and opposition to incumbent presidents. The retrospective model was devel-
oped in part in reaction to findings from political science research that showed
22 CHAPTER 1. INTRODUCTION TO RESEARCH AND DATA
that U.S. voters did not know or care very much about ideological issues and,
hence, could not be considered “issue voters.” Political scientist Morris Fior-
ina’s work on retrospective voting countered that voters don’t have to know a
lot about issues or candidate positions on issues to be issue voters.1 Instead, he
argued that the standard view of issue voting is too narrow and a theory based
retrospective issues does a better job of describing the American voter. Some of
the key tenets of the retrospective model are:
• Elections are referendums on the performance of the incumbent president
and their party;
• Voters don’t need to understand or care about the nuances of foreign and
domestic policies of the incumbent president to hold their administration
accountable;
• Voters only need to be aware of the results of the those policies, i.e, have
a sense of whether things have gone well on the international (war, trade,
crises, etc.) and domestic (economy, crimes, scandals, etc.) fronts;
• When times are good, voters are inclined to support the incumbent party;
when times are bad, they are less likely to support the incumbent party.
This is an explicitly reward-punishment model. It is referred to as retrospective
voting because the emphasis is on looking back on how things have turned out
under the incumbent administration rather than comparing details of policy
platforms to decide if the incumbent or challenging party has the best plans for
the future.
The next step in this part of the research process is developing hypotheses that
logically flow from the theory. A hypothesis is speculation about the state of the
world. Research hypotheses are based on theories and usually assert that varia-
tions in one variable are associated with, result in, or cause variation in another
variable. Typically, hypotheses specify an independent variable and a depen-
dent variable. Independent (explanatory) variables, often represented as
X, are best thought of as the variables that influence, or shape outcomes in
other variables. They are referred to as independent because we are not assum-
ing that their outcomes depend on the values of other variables. Dependent
(response) variables, often represented as Y, measure the thing we want to
explain. These are the variables that we think are affected by the independent
variables. One short-cut to recalling this is to remember that the outcome of
the dependent variable depends upon the outcome of the independent variable.
Based on the theory of retrospective voting, for instance, it is reasonable to
hypothesize that economic prosperity is positively related to the level of popular
support for the incumbent president and their party. Support for the president
should be higher when the economy is doing well than when it is not doing well.
1 This might be the only bibliographic references in this book: Fiorina, Morris,Retrospective
In social science research, hypotheses sometimes are set off and highlighted
separately from the text, just so it is clear what they are:
H1 : Economic prosperity is positively related to the level of popular
support for the incumbent president and their party. Support for
the president should be higher when the economy is doing well than
when it is not doing well.
In this hypothesis, the independent and dependent variables are represented
by two important concepts, economic prosperity and support for the incumbent
president, respectively. Concepts are abstract ideas that help to summarize
and organize reality; they define theoretically relevant phenomena and help us
understand the meaning of the theory a bit more clearly. But while concepts
such as these help us understand the expectations embedded in the hypothesis,
they are sufficiently broad and abstract that we are not quite ready to analyze
the data.
approval, you would expect that outcomes of polls used do not vary widely from
day to day and that most polls taken at a given point in time would produce
similar results.
Data Gathering. Once a researcher has determined how they intend to mea-
sure the key concepts, they must find the data. Sometimes, a researcher might
find that someone else has already gathered the relevant data they can use for
their project. For instance, researchers frequently rely upon regularly occur-
ring, large-scale surveys of public opinion that have been gathered for extended
periods of time, such as the American National Election Study (ANES), the
General Social Survey (GSS), or the Cooperative Election Study (CES). These
surveys are based on large, scientifically drawn samples and include hundreds
of questions on topics of interest to social scientists. Using data sources such
as these is referred to as secondary data analysis. Similarly, even when re-
searchers are putting together their own data set, they frequently use secondary
data. For instance, to test the hypotheses discussed above, a researcher may
want to track election results and some measure of economic activity, economic
growth. These data do not magically appear. Instead, the researcher has to
put on their thinking cap and figure out where they can find sources for these
data. As it happens, election results can be found at David Leip’s Election
Atlas (https://uselectionatlas.org), and the economic data can be found at the
Federal Reserve Economic Data website (https://fred.stlouisfed.org) and other
government sites, though it takes a bit of poking around to actually find the
right information.
Even after figuring out where to get their data, researchers still have several
important decisions to make. Sticking with the retrospective voting hypothesis,
if the focus is on national outcomes of U.S. presidential elections, there are a
number of questions that need to be answered. In what time period are we
interested? All elections? Post-WWII elections? How shall incumbent support
be measured? Incumbent party percent of the total vote or percent of the
two-party vote? If using the growth rate in GDP, over what period of time?
Researchers need to think about these types of questions before gathering data.
In this book, we will rely on several data sources: a 50-state data set, a county-
level political and demographic data set, a cross-national political and socioe-
conomic data set, and the American National Election Study (ANES), a large-
scale public opinion survey conducted before and after the 2020 U.S. presidential
election.
positive relationship
3
2
1
Dependent Variable
Dependent Variable
1
0
0
−1
−1
−2
−2
−2 −1 0 1 2 3 −2 0 1 2 3
At the same time, while the information in Figure 1.2 gives you a clear intuitive
impression of the differences in the two relationships, you can’t be very specific
about how much stronger the relationship is for independent variable B without
more precise information such as the correlation coefficients in Table 1.3. Most
often, the winning combination for communicating research results is some mix
of statistical findings and data visualization.
In addition to measuring the strength of relationships, researchers also focus
on their level of confidence in the findings. This is a key part of hypothesis
testing and will be covered in much greater detail later in the book. The basic
idea is that we want to know if the evidence of a relationship is strong enough
that we can rule out the possibility that it occurred due to chance, or perhaps to
measurement issues. Consider the relationship between Independent Variable
A and the dependent variable in Figure 1.2. On its face, this looks like a weak
relationship. In fact, without the line of prediction or the correlation coefficient,
it is possible to look at this scatterplot and come away with the impression the
pattern is random and there is no relationship between the two variables. The
correlation (.35) and line of prediction tell us there is a positive relationship,
but it doesn’t look very different that a completely random pattern. From this
perspective, the question becomes, “How confident can we be that this pattern is
really different from what you would expect if there was no relationship between
the two variables?” Usually, especially with large samples, researchers can have
a high level of confidence in strong relationships. However, weak relationships,
especially those based on a small number of cases, do not inspire confidence.
This might be a bit confusing at this point, but good research will distinguish
between confidence and strength when communicating results. This point is
emphasized later, beginning in Chapter 10.
One of the most important parts of this stage of the research process is the
interpretation of the results. The key point to get here is that the statistics
and visualizations do not speak for themselves. It is important to understand
that knowing how to type in computer commands and get statistical results is
not very helpful if you can’t also provide a coherent, substantive explanation of
the results. Bottom line: Use words!
Typically, interpretations of statistical results focus on how well the findings
comport with the expectations laid out in the hypotheses, paying special atten-
tion to both the strength of the relationships and the level of confidence in the
findings. A good discussion of research findings will also acknowledge potential
limitations to the research, whatever those may be.
1.3. RESEARCH PROCESS 29
65
1972
1964
60
1984
1956
1996
55
1988
1948
2012
2016 2004
1960 2000
1968
50
1976
2020
2008 1992
1980 1952
45
−3 −2 −1 0 1 2 3 4
The results of the analysis provide some support for the retrospective voting
hypothesis. The scatter plot shows that there is a general tendency for the
incumbent party to struggle at the polls when the economy is relatively weak and
to have success at the polls when the economy is strong. However, while there is
a positive relationship between GDP growth and incumbent vote share, it is not
a strong relationship. This can be seen in the variation in outcomes around the
line of prediction, where we see a number of outcomes (1952, 1956, 1972, 1984,
and 1992) that deviate quite a bit from the anticipated pattern. The correlation
between these two variables (.49), confirms that there is a moderate, positive
relationship between GDP growth and vote share for the incumbent presidential
party. Clearly, there are other factors that help explain incumbent party electoral
success, but this evidence shows that the state of the economy does play a role.
30 CHAPTER 1. INTRODUCTION TO RESEARCH AND DATA
1.3.4 Feedback
Although it is generally accepted that theories should not be driven by what the
data say (after all, the data are supposed to test the theory!), it would be foolish
to ignore the results of the analysis and not allow for some feedback into the
research process and reformulation of expectations. In other words, it is possible
that you will discover something in the analysis that leads you to modify your
theory, or at least change the way you think about things. In the real world
of social science data analysis, there is a lot of back-and-forth between theory,
hypothesis formation, and research findings. Typically, researchers have an idea
of what they want to test, perhaps grounded in some form of theory, or maybe
something closer to a solid rationale; they then gather data and conduct some
analyses, sometimes finding interesting patterns that influence how they think
about their research topic, even if they had not considered those things at the
outset.
Let’s consider the somewhat modest relationship between change in GDP and
votes for the incumbent party, as reported in Figure 1.3. Based on these findings,
you could conclude that there is a tendency for the electorate to punish the
incumbent party for economic downturns and reward it for economic upturns,
but the trend is not strong. Alternatively, you could think about these results
and ask ourselves if you are missing something. For instance, you might consider
the sort of conditions in which you should expect retrospective voting to be
easier for voters. In particular, if the point of retrospective voting is to reward or
punish the incumbent president for outcomes that occur during their presidency,
then it should be easier to assign responsibility in years in which the president
is running for another term. Several elections in the post-WWII era were open-
seat contests, meaning that the incumbent president was not running, mostly
due to term limits (1952, 1960, 1968, 1988, 2000, 2008, and 2016). It makes
sense that the relationship between economic conditions and election outcomes
should be weaker during these years, since the incumbent president can only
be held responsible indirectly. So, maybe you need to examine the two sets of
elections (incumbent running vs. open seat) separately before you conclude that
the retrospective model is only somewhat supported by the data.
Figure 1.4 illustrates how important it can be to allow the results of the initial
data analysis to provide feedback into the research process. On the left side,
there is a fairly strong, positive relationship between changes in GDP in the
first three quarters of the year and the incumbent party’s share of the two-party
vote when the incumbent is running. There are a couple of years that deviate
from the trend, but the overall pattern is much stronger here than it was in
Figure 1.3, which included data from all elections. In addition, the scatterplot
for open-seat contests (right side) shows that when the incumbent president is
not running, there is virtually no relationship between the state of the economy
and the incumbent party share of the two-party vote. These interpretations of
the scatter plot patterns are further supported by the correlation coefficients,
.68 for incumbent races and a meager .16 for open-seat contests.
1.4. OBSERVATIONAL VS. EXPERIMENTAL DATA 31
65
1972
1964
60
60
1984
1956
1996
55
55
1988
2012 1948
2004 2016
1960 2000
1968
50
50
1976
2020
1992 2008
1980 1952
45
45
40
40
−1 0 1 2 3 4 −3 −2 −1 0 1 2 3
Figure 1.4: Testing the Reptrospective Voting Hypothesis in Two Different Con-
texts
The lesson here is that it can be very useful to allow for some fluidity between
the different components of the research process. When theoretically interesting
possibilities present themselves during the data analysis stage, they should be
given due consideration. Still, with this new found insight, it is necessary to
exercise caution and not be over-confident in the results, in large part because
they are based on only 19 elections. With such a small number of cases, the next
two or three elections could alter the relationship in important ways if they do
not fit the pattern of outcomes in Figure 1.4. This would not be as concerning
if the findings were based on a larger sample of elections.
the researcher is able to manipulate the values of the independent variable com-
pletely independent of other potential influences. Suppose, for instance, that we
wanted to do an experimental study of retrospective voting in mayoral elections.
We could structure the experiment in such a way that all participants are given
the same information (background characteristics, policy positions, etc.) about
both candidates (the incumbent seeking reelection and their challenger), but
one-third of the participants (Group A) would be randomly assigned to receive
information about positive outcomes during the mayor’s term (reduced crime
rate, increased property values, etc.), while another third (Group B) would
receive information about negative outcomes (increased crime rate, decreased
property values, etc.) during the mayor’s term, and the remaining third of the
respondents (Group C) would not receive any information about local conditions
during the mayor’s term. If, after receiving all of the information treatments,
participants in Group A (positive information) give the mayor higher marks
and are generally more supportive than participants in Groups B (negative in-
formation) & C (no information), and members of Group C are more supportive
of the mayor than members of Group B, we could conclude that information
about city conditions during the mayor’s term caused these differences because
the only difference between the three groups is whether they got the positive,
negative, or no information on local conditions. In this example, the researcher
is able to manipulate the information about conditions in the city independent
of all other possible influences on the dependent variable.
This is not to say that experimental data do not have serious drawbacks, es-
pecially when it comes to connecting the experimental evidence to real-world
politics. Consider, for instance, that the experimental scenario described above
bears very little resemblance to the way voters encounter candidates and cam-
paigns in real world election. However, within the confines of the experiment
itself, any differences in outcomes between Group A and Group B can be at-
tributed to the difference in how city-level conditions were presented to the two
groups.
The most important thing to remember when discussing linkages between vari-
ables is to be careful about the language you use to describe those relationships.
What this means is to understand the limits of what you can say regarding the
causal mechanisms, while at the same time speaking confidently about what
you think is going on in the data.
that we are not manipulating the independent variable. Instead, we are measuring evaluations
of the economy as they exist.
34 CHAPTER 1. INTRODUCTION TO RESEARCH AND DATA
(as is usually the case for challenging party partisans) and most Republicans
reported positive evaluations of the economy (as is usually the case for in-party
partisans), and at the same time, Democrats voted overwhelmingly for Biden,
as Republicans did for Trump, on the basis of their partisan ties.
So, the issue is that the observed relationship between economic evaluations
and vote choice might be reflecting the influence of party identification on both
variables rather than the direct effect of the economic evaluations. This prob-
lem is endemic to much of social science research and needs to be addressed
head-on, usually by incorporating potentially confounding variables into the
analysis. Methods for addressing this issue are addressed at greater length in
later chapters.
Theoretical grounding. Are there strong theoretical reasons for believing
that X causes Y? This takes us back to the earlier discussion of the importance
of having clear expectations and a sound rationale for pursuing your line of
research. This is important because variables are sometimes related to each
other coincidentally and may satisfy the time-order criterion and the relation-
ship may persist when controlling for other variables. But if the relationship
is nonsensical, or at least seems like a real theoretical stretch, then it should
not be assigned any causal significance. In the case of economic evaluations
influencing vote choice, the hypothesis is on a strong theoretical footing.
Even if all of these conditions are met, it is important to remember that these are
only necessary conditions. Satisfying these conditions is not a sufficient basis
for making causal claims. Causal inferences must be made very cautiously,
especially in a non-experimental setting. It is best to demonstrate an awareness
of this by being cautious in the language you use.
1.5. LEVELS OF MEASUREMENT 35
Protestant
Catholic
Other Christian
Jewish
Other Religion
No Religion
Of course, we are interested in more than these six categories, but we’ll leave
it like this for now. The key thing is that as you move from one category to
the next, you find different types of religion but not any sort of quantifiable
difference in the labels used. For instance, “Protestant” is the first category
and “Catholic” is the second, but we wouldn’t say “Catholic” is twice as much
as “Protestant”, or one more unit of religion than “Protestant.” Nor would we
say that “Other Religion” (the fifth category listed) is one unit more of religion
than “Jewish”, or one unit less than “No Religion.” These sorts of statements
just don’t make sense, given the nature of this variable. One way to appreciate
the non-quantitative essence of these types of variables is to note that the infor-
mation conveyed in this variable would not change, and would be just as easy
to understand if listed the categories in a different order. Suppose “Catholic”
switched places with “Protestant”, and “Jewish” with “Other Christian,” as
shown below. Doing so does not really affect how we react to the information
we get from this variable.
Catholic
Protestant
Jewish
Other Christian
Other Religion
No Religion
36 CHAPTER 1. INTRODUCTION TO RESEARCH AND DATA
fication, you can think of the categories as growing more Republican (and less
Democratic) as you move from “Democrat” to “Independent” to “Republican.”
Likewise, for ideology, categories grow more conservative (and less liberal) as
you move from “Liberal” to “Moderate” to “Conservative”.
Party ID Ideology
Democrat Liberal
Independent Moderate
Republican Conservative
Both nominal and ordinal variables are also referred to as categorical vari-
ables, emphasizing the role of labeled categories rather than numeric outcomes.
3. Interval and ratio level variables are the most quantitative in nature and
have numeric values rather than category labels. This means that the outcomes
can be treated as representing objective quantitative values, and equal numeric
differences between categories have equal quantitative meaning. A true interval
scale has an arbitrary zero point; in other words, zero does not mean “none” of
whatever is being measured. The Fahrenheit thermometer is an example of this
(zero degrees does not mean there is no temperature).3 Due to the arbitrary
zero point, interval variables cannot be used to make ratio statements. For
instance, it doesn’t make sense to say that a temperature of 40 degrees is twice
as warm as that of 20 degrees! But it is 20 degrees warmer, and that 20 degree
difference has the same quantitative meaning as the difference between 40 and
60 degrees. Ratio level variables differ from interval-level in that they have a
genuine zero point. Because zero means none, ratio statements can be made
about ratio-level variables. for instance, 20% of the vote is half the size of 40%
of the vote. For all practical purposes, other than making ratio statements we
can lump ratio and interval data together. Interval and ratio variables are also
referred to as numeric variables.
To continue with the example of measuring religiosity, you might opt to ask
survey respondents how many days a week they usually say at least one prayer.
In this case, the response would range from 0 to 7 days, giving us a ratio-
level measure of religiosity. Notice the type of language you can use when
talking about this variable that you couldn’t use when talking about nominal
and ordinal variables. People who pray three days a week pray two more days
a week than those who pray one day a week and half as many days a week as
someone who prays six days a week.
Divisibility of Data. It is also possible to distinguish between variables based
on their level of divisibility. A variable whose values are finite and cannot be
subdivided is a discrete variable. Nominal and ordinal variables are always dis-
crete, and some interval/ratio variables are discrete (number of siblings, number
3 Other examples include SAT scores, credit scores, and IQ scores.
38 CHAPTER 1. INTRODUCTION TO RESEARCH AND DATA
of political science courses, etc). A variable whose values can be infinitely sub-
divided is a continuous variable (time, weight, height, temperature, % voter
turnout, % Republican vote, etc). Only interval/ratio variables can be contin-
uous, though they are not always. The table below helps organize information
on levels of measurement and divisibility.
Table 1.4. Levels of Measurement and Divisibility of Data
Divisibility
Level of Measurement Discrete Continuous
Nominal Yes No
Ordinal Yes No
Interval Yes Yes
Ratio Yes Yes
65% of Latino voters, 61% of Asian-American voters, and 41% of white voters).
Based on this strong pattern among individuals, one might expect to finding a
similar pattern between the size of the black population and support for Biden
among the states. However, this inference is completely at odds with the state-
level evidence: there is no relationship between the percent African-American
and the Biden percent of the two-party vote among the states (the correlation is
.009), largely because the greatest concentration of black voters is in conserva-
tive southern states. It is also possible that you could start with the state-level
finding and erroneously conclude that African-Americans were no more or less
likely than others to have voted for Biden in 2020, even though the individual-
level data show high levels of support for Biden among African-American voters.
This type of error is usually referred to as an error resulting from the ecological
fallacy, which can occur when making inferences about behavior at one level of
analysis based on findings from another level. Although the ecological fallacy is
most often thought of in the context of making individual-level inferences from
aggregate patterns, it is generally unwise to make cross-level inferences in either
direction. The key point here is to be careful of the language you use when
interpreting the findings of your research.
1.8 Exercises
1.8.1 Concepts and Calculations
1. Identify the level of measurement (nominal, ordinal, interval/ratio) and
divisibility (discrete or continuous) for each of the following variables.
• Course letter grade
• Voter turnout rate (%)
• Marital status (Married, divorced, single, etc)
• Occupation (Professor, cook, mechanic, etc.)
• Body weight
• Total number of votes cast
• #Years of education
• Subjective social class (Poor, working class, middle class, etc.)
• % below poverty level income
• Racial or ethnic group identification
2. For each of the pairs of variables listed below, designate which one you
think should be the dependent variable and which is the independent
variable. Give a brief explanation.
• Poverty rate/Voter turnout
• Annual Income/Years of education
• Racial group/Vote choice
• Study habits/Course grade
• Average Life expectancy/Gross Domestic Product
• Social class/happiness
3. Assume that the topics listed below represent different research topics and
classify each of them each of them as either a “political” or “social” topic.
Justify your classification. If you think a topic could be classified either
way, explain why.
• Marital Satisfaction
• Racial inequality
• Campaign spending
• Welfare policy
• Democratization
• Attitudes toward abortion
• Teen pregnancy
4. For each of the pairs of terms listed below, identify which one represents
a concept, and which one is an operational variable.
• Political partcipation/Voter turnout
• Annual income/Wealth
• Restrictive COVID-19 Rules/Mandatory masking policy
• Economic development/Per capita GDP
1.8. EXERCISES 41
5. What is a broad social or political subject area that interests you? Within
that area, what is a narrower topic of interest? Given that topic of interest,
what is a research question you think would be interesting to pursuing?
Finally, share a hypothesis related to the research question that you think
would be interesting to test.
6. A researcher asked people taking a survey if they were generally happy
or unhappy with the way life was going for them. They repeated this
question in a followup survey two weeks later and found that 85% of re-
spondents provided the same response. Is this a demonstration of validity
or reliability? Why?
7. In an introductory political science class, there is a very strong relationship
between accessing online course materials and course grade at the end of
the semester: Generally, students who access course material frequently
tend to do well, while those who don’t access course material regularly tend
to do poorly. This leads to the conclusion that accessing course material
has a causal impact on course performance. What do you think? Does
this relationship satisfy the necessary conditions for establishing causality?
Address this question using all four conditions.
8. A scholar is interested in examining the impact of electoral systems on
levels of voter turnout using data from a sample of 75 countries. The
primary hypothesis is that voter turnout (% of eligible voters who vote)
is higher in electoral systems that use proportional representation than in
majoritarian/plurality systems.
• Is this an experimental study or an an observational study?
• What is the dependent variable, and what is its level of measurement?
• What is the independent variable, and what is its level of measure-
ment?
• What is the level of analysis for this study?
42 CHAPTER 1. INTRODUCTION TO RESEARCH AND DATA
Chapter 2
Using R to Do Data
Analysis
R is used throughout this book to generate virtually all statistical results and
graphics. In addition, as an instructional aid, in most cases, the code used
to generate graphics and statistical results is provided in the text associated
with those results. This sample code can be used by students to solve the R-
based problems at the end of the chapters, or as a guide to other problems and
homework that might be assigned.
43
44 CHAPTER 2. USING R TO DO DATA ANALYSIS
2.1 Accessing R
One important tool that is particularly useful for new users of R is RStudio, a
graphical user interface (GUI) that facilitates user→R interactions. If you have
reliable access to computer labs on your college or university campus, there’s a
good chance that R and RStudio are installed and available for you to use. If
so, and if you don’t mind having access only in the labs, then great, there’s no
need to download anything. That said, one of the nice things about using R
is that you can download it (and RStudio) for free and have instant, anytime
access on your own computer.
Instructions for downloading and installing R and RStudio are given below,
but I strongly recommend using the cloud-based version of RStudio, found at
Posit.cloud. This is a terrific online option for doing your work in the RStudio
environment without having to download R and RStudio to your personal de-
vice. This is an especially attractive alternative if your personal computer is a
bit out of date or under powered. Posit.cloud can be used by individuals for
free or at a very low cost, and institutions can choose from a number of low cost
options to set up free access for their students. In addition, instructors who set
up a class account on RStudio Cloud can take advantage of a number of options
that facilitate working with their students on assignments.
Downloading and installing R is at once pretty straightforward and at the same
time a point in the process where new users encounter a few problems. The
first step (#1 in Figure 2.1) is to go to https://www.r-project.org and click on
the CRAN (Comprehensive R Archive Network) link on the left side of the web
page. This takes you to a collection of local links to CRAN mirrors, arranged
by country, from which users can download R. You should select a CRAN that
is geographically close to you. (Step 2 in Figure 2.1).
On each local CRAN there are separate links for the Linux, macOS, and Win-
dows operating systems. Click on the link that matches your operating system
(Step 3 in Figure 2.1). If you are using a Windows operating system, you then
click on Base and you will be sent to a download link; click the download link
and install the program. If you are using macOS, you will be presented with
download choices based on your specific version of macOS; choose the download
option whose system requirements match your operating system, and download
as you would any other program from the web. For Linux, choose the distribu-
tion that matches your system and follow the installation instructions.
While you can run R on its own, it is strongly recommended that you use RStu-
dio as the working environment. If you opt not to use the cloud-based ver-
sion of RStudio, go to the RStudio Download Page web page (https://posit.
co/download/rstudio-desktop/), which should look like Figure 2.2. If you have
not already installed R, you can do so by clicking the DOWNLOAD AND
INSTALL R link to install it and go through the R installation steps described
above. If you have already installed R, click the DOWNLOAD RSTUDIO
DESKTOP button to install RStudio.
slight differences, but not enough to matter for learning about the RStudio
environment.
You will use the Source window a lot. In fact, almost everything you do will
originate in the Source window. There is also a lot going on in the other windows
but, generally, everything that happens (well, okay, most things) in the other
windows happens because of what you do in the Source window. This is the
window where you write your commands (the code that tells R what to do).
Some people refer to this as the Script window because this is where you create
the “script” file that contains your commands. Script files are very useful. They
help you stay organized and can save you a lot of time and effort over the course
of a semester or while you are working on a given research project.
Here’s how the script file works. You start by writing a command at the cursor.
You then leave the cursor in the command line and click on the “Run” icon
(upper-right part of the window). Alternatively, you can leave the cursor in the
command line and press control(CNTRL)+return(Enter) on your keyboard.
When you do this, you will see that the command is copied and processed at
the > prompt in the Console window. As indicated above, what you do in
the Source window controls what happens in the other windows (mostly). You
control what is executed in the Console window by writing code in the Source
window and telling R to run the code. You can also just write the command at
the prompt in the Console window, but this tends to get messy and it is easier
to write, edit, and save your commands in the Source window.
Once you execute the command, the results usually appear in one of two places,
either the Console window (lower left) or the “Plots” tab in the lower right
2.1. ACCESSING R 47
window. If you are running a statistical function, the results will appear in
the Console window. If you are creating a graph, it will appear in the Plots
window. If your command has an error in it, the error message will appear
in the Console window. Anytime you get an error message, you should edit
the command in the source window until the error is corrected, and make sure
the part of the command that created the error message is deleted. Once you
are finished running your commands, you can save the contents of the Source
window as a script file, using the “save” icon in the upper-left corner of the
window. When you want to open the script file again, you can use the open
folder icon in the toolbar just above the Source window to see a list of your files.
It is a good idea to give the script file a substantively meaningful name when
saving it, so you know which file to access for later uses.
Creating and saving script files is one of the most important things you can do to
keep your work organized. Doing so allows you to start an assignment, save your
work, and come back to it some other time without having to start over. It’s
also a good way to have access to any previously used commands or techniques
that might be useful to you in the future. For instance, you might be working
on an assignment from this book several weeks down the road and realize that
you could use something from the first couple of weeks of the semester to help
you with the assignment. You can then open your earlier script files to find an
example of the technique you need to use. It is probably the case that most
people who use R rely on this type of recovery process–trying to remember how
to do something, then looking back through earlier script files to find a useful
example–at least until they have acquired a high level of expertise. In fact, the
writing of this textbook relied on this approach a lot!
48 CHAPTER 2. USING R TO DO DATA ANALYSIS
The upper-right window in RStudio includes a number of different tabs, the most
important of which for our purposes is the Environment tab, which provides a
list of all objects you are using in your current session. For instance, if you
import a new data set it will be listed here. Similarly, if you store the results
of some analysis in a new object (this will make more sense later), that object
will be listed here. There are a couple of options here that can save you from
having to type in commands. First, the open-folder icon opens your working
directory (the place where R saves and retrieves files) so you can quickly add
any pre-existing R-formatted data sets to your environment. You can also use
the Import Dataset icon to import data sets from other formats (SPSS, Stata,
SAS, Excel) and convert them to data sets that can be used in R. The History
tab keeps a record of all commands you have used in your R session. This is
similar to what you will find in the script file you create, except that it tends
to be a lot messier and includes lines of code that you don’t need to keep, such
as those that contain errors, or multiple copies of commands that you end up
using several times.
The lower-right window includes a lot of useful options. The Plots tab probably
is the most often used tab in the lower-right window. It is where graphs appear
when you run commands to create them. This tab also provides options for
exporting the graphs. The Files tab includes a list of all files in your working
directory. This is where your script file can be found after you’ve saved it, as
well as well as any R-generated files (e.g., data sets, graphs) you’ve saved. If
you click on any of the files, they will open in the appropriate window. The
Packages tab lists all installed packages (bundles of functions and objects you
can use–more on this later) and can be used to install new packages and activate
existing packages. The Help tab is used to search for help with R commands
and packages. It is worth noting that there is a lot of additional online help to
be had with simple web searches. The Viewer tab is unlikely to be of much use
to new R users.
This overview of the RStudio setup is only going to be most useful if you jump
right in and get started. As with learning anything new, the best way to learn
is to practice, make mistakes, and practice some more. There will be errors,
and you can learn a lot by figuring out how to correct them.
process–they’re given some data and told what commands to use and how the
results should be interpreted, but without really understanding the process that
ties things together. Understanding how things fit together is important because
it helps reduce user anxiety and facilitates somewhat deeper learning.
The five Panes below in the R and Amazing College Student comic strip (below)
are a summary of how the researcher is connected to both R and the data, and
how these connections produce results. The first thing to realize is that R
does things (performs operations) because you tell it to. Without your ideas
and input R does nothing, so you are in control. It all starts with you, the
researcher. Keep this in mind as you progress through this book.
Pane 1: You are in Charge
You, the college student, are sitting at your desk with your laptop, thinking
about things that interest you (first pane). Maybe you have an idea about
how two variables are connected to each other, or maybe you are just curious
about some sort of statistical pattern. In this scenario, let’s suppose that you
are interested in how the states differ from each other with respect to presiden-
tial approval in the first few months of the Biden administration. No doubt,
President Biden is more popular in some states than in others, and it might
be interesting to look at how states differ from each other. In actuality, you
are probably interested in discovering patterns of support and testing some the-
ory about why some states have high approval levels and some states have low
approval levels. For now, though, let’s just focus on the differences in approval.
50 CHAPTER 2. USING R TO DO DATA ANALYSIS
Odds are that you don’t have this information at your finger tips, so you need
to go find it somewhere. You’re still sitting with your laptop, so you decide to
search the internet for this information (second comic pane).
Pane 2: Get Some Data
You find the data at civiqs.com, an online public opinion polling firm. After
digging around a bit on the civiqs.com site, you find a page that has information
that looks something like Figure 2.5.
For each state, this site provides an estimate of the percent who approve, dis-
approve, and who neither approve nor disapprove of the way President Biden is
handling his job as president, based on surveys taken from January to June of
2021. This is a really nice table but it’s still a little bit hard to make sense of
the data. Suppose you want to know which states have relatively high or low
levels of approval? You could hunt around through this alphabetical listing and
try to keep track of the highest and lowest states, or you could use a program
like R to organize and process data for you. To do this, you need to download
the data to your laptop and put it into a format that R can use. (Second comic
pane). In this case, that means copying the data into a spreadsheet that can be
imported to R (this has been done for you, saved as “Approve21.xlsx”), or using
2.3. TIME TO USE R 51
more advanced methods to have R go right to the website and grab the relevant
data (beyond the scope of what we are doing). Once the data are downloaded,
you are ready to use R.
In highlighted segments like the one shown above, the italicized text that starts
with # are comments or instructions to help you understand what’s going on,
and the other lines are commands that tell R what to do. You should use this
information to help you better understand how to use R.
If you use the Import Dataset tab the commands will use the read_excel
1 If clicking the link doesn’t work, copy and paste the following link to a browser: https:
//www.dropbox.com/sh/le8u4ha8veihuio/AAD6p6RQ7uFvMNXNKEcU__I7a?dl=0 .
52 CHAPTER 2. USING R TO DO DATA ANALYSIS
command to convert the Excel file into a data set that R can use. The first
thing you will see after importing the data is a spreadsheet view of the data
set. Exit this window by clicking on the x on the right side of the spreadsheet
view tab. Some of the information we get with R commands below can also be
obtained by inspecting the spreadsheet view of the data set, but that option
doesn’t work very well for larger data sets.
At this point, in RStudio you should see that the import commands have been
executed in the Source window. Also, if you look at the Environment/History
window, you should see that a data set named Approve21 (the civiqs.com data)
has been added. This means that this data set is now available for you to use.
Usually, before jumping into producing graphs or statistics, we want to examine
the data set to become more familiar with it. This is also an opportunity to
make sure there are no obvious problems with the data. First, let’s check the
dimensions of the data set (how many rows and columns):
#Remember, `Approve21` is the name of the data set
#Get dimensions of the data set
dim(Approve21)
[1] 50 5
Ignoring “[1]”, which just tells you that this is the first line of output, the first
number produced by dim() is the number of rows in the data set, and the
second number is the number of columns. The number of rows is the number of
cases, or observations, and the number of columns is the number of variables.
According to these results, this data set had 50 cases and five variables. (if you
look in the Environment/History window, you should see this same information
listed with the data set as “50 obs. of 5 variables”).
Let’s take a second to make sure you understand what’s going on here. As
illustrated the comic Pane below, you sent requests to R in RStudio (“Hey
R, use Approve21 as the data set, and tell me its dimensions”) and R sent
information back to you. Check in with your instructor if this doesn’t make
sense to you
2.3. TIME TO USE R 53
Does 50 rows and five columns make sense, given what we know about the data?
It makes sense that there are 50 rows, since we are working with data from the
states, but if you compare this to the original civiqs.com data in Figure 2.5,
it looks like there is an extra column in the new data set. To verify that the
variables and cases are what we think they are, we can tell R to show us some
more information about the data set. First, we can have R show the names of
the variables (columns) in the data set:
#Get the names of all variables in "Approve21"
names(Approve21)
# A tibble: 6 x 5
state stateab Approve Disapprove Neither
<chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL 31 63 6
2 Alaska AK 41 53 6
3 Arizona AZ 44 51 5
4 Arkansas AR 31 63 6
5 California CA 57 35 8
6 Colorado CO 50 44 7
Here, you see the first six rows and learn that state uses full state names,
stateab is the state abbreviation, and the other variables are expressed as
whole numbers. Note that this command also gives you information on the
level of measurement for each variable: “chr” means the variable is a character
variable, and “dbl” means that the variable is a numeric variable that could
have a decimal point (though none of these variables do).
We can also look at the bottom six rows:
#List the last several rows of data
tail(Approve21)
# A tibble: 6 x 5
state stateab Approve Disapprove Neither
<chr> <chr> <dbl> <dbl> <dbl>
1 Vermont VT 58 36 7
2 Virginia VA 48 46 6
3 Washington WA 53 39 7
4 West Virginia WV 23 72 5
5 Wisconsin WI 47 48 5
6 Wyoming WY 26 68 5
If we pay attention to the values of Approve and Disapprove in these two short
lists of states, we see quite a bit of variation in the levels of support for President
2.3. TIME TO USE R 55
We still use the head() and tail() but add the order() command to get a
listing of the lowest and highest approval states.
#Sort by "Approve" and copy over original data set
Approve21<-Approve21[order(Approve21$Approve),]
#list the first several rows of data, now ordered by "Approve"
head(Approve21)
Here, we tell R to sort the data set from lowest to highest levels of approval
(Approve21[order(Approve21$Approve),]) and replace the original data set
with the sorted data set.
# A tibble: 6 x 5
state stateab Approve Disapprove Neither
<chr> <chr> <dbl> <dbl> <dbl>
1 West Virginia WV 23 72 5
2 Wyoming WY 26 68 5
3 Oklahoma OK 29 65 6
4 Idaho ID 30 65 5
5 Alabama AL 31 63 6
6 Arkansas AR 31 63 6
56 CHAPTER 2. USING R TO DO DATA ANALYSIS
This shows us the first six states in the data set, those with the lowest levels
of presidential approval. Our initial impression from the alphabetical list was
verified: President Biden is given his lowest marks in states that are usually
considered very conservative. There is also a bit of a regional pattern, as these
are all southern or mountain west states.
Now, let’s look at the states with the highest approval levels by looking at the
last six states in the data set. Notice that you don’t have to sort the data set
again because the earlier sort changed the order of the observations until they
are sorted again for some reason.
#list the last several rows of data, now ordered by "Approve"
tail(Approve21)
# A tibble: 6 x 5
state stateab Approve Disapprove Neither
<chr> <chr> <dbl> <dbl> <dbl>
1 California CA 57 35 8
2 Rhode Island RI 57 37 7
3 Vermont VT 58 36 7
4 Maryland MD 59 33 8
5 Hawaii HI 61 33 6
6 Massachusetts MA 63 30 7
Again, our initial impression is confirmed. President Biden gets his highest
marks in states that are typically considered fairly liberal states. There is also
an east coast and Pacific west flavor to this list of high approval states.
Usually, data sets have many more than five variables, so it makes sense to just
show the values of variables of interest, in this case state (or stateab) and
Approve, instead of the entire data set. To do this, you can modify the head
and tail commands to designate that certain columns be displayed (shown
below with head).
#list the first several rows of data for "state" and "Approve"
head(Approve21[c('state', 'Approve')])
# A tibble: 6 x 2
state Approve
<chr> <dbl>
1 West Virginia 23
2 Wyoming 26
3 Oklahoma 29
4 Idaho 30
5 Alabama 31
6 Arkansas 31
Here, the c('state', 'Approve') portion is used to tell R to combine the
columns headed by state and Approve from the data set Approve21.
2.3. TIME TO USE R 57
If you only want to know the names of the highest and lowest states listed in
order of approval, and dont’ need to see the values of Approve, you can just use
Approve21$states in the head and tail commands, again assuming the data
set is ordered by levels of approval (shown below with tail).
#List the state names for last several rows of data
tail(Approve21$state)
The listing of states looks a bit different, but they are still listed in order, with
Massachusetts giving Biden his highest approval rating.
Get a Graph. These few commands have helped you get more familiar with
the shape of the data set, the nature of the variables, and the types of states
at the highest and lowest end of the distribution, but it would be nice to have
a more general sense of the variation in presidential approval among the states.
For instance, how are states spread out between the two ends of the distribution?
What about states in between the extremes listed above? You will learn about
a number of statistical techniques that can give you information about the
general tendency of the data, but graphing the data as a first step can also be
very effective. Histograms are a particularly helpful type of graph for this
purpose. To get a histogram, you need to tell R to get the data set, use a
particular variable, and summarize its values in the form of a graph (as in Pane
4).
Histogram of Approve21$Approve
8
6
Frequency
4
2
0
20 30 40 50 60
Approve21$Approve
Here, the numbers on the horizontal axis represent approval levels, and the
values on the vertical axis represent the number of states in each category.
Each vertical bar represents a range of values on the variable of interest. You’ll
learn more about histograms in the next few chapters, but you can see in this
graph that there is quite a bit of variation in presidential approval, ranging from
20-25% at the low end to 60-65% at the high end (as we saw in the sorted lists),
with the most common outcomes falling between 35% and 55%.
One of the things you can do in R is modify the appearance of graphs to make
them look a bit nicer and easier to understand. This is especially important if
the information is being shared with people who are not as familiar with the
data as the researcher is. In this case, you modify the graph title (main), and
the horizontal (xlab) and vertical (ylab) labels:
# Axis labels in quotes and commands separated by commas
hist(Approve21$Approve,
main="State-level Biden Approval January-June 2021", #Graph title
ylab="Number of States", #vertical axis label
xlab="% Approval") #Horizontal axis label
2.3. TIME TO USE R 59
6
4
2
0
20 30 40 50 60
% Approval
Note that all of the new labels are in quotes and the different parts of the
command are separated by commas. You will get an error message if you do
not use quotes and commas appropriately.
Making these slight changes to the graph improved its appearance and provided
additional information to facilitate interpretation. Going back to the illustrated
panes, the cartoon student asked R for a histogram of a specific variable from a
specific data set that they had previously told R to load, and R gave them this
nice looking graph. That, in a nutshell, is how this works. The cartoon student
is happy and ready to learn more!
Pane 5: Making Progress
Before moving on, let’s check in with RStudio to see what the various windows
look like after running all of the commands listed above (Figure 2.6). Here, you
can see the script file (named “chpt2.R”) that includes all of the commands run
60 CHAPTER 2. USING R TO DO DATA ANALYSIS
above. This file is now saved in the working directory and can be used later.
The Environment window lists the Approve21 data set, along with a little bit
of information about it. The Console window shows the commands that were
executed, along with the results for the various data tables we produced. Finally,
the Plot window in the lower-right corner shows the histogram produced by the
histogram command.
and then join all of those vectors using as.data.frame to create a data set.
Let’s use the first four rows of data from Figure 2.7 to demonstrate this. The
first variable is “state”, so we can begin by creating a vector named state that
includes the first four state names in order.
#create vector of state names
state<-c("Alabama", "Alaska", "Arizona", "Arkansas")
#Print (show) state names
state
[1] 31 41 44 31
Disapprove<-c(63,53,51,63)
#Print (show) disapproval values
Disapprove
[1] 63 53 51 63
Neither<-c(6,6,5,6)
#Print (show) neither values
Neither
[1] 6 6 5 6
Now, to create a data frame, we just use the data.frame function to stitch these
columns together,
#Combine separate arrays into a single data frame
approval<-data.frame(state, stateab, Approve,Disapprove, Neither)
#Print the data frame
approval
Now you have a smalll data set with the same information found in the first few
rows of data in Figure 2.7. Hopefully, this illustration helps you connect to the
concept of a data frame. Fortunately, researchers don’t usually have to input
data in this manner, but it is a useful tool to use occasionally to understand the
concept of a data frame.
Objects and Functions When you use R, it typically is using some function
to do something with some object. Everything that exists in R is an object.
Usually, the objects of interest are data frames (data sets), packages, functions,
or variables (objects within the data frame). We have already seen many ref-
erences to objects in the preceding pages of this chapter. When we imported
the presidential approval data, we created a new object. Even the tables we
created when getting information about the data set could be thought of as
objects. Functions are a particular type of objects. You can think of functions
as commands that tell R what to do. Anything R does, it does with a function.
Those functions tell R to do something with an object.
The code we used to create the histogram also includes both a function and an
object:
hist(Approve21$Approve,
main="State-level Biden Approval January-June 2021",
xlab="% Approval", col="gray")
In this example, hist is a function call (telling R to use a function called “hist”)
and Approve21$Approve is an object. The function hist is performing opera-
tions on the object Approve21$Approve. Other “action” parts of the function
are main, xlab, and ylab, which are used to create the main title, and the labels
for the x axis, and the y axis. Finally, the graph created by this command is a
new object that appears in the “Plots” window in RStudio.
We also just used two functions (c and as.data.frame) to create six objects:
the five vectors of data (state, stateab, Approve, Disapprove, and Neither)
and the data frame where they were joined (approval).
Storing Results in New Objects One of the really useful things you can do
in R is create new objects, either to store the results of some operation or to
add new information to existing data sets. For instance, at the very beginning
of the data example in this chapter, we imported data from an Excel file and
stored it in a new object named Approve21using this command:
Approve21 <- read_excel("Approve21.xlsx")
This new object was used in all of the remaining analyses. You will see objects
used a lot for this purpose in the upcoming chapters. We also updated an object
when we sorted Approve21 and replaced the original data with sorted data.
We can also modify existing data sets by creating new variables based on trans-
formations of existing variables (more on this in Chapter 4). For instance,
64 CHAPTER 2. USING R TO DO DATA ANALYSIS
You might have noticed that I used <- to tell R to put the results of the cal-
culation into a new object (variable). Pay close attention to this, as you will
use the <- a lot. This is a convention in R that I think is helpful because it is
literally pointing at the new object and saying to take the result of the right side
action and put it in the object on the left. That said, you could write this using
= instead of <- (Approve21$Approve_prop=Approve21$Approve/100) and you
would get the same result. I favor using <- rather than =, but you should use
what you are most comfortable with.
You can check with the names function to see if the new variable is now part of
the data set.
names(Approve21)
[1] 0.23 0.26 0.29 0.30 0.31 0.31 0.31 0.32 0.33 0.34 0.35 0.36 0.36 0.36 0.37
[16] 0.37 0.37 0.39 0.39 0.39 0.41 0.41 0.42 0.43 0.44 0.44 0.45 0.46 0.47 0.47
[31] 0.47 0.48 0.48 0.50 0.50 0.50 0.51 0.52 0.52 0.52 0.53 0.53 0.53 0.54 0.57
[46] 0.57 0.58 0.59 0.61 0.63
It looks like all of the values of Approve21$Approve_prop are indeed expressed
as proportions. The bracketed numbers do not represent values of the object.
Instead, they tell the order of the outcomes in the data set. For instance, [1] is
followed by .23, .26, and 29, indicating the value of the first observation is .23,
the second is .26, and the third is .29. Likewise, [16] is followed by .37, .37, and
.39, indicating the value for the sixteenth observation is .37, the seventeenth is
.37, and the eighteenth is .39, and so on.
Packages and Libraries A couple of other terms you should understand are
package and library. Packages are collections of functions, objects (usually data
sources), and code that enhance the base functions of R. For instance, the
function hist (used to create a histogram) is part of a package called graphics
that is automatically installed as part of the R base, and the function read_xlsx
2.4. SOME R TERMINOLOGY 65
(used to import the Excel spreadsheet) is part of a package called readxl. Once
you have installed a package, you should not have to reinstall it unless you start
an R session on a different device.
If you need to use a function from a package that is not already installed, you
use the following command:
install.packages("package_name")
You can also install packages using the “Packages” tab in the lower right window
in RStudio. Just go to the “Install” icon, search for the package name, and click
on “Install”. When you install a package, you will see a lot of stuff happening in
the Console window. Once the > prompt returns without any error messages,
the new package is installed. In addition to the core set of packages that most
people use, we will rely on a number of other packages in this book.
Each chapter in this book begins with a list of the packages that will be used
in the chapter, and you can install them as you work through the chapters, if
you wish. Alternatively, you can copy, paste, and run the code below to install
all of the packages used in this book in one fell swoop
install.packages(c("dplyr","desc","descr","DescTools","effectsize",
"gplots","Hmisc","lmtest","plotrix","ppcor","readxl",
"sandwich","stargazer"))
If your instructor is using RStudio Cloud, one of the benefits of that platform is
that they can pre-load the packages, saving students the sometimes frustrating
process of package installation.
Libraries are where the packages are stored for active use. Even if you have
installed all of the required packages, you generally can’t use them unless you
have loaded the appropriate library.
#Note that you do not use quotation marks for package names here.
library(package_name)
You can also load a library by going to the “Packages” tab in the lower-right
window and clicking on the box next to the already installed package. This
command makes the package available for you to use. I used the command
library(readxl) before using read_xlsx() to import the presidential approval
data. Doing this told R to make the functions in readxl available for me to
use. If I had not done this, I would have gotten an error message something
like “Error: could not find function read_xlsx.” You will also get this error
if you misspell the function name (happens a lot). Installing the packages and
attaching the libraries allows us to access and use any functions or objects
included in these packages.
Working Directory When you create files, ask R to use files, or import files
to use in R, they are stored in your working directory. Organizationally, it is
important to know where this working directory is, or to change its location
66 CHAPTER 2. USING R TO DO DATA ANALYSIS
so you can find your work when you want to pick it up again. By default, the
directory where you installed RStudio is designated the working directory. In
R, you can check on the location of the working directory with the getwd()
command. I get something like the following when I use this command:
getwd()
[1] “/Users/username/Dropbox/foldername”
In this case, the working directory is in a Dropbox folder that I use for putting
together this book (I modified the path here to keep it simple). If I wanted to
change this to something more meaningful and easier to remember, I could use
the setwd() function. For instance, as a student, you might find it useful to
create a special folder on you desktop for each class.
# Here, the working directory is set to a folder on my desktop,
# named for the data analysis course I teach.
setwd("/Users/username/Desktop/PolSci390")
I’m fine with the current location of the working directory, so I won’t change
it, but you should make sure that yours is set appropriately. The nice thing
about keeping all of your work (data files, script files, etc.) in the working
directory is that you do not have to include the sometimes cumbersome file
path names when accessing the files. So, instead of having to write something
like readxl("/Users/username/Desktop/PolSci390/Approve21.xlsx"), you
just have to write readxl("Approve21.xlsx"). If you need to use files that are
not stored your working directory and you don’t remember their location, you
can use the file.choose() command to locate them on your computer. The
output from this command will be the correct path and file name, which you
can then copy and paste into R.
you can also save the Approve21 data set as an R data file so you won’t need
to import the Excel version of this data set when you want to use it again.
The command for saving an object as an R data file is save(objectname,
file=filename.rda). Here, the object is Approve21 and I want to keep that
name as the file name, so I use the following command:
2.5. NEXT STEPS 67
save(Approve21, file="Approve21.rda")
This will save the file to your working directory. If you want to save it to another
directory, you need to add the file path, something like this:
save(Approve21, file="~/directory/Approve21.rda")
Now, when you want to use this file in the future, you can use the load()
command to access the data set.
load("Approve21.rda")
Copy and Paste Results Finally, when you are doing R homework, you
generally should show all of the final results you get, including the R code you
used, unless instructed to do otherwise. This means you need to copy and paste
the findings into another document, probably using Microsoft Word or some
similar product. When you do this, you should make sure you format the pasted
output using a fixed-width font (“Courier”, “Courier New”, and “Monaco” work
well). Otherwise, the columns in the R output will wander all over the place,
and it will be hard to evaluate your work. You should always check to make
sure the columns are lining up on paper the same way they did on screen. You
might also have to adjust the font size to accomplish this.
To copy graphs, you can click “Export” in the graph window and save your
graph as an image (make sure you keep track of the file location). Then, you
can insert the graph into your homework document. Alternatively, you can also
click “copy to clipboard” after clicking on “Export” and copy and paste the
graph directly into your document. You might have to resize the graphs after
importing, following the conventions used by your document program.
You may have to deal with other formatting issues, depending you what doc-
ument software you are using. The main thing is to always take time to make
sure your results are presented as neatly as possible. In most cases, it a good
idea to include only your final results, unless instructed otherwise. It is usually
best to include error messages only if you were unable to get something to work
and you want to receive feedback and perhaps some credit for your efforts.
2.6 Exercises
2.6.1 Concepts and Calculations
1. The following lines of R code were used earlier in the chapter. For each
one, identify the object(s) and function(s). Explain your answer.
• sapply(Approve21, class)
• save(Approve21, file="Approve21.rda")
• Approve21$Approve_prop<-Approve21$Approve/100
• Approve21 <- read_excel("Approve21.xlsx")
2. In your own words, describe what a script file is and what its benefits are.
3. The following command saves a copy of the news states20 data set:
save(states20, file="states20.rda"). Is this file being saved to the
working directory or to another directory? How can you tell?
4. After using head(Approve21) to look at the first few rows of the Ap-
prove21 data set, I pasted the screen output to my MSWord document
and it looked like this:
state stateab Approve Disapprove Neither
Alabama AL 31 63 6
Alaska AK 41 53 6
Arizona AZ 44 51 5
Arkansas AR 31 63 6
California CA 57 35 8
Colorado CO 50 44 7
Why did I end up with these crooked columns that don’t align with the variable
names? How can I fix it?
5. I tried to load the anes20.rda data file using the following command:
load(Anes20.rda). It didn’t work, and I got the following message. Er-
ror in load(Anes20.rda) : object ‘Anes20.rda’ not found. What
did I do wrong?
2.6. EXERCISES 69
2.6.2 R Problems
1. Load the countries2 data set and get the names of all of the variables
included in it. Based just on what you can tell from the variable names,
what sorts of variables are in this data set? Identify one variable that
looks like it might represent something interesting to study (a potential
dependent variable), and then identify another variable that you think
might be related to the first variable you chose.
2. Use the dim function to tell how many variables and how many countries
are in the data set.
3. Use the Approve21 data set and create a new object, Approve21$net_approve,
which is calculated as the percent in the state who approve of the Biden’s
performance MINUS the percent in the state who disapprove of Biden’s
performance. Sort the data set by ‘Approve21$net_approve and list the
six highest and lowest states. Say a few words about the types of states
in these two lists.
4. Produce a histogram of Approve21$net_approve and describe what you
see. Be sure to provide substantively meaningful labels for the histogram.
70 CHAPTER 2. USING R TO DO DATA ANALYSIS
Chapter 3
If you get errors at this point, check to make sure the files are in your working
directory or that you used the correct file path; also make sure you spelled the
file names correctly and enclosed the file path and file name in quotes. Note
that <FilePath> is the place where the data files are stored. If the files are
store in your working directory, you do not need to include the file path.
#If files are in your working directory, just:
load("anes20.rda")
load("states20.rda")
In addition, you should also load the libraries for descr and Desctools, two
packages that provide many of the functions we will be using. You may have to
install the packages (see Chapter 2) if you have not done so already.
library(descr)
library(DescTools)
71
72 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
3.2 Introduction
Sometimes, very simple statistics or graphs can convey a lot of information and
play an important role in the presentation of data analysis findings. Advanced
statistical and graphing techniques are normally required to speak with confi-
dence about data-based findings, but it is almost always important to start your
analysis with some basic tools, for two reasons. First, these tools can be used to
alert you to potential problems with the data. If there are issues with the way
the data categories are coded, or perhaps with missing data, those problems
are relatively easy to spot with some of the simple methods discussed in this
chapter. Second, the distribution of values (how spread out they are, or where
they tend to cluster) on key variables can be an important part of the story
told by the data, and some of this information can be hard to grasp when using
more advanced statistics.
This chapter focuses on using simple frequency tables, bar charts, and his-
tograms to tell the story of how a variable’s values are distributed. Two data
sources are used to provide examples in this chapter: a data set comprised of
selected variables from the 2020 American National Election Study, a large-
scale academic survey of public opinion in the months just before and after the
2020 U.S. presidential election (saved as anes20.rda), and a state-level data set
containing information on dozens of political, social, and demographic variables
from the fifty states (saved as states20.rda) In the anes20 data set, most vari-
able names follow the format “V20####”, while in the states20 data set the
variables are given descriptive names that reflect the content of the variables.
Codebooks for these variables are included in the appendix to this book.
[1] "1. Increased a lot" "2. Increased a little" "3. Kept the same"
[4] "4. Decreased a little" "5. Decreasaed a lot"
#If you get an error message, make sure you have loaded the data set.
3.3. COUNTING OUTCOMES 73
Here, you can see the category labels, which are ordered from “Increased a lot”
at one end, to “Decreased a lot” at the other. This is useful information but
what we really want to know is how many survey respondents chose each of
these categories. We can’t do this without using some function to organize and
count the outcomes for us. This is readily apparent when you look at the way
the data are organized, as illustrated below using just the first 50 out of over
8000 cases:
#Show the values of V201320x for the first 50 cases
anes20$V201320x[1:50]
In this form, it is difficult to make sense out of these responses. Does one
outcome seem like it occurs a lot more than all of the others? Are there some
outcomes that hardly ever occur? Do the outcomes generally lean toward the
“Increase” or “Decrease” side of the scale? You really can’t tell from the data
as they are listed, and these are only the first fifty cases. Having R organize
and tabulate the data provides much more meaningful information.
What you need to do is create a table that summarizes the distribution of re-
sponses. This table is usually known as a frequency distribution, or a frequency
table. The base R package includes a couple of commands that can be used for
this purpose. First, you can use table() to get simple frequencies:
#Create a table showing the how often each outcome occurs
table(anes20$V201320x)
[1] 8225
So now we know that there were 8225 total valid responses to the question on
aid to the poor, and 2560 of them favored increasing spending a lot. Now we
can start thinking about whether that seems like a lot of support, relative to
the sample size. So what we need to do is express the frequency of the category
outcomes relative to the total number of outcomes. These relative frequencies
are usually expressed as percentages or proportions.
Percentages express the relative occurrence of each value of x. For any given
category, this is calculated as the number of observations in the category, divided
by the total number of valid observation across all categories, multiplied times
100:
𝑓𝑘
Category Percent = ∗ 100
𝑛
3.3. COUNTING OUTCOMES 75
Where:
𝑓𝑘 = frequency, or number of cases in any given category
n = the number of valid cases from all categories
This simple statistic is very important for making relative comparisons. Percent
literally means per one-hundred, so regardless of overall sample size, we can look
at the category percentages and get a quick, standardized sense of the relative
number of outcomes in each category. This is why percentages are also referred
to as relative frequencies. In frequency tables, the percentages can range from
0 to 100 and should always sum to 100.
Proportions are calculated pretty much the same way, except without multi-
plying times 100:
[1] 31.12462
#Calculate Proportion in "Increased a lot" category
2560/sample_size
[1] 0.3112462
So we see that about 31% of all responses are in this category. What’s nice
about percentages and proportions is that, for all practical purposes, the values
have the same meaning from one sample to another. In this case, 31.1% (or
.311) means that slightly less than one-third of all responses are in this cate-
gory, regardless of the sample size. That said, their substantive importance can
depend on the number of categories in the variable. In the present case, there
are five response categories, so if responses were randomly distributed across
categories, you would expect to find about 20% in each category. Knowing this,
the outcome of 31% suggests that this is a pretty popular response category. Of
course, we can also just look at the percentages for the other response categories
to gain a more complete understanding of the relative popularity of the response
choices.
Fortunately, we do not make this calculation manually for every category. In-
stead, we can use the prop.table function to get the proportions for all five
categories. In order to do this, we need to store the results of the raw frequency
76 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
table in a new object and then have prop.table use that object to calculate
the proportions. Note that I use the extension “.tbl” when naming the new
object. This serves as a reminder that this particular object is a table. When
you execute commands such as this, you should see the new object appear in
the Global Environment window.
#Store the frequency table in a new object called "poorAid.tbl"
poorAid.tbl<-table(anes20$V201320x)
#Create a proportion table using the contents of the frequency table
prop.table(poorAid.tbl)
first alternative is the freq command, which is provided in the descr package,
a package that provides several tools for doing descriptive analysis. Here’s what
you need to do:
#Provide a frequency table, but not a graph
freq(anes20$V201320x, plot=F)
As you can see, we get all of the information provided in the earlier tables, plus
some additional information, and the information is somewhat better organized.
The first column of data shows the raw frequencies, the second shows the total
percentages, and the final column is the valid percentages. The valid percentages
match up with the proportions reported earlier, while the “Percent” column
reports slightly different percentages based on 8280 responses (the 8225 valid
responses and 55 survey respondents who did not provide a valid response).
When conducting surveys, some respondents refuse to answer some questions,
or may not have an opinion, or might be skipped for some reason. These 55
responses in the table above are considered missing data and are denoted as NA
in R. It is important to be aware of the level of missing data and usually a good
idea to have a sense of why they are missing. Sometimes, this requires going
back to the original codebooks or questionnaires (if using survey data) for more
information about the variable. Generally, researchers present the valid percent
when reporting results.
One statistic missing from this table is the cumulative percent, which can be
useful for getting a sense of how a variable is distributed. The cumulative %
is the percent of observations in or below (in a numeric or ranking sense) a
given category. You calculate the cumulative percent for a given ordered or
numeric value by summing the percent with that value and the percent in all
lower ranked values. We’ve actually already discussed this statistic without
actually calling it by name. In part of the discussion of the results from the
table command, it was noted that just over 50% favored increasing spending.
This is the cumulative percent for the second category (31.1% from the first
and 19.7% from the second category). Of course, it’s easier if you don’t have
to do the math in your head on the fly every time you want the cumulative
percent. Fortunately, there is an alternative command, Freq, that will give you
a frequency table that includes cumulative percentages (note the upper-case
78 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
Here, you get the raw frequencies, the valid percent (note that there are no
NAs listed), the cumulative frequencies, and the cumulative percent. The key
addition is the cumulative percent, which I consider to be a sometimes useful
piece of information. As pointed out above, you can easily see that just over half
the respondents (50.8%) favored an increase in spending on aid to the poor, and
almost 90% opposed cutting spending (percent in favor of increasing spending
or keeping spending the same, the cumulative percentage in the third category).
So what about table, the original function we used to get frequencies? Is it
still of any use? You bet it is! In fact, many other functions make use of the
information from the table command to create graphics and other statistics,
as you will see shortly.
Besides providing information on the distribution of single variables, it can also
be useful to compare the distributions of multiple variables if there are sound
theoretical reasons for doing so. For instance, in the example used above, the
data showed widespread support for federal spending on aid to the poor. It
is interesting to ask, though, about how supportive people are when we refer
to spending not as “aid to the poor” but as “welfare programs,” which tech-
nically are programs to aid the poor. The term “welfare” is viewed by many
as a “race-coded” term, one that people associate with programs that primarily
benefit racial minorities (mostly African-Americans), which leads to lower levels
of support, especially among whites. As it happens, the 2020 ANES asked the
identical spending question but substituted “welfare programs” for “aid to the
poor.” Let’s see if the difference in labeling makes a difference in outcomes. Of
course, we don’t expect to see the exact same percentages because we are using
a different survey question, but based on previous research in this area, there is
good reason to expect lower levels of support for spending on welfare programs
than on “aid to the poor.”
freq(anes20$V201314x, plot=FALSE)
states20$d2pty20
Frequency Percent
27.52 1 2
30.2 1 2
32.78 1 2
33.06 1 2
34.12 1 2
35.79 1 2
36.57 1 2
36.8 1 2
37.09 1 2
38.17 1 2
39.31 1 2
80 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
40.22 1 2
40.54 1 2
41.6 1 2
41.62 1 2
41.8 1 2
42.17 1 2
42.51 1 2
44.07 1 2
44.74 1 2
45.82 1 2
45.92 1 2
47.17 1 2
48.31 1 2
49.32 1 2
50.13 1 2
50.16 1 2
50.32 1 2
50.6 1 2
51.22 1 2
51.41 1 2
53.64 1 2
53.75 1 2
54.67 1 2
55.15 1 2
55.52 1 2
56.94 1 2
58.07 1 2
58.31 1 2
58.67 1 2
59.63 1 2
59.93 1 2
60.17 1 2
60.6 1 2
61.72 1 2
64.91 1 2
65.03 1 2
67.03 1 2
67.12 1 2
68.3 1 2
Total 50 100
#If you get an error, check to be sure the "states20" data set loaded.
The most useful information conveyed here is that vote share ranges from 27.5
to 68.3. Other than that, this frequency table includes too much information to
absorb in a meaningful way. There are fifty different values and it is really hard
3.3. COUNTING OUTCOMES 81
to get a sense of the general pattern in the data. Do the values cluster at the
high or low end? In the middle? Are they evenly spread out? In cases like this,
it is useful to collapse the data into fewer categories that represent ranges of
outcomes. Fortunately, the Freq command does this automatically for numeric
variables.
Freq(states20$d2pty20, plot=FALSE)
We’ll explore regrouping data like this in greater detail in the next chapter.
The code listed below is used to generate the bar chart for V201320x, the variable
measuring spending preferences on programs for the poor. Note here that the
barplot command uses the initial frequency table as input, saved earlier as
poorAid.tbl, rather than the name of the variable. This illustrates what I
mentioned earlier, that even though the table command does not provide a lot
of information, it can be used to help with other R commands. It also reinforces
an important point: bar charts are the graphic representation of the raw data
from a frequency table.
#Plot the frequencies of anes20$V201320x
barplot(poorAid.tbl)
3.4. GRAPHING OUTCOMES 83
2500
1500
500
0
Sometimes you have to tinker a bit with graphs to get them to look as good as
they should. For instance, you might have noticed that not all of the category
labels are printed above. This is because the labels themselves are a little
bit long and clunky, and with five bars to print, some of them were dropped
due to lack of room. We could add a command to make reduce the size of
the labels, but that can lead to labels that are too small to read (still, we
will look at that command later). Instead, we can replace the original labels
with shorter ones that still represent the meaning of the categories, using the
names.arg command. Make sure to notice the quotation marks and commas in
the command. We also need to add axis labels and a main title for the graph,
the same as we did in Chapter 2. Adding this information makes it much easier
for your target audience to understand what’s being presented. You, as the
researcher, are familiar with the data and may be able to understand the graph
without this information, but the others need a bit more help.
#Same as above but with labels altered for clarity in "names.arg"
barplot(poorAid.tbl,
names.arg=c("Increase/Lot", "Increase",
"Same", "Decrease","Decrease/Lot"),
xlab="Increase or Decrease Spending on Aid to the Poor?",
ylab="Number of Cases",
main="Spending Preference")
84 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
Spending Preference
3000
Number of Cases
2000
500 1000
0
I think you’ll agree that this looks a lot better than the first graph. By way
of interpretation, you don’t need to know the exact values of the frequencies
or percentages to tell that there is very little support for decreasing spending
and substantial support for keeping spending the same or increasing it. This is
made clear simply from the visual impact of the differences in the height of the
bars. Images like this often make quick and clear impressions on people.
Now let’s compare this bar chart to one for the question that asked about
spending on welfare programs. Since we did not save the contents of the
original frequency table for this variable to a new object, we can insert
table(anes20$V201314x) into the barplot command:
#Tell R to use the contents of "table(anes20$V201314x)" for graph
barplot(table(anes20$V201314x),
names.arg=c("Increase/Lot", "Increase",
"Same", "Decrease", "Decrease/Lot"),
xlab="Increase or Decrease Spending on Welfare?",
ylab="Number of Cases",
main="Spending Preference")
3.4. GRAPHING OUTCOMES 85
3500
2500 Spending Preference
Number of Cases
1500
500
0
As was the case when we compared the frequency tables for these two variables,
the biggest difference that jumps out is the lower level of support for increas-
ing spending, and the higher level of support for decreasing welfare spending,
compared to preferences of spending on aid to the poor. You can flip back and
forth between the two graphs to see the differences, but sometimes it is better
to have the graphs side by side, as below.1
#Set output to one row, two columns
par(mfrow=c(1,2))
barplot(poorAid.tbl,
ylab="Number of Cases",
#Adjust the y-axis to match the other plot
ylim=c(0,3500),
xlab="Spending Preference",
main="Aid to the Poor",
#Reduce the of the labels to 60% of original
cex.names=.6,
#Use labels for end categories, other are blank
names.arg=c("Increase/Lot", "", "", "","Decrease/Lot"))
#Use "table(anes20$V201314x)" since a table object was not created
barplot(table(anes20$V201314x),
xlab="Spending Preference",
ylab="Number of Cases",
main="Welfare",
cex.names=.6, #Reduce the size of the category labels
1 The code is provided here but might be a bit confusing to new R users. Look at it and
3500
2500
2500
Number of Cases
Number of Cases
1500
1500
500
500
0
0
This does make the differences between the two questions more apparent. Note
that I had to delete a few category labels and reduce the size of the remaining
labels to make everything fit in this side-by-side comparison.
Finally, if you prefer to plot the relative rather than the raw frequencies, you
just have to specify that R should use the proportions table as input, and change
the y-axis label accordingly:
#Use "prop.table" as input
barplot(prop.table(poorAid.tbl),
names.arg=c("Increase/Lot", "Increase",
"Same", "Decrease","Decrease/Lot"),
xlab="Increase or Decrease Spending on Aid to the Poor?",
ylab="Proportion of Cases",
main="Spending Preference")
3.4. GRAPHING OUTCOMES 87
0.2
0.1
0.0
Bar Chart Limitations. Bar charts work really well for most categorical vari-
ables because they do not assume any particular quantitative distance between
categories on the x-axis, and because categorical variables tend to have rela-
tively few, discrete categories. Numeric data generally do not work well with
bar charts, for reasons to be explored in just a bit. That said, there are some
instances where this general rule doesn’t hold up. Let’s look at one such excep-
tion, using a state policy variable from the states20 data set. Below is a bar
chart for abortion_laws, a variable measuring the number of legal restrictions
on abortion in the states in 2020.2
barplot(table(states20$abortion_laws),
xlab="Number of laws Restricting Abortion Access",
ylab="Number of States",
cex.axis=.8)
2 Note that the data for this variable do not reflect the sweeping changes to abortion laws
12
10
Number of States
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Okay, this is actually kind of a nice looking graph. It’s easy to get a sense of
how this variable is distributed: most states have several restrictions and the
most common outcomes are states with 9 or 10 restrictions. It is also easier
to comprehend than if we got the same information in a frequency table (go
ahead and get a frequency table to see if you agree). The bar chart works
in this instance, because there are relatively few, discrete categories, and the
categories are consecutive, with no gaps between values. So far, so good, but in
most cases, numeric variables do not share these characteristics, and bar charts
don’t work well. This point is illustrated quite nicely in this graph of Joe Biden’s
percent of the two-party vote in the states in the 2020 election.
par(las=2) #This tells R to plot the labels vertically
barplot(table(states20$d2pty20),
xlab="Biden % of Two-party Vote",
ylab="Number of States",
#This tells R to shrink the labels to 70% of normal size
cex.names = .7)
3.4. GRAPHING OUTCOMES 89
1.0
0.8
Number of States
0.6
0.4
0.2
0.0
27.52
30.2
32.78
33.06
34.12
35.79
36.57
36.8
37.09
38.17
39.31
40.22
40.54
41.6
41.62
41.8
42.17
42.51
44.07
44.74
45.82
45.92
47.17
48.31
49.32
50.13
50.16
50.32
50.6
51.22
51.41
53.64
53.75
54.67
55.15
55.52
56.94
58.07
58.31
58.67
59.63
59.93
60.17
60.6
61.72
64.91
65.03
67.03
67.12
68.3
Biden % of Two−party Vote
Not to put too fine a point on it, but this is a terrible graph, for many of the same
reasons the initial frequency table for this variable was of little value. Other
than telling us that the outcomes range from 27.52 to 68.3, there is nothing
useful conveyed in this graph. What’s worse, it gives the misleading impression
that votes were uniformly distributed between the lowest and highest values.
There are a couple of reasons for this. First, no two states had exactly the
same outcome, so there are as many distinct outcomes and vertical bars as
there are states, leading to a flat distribution. This is likely to be the case with
many numeric variables, especially when the outcomes are continuous. Second,
the proximity of the bars to each other reflects the rank order of outcomes,
not the quantitative distance between categories. For instance, the two lowest
values are 27.52 and 30.20, a difference of 2.68, and the third and fourth lowest
values are 32.78 and 33.06, a difference of .28. Despite these differences in
the quantitative distance between the first and second, and third and fourth
outcomes, the spacing between the bars in the bar chart makes it look like the
distances are the same. The bar chart is only plotting outcomes by order of the
labels, not by the quantitative values of the outcomes. Bottom line, bar charts
are great for most categorical data but usually are not the preferred method for
graphing numeric data.
Plots with Frequencies. We have been using the barplot command to get
bar charts, but these charts can also be created by modifying the freq com-
mand. You probably notice that throughout the discussion of frequencies, I
used commands that look like this:
freq(anes20$V201320x, plot=FALSE)
The plot=FALSE part of the command instructs R to not create a bar chart for
the variable. If it is dropped, or if it is changed to plot=TRUE, R will produce
a bar chart along with the frequency table. You still need to add commands to
create labels and main titles, and to make other adjustments, but you can do
all of this within the frequency command. Go ahead, give it a try.
90 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
So why, not just do this from the beginning? Why go through the extra steps?
Two reasons, really. First, you don’t need a frequency table every time you
produce or modify a bar chart. The truth is, you probably have to make several
different modifications to a bar chart before you are happy with how it looks, and
if you get a frequency table with every iteration of chart building, your screen
will get to be a bit messy. More importantly, however, using the barplot
command pushes you a bit more to understand “what’s going on under the
hood.” For instance, telling R to use the results of the table command as input
for a bar chart helps you understand a bit better what is happening when R
creates the graph. If the graph just magically appears when you use the freq
command, you are another step removed from the process and don’t even need
to think about it. You may recall from Chapter 2 that I discussed how some
parts of the data analysis process seem a bit like a black box to students—
something goes in, results come out, and we have no idea what’s going on inside
the box. Learning about bar charts via the barplot command gives you a little
peek inside the black box. However, now that you know this, you can decide
for yourself how you want to create bar charts.
3.4.2 Histograms
Let’s take another look at Joe Biden’s percent of the two-party vote in the states
but this time using a histogram.
#Histogram for Biden's % of the two-party vote in the states
hist(states20$d2pty20,
xlab="Biden % of Two-party Vote",
ylab="Number of States",
main="Histogram of Biden Support in the States")
3 There are other graphing methods that provide this type of information, but they require
a bit more knowledge of measures of central tendency and variation, so they will be presented
in later chapters.
3.4. GRAPHING OUTCOMES 91
6
4
2
0
30 40 50 60 70
The width of the bars represents a range of values and the height represents the
number of outcomes that fall within that range. At the low end there is just
one state in the 25 to 30 range, and at the high end, there are four states in
the 65 to 70 range. More importantly, there is no clustering at one end or the
other, and the distribution is somewhat bell-shaped but with dip in the middle.
It would be very hard to glean this information from the frequency table and
bar chart for this variable presented earlier.
Recall that the bar charts we looked at earlier were the graphic representation of
the raw frequencies for each outcome. We can think of histograms in the same
way, except that they are the graphic representation of the binned frequencies,
similar to those produced by the Freq command. In fact, R uses the same rules
for grouping observations for histograms as it does for the binned frequencies
used in the Freq command used earlier.
The histogram for the abortion law variable we look at earlier (abortion_laws)
is presented below.
#Histogram for the number of abortion restrictions in the states
hist(states20$abortion_laws,
xlab="Number of Laws Restricting Access to Abortions",
ylab="Number of States",
main="Histogram of Abortion Laws in the States")
92 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
15
Number of States
10
5
0
0 2 4 6 8 10 12 14
This information looks very similar to that provided in the bar chart, except the
values are somewhat less finely grained in the histogram. What the histogram
does that is useful is mask the potentially distracting influence of idiosyncratic
bumps or dips in the data that might catch the eye in a bar chart and divert
attention from the general trend in the data. In this case, the difference between
the two modes of presentation is not very stark, but as a general rule, histograms
are vastly superior to bar charts when using numeric data.
0.010
0.000
30 40 50 60 70
The smoothed density line reinforces the impression from the histogram that
there are relatively few states with extremely low or high values, and the vast
majority of states are clustered in the 40-60% range. It is not quite a bell-
shaped curve–somewhat symmetric, but a bit flatter than a bell-shaped curve.
The density values on the vertical axis are difficult to interpret on their own,
so it is best to focus on the shape of the distribution and the fact that higher
values mean more frequently occurring outcomes.
You can also view the density plot separately from the histogram, using the
plot function:
#Generate a density plot with no histogram
plot(density(states20$d2pty20) ,
xlab="Biden % of Two-Party Vote",
main="Biden Votes in the States",
lwd=3)
94 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
0.030
0.020
Density
0.010
0.000
20 30 40 50 60 70 80
The main difference here is that the density plot is not limited to the same x-
axis limits as in the histogram and the solid line can extend beyond those limits
as if there were data points out there. Let’s take a quick look at a density plot
for the other numeric variable used in this chapter, abortion laws in the states.
#Density plot for Number of abortion laws in the states
plot(density(states20$abortion_laws),
xlab="Number of Laws Restricting Access to Abortions",
main="Abortion Laws in the States",
lwd=3)
0.04
0.00
0 5 10 15
This plot shows that the vast majority of the states have more than five abortion
restrictions on the books, and the distribution is sort of bimodal (two primary
groupings) at around five and ten restrictions.
As you progress through the chapters in this book, you will learn a lot more
about how to use graphs to illuminate interesting things about your data. Before
moving on to the next chapter, I want to show you a few things you can do to
change the appearance of the simple bar charts and histograms we have been
working with so far.
• col=" " is used designate the color of the bars. Gray is the default color,
but you can choose to use some other color if it makes sense to you. You
can get a list of all colors available in R by typing colors() at the prompt
in the console window.
• horiz=T is used if you want to flip a bar chart so the bars run horizontally
from the vertical axis.
• breaks= is used in a histogram to change the number of bars (bins) used
to display the data. We used this command earlier in the discussion of
setting specific bin ranges in frequency tables, but for right now we will
just specify a single number that determines how many bars will be used.
The examples below add some of this information to graphs we examined earlier
in this chapter.
hist(states20$d2pty20,
xlab="Biden % of Two-party Vote",
ylab="Number of States",
main="Histogram of Biden Support in the States",
col="white", #Use white to color the the bars
breaks=5) #Use just five categories
96 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
15
Number of States
10
5
0
20 30 40 50 60 70
As you can see, things turned out well for the histogram with just five bins,
though I think five is probably too few and obscures some of the important
variation in the variable. If you decided you prefer a certain color but not the
default gray, you just change col="white" to something else.
The graph below is a first attempt flipping the barplot for attitudes toward the
spending on aid for the poor, using white horizontal bars.
barplot(poorAid.tbl,
names.arg=c("Increase/Lot", "Increase", "Same", "Decrease","Decrease/Lot"),
xlab="Number of Cases",
ylab="Increase or Decrease Spending on Aid to the Poor?",
main="Spending Preference",
horiz=T, #Plot bars horizontally
col="white") #Use white to color the the bars
3.4. GRAPHING OUTCOMES 97
Increase or Decrease Spending on Aid to the Poor?
Decrease/Lot
Same
Increase/Lot Spending Preference
Number of Cases
As you can see, the horizontal bar chart for spending preferences turns out
to have a familiar problem: the value labels are too large and don’t all print.
Frequently, when turning a chart on its side, you need to modify some of the
elements a bit, as I’ve done below 4 .
First, par(las=2) instructs R to print the value labels sideways. Anytime you
see par followed by other terms in parentheses, it is likely to be a command
to alter the graphic parameters. The labels were still a bit too long to fit, so
I increased the margin on the left with par(mar=c(5,8,4,2)). This command
sets the margin size for the graph, where the order of numbers is c(bottom,
left, top, right). Normally this is set to mar=c(5,4,4,2), so increasing
the second number to eight expanded the left margin area and provided enough
room for value labels. However, the horizontal category labels overlapped with
the y-axis title, so I dropped the axis title and modified the main title to help
clarify what the labels represent.
#Change direction of the value labels
par(las=2)
#Change the left border to make room for the labels
par(mar=c(5,8,4,2))
barplot(poorAid.tbl,
names.arg=c("Increase/Lot", "Increase", "Same", "Decrease","Decrease/Lot"),
xlab="Number of Cases",
main="Spending Preference: Aid to the Poor",
horiz=T,
col="white",)
4 One option not shown here is to reduce the size of the labels using cex.names. Unfortu-
nately, you have to cut the size in half to get them to fit, rendering them hard too read.
98 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
Decrease/Lot
Decrease
Same
Increase
Increase/Lot
0
500
1000
1500
2000
2500
3000
Number of Cases
As you can see, changing the orientation of your bar chart could entail changing
any number of other characteristics. If you think the horizontal orientation
works best for a particular variable, then go for it. Just take a close look when
you are done to make sure everything looks the way it should look.
Whenever you change the graphing parameters as we have done here, you need
to change them back to what they were originally. Otherwise, those changes
will affect all of your subsequent work.
#Return graph settings to their original values
par(las=1)
par(mar=c(5,4,4,2))
so that you have the data you need to create the best most useful graphs and
statistics for your research. This task is taken up in the next chapter.
3.6 Exercises
1. You might recognize this list of variables from the exercises at the end
of Chapter 1. Identify whether a histogram or bar chart would be most
appropriate for summarizing the distribution of each variable. Explain
your choice.
2. This histogram shows the distribution of medical doctors per 100,000 pop-
ulation across the states.
• Given that the intervals in the histogram are right-closed, what range
of values are included in the 250 to 300 interval? How would this be
different if the intervals were left-closed.
100 CHAPTER 3. FREQUENCIES AND BASIC GRAPHS
15
Number of States
10
5
0
3.6.2 R Problems
1. I’ve tried to get a bar chart for anes20$V201119, a variable that measures
how happy people are with the way things are going in the U.S., but I
keep getting an error message. Can you figure out what I’ve done wrong?
Diagnose the problem and present the correct barplot.
barplot(anes20$V201119,
xlab="How Happy with the Way Things Are Going?",
ylab="Number of Respondents")
2. Choose the most appropriate type of frequency table (freq or Freq) and
graph (bar chart or histogram) to summarize the distribution of values for
the three variables listed below. Make sure to look at the codebooks so
you know what these variables represent. Present the tables and graphs
and provide a brief summary of their contents. Also, explain your choice
of tables and graphs, and be sure to include appropriate axis titles with
your graphs.
• Variables: anes20$V202178, anes20$V202384, and states20$union.
3. Create density plots for the numeric variables listed in problem 2, making
sure to create appropriate axis titles. In your opinion, are the density plots
or the histograms easier to read and understand? Explain your response.
Chapter 4
Transforming Variables
4.2 Introduction
Before moving on to other graphing and statistical techniques, we need to take
a bit of time to explore ways in which we can use R to work with the data to
make it more appropriate for the tasks at hand. Some of the things we will do,
such as assigning simpler, more intuitive names to variables and data sets, are
very basic, while other things are a bit more complex, such as reordering the
categories of a variable and combining several variables into a single variable.
Everything presented in this chapter is likely to be of use to you in your course
assignments and, if you continue on with data analysis, at some point in the
future.
101
102 CHAPTER 4. TRANSFORMING VARIABLES
in some way, or there could be several variables that need some attention, or
perhaps you need to modify the entire data set. Not to fear, though, as these
modifications are usually straightforward, especially with practice, and they
should make for better research.
Don’t Forget Script Files. As suggested before, one very important part
of the data transformation process is record keeping. It is essential that you
keep track of the transformations you make. This is where script files come in
handy. Save a copy of all of the commands you use by creating and saving a
script file in the Source window of RStudio (see Chapter 2 if this is unfamiliar
to you). At some point, you are going to have to remember and possibly report
the transformations you make. It could be when you turn in your assignment,
or maybe when you want to use something you created previously, or perhaps
when you are writing a final paper. You cannot and should not just work from
memory. Keep track of all changes you make to the data.
In these commands, we are literally telling R to take the contents of the two
spending preference variables and put then into two new objects with dif-
ferent names (note the use of <-). One important thing to notice here is
that when we created the new variables they were added to the anes20 data
set. This is important because if we had created stand alone objects (e.g.,
4.4. RENAMING AND RELABELING 103
parts.
Let’s do this with a couple of other variables from anes20, the party feeling
thermometers. These variables are based on questions that asked respondents
to rate various individuals and groups on a 0-to-100 scale, where a rating of 0
means you have very cold, negative feelings toward the individual or group, and
100 means you have very warm, positive feelings. In some ways, these sound like
ordinal variables, but, owing to the 101-point scale, the feeling thermometers
are frequently treated as numeric variables, as we will treat them here. The
variable names for the Democratic and Republican feeling thermometer ratings
are anes20$V201156 and anes20$V201157, respectively. Again, these variable
names do not exactly trip off the tongue, and you are probably going to have to
look them up in the codebook every time you want to use one of them. So, we
can copy them into new objects and give them more substantively meaningful
names.
First, let’s take a quick look at the distributions of the original variables:
#This first bit tells R to show the graphs in one row and two columns
par(mfrow=c(1,2))
hist(anes20$V201156,
xlab="Rating",
ylab="Frequency",
main="V201156")
hist(anes20$V201157,
xlab="Rating",
ylab="Frequency",
main="V201157")
V201156 V201157
1500
1500
Frequency
Frequency
1000
1000
500
500
0
0 20 40 60 80 100 0 20 40 60 80 100
Rating Rating
4.4. RENAMING AND RELABELING 105
#This last bit tells R to return to showing graphs in one row and one column
par(mfrow=c(1,1))
The first thing to note is how remarkably similar these two distributions are. For
both the Democratic (left side) and Republican Parties, there is a large group of
respondents that registers very low ratings (0-10 on the scale), and a somewhat
even spread of responses along the rest of the horizontal axis. What’s also of
note here is that the variable names used (by default) as titles for the graphs
give you absolutely no information about what the variables are, as “V201165”
and “V201157” have not inherent meaning.
As you now know, we can copy these variables into new variables with more
meaningful names.
#Copy feeling thermometers into new variables
anes20$dempty_ft<-anes20$V201156
anes20$reppty_ft<-anes20$V201157
Now, let’s take a quick look at histograms for these new variables, just to make
sure they appear the way they should.
#Histograms of feeling thermometers with new variable names
par(mfrow=c(1,2))
hist(anes20$dempty_ft,
xlab="Rating",
ylab="Frequency",
main="dempty_ft")
hist(anes20$reppty_ft,
xlab="Rating",
ylab="Frequency",
main="reppty_ft")
106 CHAPTER 4. TRANSFORMING VARIABLES
dempty_ft reppty_ft
1500
1500
Frequency
Frequency
1000
1000
500
500
0
0
0 20 40 60 80 100 0 20 40 60 80 100
Rating Rating
par(mfrow=c(1,1))
Everything looks as it should, and the new variable names help us figure which
graph represents which party without having to refer to the codebook to check
the variable names.
[1] "factor"
class(anes20$welfare_spnd)
[1] "factor"
This may not make much difference, if the alphabetical order of categories
matches the substantive ordering of categories, since the default is to sort the
categories alphabetically. For instance, the alphabetical ordering of the cat-
egories for the spending preference variables is in sync with their substantive
ordering because the categories begin with numbers. You can check this using
frequencies or by using the levels function.
4.4. RENAMING AND RELABELING 107
[1] "1. Increased a lot" "2. Increased a little" "3. Kept the same"
[4] "4. Decreased a little" "5. Decreasaed a lot"
levels(anes20$welfare_spnd)
[1] "1. Increased a lot" "2. Increased a little" "3. Kept the same"
[4] "4. Decreased a little" "5. Decreasaed a lot"
However, this is not always the case, and there could be some circumstances in
which R needs to formally recognize that a variable is “ordered” for the purpose
of performing some function. It’s easy enough to change the variable class, as
shown below, where the class of the two spending preference variables is changed
to ordered:
#Change class of spending variables to "ordered"
anes20$poor_spnd<-ordered(anes20$poor_spnd)
anes20$welfare_spnd<-ordered(anes20$welfare_spnd)
[1] "numeric"
class(anes20$reppty_ft)
[1] "numeric"
All good with the feeling thermometers. But I knew this already because R
would not have produced histograms for variables that were classified as fac-
tor variables. Occasionally, you may get an error message because the func-
tion you are using requires a certain class of data. For instance, if I try
to get a histogram of anes20$poor_spnd, I get the following error: Error
in hist.default(anes20$poor_spnd) : 'x' must be numeric. When this
happens, check to make sure the variable is properly classified, and change the
108 CHAPTER 4. TRANSFORMING VARIABLES
In this case, we used names.arg to temporarily replace the existing value labels
with shorter labels that fit on the graph. We can use the levels function to
change these labels permanently, so we don’t have to add names.arg every time
we want to create a bar chart. In addition to using the levels function to view
value labels, we can also use it to change those labels. We’ll do this for both
spending variables, applying the labels used in the names.arg function in the
barplot command. The table below shows the original value labels and the
replacement labels for the spending preference variables.
Table 4.1: Original and Replacement Value labels for Spending Preference Vari-
ables
Original Replacement
1. Increased a lot Increase/Lot
2. Increased a little Increase
3. Kept the same Same
4. Decreased a little Decrease
5. Decreased a lot Decrease/Lot
In the command below, we tell R to replace the value labels for anes20$poor_spnd
with a defined set of alternative labels taken from Table 4.1.
#Assign levels to 'poor_spnd' categories
levels(anes20$poor_spnd)<-c("Increase/Lot","Increase","Same",
"Decrease","Decrease/Lot")
4.4. RENAMING AND RELABELING 109
It is important that each of the labels is enclosed in quotation marks and that
they are separated by commas. We can check our handiwork to make sure that
the transformation worked:
#Check levels
levels(anes20$poor_spnd)
Spending Preference
2500
Number of Cases
1500
500
0
have the original category labels. Instead, we created new variables and replaced
their value labels. One of the first rules of data transformation is to never write
over (replace) original data. Once you write over the original data and save
the changes, those data are gone forever. If you make a mistake in creating the
new variables, you can always create them again. But if you make a mistake
with the original data and save your changes, you can’t go back and undo those
changes.1
Besides changing value labels, you might also need to alter the number of cate-
gories, or reorder the existing categories of a variable. Let’s start with the two
spending preference variables, anes20$poor_spnd and anes20$welfare_spnd.
These variables have five categories, ranging from preferring that spending be
“increased a lot” to preferring that it be “decreased a lot.” Now, suppose we are
only interested in three discrete outcomes, whether people want to see spending
increased, decreased, or kept the same. We can collapse the “Increase/Lot”
and “Increase” into a single category, and do the same for “Decrease/Lot”
and “Decrease”, resulting in a variable with three categories, “Increase”, “Keep
Same”, and “Decrease”. Since we know from the work above how the cate-
gories are ordered, we can tell R to label the first two categories “Increase”,
the third category “Same”, and the last two categories “Decrease”. Let’s start
by converting the contents of anes20$poor_spnd to a three-category variable,
anes20$poor_spnd.3.2
#create new ordered variable with appropriate name
anes20$poor_spnd.3<-(anes20$poor_spnd)
#Then, write over existing five labels with three labels
levels(anes20$poor_spnd.3)<- c("Increase", "Increase", "Keep Same",
"Decrease", "Decrease")
#Check to see if labels are correct
barplot(table(anes20$poor_spnd.3),
xlab="Spending Preference",
ylab="Number of Respondents",
main="Spending Preference: Aid to the Poor")
1 One alternative if this does happen is to go back and download the data from the original
3000
2000
1000
0
Spending Preference
3500
Number of Respondents
2500
1500
500
0
Spending Preference
Using three categories clarifies things a bit, but you may have noticed that while
the categories are ordered, there is a bit of disconnect between the magnitude of
the labels and their placement on the horizontal axis. The ordered meaning of
the label of the first category (Increase) is greater than the meaning of the label
of the third category (Decrease). In this case, moving from the lowest to highest
listed categories corresponds with moving from the highest to lowest labels, in
terms of ranked meaning. Technically, this is okay, as long as you can keep it
straight in your head, but it can get a bit confusing, especially if you are talking
about how outcomes on this variable are related to outcomes on other variables.
Fortunately, we can reorder the categories to run from the intuitively “low”
value (Decrease) to the intuitively “high” value (Increase).
Let’s do this with poor_spnd.3 and welfare_spnd.3, and then check the levels
afterwords. Here, we use ordered to instruct R to use the order of labels
specified in the level command. Note that we must use already assigned labels,
but can reorder them.
#Use 'ordered' and 'levels' to reorder the categories
anes20$poor_spnd.3<-ordered(anes20$poor_spnd.3,
levels=c("Decrease", "Keep Same", "Increase"))
levels(anes20$welfare_spnd.3)
The transformations worked according to plan and there is now a more mean-
ingful match between the label names and their order on graphs and in tables
when using these variables.
Here, you can see there are three categories, but the order does not make sense.
In particular, “Neither favor nor oppose” should be treated as a middle category,
rather than as the highest category. As currently constructed, the order of cat-
egories does not consistently increase or decrease in the level of some underlying
scale. It would make more sense for this to be scaled as either “favor-neither-
oppose” or “oppose-neither-favor.” Given that “oppose” is a negative term and
“favor” a positive one, it probably makes the most sense for this variable to be
reordered to “oppose-neither-favor.”
First, let’s create a new variable name, anes20$mail, and replace the original
labels to get rid of the numbers and shorten things a bit before we reorder the
categories.
#Create new variable
anes20$mail<-(anes20$V201354)
#Create new labels
levels(anes20$mail)<-c("Favor", "Oppose", "Neither")
#Check levels
levels(anes20$mail)
#Reorder categories
anes20$mail<-ordered(anes20$mail,
levels=c("Oppose", "Neither", "Favor"))
#Check Levels
levels(anes20$mail)
Now we have an ordered version of the same variable that will be easier to use
and interpret in the future.
anes20$netpty_ft.3
Frequency Percent Valid Percent Cum Percent
[-100, 0) 3308 39.952 40.76 40.76
0 900 10.870 11.09 51.85
[ 1, 100] 3908 47.198 48.15 100.00
NA's 164 1.981
Total 8280 100.000 100.00
First, it is reassuring to note that the result shows three categories with limits
that reflect the desired groupings. Substantively, this shows that the Demo-
cratic Party holds a slight edge in feeling thermometer ratings, preferred by
48% compared to 41% for the Republican Party, and that about 11% rated
both parties exactly the same. Looking at this table, however, it is clear that
we could do with some better value labels. We know that the [-100, 0) category
represents respondents who preferred the Republican Party, and respondents in
the [1, 100] category preferred the Democratic Party, but this is not self-evident
by the labels. So let’s replace the numeric ranges with some more descriptive
labels.
#Assign meaningful level names
levels(anes20$netpty_ft.3)<-c("Favor Reps", "Same", "Favor Dems")
freq(anes20$netpty_ft.3, plot=F)
anes20$netpty_ft.3
Frequency Percent Valid Percent Cum Percent
Favor Reps 3308 39.952 40.76 40.76
Same 900 10.870 11.09 51.85
Favor Dems 3908 47.198 48.15 100.00
NA's 164 1.981
Total 8280 100.000 100.00
In the example used above, we had a clear idea of exactly which cut points to
use. But sometimes, you don’t care so much about the specific values but are
4.6. COMBINING VARIABLES 117
more interested in grouping data into thirds, quarters, or some other quantile.
This is easy to do using the cut2 function. For instance, suppose we are using
the states20 data set and want to classify the states according to state policy
liberalism.
states20$policy_lib
states20$policy_lib.3
Frequency Percent Cum Percent
[-2.525,-0.927) 17 34 34
[-0.927, 0.916) 17 34 68
[ 0.916, 2.515] 16 32 100
Total 50 100
You should notice a couple of things about this transformation. First, although
not exactly equal in size, each category has roughly one-third of the observations
in it. It’s not always possible to make the groups exactly the same size, especially
if values are rounded to whole numbers, but using this method will get you close.
Second, as in the first example, we need better value labels. The lowest category
represents states with relatively conservative policies, the highest category states
with liberal policies, and the middle group represents states with policies that
are not consistently conservative or liberal:
#Create meaningful level names
levels(states20$policy_lib.3)<-c("Conservative", "Mixed", "Liberal")
#Check levels
118 CHAPTER 4. TRANSFORMING VARIABLES
freq(states20$policy_lib.3, plot=F)
states20$policy_lib.3
Frequency Percent Cum Percent
Conservative 17 34 34
Mixed 17 34 68
Liberal 16 32 100
Total 50 100
While these labels are easier to understand, it is important not to lose sight of
what they represent, the bottom, middle, and top thirds of the distribution of
states20$policy_lib.
Table 4.2. Liberal Outcomes on LGBTQ Rights Variables in the 2020 ANES
The process for combining these five variables into a single index of support for
LGBTQ rights involves two steps. First, for each question, we need to create
an indicator variable with a value of 1 for all respondents who gave the liberal
response, and a value of 0 for all respondents who gave some other response.
Let’s do this first for anes20$V201406 for demonstration purposes.
#Create indicator (0,1) for category 2 of "equal services" variable
anes20$lgbtq1<-as.numeric(anes20$V201406 ==
"2. Should be required to provide services")
0 1
4083 4085
Here, we see an even split in public opinion on this topic and that the raw
frequencies in categories 0 and 1 on anes20$lgbtq1 are exactly what we should
expect based on the frequencies in anes20$V201406.
Just a couple of quick side notes before creating the other indicator variables
for the index. First, these types of variables are also commonly referred to
as indicator, dichotomous, or dummy variables. I tend to use all three terms
interchangeably. Second, we have taken information from a factor variable and
created a numeric variable. This is important to understand because it can open
up the number and type of statistics we can use with data that are originally
coded as factor variables.
The code below is used to create the other indicators:
#Create indicator for "bathroom" variable
anes20$lgbtq2<-as.numeric(anes20$V201409 ==
"2. Bathrooms of their identified gender")
#Create indicator for "job discrimination" variable
anes20$lgbtq3<-as.numeric(anes20$V201412 == "1. Favor")
#Create indicator for "adoption" variable
anes20$lgbtq4<-as.numeric(anes20$V201415 == "1. Yes")
#Create indicator for "marriage" variable
anes20$lgbtq5<-as.numeric(anes20$V201416 == "1. Allowed to legally marry")
It might be a good exercise for you to copy all of this code and check on your
own to see that the transformations have been done correctly.
Now, the second step of the process is to combine all of these variables into
a single index. Since these are numeric variables and they are all coded the
same way, we can simply add them together. Just to make sure you understand
how this works, let’s suppose a respondent gave liberal responses (1) to the first,
third, and fifth of the five variables, and conservative responses (0) to the second
and fourth ones. The index score for that person would be 1 + 0 + 1 + 0 + 1
= 3.
We can add the dichotomous variables together and look at the frequency table
for the new index, anes20$lgbtq_rights.
4.6. COMBINING VARIABLES 121
anes20$lgbtq_rights
Frequency Percent Valid Percent
0 458 5.531 5.86
1 766 9.251 9.80
2 935 11.292 11.96
3 1272 15.362 16.27
4 1768 21.353 22.62
5 2617 31.606 33.48
NA's 464 5.604
Total 8280 100.000 100.00
The picture that emerges in the frequency table is that there is fairly widespread
support for LGBTQ rights, with about 56% of respondents supporting liberal
outcomes in at least four of the five survey questions, and very few respondents
at the low end of the scale. This point is, I think, made even more apparent by
looking at a bar chart for this variable.3
barplot(table(anes20$lgbtq_rights),
xlab="Number of Rights Supported",
ylab="Number of Respondents",
main="Support for LGBTQ Rights")
3 That’s right, this is another example of a numeric variable for which a bar chart works
fairly well.
122 CHAPTER 4. TRANSFORMING VARIABLES
Number of Respondents
2000
1000
500
0
0 1 2 3 4 5
What’s most important here is that we now have a more useful measure of
support for LGBTQ rights than if we had used just a single variable out of the
list of five reported above. This variable is based on responses to five separate
questions and provides a more comprehensive estimate of where respondents
stand on LGBTQ rights. In terms of some of the things we covered in the
discussion of measurement in Chapter 1, this index is strong in both validity
and reliability.
Remember to use getwd see what your current working directory is (this is
the place where the file will be saved if you don’t specify a directory in the
command), and setwd to change the working directory to place where you want
to save the results if you need to.
To save the anes20 data set over itself:
4.8. NEXT STEPS 123
save(anes20, file="<FilePath>/anes20.rda")
Saving the original file over itself is okay if you are only adding newly created
variables. Remember, though, that if you transformed variables and replaced
the contents of original variables (using the same variable names), you cannot
retrieve the original data if you save now using the same file name. None of the
transformations you’ve done in this chapter are replacing original variables, so
you can go ahead and save this file as anes20.
4.9 Exercises
4.9.1 Concepts and Calculations
1. The ANES survey includes a variable measuring marital status with six
categories, as shown below.
Create a table similar to Table 4.1 in which you show how you would map these
six categories on to a new variable with three categories, “Married”, “Never
Married”, and “Other.” Explain your decision rule for the way you mapped
categories from the old to the new variable.
[1] "1. Married: spouse present"
[2] "2. Married: spouse absent {VOL - video/phone only}"
[3] "3. Widowed"
[4] "4. Divorced"
[5] "5. Separated"
[6] "6. Never married"
124 CHAPTER 4. TRANSFORMING VARIABLES
4.9.2 R Problems
1. Rats! I’ve done it again. I was trying to combine responses to two gun con-
trol questions into a single, three-category variable (anes20$gun_cntrl)
4.9. EXERCISES 125
measuring support for restrictive gun control measures. The original vari-
ables are anes20$V202337 (should the federal government make it more
difficult or easier to buy a gun?) and anes20$V202342 (Favor or oppose
banning ‘assault-style’ rifles). Make sure you take a close look at these
variables before proceeding.
I used the code shown below to create the new variable, but, as you can see when
you try to run it, something went wrong (the resulting variable should range
from 0 to 2). What happened? Once you fix the code, produce a frequency
table for the new index and report how you fixed it.
anes20$buy_gun<-as.numeric(anes20$V202337=="1. Favor")
anes20$ARguns<-as.numeric(anes20$V202342=="1. More difficult")
anes20$gun_cntrl<-anes20$buy_gun + anes20$ARguns
freq(anes20$gun_cntrl, plot=F)
anes20$gun_cntrl
Frequency Percent Valid Percent
0 7372 89.03 100
NA's 908 10.97
Total 8280 100.00 100
2. Use the mapping plan you produced for Question 1 of the Concepts
and Calculations problems to collapse the current six-category vari-
able measuring marital status (anes20$V201508) into a new variable,
anes20$marital, with three categories, “Married”, “Never Married,
and”Other.”
• Create a frequency table for both anes20$V201508 and the new
variable, anes20$marital. Do the frequencies for the categories
of the new variable match your expectations, given the category
frequencies for the origianal variable?
• Do you prefer the frequency table or bar chart as a method for looking
at these variables? Why?
4. The table below summarizes information about four variables from the
anes20 data set that measure attitudes toward different immigration poli-
cies. Take a closer look at each of these variables, so you are comfortable
with them.
• Use the information in the table to create four numeric indicator
(dichotomous) variables (one for each) and combine those variables
into a new index of immigration attitudes named anes20$immig_pol
(show all steps along the way).
• Create a frequency table OR bar chart for anes20immig_pol and
describe its distribution.
Measures of Central
Tendency
5.2.1 Mode
The mode is the category or value that occurs most often, and it is most
appropriate for nominal data because it does not require that the underlying
variable be quantitative in nature. That said, the mode can sometimes provide
useful information for ordinal and numeric data, especially if those variables
have a limited number of categories.
127
128 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
[1] NA
attr(,"freq")
[1] NA
Oops, this result doesn’t look quite right. That’s because many R functions
don’t know what to do with missing data and will report NA instead of the
information of interest. We get this error message because there are 83 missing
cases for this variable (see the frequency). This is fixed in most cases by adding
na.rm=T to the command line, telling R to remove the NAs from the analysis.
#Add "na.rm=T" to account for missing data
Mode(anes20$denom, na.rm=T)
[1] Protestant
attr(,"freq")
[1] 2113
8 Levels: Protestant Catholic OtherChristian Jewish OtherRel ... Nothing
This confirms that “Protestant” is the modal category, with 2113 respondents,
and also lists all of the levels. In many cases, I prefer to look at the frequency
table for the mode because it provides a more complete picture of the variable,
showing, for instance, that while Protestant is the modal category, “Catholic”
is a very close second.
While the mode is the most suitable measure of central tendency for nominal-
level data, it can be used with ordinal and interval-level data. Let’s look
at two variables we’ve used in earlier chapters, spending preferences on
programs for the poor (anes20$V201320x), and state abortion restrictions
(states20$abortion_laws):
#Mode for spending on aid to the poor
Mode(anes20$V201320x, na.rm=T)
[1] 10
attr(,"freq")
130 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
[1] 12
Here, we see that the modal outcome for spending preferences is “Kept the
same”, with 3213 respondents, and the mode for abortion regulations is 10,
which occurred twelve times. While the mode does provide another piece of
information for these variables, there are better measures of central tendency
when working with ordinal or interval/ratio data. However, the mode is the
preferred measure of central tendency for nominal data.
5.3 Median
The median cuts the sample in half: it is the value of the outcome associated with
the observation at the middle of the distribution when cases are listed in order
of magnitude. Because the median is found by ordering observations from the
lowest to the highest value, the median is not an appropriate measure of central
tendency for nominal variables. If the cases for an ordinal or numeric variable
are listed in order of magnitude, we can look for the point that cuts the sample
exactly in half. The value associated with that observation is the median. For
instance, if we had 5 observations ranked from lowest to highest, the middle
observation would be the third one (two above it and two below it), and the
median would be the value of the outcome associated with that observation.
Just to be clear, the median in this example would not be 3, but the value of
the outcome associated with the third observation. The median is well-suited
for ordinal variables but can also provide useful information regarding numeric
variables.
Here is a useful formula for finding the middle observation:
𝑛+1
Middle Observation =
2
where n=number of cases
If n is an odd number, then the middle is a single data point. If n is an even
number, then the middle is between two data points and we use the mid-point
between those two values as the median. Figure 5.1 illustrates how to find the
median, using hypothetical data for a small sample of cases.
In the first row of data, the sixth observation perfectly splits the sample, with
five observations above it and five below. The value of the sixth observation, 15,
is the median. In the second row of data, there are an even number of cases (10),
so the middle of the distribution is between the fifth and sixth observations, with
values of 14 and 15, respectively. The median is the mid-point between these
values, 14.5.
Now, let’s look at this with some real-world data, using the abortion laws vari-
able from the states20 data set. There are 50 observations, so the mid-point
is between the 25th and 26th observations ((50+1)/2). Obviously, we can get
5.3. MEDIAN 131
Figure 5.1: Finding the Median with Odd and Even Numbers of Cases
[1] 1 2 3 3 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 8 8 8 9
[26] 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 11 11 12 12 12 13 13
We could also get this information more easily:
#Get the median value
median(states20$abortion_laws)
[1] 9
To illustrate the importance of listing the outcomes in order of magnitude, check
the list of outcomes for states20$abortion_law when the data are listed in
alphabetical order, by state name:
#list abortion_laws without sorting from lowest to highest
states20$abortion_laws
[1] 8 8 11 10 5 2 4 6 10 9 5 11 5 12 9 13 10 10 4 6 5 10 9 9 12
[26] 7 10 7 3 7 6 4 9 9 10 13 3 9 6 10 10 10 12 10 1 8 5 6 10 6
If you took the mid-point between the 25th (12) and 26th (7) observations, you
would report a median of 9.5. While this is close to 9, by coincidence, it is
incorrect.
Let’s turn now to finding the median value for spending preferences on programs
for the poor (anes20$V201320x). Since there are over 8000 observations for this
variable, it is not practical to list them all in order, but we can do the same sort of
thing using a frequency table. Here, I use the Freq command to get a frequency
132 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
What we want to do now is use the cumulative percent to identify the category
associated with the 50th percentile. In this case, you can see that the cumulative
frequency for the category “Increased a little” is 50.8%, meaning the middle
observation (50th percentile) is in this category, so this is the median outcome.
We can check this with the median command in R. One little quirk here is that
R requires numeric data to calculate the median. That’s fine. All we have to
do is tell R to treat the values as if they were numeric, replacing the five levels
with values 1,2,3,4 and 5.
#Get the median, treating the variable as numeric
median(as.numeric(anes20$V201320x), na.rm=T)
[1] 2
We get confirmation here that the median outcome is the second category, “In-
creased a little”.
𝑛
∑𝑖=1 𝑥𝑖
𝑥̄ =
𝑛
This reads: the sum of the values of all observations (the numerical outcomes)
of x, divided by the total number of valid observations (observations with real
values). This formula illustrates an important way in which the mean is different
from both the median and the mode: it is based on information from all of the
values of x, not just the middle value (the median) or the value that occurs
most often (the mode). This makes the mean a more encompassing statistic
than either the median or the mode.
Using this formula, we could calculate the mean number of state abortion re-
strictions something like this:
1 + 2 + 3 + 3 + 4 + 4 + ⋅ ⋅ ⋅ ⋅ +11 + 12 + 12 + 12 + 13 + 13
50
We don’t actually have to add up all fifty outcomes manually to get the numer-
ator. Instead, we can tell R to sum up all of the values of x:
#Sum all of the values of 'abortion_laws'
sum(states20$abortion_laws)
[1] 394
So the numerator is 394. We divide through by the number of cases (50) to get
the mean:
#Divide the sum or all outcomes by the number of cases
394/50
[1] 7.88
394
𝑥̄ = = 7.88
50
[1] 7.88
A very important characteristic of the mean is that it is the point at which the
weight of the values is perfectly balanced on each side. It is helpful to think
of the outcomes numeric variables distributed according to their weight (value)
on a plank resting on a fulcrum, and that fulcrum is placed at the mean of
the distribution, the point at which both sides are perfectly balanced. If the
134 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
fulcrum is placed at the mean, the plank will not tip to either side, but if it is
placed at some other point, then the weight will not be evenly distributed and
the plank will tip to one side or the other.
*Source: https://stats.stackexchange.com/questions/200282/explaining-mean-
median-mode-in-laymans-terms
A really important concept here is the deviation of observations of x from the
mean of x. Mathematically, this is represented as 𝑥𝑖 − 𝑥.̄ So, for instance, the
deviation from the mean number of abortion restrictions (7.88) for a state like
Alabama, with 8 abortion restrictions on the books is .12 units (8-7.88), and for
a state like Colorado, with 2 restrictions on the books, the deviation is -5.88 (2-
7.88). For any numeric variable, the distance between the mean and all values
greater than the mean is perfectly offset by the distances between the mean and
all values lower than the mean. The distribution is perfectly balanced around
the mean.
𝑛
This means that summing up all of the deviations from the mean (∑𝑖=1 𝑥𝑖 − 𝑥)̄
is always equal to zero. This point is important for understanding what the
mean represents, but it is also important for many other statistics that are
based on the mean.
We can check this balance property in R. First, we express each observation as
a deviation from the means.
##Subtract the mean of x from each value of x
dev_mean=(states20$abortion_laws- mean(states20$abortion_laws))
#Print the mean deviations
dev_mean
[1] 0.12 0.12 3.12 2.12 -2.88 -5.88 -3.88 -1.88 2.12 1.12 -2.88 3.12
[13] -2.88 4.12 1.12 5.12 2.12 2.12 -3.88 -1.88 -2.88 2.12 1.12 1.12
5.4. THE MEAN 135
[25] 4.12 -0.88 2.12 -0.88 -4.88 -0.88 -1.88 -3.88 1.12 1.12 2.12 5.12
[37] -4.88 1.12 -1.88 2.12 2.12 2.12 4.12 2.12 -6.88 0.12 -2.88 -1.88
[49] 2.12 -1.88
As expected, some deviations from the mean are positive, and some are negative.
Now, if we take the sum of all deviations, we get:
#Sum the deviations of x from the mean of x
sum(dev_mean)
[1] 5.329071e-15
That’s a funny looking number. Because of the length of the resulting number,
the sum of the deviations from the mean is reported using scientific notation. In
this case, the scientific notation is telling us we need to move the decimal point
15 places to the left, resulting in .00000000000000532907. Not quite exactly 0,
due to rounding, but essentially 0.1
Note, that the median does not share this “balance” property; if we placed a
fulcrum at the median, the distribution would tip over because the positive
deviations do not balance perfectly by the negative deviations, as shown here:
#Sum the deviations of x from the median of x
sum(states20$abortion_laws- median(states20$abortion_laws))
[1] -56
The negative deviations from the median outweigh the positive by a value of
56.
1 If you are interested in how the mean came to be used as a measure of “representative-
freq(anes20$mil_serv, plot=F)
not required, but these types of extensions are helpful when trying to remember
which similarly named variable is which. Now, let’s get a frequency for the new
variable:
#check the new indicator variable
freq(anes20$service.n, plot=F)
anes20$service.n
Frequency Percent Valid Percent
0 7311 88.2971 88.59
1 942 11.3768 11.41
NA's 27 0.3261
Total 8280 100.0000 100.00
[1] 0.1141403
One thing you might be questioning at this point is how we can treat what
seems like a nominal variable—whether people have or have not served in the
military–as a numeric variable. The way I like to think about this is that the
variable measures the presence (1) or absence (0) of a characteristic. In cases
like this, you can think of the 0 value as a genuine zero point, an important
characteristic of most numeric variables. In other words, 0 means that the
respondent has none of the characteristic (military service) being measured.
Since the value 1 indicates having a unit of the characteristic, we can treat this
variable as numeric. However, it is important to always bear in mind what
the variable represents and that there are only two outcomes, 0 and 1. This
becomes especially important in later chapters when we discuss using these types
of variables in regression analysis.
138 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
negatively) skewed distribution, due to some extreme values at the low end of
the scale. When the mean and median are the same, or nearly the same, this
could indicate a bell-shaped distribution, but there are other possibilities as
well. When the mean and median are the same, there is no skew.
Figure 3 provides caricatures of what these patterns might look like. The first
graph shows that most data are concentrated at the low (left) end of the x-axis,
with a few very extreme observations at the high (right) end of the axis pulling
the mean out from the middle of the distribution. This is a positive skew. The
second graph shows just the opposite pattern: most data at the high end of the
x-axis, with a few extreme values at the low (left) end dragging the mean to the
left. This is a negatively skewed distribution.
Mean Mean
0.0 0.4 0.8
Median Median
Density
Density
0 1 2 3 4 5 6 0 1 2 3 4 5 6
X X
No Skew (Mean=Median)
Mean, Median
0.0 0.2 0.4
Density
0 1 2 3 4 5 6
Let’s see what this looks like when using real-world data, starting with abortion
laws in the states. First, we take another look at the mean and median, which
we produced in an earlier section of this chapter, and then we can see a density
plot for the number of restrictive abortion laws in the states.
have no skewness.
140 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
[1] 7.88
median(states20$abortion_laws)
[1] 9
The mean is a bit less than the median, so we might expect to see signs of
negative (left) skewness in the density plot.
plot(density(states20$abortion_laws),
xlab="# Abortion Restrictions in State Law",
main="")
#Insert vertical lines for mean and median
abline(v=mean(states20$abortion_laws))
abline(v=median(states20$abortion_laws), lty=2) #use dashed line
0.12
0.08
Density
0.04
0.00
0 5 10 15
Here we the difference between the mean (solid line) and the median (dashed
line) in the context of the full distribution of the variable, and the graph shows
a distribution with a bit of negative skew. The skewness is not severe, but it is
visible to the naked eye.
An R code digression. The density plot above includes the addition of two
vertical lines, one for the mean and one for the median. To add these lines, I used
the abline command. This command allows you to add lines to existing graphs.
In this case, I want to add two vertical lines, so I use v= to designate where to
put the lines. I could have put in the numeric values of the mean and median
(v=7.88 and v=9), but I chose to use have R calculate the mean and median
and insert the results in the graph (v=mean(states20$abortion_laws) and
v=median(states20$abortion_laws)). Either way would get the same result.
5.5. MEAN, MEDIAN, AND THE DISTRIBUTION OF VARIABLES 141
Also, note that for the median, I added lty=2 to get R to use “line type 2”,
which is a dashed line. The default line type is a solid line, which is used for
the mean.
Now, let’s take a look at another distribution, this time for a variable we have
not looked at before, the percent of the state population who are foreign-born
(states20$fb). This is an increasingly important population characteristic,
with implications for a number of political and social outcomes.
mean(states20$fb)
[1] 7.042
median(states20$fb)
[1] 4.95
Here, we see a bit more evidence of a skewed distribution. In absolute terms, the
difference between the mean and the median (2.09) is not much greater than in
the first example (1.22), but the density plot (below) looks somewhat more like
a skewed distribution than in the first example. In this case, the distribution is
positively skewed, with a few relatively high values pulling the mean out from
the middle of the distribution.
#Density plot for % foreign-born
plot(density(states20$fb),
xlab="Percent foreign-born",
main="")
#Add lines for mean and median
abline(v=mean(states20$fb))
abline(v=median(states20$fb), lty=2) #Use dashed line
0.00 0.02 0.04 0.06 0.08 0.10
Density
−5 0 5 10 15 20 25
Percent foreign−born
distribution of percent of the two-party vote for Joe Biden in the 2020 election:
mean(states20$d2pty20)
[1] 48.8044
median(states20$d2pty20)
[1] 49.725
Here, there is very little difference between the mean and the median, and there
appears to be almost no skewness in the shape of the density plot (below).
plot(density(states20$d2pty20),
xlab="% of Two-Party Vote for Biden",
main="")
abline(v=mean(states20$d2pty20))
abline(v=median(states20$d2pty20), lty=2)
0.030
0.020
Density
0.010
0.000
20 30 40 50 60 70 80
What’s interesting here is that the distance between the mean and median for
Biden’s vote share (.92) is not terribly different than the distance between the
mean and the median for the number of abortion laws (1.12). Yet there is not a
hint of skewness in the distribution of votes, while there is clearly some negative
skew to abortion laws. This, of course, is due to the difference in scale between
the two variables. Abortion laws range from 1 to 13, while Biden’s vote share
ranges from 27.5 to 68.3, so relative to the scale of the variable, the distance
between the mean and median is much greater for abortion laws in the states
than it for Biden’s share of the two-party vote in the states. This illustrates
the real value of examining a numeric variable’s distribution with a histogram
or density plot alongside statistics like the mean and median. Graphs like these
provide context for those statistics
5.6. SKEWNESS STATISTIC 143
[1] -0.3401058
Skew(states20$fb)
[1] 1.202639
Skew(states20$d2pty20)
[1] 0.006792623
These skewness statistics make a lot of sense, given the earlier discussion of these
three variables. There is a little bit of negative skew (-.34) to the distribution of
abortion restrictions in the states, a more pronounced level of (positive) skewness
to the distribution of the foreign-born population, and no real skewness in the
distribution of Biden support in the states (.007). None of these results are
anywhere near the -2,+2 cut-off points discussed earlier, so these distributions
should not pose any problems for any analysis that uses these variables.
Quite often, you will be using variables whose distributions look a lot like the
three shown above, maybe a bit skewed, but nothing too outrageous. Occa-
sionally, though, you come across variables with severe skewness that can be
detected visually and by using the skewness statistic. As a point of reference,
consider the three distributions shown earlier in Figure 5.3, one with what ap-
pears to be severe positive skewness (top left), one with severe negative skewness
(top right), and one with no apparent skewness (bottom left). The value of the
skewness statistics for these three distributions are 4.25, -4.25, and 0, respec-
tively.
3 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 (𝑥𝑖 −𝑥)̄ 3
= 𝑛∗𝑆 3
, where n= number of cases, and S=standard deviation
144 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
The solid vertical lines for the mean and dashed vertical lines for the median used
in the graphs shown above are useful visual aids for understanding how those
variables are distributed. These lines were explained in the text, but there was
no identifying information provided in the graphs themselves. When presenting
information like this in a formal setting (e.g., on the job, for an assignment, or
for a term paper), it is usually expected that you provide a legend with your
graphs that identifies what the lines represent (see figure 5.3 as an example).
This can be a bit complex, especially if there are multiple lines in your graph.
For our current purposes, however, it is not terribly difficult to add a legend.
We’re going to add the following line of code to the commands used to generate
the plot for states20$fb to add a legend to the plot:
#Create a legend to be put in the top-right corner, identifying
#mean and median outcomes with solid and dashed lines, respectively
legend("topright", legend=c("Mean", "Median"), lty=1:2)
The first bit ("topright") is telling R where to place the legend in the graph.
In this case, I specified topright because I know that is where there is empty
space in the graph. You can specify top, bottom, left, right, topright, topleft,
bottomright, or bottomleft, depending on where the legend fits the best. The
second piece of information, legend=c("Mean", "Median"), provides names for
the two objects being identified, and the last part, lty=1:2, tells R which line
types to use in the legend (the same as in the graph). Let’s add this to the
command lines for the density plot for states20$fb and see what we get.
plot(density(states20$fb),
xlab="Percent foreign-born",
main="")
abline(v=mean(states20$fb))
abline(v=median(states20$fb), lty=2)
#Add the legend
legend("topright", legend=c("Mean", "Median"), lty=1:2)
5.8. NEXT STEPS 145
−5 0 5 10 15 20 25
Percent foreign−born
This extra bit of information fits nicely in the graph and aids in interpreting
the pattern in the data.
Sometimes, it is hard to get the legend to fit well in a graph space. When this
happens, you need to tinker a bit to get a better fit. There are some fairly
complicated ways to achieve this, but I favor trying a couple of simple things
first: try different locations for the legend, reduce the number of words you use
to name the lines, or add the cex command to reduce the overall size of the
legend. When using cex, you might start with cex=.8, which will reduce the
legend to 80% of its original size, and then change the value as needed to make
the legend fit.
5.9 Exercises
5.9.1 Concepts and Calculations
As usual, when making calculations, show the process you used.
146 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
1. Let’s return to the list of variables used for exercises in Chapters 1 &
3. Identify what you think is the most appropriate measure of central
tendency for each of these variables. Choose just one measure for each
variable
• Course letter grade
• Voter turnout rate (votes cast/eligible voters)
• Marital status (Married, divorced, single, etc)
• Occupation (Professor, cook, mechanic, etc.)
• Body weight
• Total number of votes cast in an election
• #Years of education
• Subjective social class (Poor, working class, middle class, etc.)
• Poverty rate
• Racial or ethnic group identification
2. Below is a list of voter turnout rates in twelve Wisconsin Counties during
the 2020 presidential election. Calculate the mean, median, and mode for
the level of voter turnout in these counties. Which of these is the most
appropriate measure of central tendency for this variable? Why? Based on
the information you have here, is this variable skewed in either direction?
3. The following table provides the means and medians for four different
variables from the states20 data set. Use this information to offer your
best guess as to the level and direction of skewness in these variables.
Explain your answer. Make sure to look at the states20 codebook first,
so you know what these variables are measuring.
5.9.2 R Problems
As usual, show the R commands and output you used to answer these questions.
1. Use R to report all measures of central tendency that are appropriate
for each of the following variables: The feeling thermometer rating
for the National Rifle Association (anes20$V202178), Latinos as a
percent of state populations (states20$latino), party identification
(anes20$V201231x), and region of the country where ANES survey
respondents live (anes20$V203003). Where appropriate, also discuss
skewness.
2. Using one of the numeric variables listed in the previous question, create
a density plot that includes vertical lines showing the mean and median
outcomes, and add a legend. Describe what you see in the plot.
148 CHAPTER 5. MEASURES OF CENTRAL TENDENCY
Chapter 6
Measures of Dispersion
6.2 Introduction
While measures of central tendency give us some sense of the typical outcome
of a given variable, it is possible for variables with the same mean, median, and
mode to have very different looking distributions. Take, for instance the three
graphs below; all three are perfectly symmetric (no skew), with the same means
and medians, but they vary considerably in the level of dispersion around the
mean. On average, the data are most tightly clustered around the mean in the
first graph, spread out the most in the third graph, and somewhere in between
in the second graph. These graphs vary not in central tendency but in how
concentrated the distributions are around the central tendency.
The concentration of observations around the central tendency is an important
concept and we are able to measure it in a number of different ways using
measures of dispersion. There are two different, related types of measures of
149
150 CHAPTER 6. MEASURES OF DISPERSION
0.8
0.4
Density
Density
0.4
0.2
0.0
0.0
4 6 8 10 12 14 16 4 6 8 10 12 14 16
x x
Mean=10, Median=10
0.05 0.10 0.15
Density
4 6 8 10 12 14 16
Figure 6.1: Distributions with Identical Central Tendencies but Different Levels
of Dispersion
dispersion, those that summarize the overall spread of the outcomes, and those
that summarize how tightly clustered the observations are around the mean.
6.3.1 Range
The range is a measure of dispersion that does not use information about the
central tendency of the data. It is simply the difference between the lowest
and highest values of a variable. To be honest, this is not always a useful or
interesting statistic. For instance, all three of the graphs shown above have the
same range (4 to 16) despite the differences in the shape of the distributions.
Still, the range does provide some information and could be helpful in alerting
you to the presence of outliers, or helping you spot coding problems with the
data. For example, if you know the realistic range for a variable measuring age
in a survey of adults is roughly 18-100(ish), but the data show a range of 18 to
950, then you should look at the data to figure out what went wrong. In a case
like this, it could be that a value of 950 was recorded rather than that intended
6.3. MEASURES OF SPREAD 151
value of 95.
Below, we examine the range for the age of respondents to the cces20 survey.
R provides the minimum and maximum values but does not show the width of
the range.
#First, create cces20$age, using cces20$birthyr
cces20$age<-2020-cces20$birthyr
#Then get the range for `age` from the cces20 data set.
range(cces20$age, na.rm=T)
[1] 18 95
Here, we see that the range in age in the cces20 sample is from 18 to 95, a
perfectly plausible age range (only adults were interviewed), a span of 77 years.
Other than this, there is not much to say about the range of this variable.
[1] 30
#Get quartiles for better understanding of IQR
summary(cces20$age)
Using the cces20$age variable, the 25th percentile (“1st Qu”) is associated with
the value 33 and the 75th percentile (“3rd Qu”) with the value 63, so the IQR
is from 33 to 63 (a difference of 30).
The summary command: the mean and median, and the minimum and maximum
values, which describe the range of the variable. In the case of respondent age,
the mean and the median are very similar in value, indicating very little skewness
to the data, an observation supported by the skewness statistic.
#Get skewness statistic
Skew(cces20$age)
[1] 0.06791901
One interesting aspect of the inter-quartile range is that you can think of it as
both a measure of dispersion and a measure of central tendency. The upper and
lower limits of the IQR define the middle of the distribution, the very essence of
central tendency, while the difference between the upper and lower limits is an
indicator of how spread out the middle of the distribution is, a central concept
to measures of dispersion. One important note, though, is that the width of the
IQR reflects not just the spread of the data, but also the scale of the variable.
An IQR width of 30 means one thing on a variable with a range from 18 to 95,
like cces20$age, but quite another on a scale with a more restricted range, say
18 to 65, in which case, an IQR equal to 30 would indicate a lot more spread in
the data, relative to the scale of the variable.
We can use tools learned earlier to visualize the interquartile range of age, using
a histogram with vertical lines marking its upper and lower limits.
#Age Histogram
hist(cces20$age, xlab="Age",
main="Histogram of Age with Interquartile Range")
#Add lines for 25th and 75th percentiles
abline(v=33, lty=2,lwd=2)
abline(v=63, lwd=2)
#Add legend
legend("topright", legend=c("1st Qu.","3rd Qu."), lty=c(2,1))
6.3. MEASURES OF SPREAD 153
1st Qu.
3rd Qu.
5000
Frequency
3000
0 1000
20 40 60 80
Age
6.3.3 Boxplots
A nice graphing method that focuses explicitly on the range and IQR is the
boxplot, shown below. The boxplot is a popular and useful tool, one that is
used extensively in subsequent chapters.
#Boxplot Command
boxplot(cces20$age, main="Boxplot for Respondent Age", ylab="Age")
I’ve added annotation to the output in the figure below to make it easier for you
to understand the contents of a boxplot. The dark horizontal line in the plot
is the median, the box itself represents the middle fifty percent, and the two
end-caps usually represent the lowest and highest values. In cases where there
are extreme outliers, they will be represented with dots outside the upper and
lower limits.1 Similar to what we saw in the histogram, the boxplot shows that
the middle 50% of outcomes is situated fairly close to the middle of the range,
indicating a low level of skewness.
Now, let’s look at boxplots and associated statistics using some of the variables
from the states20 data set that we looked at in Chapter 5: percent of the
1 Outliers are defined 𝑄1 − 1.5 ∗ 𝐼𝑄𝑅 and 𝑄3 + 1.5 ∗ 𝐼𝑄𝑅.
154 CHAPTER 6. MEASURES OF DISPERSION
state population who are foreign-born, and the number of abortion restrictions
in state law. First, you can use the summary command to get the relevant
statistics, starting with states20$fb
#Get summary statistics
summary(states20$fb)
Boxplot of % Foreign−born
20
% Foreign−born
15
10
5
Here, you can see that the “Box” is located at the low end of the range and that
there are a couple of extreme outliers (the circles) at the high end. This is what
a negatively skewed variable looks like in a box plot. Bear in mind that the
level of skewness for this variable is not extreme (1.2), but you can still detect
it fairly easily in the box plot. Contrast this to the boxlplot for cces20$age
(above), which offers no strong hint of skewness.
5 10 15 20
% Foreign−born
The range for the entire variable is from 1 to 13, and the interquartile range is
at the high end of this scale at 6 to 10. This means there is a concentration of
observations toward the high end of the scale, with fewer observations at the low
end. Again, the same information is provided in the horizontal boxplot below.
#Boxplot for abortion restriction laws
boxplot(states20$abortion_laws,
xlab="Number of Abortion Restrictions",
main="Boxplot of State Abortion Restrictions",
horizontal = T)
6.3. MEASURES OF SPREAD 157
2 4 6 8 10 12
The pattern of skewness is not quite as clear as it was in the example using
the foreign-born population, but we wouldn’t expect it to be since we know the
distribution is not as skewed. Still, you can pick up hints of a modest level
of skewness based on the position of the “Box” along with the location of the
median.
The inter-quartile range can also be used for ordinal variables, with limits.
For instance, for the ANES question on preferences for spending on the poor
(anes20$V201320x), we can determine the IQR from the cumulative relative
frequency:
Freq(anes20$V201320x)
For this variable, the 25th percentile is in the first category (“Increased a lot”)
and the 75 percentile is in the third category (“Kept the same”). The language
for ordinal variables is a bit different and not quantitative as when using numeric
data. For instance, in this case, it is not appropriate to say the inter-quartile
range is from 2 (from 1 to 3), as the concept of numeric difference doesn’t work
well here. Instead, it is more appropriate to say the inter-quartile range is from
‘increase a lot’ to ‘kept the same’, or that the middle 50% hold opinions ranging
158 CHAPTER 6. MEASURES OF DISPERSION
𝑛
∑𝑖=1 𝑥𝑖 − 𝑥̄
Average Deviation = =0
𝑛
The sum of the positive deviations from the mean will always be equal to the
sum of the negative deviations from the mean. No matter what the distribution
looks like, the average deviation is zero. So, in practice, this is not a useful
statistic, even though, conceptually, the typical deviation from the mean is
what we want to measure. What we need to do, then, is somehow treat the
negative values as if they were positive, so the sum of the deviations does not
always equal 0.
Mean absolute deviation. One solution is to express deviations from the
mean as absolute values. So, a deviation of –2 (meaning two units less than the
mean) would be treated the same as a deviation of +2 (meaning two more than
the mean). Again, conceptually, this is what we need, the typical deviation from
the mean but without offsetting positive and negative deviations.
𝑛
∑𝑖=1 |𝑥𝑖 − 𝑥|̄
M.A.D =
𝑛
Let’s calculate this using the percent foreign-born (states20$fb):
#Calculate absolute deviations
absdev=abs(states20$fb-mean(states20$fb))
#Sum deviations and divide by n (50)
M.A.D.<-sum(absdev)/50
#Display M.A.D
M.A.D.
[1] 4.26744
6.4. DISPERSION AROUND THE MEAN 159
This result shows that, on average, the observations for this variable are within
4.27 units (percentage points, in this case) of the mean. We can get the same
result using the MeanAD function from the DescTools package:
#Mean absolute deviation from R
MeanAD(states20$fb, center=mean)
[1] 4.26744
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2
Variance (𝑆 2 ) =
𝑛−1
This is a very useful measure of dispersion and forms the foundation for many
other statistics, including inferential statistics and measures of association. Let’s
review how to calculate it, using the percent foreign-born (states20$fb):
Sum the squared deviations and divide by n-1 (49), and you get:
#Calculate variance
sum(fb_dev.sq)/49
[1] 29.18004
Of course, you don’t have to do the calculations on your own. You could just
ask R to give you the variance of states20$fb:
160 CHAPTER 6. MEASURES OF DISPERSION
[1] 29.18004
One difficulty with the variance, at least from the perspective of interpretation,
is that because it is expressed in terms of squared deviations, it can sometimes be
hard to connect the number back to the original scale. On its face, the variance
can create the impression that there is more dispersion in the data than there is.
For instance, the resulting number (29.18) is greater than the range of outcomes
(20.3) for this variable. This makes it difficult to relate the variance to the actual
data, especially if it is supposed to represent the “typical” deviation from the
mean. As important as the variance is as a statistic, it comes up a bit short
as an intuitive descriptive device for conveying to people how much dispersion
there is around the mean. This brings us to the most commonly used measure
of dispersion for numeric variables.
Standard Deviation. The standard deviation is the square root of the vari-
ance. Its important contribution to consumers of descriptive statistical infor-
mation is that it returns the variance to the original scale of the variable.
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥)̄ 2
𝑆=√
𝑛−1
Or get it from R:
# Use 'sd' function to get the standard deviation
sd(states20$fb)
[1] 5.401855
Note that the number 5.40 makes a lot more sense as a measure of typical
deviation from the mean for this variable than 29.18 does. Although this is not
exactly the same as the average deviation from the mean, it should be thought
of as the “typical” deviation from the mean. You may have noticed that the
standard deviation is a bit larger than the mean absolute deviation. This will
generally be the case because squaring the deviations assigns a bit more weight
to relatively high or low values.
Although taking the square root of the variance returns value to the scale of the
variable, there is still a wee problem with interpretation. Is a standard deviation
of 5.40 a lot of dispersion, or is it a relatively small amount of dispersion? In
isolation, the number 5.40 does not have a lot of meaning. Instead, we need to
have some standard of comparison. The question should be, “5.40, compared to
what?” It is hard to interpret the magnitude of the standard deviation on its
6.4. DISPERSION AROUND THE MEAN 161
own, and it’s always risky to make a comparison of standard deviations across
different variables. This is because the standard deviation reflects two things:
the amount of dispersion around the mean and the scale of the variable.
The standard deviation reported above comes from a variable that ranges from
1.4 to 21.7, and that scale profoundly affects its value. To illustrate the impor-
tance of scale, lets suppose that instead of measuring the foreign-born popula-
tion as a percent of the state population, we measure it as a proportion of the
state population:
#Express foreign-born as a proportion
states20$fb.prop=states20$fb/100
[1] 0.05401855
Does that mean that there is less dispersion around the mean for fb.prop than
for fb? No. Relative to their scales, both variables have the same amount of
dispersion around their means. The difference is in the scales, nothing else.
𝑆
CV =
𝑥̄
Here, we express the standard deviation relative to the mean of the variable.
Values less than 1 indicate relatively low levels of variation; values greater than
1 indicate high levels of variation.
[1] 0.767091
This tells us that the outcomes for the percent foreign-born in the states are
relatively concentrated around the mean. One important caveat regarding the
coefficient of variation is that it should only be used with variables that have
all positive values.
162 CHAPTER 6. MEASURES OF DISPERSION
Which of these would you say has the least amount of variation? Just based on
“eyeballing” the plot, it looks like there is hardly any variation in abortion laws,
a bit more variation in the foreign-born percentage of the population, and the
most variation in Biden’s percent of the two-party vote. The most important
point to take away from this graph is that it is a terrible graph! You should
never put three such disparate variables, measuring such different outcomes,
and using different scales in the same distribution graph. It’s just not a fair
comparison, due primarily to differences in scale. Of course, that’s the point of
the graph.
You run into the same problem using the standard deviation as the basis of
comparison:
#get standard deviations for all three variables
sd(states20$fb)
6.4. DISPERSION AROUND THE MEAN 163
[1] 5.401855
sd(states20$d2pty20)
[1] 10.62935
sd(states20$abortion_laws)
[1] 2.946045
Here we get the same impression regarding relative levels of dispersion across
these variables, but this is still not a fair comparison because one of the pri-
mary determinants of the standard deviation is the scale of the variable. The
larger the scale, the larger the variance and standard deviation are likely to be.
The abortion law variable is limited to a range of 1-13 (abortion regulations),
the foreign-born percent in the states ranges from 1.4% to 21.7%, and Biden’s
percent of the two-party vote ranges from 24% to 67%. So, making comparisons
of standard deviations or other measures of dispersion, is generally not a good
idea if you do not account for differences in scale.
The coefficient of variation is a useful statistic, precisely for this reason. Below,
we see that when considering variation relative to the scale of the variables, using
the coefficient of variation, the impression of these three variables changes.
#Get coefficient of variation for all three variables
CoefVar(states20$fb)
[1] 0.767091
CoefVar(states20$d2pty20)
[1] 0.2177949
CoefVar(states20$abortion_laws)
[1] 0.3738636
These results paint a different picture of the level of dispersion in these three
variables: the percent foreign-born exhibits the most variation, relative to its
scale, followed by abortion laws, and then by Biden’s vote share. The fact
that the coefficient of variation for Biden’s vote share is so low might come as
a surprise, given that it “looked” like it had as much or more variation than
the other two variables in the boxplot. But the differences in the boxplot were
due largely to differences in the scale of the variables. Once that scale is taken
into account by the coefficient of variation, the impression of dispersion levels
changes a lot.
164 CHAPTER 6. MEASURES OF DISPERSION
anes20$service.n
Frequency Percent Valid Percent
0 7311 88.2971 88.59
1 942 11.3768 11.41
NA's 27 0.3261
Total 8280 100.0000 100.00
About 88.6% in 0 (no previous service) and 11.4% in 1 (served). Does that seem
like a lot of variation? It’s really hard to tell without thinking about what a
dichotomous variable with a lot of variation would look like.
First, what would this variable look like if there were no variation in its our-
comes? All of the observations (100%) would be in one category. Okay, so what
would it look like if there was maximum variation? Half the observations would
be in 0 and half in 1; it would be 50/50. The distribution for anes20$service.n
seems closer to no variation than to maximum variation.
We use a different formula to calculate the variance in dichotomous variables:
𝑆 2 = 𝑝(1 − 𝑝)
√
𝑆 = √𝑝(1 − 𝑝) = .1011 = .318
[1] 0.3180009
What does this mean? How should it be interpreted? With dichotomous vari-
ables like this, the interpretation is a bit harder to grasp than with a continuous
6.6. DISPERSION IN CATEGORICAL VARIABLES? 165
numeric variable. One way to think about it is that if this variable exhibited
maximum variation (50% in one category, 50% in the other), the value of the
standard deviation would be √(𝑝(1 − 𝑝) = .50. To be honest, though, the real
value in calculating the standard deviation for a dichotomous variable is not in
the interpretation of the value, but in using it to calculate other statistics that
form the basis of statistical inference (Chapter 8).
freq(anes20$raceth.5, plot=F)
spread out evenly across categories, as diverse as possible. The responses for
this variable are concentrated in “White(NH)” (73%), indicating not a lot of
variation.
The Index of Qualitative Variation (IQV) can be used to calculate how
close to the maximum level of variation a particular distribution is.
The formula for the IQV is:
𝑛
𝐾
𝐼𝑄𝑉 = ∗ (1 − ∑ 𝑝𝑘2 )
𝐾 −1 𝑘=1
Where:
K= Number of categories
k = specific categories
𝑝 = Proportion in category k
This formula is saying to sum up all of the squared category proportions, sub-
tract that sum from 1, and multiply the result times the number of categories
divided by the number of categories minus 1. This last part adjusts the main
part of the formula to take into account the fact that it is harder to get to max-
imum diversity with fewer categories. There is not an easy-to-use R function
for calculating the IQV, so let’s do it the old-fashioned way:
5
IQV = ∗ (1 − (.729152 + .088772 + .093182 + .034732 + .054172 )) = .56
4
Note: the proportions are taken from the valid percentages in the frequency
table.
We should interpret this as meaning that this variable is about 56% as diverse
as it could be, compared to maximum diversity. In other words, not terribly
diverse.
assume that age is normally distributed and then apply what we know about the
standard deviation and the normal curve to this variable. To do this we would
need to know how many standard deviations above the mean the 66-year-old is.
First, we take the raw difference between the mean and the respondent’s age,
66:
#Calculate deviation from the mean
(66-48.39)
[1] 17.61
So, the 66-year-old respondent is just about 18 years older than the average
respondent. This seems like a lot, but thinking about this relative to the rest of
the sample, it depends on how much variation there is in the age of respondents.
For this particular sample, we can evaluate this with the standard deviation,
which is 17.66.
#Get std dev of age
sd(cces20$age, na.rm=T)
[1] 17.65902
So, now we just need to figure out how many standard deviations above the
mean age this 66-year-old respondent is. This is easy to calculate. They are
17.61 years older than the average respondent, and the standard deviation for
this variable is 17.66, so we know that the 66-year-old is close to one standard
deviation above the mean (actually, .997 standard deviations):
#Express deviation from mean, relative to standard deviation
(66-48.39)/17.66
[1] 0.9971687
Let’s think back to the normal distribution in Figure 6.4 and assume that age
is a normally distributed variable. We know that the 66-year-old respondent is
about one standard deviation above the mean, and that approximately 84% of
the observations of a normally distributed variable are less than one standard
deviation above the mean. Therefore, we might expect that the 66-year-old
respondent is in the 84th percentile for age in this sample.
Of course, empirical variables such as age are unlikely to be normally distributed.
Still, for most variables, if you know that an outcome is one standard deviation
or more above the mean, you can be confident that, relative to that variable’s
distribution, the outcome is a fairly high value. Likewise, an outcome that is one
standard deviation or more below the mean is relatively low. And, of course, an
outcome that is two standard deviations (below) above the mean is very high
(low), relative to the rest of the distribution.
The statistic we calculated above to express the respondent’s age relative to
the empirical distribution is known as a z-score. Z-scores transform the original
170 CHAPTER 6. MEASURES OF DISPERSION
(raw) values of a numeric variable into the number of standard deviations above
or below the mean that those values are. Z-scores are calculated as:
𝑥𝑖 − 𝑥 ̄
𝑍𝑖 =
𝑆
cces20$age_66
Frequency Percent
18-66 50050 82.05
67 and older 10950 17.95
Total 61000 100.00
One reason the estimate from an empirical sample is so close to the expectations
based on the theoretical normal distribution is that the distribution for age does
not deviate drastically from normal (See histogram below). The solid curved
line in Figure 6.5 is the normal distribution, and the vertical line identifies the
cutoff point for 66-years-old.
0.030
Age=66
0.020 Normal Curve
Density
0.010
0.000
0 20 40 60 80 100
Age of respondent
Figure 6.5: Camparing the Emipircal Histogram for Age with the Normal Curve
2 Code for creating this table was adapted from a post by Arthur Charpentier–https://
www.r-bloggers.com/2013/10/generating-your-own-normal-distribution-table/.
172 CHAPTER 6. MEASURES OF DISPERSION
Table 6.1. Areas Under the Normal Curve to the left of Positive Z-scores
So, for instance, if you look at the intersection of 1.0 on the side and 0.00 at
the top (z=1.0), you see the value of .8413. This means, as we already know,
that approximately 84% of the area under the curve is below a z-score of 1.0.
Of course, this also means that approximately 16% of the area lies above z=1.0,
and approximately 34% lies between the mean and a z-score of 1.0. Likewise,
we can look at the intersection of 1.9 (row) and .06 (column) to find the area
under the curve for z-score of 1.96. The area to the left of z=1.96 is 95.7%, and
the area to the right of z=1.96 is 2.5% of the total area under the curve. Let’s
walk through this once more, using z=1.44. What is the area to the left, to the
right, and between the mean and z=1.44:
• To the left: .9251 (found at the intersection of 1.4 (row) and .04 (column))
• To the right: (1-.9251) = .0749
• Between the mean and 1.44= (.9252-.5) = .4251
If these results confuse you, make sure to check in with your professor.
So, that’s the “old-School” way to find areas under the normal distribution, but
there is an easier and more precise way to do this using R. To find the area to
the left of any given z-score, we use the pnorm function. This function displays
the area under the curve to the left of any specify z-score. Let’s check our work
above for z=1.44.
#Get area under the curve to the left of z=1.44
pnorm(1.44)
6.9. ONE LAST THING 173
[1] 0.9250663
So far, so good. R can also give you the area to the right of a z-score. Simply
add lower.tail = F to the pnorm command:
#Get area under the curve for to the right of z=1.44
#Add "lower.tail = F"
pnorm(1.44, lower.tail = F)
[1] 0.0749337
And the area between the mean and z=1.44:
pnorm(1.44)-.5
[1] 0.4250663
------------------------------------------------------------------------------
states20$fb (numeric)
6.11 Exercises
6.11.1 Concepts and Calculations
As usual, when making calculations, show the process you used.
1. Use the information provided below about three hypothetical variables
and determine of the variables appears to be skewed in one direction of
the other. Explain your conclusions.
2. The list of voter turnout rates in Wisconsin counties that you used for
an exercise in Chapter 5 is reproduced below, with two empty columns
added: the deviation of each observation from the mean, and another
for the square of that deviation. Fill in this information and calculate
the mean absolute deviation and the standard deviation. Interpret these
statistics. Which one do you find easiest to understand? Why? Next,
calculate the coefficient of variation for this variable. How do you interpret
this statistic?
Wisconsin
County % Turnout 𝑋 𝑖 − 𝑋̄ (𝑋 𝑖 − 𝑋)̄ 2
Clark 63
Dane 87
Forrest 71
Grant 63
Iowa 78
Iron 82
Jackson 65
Kenosha 71
Marinette 71
Milwaukee 68
Portage 74
Taylor 70
3. Across the fifty states, the average cumulative number of COVID-19 cases
per 10,000 population in August of 2021 was 1161, and the standard de-
viation was 274. The cases per 10,000 were 888 in Virginia and 1427 in
South Carolina. Using what you know about the normal distribution,
and assuming this variable follows a normal distribution, what percent of
states do you estimate have values equal to or less than Virginia’s, and
what percent do you expect to have values equal to or higher than South
Carolina’s? Explain how you reached your conclusions.
4. The average voter turnout rate across the states in the 2020 presidential
election was 67.4% of eligible voters, and the standard deviation was 5.8.
Calculate the z-scores for the following list of states and identify which
state is most extreme and which state is least extreme.
6.11.2 R Problems
1. Using the pnorm function, estimate the area under the normal curve for
each of the following each of the following:
• Above Z=1.8
• Below Z= -1.3
• Between Z= -1.3 and Z=1.8
2. For the remaining problems, use countries2 data set. One important
variable in the countries2 data set is lifexp, which measures life ex-
pectancy. Create a histogram, a boxplot, and a density plot, and describe
the distribution of life expectancy across countries. Which of these graph-
ing methods do you think is most useful for getting a sense of how much
variation there is in this variable? Why? What about skewness? What
can you tell from these graphs? Be specific.
3. Use the results of the Desc command to describe the amount of variation in
life expectancy, focusing on the range, inter-quartile range, and standard
deviation. Make sure to provide interpretations of these statistics.
6.11. EXERCISES 177
Probability
7.2 Probability
We all use the language of probability in our everyday lives. Whether we are
talking or thinking about the likelihood, odds, or chance that something will
occur, we are using probabilistic language. At its core, probability is about
whether something is likely or unlikely to happen.
Probabilities can be thought of as relative frequencies that express how often a
given outcome (X) occurs, relative to the number of times it could occur:
Number X outcomes
𝑃 (𝑋) =
Number possible X outcomes
Probabilities range from 0 and 1, with zero meaning an event never happens and
1 meaning the event always happens. As you move from 0 to 1, the probability of
an event occurring increases. When the probability value is equal to .50, there is
an even chance that the event will occur. This connection between probabilities
and everyday language is summarized below in Figure 7.1. It is also common
for people to use the language of percentages when discussing probabilities, e.g.,
0% to 100% range of possible outcomes. For instance, a .75 probability that
179
180 CHAPTER 7. PROBABILITY
something will occur might be referred to as a 75% probability that the event
will occur.
Consider this example. Suppose we want to know the probability that basketball
phenom and NBA Champion Giannis Antetokounmpo, of the Milwaukee Bucks,
will make any given free-throw attempt. We can use his performance from the
2020-2021 regular season as a guide. Giannis (if I may) had a total of 581 free-
throw attempts (# of possible outcomes) and made 398 (number of X outcomes)
. Using the formula from above, we get:
389
𝑃 (𝑠𝑢𝑐𝑐𝑒𝑠𝑠) = = .685
581
The probability of Giannis making any given free-throw during the season was
.685. This outcome is usually converted to a percentage and known as his
free-throw percentage (68.5%). Is this a high probability or a low probability?
Chances are better than not that he will make any given free-throw, but not
by a lot. Sometimes, the best way to evaluate probabilities is comparatively; in
this case, that means comparing Giannis’ outcome to others. If you compared
Giannis to me, he would seem like a free-throw superstar, but that is probably
not the best standard for evaluating a professional athlete. Instead, we can
also compare him to the league average for the 2020-2021 season (.778), or his
own average in the previous year (.633). By making these comparisons, we
are contextualizing the probability estimate, so we can better understand its
meaning in the setting of professional basketball.
It is also important to realize that this probability estimate is based on hundreds
of “trials” and does not rule out the possibility of “hot” or “cold” shooting
7.3. THEORETICAL PROBABILITIES 181
streaks. For instance, Giannis was made 4 of his 11 free throw attempts in
Game 5 against the Phoenix Suns (.347) and was 17 of 19 (.897) in Game 6.
In those last two games, he was 21 of 30 (.70), just a wee bit higher than his
regular season average.
50
40
30
Figure 7.2: Simulated Results from Large and Small Coin Toss Samples
We can work through the same process for rolling a fair six-sided die, where
the probability of each of the six outcomes is 1/6=.167. The results below
summarize the results of rolling a six-sided die 100 times (on the left) and 2000
times (on the right), using a solid horizontal line to indicate the expected value
(.167). Each of the six outcomes should occur approximately the same number
of times. On the left side, based on just 100 rolls of the die, the outcomes
deviate quite a bit from the expected outcomes, ranging from .08 of outcomes
for the number 3, to .21 of outcomes for both numbers 2 and 4. This is not
what we expect from a fair, six-sided die. But if we increase the number of rolls
to 2000, we see much more consistency across each of the six numbers on the
die, and all of the proportions are very close to .167, ranging from .161 for the
number 1, to .175 for the number 4.
The coin toss and die rolling simulations are important demonstrations of the
Law of Large Numbers: If you conduct multiple trials or experiments of a
1 By “simulated,” I mean that I created an object named “coin” with two values, “Heads”
and “Tails” (coin <- c(“Heads”, “Tails”)) and told R to choose randomly from these two values
and store the results in a new object, “coin10”, where the “10” indicates the number of tosses
(coin10=sample(coin, 10, rep=T)). The frequencies of the outcomes for object “coin10” can
be used to show the number of “Heads” and “Tails” that resulted from the ten tosses. I then
used the same process to generate results from larger samples of tosses.
7.4. EMPIRICAL PROBABILITIES 183
0.25
0.20
0.20
0.15
0.15
Propotion
Propotion
0.10
0.10
0.05
0.05
0.00
1 2 3 4 5 6 0.00 1 2 3 4 5 6
Ouctome Outcome
Figure 7.3: Simulated Results from Large and Small Samples of Die Rolls
random event, the average outcome approaches the theoretical (expected) outcome
as the number of trials increases and becomes large.
This idea is very important and comes into play again in Chapter 8.
For instance:
• We know from earlier that the probability of Giannis making any given
free throw during the 2020-2021 season was 𝑃 (𝑠𝑢𝑐𝑐𝑒𝑠𝑠) = (389/581) =
.685, based on observed outcomes from the season.
• From the past several semesters of teaching data analysis, I know that 109
of 352 students earned a grade in the B range, so the 𝑃 (𝐵) = 109/352 =
.31.
• Brittney Griner, the center for the Phoenix Mercury, made 248 of the 431
shots she took from the floor in the 2021 season, so the probability of her
making a shot during that season was 𝑃 (𝑆𝑤𝑖𝑠ℎ) = 248/431 = .575.
184 CHAPTER 7. PROBABILITY
• Joe Biden won the presidential election in 2020 with 51.26% of the popular
vote, so the probability that a 2020 voter drawn at random voted for Biden
is .5126.
To arrive at these empirical probabilities, I used data from the real world and
observed the relative occurrence of different outcomes. These probabilities are
the relative frequencies expressed as proportions.
Now, we can get a crosstabulation of both vote choice and education level.
#Get a crosstab of vote by education, using a sample weight, not plot
crosstab(anes20$vote, anes20$educ,weight=anes20$V200010b, plot=F)
in the table represents the intersection of a given row and column. At the
bottom of each column and the end of each row, we find the row and column
totals, also known as the marginal frequencies. These totals are the frequencies
for the dependent and independent variables (note, however, that the sample
size is now restricted to the 5183 people who gave valid responses to both survey
questions).
Cell Contents
|-------------------------|
| Count |
|-------------------------|
======================================================================
anes20$educ
anes20$vote LT HS HS Some Coll 4yr degr Grad degr Total
----------------------------------------------------------------------
Biden 124 599 741 801 518 2783
----------------------------------------------------------------------
Trump 117 644 770 493 232 2256
----------------------------------------------------------------------
Other 10 20 47 47 20 144
----------------------------------------------------------------------
Total 251 1263 1558 1341 770 5183
======================================================================
From this table we can calculate a number of different probabilities. To calculate
the vote probabilities, we just need to divide the raw vote (row) totals by the
sample size:
𝑃 (𝐵𝑖𝑑𝑒𝑛) = 2783/5183 = .5369
𝑃 (𝑇 𝑟𝑢𝑚𝑝) = 2256/5183 = .4353
𝑃 (𝑂𝑡ℎ𝑒𝑟) = 144/5183 = .0278
Note that even with the sample weights applied, these estimates are not exactly
equal to the population outcomes (.513 for Biden and .468 for Trump). Some
part of this is due to sampling error, which you will learn more about in the
next chapter.
We can use the column totals to calculate the probability that respondents have
different levels of education.
𝑃 (LT HS) = 251/5183 = .0484
𝑃 (HS) = 1263/5183 = .2437
𝑃 (Some Coll) = 1558/5183 = .3006
𝑃 (4yr degr) = 1341/5183 = .2587
𝑃 (grad degr) = 770/5183 = .1486
186 CHAPTER 7. PROBABILITY
of respondents in the cell where the Trump row intersects that HS column (644)
and divide the resulting number (2875) by the total number of respondents in
the table (5183):
𝑃 (𝑇 𝑟𝑢𝑚𝑝 ∪ 𝐻𝑆) = (2256 + 1263 − 644)/5183 = .5547
𝑃 (𝑇 𝑟𝑢𝑚𝑝 ∩ 𝐻𝑆)
𝑃 (𝑇 𝑟𝑢𝑚𝑝 ∣ 𝐻𝑆) =
𝑃 (𝐻𝑆)
.12425
(𝑇 𝑟𝑢𝑚𝑝 ∣ 𝐻𝑆) = = .5099
.24368
A more intuitive way to get this result, if you are working with a joint frequency
distribution, is to divide the total number of respondents in the Trump/HS cell
644
by the total number of people in the HS column: 1263 = .5099. By limiting
the frequencies to the HS column, we are in effect calculating the probability of
respondents voting for Trump given that high school equivalence is their highest
level of education. Using the raw frequencies has the added benefit of skipping
the step of calculating the probabilities.
Okay, so the probability of someone voting for Trump in 2020, given their level
of educational attainment was a high school degree is .5099. So what? Well,
first, we note that this is higher than the overall probability of voting for Trump
(𝑃 (𝑇 𝑟𝑢𝑚𝑝) = .4353), so we know that having a high school level of educational
188 CHAPTER 7. PROBABILITY
• Since the total area under the curve equals 1.0, we know that the area
above a z-score of 1.5 is equal to 1-.9332–>.0668
7.5. THE NORMAL CURVE AND PROBABILITY 189
From this, we conclude that the probability of drawing a value greater than
+1.5 standard deviations is .0668.
Of course, you can also use the pnorm function to get the same result:
#Get area to the right of z=1.5
pnorm(1.5, lower.tail = F)
[1] 0.0668072
We can use the normal distribution to solve probability problems with real world
data. For instance, suppose you are a student thinking about enrolling in my
Political Data Analysis class and you want to know how likely it is that a student
chosen at random would get a grade of A- or higher for the course (at least 89%
of total points). Let’s further suppose that you know the mean (75.12) and
standard deviation (17.44) from the previous several semesters. What you want
to know is, based on data from previous semesters, what is the probability of
any given student getting an overall score of 89 or better?
In order to solve this problem, we need to make the assumption that course
grades are normally distributed, convert the target raw score (89) into a z-score,
and then calculate the probability of getting that score or higher, based on the
area under the curve to the right of that z-score. Recall that the formula for a
z-score is:
𝑥𝑖 − 𝑥 ̄
𝑍𝑖 =
𝑆
In this case 𝑥𝑖 is 89, the score you need to earn in order to get at least an A-,
so:
89 − 75.15 13.85
𝑍89 = = = .7942
17.44 17.44
We know that 𝑍89 = .7942, so now we need to calculate the area to the right of
.7942 standard deviations on a normally distributed variable:
#Get area to the right of z=.7942
pnorm(.7942, lower.tail = F)
[1] 0.2135395
The probability of a student chosen at random getting a grade of A- or higher
is about .2135. You could also interpret this as meaning that the expectation is
that about 21.35% of students will get a grade of A- or higher.
It is important to recognize that this probability estimate is based on assump-
tions we make about a theoretical distribution, even though we are interested
in estimating probabilities for an empirical variable. As discussed in Chapter 6,
the estimates derived from a normal distribution are usually in the ballpark of
190 CHAPTER 7. PROBABILITY
what you find in empirical distributions, provided that the empirical distribu-
tion is not too oddly shaped. Since we have the historical data on grades, we
can get a better sense of how well our theoretical estimate matches the over-
all observed probability of getting at least an A-, based on past experiences
(empirical probability).
#Let's convert the numeric scores to letter grades (with rounding)
grades$A_Minus<-ordered(cut2(grades$grade_pct, c(88.5)))
#assign levels
levels(grades$A_Minus)<-c("LT A-", "A-/A")
#show the frequency distribution for the letter grade variable.
freq(grades$A_Minus, plot=F)
grades$A_Minus
Frequency Percent Cum Percent
LT A- 276 78.41 78.41
A-/A 76 21.59 100.00
Total 352 100.00
Not Bad! Using real-world data, and collapsing the point totals into two bins,
we estimate that the probability of a student earning a grade of A- or higher is
.216. This is very close to the estimate based on assuming a theoretical normal
distribution (.214).
There is one important caveat: this bit of analysis disregards the fact that the
probability of getting an A- is affected by a number of factors, such as effort
and aptitude for this kind of work. If we have no other information about
any given student, our best guess is that they have about a .22 probability of
getting an A- or higher. Ultimately, it would be better to think about this
problem in terms of conditional probabilities, if we had information on other
relevant variables. What sorts of things do you think influence the probability
of getting an A- or higher in this (or any other) course? Do you imagine that
the probability of getting a grade in the A-range might be affected by how much
much time students are able to put into the course? Maybe those who do all
of the reading and spend more time on homework have a higher probability of
getting a grade in the A-range. In other words, to go back to the language of
conditional probabilities, we might want to say that the probability of getting
a grade in the A-range is conditioned by student characteristics.
testing hypotheses about the differences in mean outcomes between two groups.
The upcoming material is a bit more abstract than what you have read so far,
and I think you will find the shift in focus interesting. My sense from teaching
these topics for several years is that this is also going to be new material for
most of you. Although this might make you nervous, I encourage you to look
at it as an opportunity!
7.7 Exercises
7.7.1 Concepts and Calculations
1. Use the table below, showing the joint frequency distribution for attitudes
toward the amount of attention given to sexual harassment (a recoded
version of anes20$V202384) and vote choice for this problem.
192 CHAPTER 7. PROBABILITY
Cell Contents
|-------------------------|
| Count |
|-------------------------|
=============================================================
anes20$harass
anes20$vote Not Far Enough About Right Too Far Total
-------------------------------------------------------------
Biden 1385 1108 293 2786
-------------------------------------------------------------
Trump 402 984 869 2255
-------------------------------------------------------------
Other 61 51 33 145
-------------------------------------------------------------
Total 1848 2143 1195 5186
=============================================================
• Estimate the following probabilities:
𝑃 (𝑇 𝑜𝑜 𝐹 𝑎𝑟)
𝑃 (𝐴𝑏𝑜𝑢𝑡 𝑅𝑖𝑔ℎ𝑡)
𝑃 (𝑁 𝑜𝑡 𝐹 𝑎𝑟 𝐸𝑛𝑜𝑢𝑔ℎ)
𝑃 (𝑇 𝑟𝑢𝑚𝑝 ∩ 𝑁 𝑜𝑡 𝐹 𝑎𝑟 𝐸𝑛𝑜𝑢𝑔ℎ)
𝑃 (𝑇 𝑟𝑢𝑚𝑝 ∪ 𝑁 𝑜𝑡 𝐹 𝑎𝑟 𝐸𝑛𝑜𝑢𝑔ℎ)
𝑃 (𝑇 𝑟𝑢𝑚𝑝 ∣ 𝑇 𝑜𝑜 𝐹 𝑎𝑟)
𝑃 (𝑇 𝑟𝑢𝑚𝑝 ∣ 𝐴𝑏𝑜𝑢𝑡 𝑅𝑖𝑔ℎ𝑡)
𝑃 (𝑇 𝑟𝑢𝑚𝑝 ∣ 𝑁 𝑜𝑡 𝐹 𝑎𝑟 𝐸𝑛𝑜𝑢𝑔ℎ)
• Using your estimates of the conditional probabilities, summarize how the
probability of voting for President Trump was related to how people felt
about the amount of attention given to sexual harassment.
2. I flipped a coin 10 times and it came up Heads only twice. I say to my
friend that the coin seems biased toward Tails. They say that I need to
flipped it a lot more times before I can be confident that there is something
wrong with the coin. I flip the coin 1000 times, and it came up Heads 510
times. Was my friend right? What principle is involved here? In other
words, how do you explain the difference between 2/10 on my first set of
flips and 510/1000 on my second set?
3. In an analysis of joint frequency distribution for vote choice and whether
people support or oppose banning assault-style rifles in the 2020 election, I
find that 𝑃 (𝐵𝑖𝑑𝑒𝑛) = .536 and 𝑃 (𝑂𝑝𝑝𝑜𝑠𝑒) = .305. However, when I apply
the multiplication rule ((𝑃 (𝐵𝑖𝑑𝑒𝑛)∗𝑃 (𝑂𝑝𝑝𝑜𝑠𝑒)) to find 𝑃 (𝐵𝑖𝑑𝑒𝑛∩𝑂𝑝𝑝𝑜𝑠𝑒)
7.7. EXERCISES 193
I get .1638, while the correct answer is .087. What did I do wrong? Why
didn’t the multiplication rule work?
4. Identify each of the following as a theoretical or empirical probability.
• The probability of drawing a red card from a deck of 52 playing cards.
• The probability of being a victim of violent crime.
• The probability that 03 39 44 54 62 19 is the winning Powerball
combination.
• The probability of being hospitalized if you test positive for COVID-
19.
• The proability that Sophia Smith, of the Portland Thorns FC, will
score a goal in any given game in which she plays.
7.7.2 R Problems
1. Use the code below to create a new object in R called “coin” and assign
two different outcomes to the object, “Heads” and “Tails”. Double check
to make sure you’ve got this right.
coin <- c("Heads", "Tails")
coin
coin10
Heads Tails
4 6
3. Now, repeat the R commands in Question 2 nine more times, recording
the number of heads and tails you get from each simulation. Sum up the
number of “Heads” outcomes from the ten simulations. If the probability
of getting “Heads” on any given toss is .50, then you should have approx-
imately 50 “Heads” outcomes. Discuss how well you results match the
expected 50/50 outcome. Also, comment on the range of outcomes–some
close to 50/50, some not at all close–you got across the ten simulations.
194 CHAPTER 7. PROBABILITY
Chapter 8
195
196 CHAPTER 8. SAMPLING AND INFERENCE
Table 8.1. Symbols and Formulas for Sample Statistics and Population Param-
eters.
Sample Population
Measure Statistic Formula Parameter Formula
𝑛 𝑁
∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑥𝑖
Mean 𝑥̄ 𝑛 𝜇 𝑁
𝑛 𝑁
∑𝑖=1 (𝑥𝑖 −𝑥)̄ 2 ∑𝑖=1 (𝑥𝑖 −𝜇)2
Variance 𝑆2 𝑛−1 𝜎2 𝑁
𝑛 2 𝑁
Standard 𝑆 √ ∑𝑖=1 (𝑥𝑖 −𝑥)̄ 𝜎 √ ∑𝑖=1 (𝑥𝑖 −𝜇)2
𝑛−1 𝑁
Deviation
sions of dividing by 𝑛 − 1, some formal and hard to follow if you have my level of math
expertise, and some a bit more accessible: https://stats.stackexchange.com/questions/
3931/intuitive-explanation-for-dividing-by-n-1-when-calculating-standard-deviation ,
https://willemsleegers.com/content/posts/4-why-divide-by-n-1/why-divide-by-n-1.html
, and https://youtu.be/9Z72nf6N938
8.3. SAMPLING ERROR 197
every unit in the population has an equal chance of being selected. If this is the
case, then the sample should “look” a lot like the population.
Imagine that you have 6000 colored ping pong balls (2000 yellow, 2000 red, and
2000 green), and you toss them all into a big bucket and give the bucket a good
shake, so the yellow, red, and green balls are randomly dispersed in the bucket.
Now, suppose you reach into the bucket and randomly pull out a sample of 600
ping pong balls. How many yellow, red, and green balls would you expect to
find in your 600-ball sample? Would you expect to get exactly 200 yellow, 200
red, and 200 green balls, perfectly representing the color distribution in the full
6000-ball population? Odds are, it won’t work out that way. It is highly unlikely
that you will end up with the exact same color distribution in the sample as
in the population. However, if the balls are randomly selected, and there is no
inherent bias (e.g., one color doesn’t weigh more than the others, or something
like that) you should end up with close to one-third yellow, one-third red, and
one-third green. This is the idea behind sampling error: samples statistics,
by their very nature, will differ population from parameters; but given certain
characteristics (large, random samples), they should be close approximations of
the population parameters. Because of this, when you see reports of statistical
findings, such as the results of a public opinion poll, they are often reported
along with a caveat something like “a margin of error of plus or minus two
percentage points.” This is exactly where this chapter is eventually headed,
measuring and taking into account the amount of sampling error when making
inferences about the population.
Let’s look at how this works in the context of some real-world political data.
For the rest of this chapter, we will use county-level election returns from the
2020 presidential election to illustrate some principles of sampling and inference.
When you begin working with a new data set, it is best to take a look at some
of its attributes so you have a better sense of what you are working with. So,
let’s take a look at the dimensions and variable names for this data set.
#Size of 'county20' data set
dim(county20)
[1] 3152 10
198 CHAPTER 8. SAMPLING AND INFERENCE
names(county20)
#Add a legend
legend("topright", legend=c("Mean","Median"), lty=1:2)
summary(county20$d2pty20)
[1] 0.8091358
As you can see, this variable is somewhat skewed to the right, with the mean
(34.04) a bit larger than the median (30.64), producing a skewness value of .81.
Just to be clear, the histogram and statistics reported above are based on the
population of counties, not a sample of counties. So, 𝜇 = 34.04.3
2 Presidential votes cast in Alaska are not tallied at the county level (officially, Alaska does
not use counties), so this data set includes the forty electoral districts that Alaska uses.
3 You might look at this distribution and wonder how the average Biden share of the two-
party vote could be roughly 34% when he captured 52% of the national two party vote. The
answer, of course, is that Biden won where more people live while Trump tended to win in
counties with many fewer voters. In fact, just over 94 million votes were cast in counties won
8.3. SAMPLING ERROR 199
800 Mean
Median
Number of Counties
600
400
200
0
0 20 40 60 80 100
To get a better idea of what we mean by sampling error, let’s take a sample of
500 counties, randomly drawn from the population of counties, and calculate
the mean Democratic percent of the two-party vote in those counties. The
R code below shows how you can do this. One thing I want to draw your
attention to here is the set.seed() function. This tells R where to start the
random selection process. It is not absolutely necessary to use this, but doing
so provides reproducible results. This is important in this instance because,
without it, the code would produce slightly different results every time it is run
(and I would have to rewrite the text reporting the results over and over).
#Tell R where to start the process of making random selections
set.seed(250)
#draw a sample of 500 counties, using "d2pty20";
#store in "d2pty_500"
d2pty_500<-sample(county20$d2pty20, 500)
#Get stats on Democratic % of two party vote from sample
summary(d2pty_500)
Let’s see:
#Different "set.seed" here because I want to show that the
#results from another sample are different.
#Draw a different sample of 500 counties
set.seed(200)
d2pty_500b<-sample(county20$d2pty20, 500)
summary(d2pty_500b)
0.4
0.3 Mu=34.04
Density
0.2
0.1
0.0
30 31 32 33 34 35 36 37 38
This makes sense, right? If we take multiple large, random samples from a pop-
ulation, some of our sample statistics should be a bit higher than the population
parameter, and some will be a bit lower. And, of course, a few will be a lot
higher and a few a lot lower than the population parameter, but most of them
will be clustered near the mean (this is what gives the distribution its bell
shape). By its very nature, sampling produces statistics that are different from
the population values, but the sample statistics should be relatively close to the
population values.
Here’s an important thing to keep in mind: The shape of the sampling distri-
bution generally does not depend on the distribution of the empirical variable.
In other words, the variable being measured does not have to be normally dis-
tributed in order for the sampling distribution to be normally distributed. This
is good, since very few empirical variables follow a normal distribution! Note,
however, that in cases where there are extreme outliers, this rule may not hold.
tribution of means from those samples. The resulting distribution should start
to look like the distribution presented in Figure 8.2, especially as the number
of samples increases.
Here’s how you read this output. Each of the values represents an single mean
drawn from a sample of 50 counties. The first sample drawn had a mean of 31.24
the second sample 34.73, and so on, with the 50th sample having a mean of 35.4.
Most of the sample means are fairly close to the population value (34.04), and
a few are more distant. Looking at the summary statistics (below), we see that,
on average, these fifty sample means balance out to an overall mean of 34.3,
which is very close to the population value, and the distribution has relatively
little skewness (.21)
summary(sample_means50)
Skew(sample_means50)
[1] 0.2099118
Figure 8.3 uses a density plot (solid line) of this sampling distribution, displayed
alongside a normal distribution (dashed line), to get a sense of how closely the
distribution fits the contours of a normal curve. As you can see, with just 50
relatively small samples, the sampling distribution is beginning to take on the
shape of a normal distribution.
Sampling Dist
Normal Dist
0.15
Density
0.10
0.05
0.00
27 29 31 33 35 37 39 41 43
Sample Means
So, let’s see what happens when we create another sampling distribution but
increase the number of samples to 500. In theory, this sampling distribution
should resemble a normal distribution more closely. In the summary statistics,
we see that the mean of this sampling distribution (34.0) is slightly closer to the
population value (34.04) than in the previous example, and there is virtually
no skewness. Further, if you examine the density plot shown in Figure 8.4, you
will see that the sampling distribution of 500 samples of 50 counties follows the
contours of a normal distribution more closely, as should be the case. If we
increased the number of samples to 1000 or 2000, we would expect that the
sampling distributions would grow even more similar to a normal distribution.
#Gather sample means from 500 samples of 50 counties
set.seed(251)
sample_means500 <- rep(NA, 50)
for(i in 1:500){
samp <- sample(county20$d2pty20, 50)
sample_means500[i] <- mean((samp), na.rm=T)
204 CHAPTER 8. SAMPLING AND INFERENCE
}
summary(sample_means500)
[1] 0.02819022
Sampling Dist
Normal Dist
0.15
0.10
Density
0.05
0.00
26 28 30 32 34 36 38 40 42
Sample Means
𝜎
𝜎𝑥̄ = √
𝑁
𝑆
𝑆𝑥̄ = √
𝑛
This formula is saying that the standard error of the mean is equal to the sample
standard deviation divided by the square root of the sample size. So let’s go
ahead and calculate
√ the standard error for a sample of just 100 counties (this
makes the whole 𝑁 business a lot easier). The observations for this sample
are stored in a new object, d2pty100. First, just a few descriptive statistics,
presented below. Of particular note here is that the mean from this sample,
(32.71) is again pretty close to 𝜇, 34.04 (Isn’t it nice how this works out?).
set.seed(251)
#draw a sample of d2pty20 from 100 counties
d2pty100<-sample(county20$d2pty20, 100)
mean(d2pty100)
[1] 32.7098
sd(d2pty100)
[1] 17.11448
We can use the sample standard deviation (17.11) to calculate the standard
error of the sampling distribution (based on the characteristics of this single
sample of 100 counties):
se100=17.11/10
se100
[1] 1.711
Of course, we can also get this more easily:
#This function is in the "DescTools" package.
MeanSE(d2pty100)
[1] 1.711448
206 CHAPTER 8. SAMPLING AND INFERENCE
The mean from this sample is 32.71 and the standard error is 1.711. For right
now, treating this as our only sample, 32.71 is our best guess for the population
value. We can refer to this as the point estimate. But we know that this
is probably not the population value because we would get a different sample
means if we took additional samples, and they can’t all be equal to the popula-
tion value; but we also know that most sample means, including ours, are going
to be fairly close to 𝜇.
Now, suppose we want to use this sample information to create a range of values
that we are pretty confident includes the population parameter. We know that
the sampling distribution is normally distributed and that the standard error
is 1.711, so we can be confident that 68% of all sample means are within one
standard error of the population value. Using our sample mean as the estimate
of the population value (it’s the best guess we have), we can calculate a 68%
confidence interval:
Here we are saying that a 68% confidence interval ranges from the 𝑥̄ plus and
minus the value for z that give us 68% of the area under the curve around the
mean, times the standard error of the mean. In this case, the multiplication is
easy because the critical value of z (the z-score for an area above and below the
mean of about .68) is 1, so:
[1] 30.999
#Estimate upper limit of confidence interval
UL.68=32.71+1.711
UL.68
[1] 34.421
30.999 ≤ 𝜇 ≤ 34.421
The lower limit of the confidence interval (LL) is about 31 and the upper limit
(UL) is 34.42, a narrow range of just 3.42 that does happen to include the
population value, 34.04.
You can also use the MeanCI function to get a confidence interval around a
sample mean:
8.5. CONFIDENCE INTERVALS 207
38
36
Sample Estimates
34
32
30
28
0 10 20 30 40 50
Sample
[1] 1.959964
The critical value for 𝑧.95 = 1.96. So, now we can substitute this into the
equation we used earlier for the 68% confidence interval to obtain the 95%
confidence interval:
[1] 29.36
#Estimate upper limit of confidence interval
UL.95=32.71+3.35
UL.95
[1] 36.06
29.36 ≤ 𝜇 ≤ 36.06
8.5. CONFIDENCE INTERVALS 209
Now we can say that we are 95% confident that the population value for the
mean Democratic share of the two-party vote across counties is between the
lower limit of 29.36 and the upper limit of 36.06. Technically, what we should
say is that 95% of all confidence intervals based on z=1.96 include the value for
𝜇, so the probability that 𝜇 is in this confidence interval is .95.
Note that this interval is wider (almost 6.7 points) than the 68% interval (about
3.4 points), because we are demanding a higher level of confidence. So, suppose
we want to narrow the width of the interval but we do not want to sacrifice the
level of confidence. What can we do about this? The answer lies in the formula
for the standard error:
𝑆
𝑆𝑥̄ = √
𝑛
We only have one thing in this formula that we can manipulate, the sample
size. We can’t really change the standard deviation since it is a function of the
population standard deviation. If we took another sample, we would get a very
similar standard deviation, something around 17.11. However, we might be able
to affect the sample size, and as the sample size increases, the standard error of
the mean decreases.
Let’s look at this for a new sample of 500 counties.
set.seed(251)
#draw a sample of d2pty20 from 1000 counties
d2pty500<-sample(county20$d2pty20, 500)
mean(d2pty500)
[1] 33.24397
sd(d2pty500)
[1] 15.8867
MeanSE(d2pty500)
[1] 0.7104747
Here, you can see that the mean (33.24) and standard deviation (15.89) are
fairly close in value to those obtained from the smaller sample of 100 counties
(32.71 and 17.11), but the standard error of the mean is much smaller (.71
compared to 1.71). This difference in standard error, produced by the larger
sample size, results in a much narrower confidence interval, even though the
level of confidence (95%) is the same:
[1] 31.85
#Estimate upper limit of confidence interval (N=1000)
UL.95=33.24+1.39
UL.95
[1] 34.63
The width of the confidence interval is on 2.78 points, compared to 6.7 points
for the 100-county sample. Let’s take a closer look at how the width of the
confidence interval responds to sample size, using data from the current example:
6
Width of Confidence Interval
5
4
3
2
1
Sample Size
As you can see, there are diminishing returns in error reduction with increases
in sample size. Moving from small samples of 100 or so to larger samples of 500
or so results in a steep drop in the width of the confidence interval; moving from
500 to 1000 results is a smaller reduction in width; and moving from 1000 to
2000 results in an even smaller reduction in the width of the confidence interval.
This pattern has important implications for real-world research. Depending on
how a researcher is collecting their data, they may be limited by the very real
costs associated with increasing the size of a sample. If conducting a public
opinion poll, or recruiting experimental participants, for instance, each addi-
tional respondent costs money, and spending money on sample sizes inevitably
means taking money away from some other part of the research enterprise. The
8.6. PROPORTIONS 211
take away point in Figure 8.6 is that money spent on increasing the sample size
from a very small sample (say, 100) to somewhere around 500 is money well
spent. If resources are not a constraint, increasing the sample size beyond that
point does have some payoff in terms of error reduction, but the returns on
money spent diminish substantially for increases in sample size beyond about
1000.4
8.6 Proportions
Everything we have just seen regarding the distribution of sample means also
applies to the distribution of sample proportions. It should, since a proportion
is just the mean of a dichotomous variable scored 0 and 1. For example, with the
same data used above, we can focus on a dichotomous variable that indicates
whether Biden won (1) or lost (0) in each county. The mean of this variable
across all counties is the proportion of counties won by Biden
#Create dichotomous indicator for counties won by Biden
demwin<-as.numeric(county20$d2pty20 >50)
table(demwin)
demwin
0 1
2595 557
mean(demwin, na.rm=T)
[1] 0.1767132
Biden won 557 counties and lost 2595, for a winning proportion of .1767. This
is the population value (𝑃 ).
Again, we can take samples from this population and none of the proportions
calculated from them may match the value of 𝑃 exactly, but most of them
should be fairly close in value and the mean of the sample proportions should
equal the population value over infinite sampling.
Let’s check this out for 500 samples of 50 counties each, stored in a new object,
sample_prop500.
set.seed(251)
#Create an object with space to store 50 sample means
sample_prop500 <- rep(NA, 50)
#run through the date 500 times, getting a 50-county sample of 'demwin'
#each time. Store the the mean of each sample in 'sample_prop500'
for(i in 1:500){
samp <- sample(demwin, 50)
4 Of course there are other reasons to favor large samples, such as providing larger sub-
samples for relatively small groups within the population, or in anticipation of missing infor-
mation on some of the things being measured.
212 CHAPTER 8. SAMPLING AND INFERENCE
[1] 0.1670031
Here we see that the mean of the sampling distribution (.174) is, as expected,
very close to the population proportion (.1767), and the distribution has very
little skew. The density plots in Figure 8.7 show that the shape of the sampling
distribution mimics the shape for the normal distribution fairly closely. This
is all very similar to what we saw with the earlier analysis using the mean
Democratic share of the two-party vote.
7
Sampling Dist
Normal Dist
6
5
Density
4
3
2
1
0
Sample Proportions
Everything we learned about confidence intervals around the mean also applies
to sample estimates of the proportion. A 95% confidence interval for a sample
proportion is:
𝑐.𝑖.. 95 = 𝑝 ± 𝑧.95 ∗ 𝑆𝑝
Where the standard error of the proportion is calculated the same as the stan-
dard error of the mean–the standard deviation divided by the square root of the
sample size–except in this case the standard deviation is calculated differently:
8.6. PROPORTIONS 213
√𝑝 ∗ (1 − 𝑝) 𝑝 ∗ (1 − 𝑝)
𝑆𝑝 = √ =√
𝑁 𝑁
Let’s turn our attention to estimating a confidence interval for the proportion of
counties won by Biden from a single sample with 500 observations (demwin500).
set.seed(251)
#Sample 500 counties for demwin
demwin500<-sample(demwin, 500)
mean(demwin500)
[1] 0.154
For the sample of 500 counties taken above, the mean is .154, which is quite
a bit lower that the known population value of .1767. To calculate the a 95%
confidence interval, we need to estimate the standard error:
seprop500=sqrt((.154*(1-.154)/500))
seprop500
[1] 0.01614212
We can now plug the standard error of the proportion int0 the confidence in-
terval:
[1] 0.1225
#Estimate upper limit of confidence interval
UL.95=.154+.0315
UL.95
[1] 0.1855
.123 ≤ 𝑃 ≤ .186
The confidence interval is .063 points wide, meaning that we are 95% confident
that the population value for this variable is between .123 and .186. If you want
to put this in terms of percentages, we are 95% certain that Biden won between
12.2% and 18.6% of counties. We know, that Biden actually won 17.4% of all
214 CHAPTER 8. SAMPLING AND INFERENCE
counties so, as expected, this confidence interval from our sample of 500 counties
includes the population value.
The “±” part of the confidence interval might sound familiar to you from media
reports of polling results. This figure is sometimes referred to as the margin of
error. When you hear the results of a public opinion poll reported on television
and the news reader usually adds language like “plus or minus 3.3 percentage
points,” they are referring to the confidence interval (usually 95% confidence
interval), except they tend to report percentage points rather than proportions.
So, for instance, in the example shown below, the Fox News poll taken from
September 12-September 15, 2021, with a sample of 1002 respondents, reports
President Biden’s approval rating at 50%, with a margin or error of ±3.0.
Figure 8.8: Margin of Error and Confidence Intervals in Media Reports of Polling
Data
If you do the math (go ahead, give it a shot!), based on 𝑝 = .50 and 𝑛 = 1002,
this ±3.0 corresponds with the upper and lower limits of a 95% confidence
interval ranging from .47 to .53. So we can say with a 95% level of confidence
that at the time of the poll, President Biden’s approval rating was between .47
(47%) and .53 (53%), according to this Fox News Poll.
captures the spirit of it very well: we use sample data to test ideas about the
values of population parameters. On to hypothesis testing!
8.8 Exercises
8.8.1 Concepts and Calculations
1. A group of students on a college campus are interested in how much stu-
dents spend on books and supplies in a typical semester. They interview
a random sample of 300 students and find that the average semester ex-
penditure is $350 and the standard deviation is $78.
a. Are the results reported above from an empirical distribution or a
sampling distribution? Explain your answer.
b. Calculate the standard error of the mean.
c. Construct and interpret a 95% confidence interval around the mean
amount of money students spend on books and supplies per semester.
2. In the same survey used in question 1, students were asked if they were
satisfied or dissatisfied with the university’s response to the COVID-19
pandemic. Among the 300 students, 55% reported being satisfied. The
administration hailed this finding as evidence that a majority of students
support the course they’ve taken in reaction to the pandemic. What do
you think of this claim? Of course, as a bright college student in the midst
of learning about political data analysis, you know that 55% is just a point
estimate and you really need to construct a 95% confidence interval around
this sample estimate before concluding that more than half the students
approve of the administration’s actions. So, let’s get to it. (Hint: this is
a “proportion” problem)
a. Calculate the standard error of the proportion. What does this rep-
resent?
b. Construct and interpret a 95% confidence interval around the re-
ported proportion of students who are satisfied with the administra-
tion’s actions.
c. Is the administration right in their claim that a majority of students
support their actions related to the pandemic?
3. One of the data examples used in Chapter Four combined five survey
questions on LGBTQ rights into a single index ranging from 0 to 6. The
resulting index has a mean of 3.4, a standard deviation of 1.56, and a
sample size of 7,816.
• Calculate the standard error of the mean and a 95% confidence
interval for this variable.
216 CHAPTER 8. SAMPLING AND INFERENCE
• What is the the standard error of the mean if you assume that the
sample size is 1,000?
• What is the the standard error of the mean if you assume that the
sample size is 300?
8.8.2 R Problems
For these problems, you should use load the county20large data set to
analyze the the distribution of internet access across U.S. counties, using
county20large$internet. This variable measures the percent of households
with broadband access from 2015 to 2019.
1. Describe the distribution of county20large$internet, using a histogram
and the mean, median, and skewness statistics. Note that since these 3142
counties represent the population of counties, the mean of this variable is
the population value (𝜇).
2. Use the code provided below to create a new object named web250
that represents a sample of internet access in 250 counties, drawn from
county20large$internet.
set.seed(251)
web250<-sample(county20large$internet, 250)
Hypothesis Testing
217
218 CHAPTER 9. HYPOTHESIS TESTING
From this sample of 100 employees, after one year of the new policy in place we
estimate that there is a 95% chance that 𝜇 is between 51.78 and 57.81, and the
probability that 𝜇 is outside this range is less than .05. Based on this alone we
can say there is less than a 5% chance that the number of hours of sick leave
taken is the same that it was in the previous year. In other words, there is a
fairly high probability that fewer sick leave hours were used in the year after
that policy change than in the previous year.
𝐻0 ∶ 𝜇 = 59.2
Note that this is saying is that there is no real difference between last year’s
mean number of sick days (𝜇) and the sample we’ve drawn from this year (𝑥). ̄
Even though the sample mean looks different from 59.2, the true population
mean is 59.2, and the sample statistic is just a result of random sampling error.
After all, if the population mean is equal to 59.2, any sample drawn from that
population will produce a mean that is different from 59.2, due to sampling
error. In other words, H0 , is saying that the new policy had no effect, even
though the sample mean suggests otherwise.
Because the county analyst is interested in whether the new policy reduced the
use of sick leave hours, the alternative hypothesis is:
𝐻1 ∶ 𝜇 < 59.2
Here, we are saying that the sample statistic is different enough from the hy-
pothesized population value (59.2) that it is unlikely to be the result of random
chance, and the population value is less than 59.2.
Note here that we are not testing whether the number of sick days is equal to
54.8 (the sample mean). Instead, we are testing whether the average hours of
sick leave taken this year is lower than the average number of sick days taken
last year. The alternative hypothesis reflects what we really think is happening;
it is what we’re really interested in. However, we cannot test the alternative
hypotheses directly. Instead, we test the null hypothesis as a way of gathering
evidence to support the alternative.
So, the question we need to answer to test the null hypothesis is, how likely is it
that a sample mean of this magnitude (54.8) could be drawn from a population
in which 𝜇= 59.2? We know that we would get lots of different mean outcomes
if we took repeated samples from this population. We also know that most of
220 CHAPTER 9. HYPOTHESIS TESTING
them would be clustered near 𝜇 and a few would be relatively far away from 𝜇
at both ends of the distribution. All we have to do is estimate the probability
of getting a sample mean of 54.8 from a population in which 𝜇= 59.2 If the
probability of drawing 𝑥̄ from 𝜇 is small enough, then we can reject H0 .
How do we assess this probability? By using what we know about sampling dis-
tributions. Check out the figure below, which illustrates the logic of hypothesis
testing using a theoretical distribution:
Critical Values. A common and fairly quick way to use the z-score in hy-
pothesis testing is by comparing it to the critical value (c.v.) for z. The c.v.
is the z-score associated with the probability level required to reject the null
hypothesis. To determine the critical value of z, we need to determine what the
probability threshold is for rejecting the null hypothesis. In the social sciences
is fairly standard to consider any probability level lower than .05 sufficient for
rejecting the null hypothesis. This probability level is also known as the signif-
icance level.
Typically, the critical value is the z-score that gives us .05 as the area on the
tail (left in this case) of the normal distribution. Looking at the z-score table
from Chapter 6, or using the qnorm function in R, we see that this is z = -1.645.
The area beyond the critical value is referred to as the critical region, and is
sometimes also called the area of rejection: if the z-score fall in this region, the
null hypothesis is rejected.
#Get the z-score for .05 area at the lower tail of the distribution
qnorm(.05, lower.tail = T)
[1] -1.644854
Once we have the 𝑐.𝑣., we can calculate the z-score for the difference between 𝑥̄
and 𝜇. The z-score will be positive if 𝑥̄ − 𝜇 > 0 and negative if 𝑥̄ − 𝜇 < 0. If
|𝑧| > |𝑧𝑐𝑣 |, then we reject the null hypothesis:
So let’s get back to the sick leave example.
• First, what’s the critical value? -1.65 (make sure you understand why this
is the value)
• What is the obtained value of z?
[1] 0.002118205
This alpha area (or p-value) is close to zero, meaning that there is little chance
that the null hypothesis is true. Check out Figure 9.2 as an illustration of how
unlikely it is to get a sample mean of 54.8 (thin solid line) from a population
in which 𝜇 = 59.2, (thick solid line) based on our sample statistics. Remember,
the area to the left of the critical value (dashed line) is the critical region, equal
to .05 of the area under the curve, and the sample mean is far to the left of this
point.
One useful way to think about this p-value is that if we took 1000 samples of
100 workers from a population in which 𝜇 = 59.2 and calculated the mean hours
of sick leave taken for each sample, only two samples would give you a result
equal to or less than 54.8 simply due to sampling error. In other words, there
is a 2/1000 chance that the sample mean was the result of random variation
instead of representing a real difference from the hypothesized value.
0.25
Mu
C.V.
Sample Mean
0.20
0.15
Density
0.10
0.05
54 56 58 60 62 64
59.2), we might want to test a two-tailed hypothesis, that the new policy could
create a difference in sick day use–maybe positive, maybe negative.
𝐻1 ∶ 𝜇 ≠ 59.2
The process for testing two-tailed hypotheses is exactly the same, except that we
use a larger critical value because even though the 𝛼 area is the same (.05), we
must now split it between two tails of the distribution. Again, this is because
we are not sure if the policy will increase or decrease sick leave. When the
alternative hypothesis does not specify a direction, we use the two-tailed test.
One−tailed, z= −1.65
Two−tailed, z=+/−1.96
0.3
Density
0.2
0.1
−3 −2 −1 0 1 2 3
The figure below illustrates the difference in critical values for one- and two-
tailed hypothesis tests. Since we are splitting .05 between the two tails, the c.v.
for a two-tailed test is now the z-score that gives us .025 as the area beyond z
at the tails of the distribution. Using the qnorm function in R (below), we see
that this is z= -1.96, which we take as ±1.96 for a two-tailed test critical value
(p=.05).
#Z-score for .025 area at one tail of the distribution
qnorm(.025)
[1] -1.959964
If we obtain a z-score (positive or negative) that is larger in absolute magnitude
than this, we reject H0 . Using a two-tailed test requires a larger z-score, making
it slightly harder to reject the null hypothesis. However, since the z-score in the
sick leave example was -2.86, we would still reject H0 under a two-tailed test.
In truth, the choice between a one- or two-tailed test rarely makes a difference
in rejecting or failing to reject the null hypothesis. The choice matters most
224 CHAPTER 9. HYPOTHESIS TESTING
when the p-value from a one-tailed test is greater than .025, in which case it
would be greater than .05 in a two-tailed test. It is worth scrutinizing findings
from one-tailed tests that are just barely statistically significant to see if a two-
tailed test would be more appropriate. Because the two-tailed test provides
a more conservative basis for rejecting the null hypothesis, researchers often
choose to report two-tailed significance levels even when a one-tailed test could
be justified. Many statistical programs, including R, report two-tailed p-values
by default.
9.3 T-Distribution
Thus far, we have focused on using z-scores and the z-distribution for testing
hypotheses and constructing confidence intervals. Another distribution available
to us is the t-distribution. The t-distribution has an important advantage over
the z-distribution: it does not assume that we know the population standard
error. This is very important because we rarely know the population standard
error. In other words, the t-distribution assumes that we are using an estimate
of the standard error. As shown in Chapter 8, the estimate of the standard
error of the mean is:
𝑆
𝑆𝑥̄ = √
𝑁
𝑆𝑥̄ is our best guess for 𝜎𝑥̄ , but it is based on a sample statistic, so it does
involve some level of error.
In recognition of the fact that we are estimating the standard error with sample
data rather than the population, the t-distribution is somewhat flatter (see Fig-
ure 9.4 below) than the z-distribution. Comparing the two distributions, you
can see that they are both perfectly symmetric but that the t-distribution is a
bit more squat and has slightly fatter tails. This means that the critical value
for a given level of significance will be larger in magnitude for a t-score than for
a z-score. This difference is especially noticeable for small samples and virtu-
ally disappears for samples greater than 100, at which point the t-distribution
becomes almost indistinguishable from the z-distribution (see Figure 9.5).
Now, here’s the fun part—the t-score is calculated the same way as the z-score.
We do nothing different than what we did to calculate the z-score.
𝑥̄ − 𝜇
𝑡=
𝑆𝑥̄
We use the t-score and the t-distribution in the same way and for the same
purposes that we use the z-score.
1. Choose a p-value for the 𝛼 associated with the desired level of statistical
significant for rejecting H0 . (Usually .05)
9.3. T-DISTRIBUTION 225
0.4 t−distribution
Normal Distribution
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
t(z)−score
𝑑𝑓 = 𝑛 − 1
𝑑𝑓 = 100 − 1 = 99
You can see the impact of sample size (through degrees of freedom) on the shape
of the t-distribution in figure 9.5: as sample size and degrees of freedom increase,
the t-distribution grows more and more similar to the normal distribution. At
226 CHAPTER 9. HYPOTHESIS TESTING
0.4
Normal
df=7
df=5
df=3
0.3
Density
0.2
0.1
0.0
−4 −2 0 2 4
t−score
stackoverflow.com/questions/31637388/).
9.3. T-DISTRIBUTION 227
Table 9.1. T-score Critical Values at Different P-values and Degrees of Freedom
[1] -1.660391
By default, qt() provides the critical values for a specified alpha area at the
lower tail of the distribution (hence, -1.66). To find the t-score associated with
an alpha area at the right (upper) tail of the distribution, just add lower.tail=F
to the command:
228 CHAPTER 9. HYPOTHESIS TESTING
[1] 1.660391
For a two-tailed test, you need to cut the alpha area in half:
#Calculate t-score for .025 at one tail, with df=99
qt(.025, 99)
[1] -1.984217
Here, R reports a critical value of −1.984, which we take as ±1.984 for a two-
tailed test from a sample with df=99. Again, this is slightly larger than the
critical value for a z-score (1.96). If you used the t-score table to do this the
old-school way, you would find the critical value is t=1.99, for df=90. The results
from using the qt function are more accurate than from using the t-table since
you are able to specify the correct degrees of freedom.
Whether using a one- or two-tailed test, the conclusion for the sick leave example
is unaffected: the t-score obtained from the sample (-2.68) is in the critical
region, so reject H0 .
We can also get a bit more precise estimate of the probability of getting a sample
mean of 54.8 from a population in which 𝜇=59.2 by using the pt() function to
get the area under the curve to the left of t=-2.86:
pt(-2.86, df=99)
[1] 0.002583714
Note that this result is very similar to what we obtained when using the z-
distribution (.002118). To get the area under the curve to the right of a positive
t-score, add lower.tail=F to the command:
#Specifying "lower.tail=F" instructs R to find the area to the right of
#the t-score
pt(2.86, df=99, lower.tail = F)
[1] 0.002583714
For a two-tailed test using the t-distribution, we double this to find a p-value
equal to .005167.
9.4 Proportions
As discussed in Chapter 8, the logic of hypothesis testing about mean values
also applies to proportions. For example, in the sick leave example, instead of
testing whether 𝜇 = 59.2 we could test a hypothesis regarding the proportion
of employees who take a certain number of sick days. Let’s suppose that in the
9.4. PROPORTIONS 229
year before the new policy went into effect, 50% of employees took at least 7 sick
days. If the new policy has an impact, then the proportion of employees taking
at least 7 days of sick leave during the year after the change in policy should
be lower than .50. In the sample of 100 employees used above, the proportion
of employees taking at least 7 sick days was .41. In this case, the null and
alternative hypotheses are:
H0 : P=.50
H1 : P<.50
To review, in the previous example, to test the null hypothesis we established a
desired level of statistical significance (.05), determined the critical value for the
t-score (-1.66), calculated the t-statistic, and compare it the the critical value.
There are a couple differences, however, when working with hypotheses about
the population value of proportions.
Because we can calculate the population standard deviation based on the hypoth-
esized value of P (.5), we can use the z-distribution rather than the t-distribution
to test the null hypothesis. To calculate the z-score, we use the same formula
as before:
𝑝−𝑃
𝑧=
𝑆𝑝
Where:
𝑃 (1 − 𝑃 )
𝑆𝑝 = √
𝑛
We know from before that the critical value for a one-tailed test using the z-
distribution is -1.65. Since this z-score is larger (in absolute terms) than the
critical value, we can reject the null hypothesis and conclude that the proportion
of employees using at least 7 days of sick leave per year is lower than it was in
the year before the new sick leave policy went into effect.
Again, we can be a bit more specific about the p-value:
pnorm(-1.8)
[1] 0.03593032
230 CHAPTER 9. HYPOTHESIS TESTING
Here are a couple of things to think about with this finding. First, while the
p-value is lower than .05, it is not much lower. In this case, if you took 1000
samples of 100 workers from a population in which 𝑃 = .50 and calculated the
proportion who took 7 or more sick days, approximately 36 of those samples
would produce a proportion equal to .41 or lower, just due to sampling error.
This still means that the probability of getting this sample finding from a pop-
ulation in which the null hypothesis was true is pretty small (.03593), so we
should be comfortable rejecting the null hypothesis. But what if there were
good reasons to use a two-tailed test? Would we still reject the null hypothesis?
No, because the critical value for a two-tailed test (-1.96) would be larger in ab-
solute terms than the z-score, and the p-value would be .07186. These findings
stand in contrast to those from the analysis of the average number of sick days
taken, where the p-values for both one- and two-tailed tests were well below the
.05 cut-off level.
One of the take-home messages from this example is that our confidence in
findings is sometimes fragile, since “significance” can be a function of how you
frame the hypothesis test (one- or two-tailed test?) or how you measure your
outcomes (average hours of sick days taken, or proportion who take a certain
number of sick days). For this reason, it is always a good idea to be mindful of
how the choices you make might influence your findings.
9.5 T-test in R
Let’s say you are looking at data on public perceptions of the presidential can-
didates in 2020 and you have a sense that people had mixed feelings about
Democratic nominee, Joe Biden, going into the election. This leads you to ex-
pect that his average rating on the 0 to 100 feeling thermometer scale from the
ANES was probably about 50 . You decide to test this directly with the anes20
data set.
The null hypothesis is:
H0 : 𝜇 = 50
Because there are good arguments for expecting the mean to be either higher
or lower than 50, the alternative hypothesis is two-tailed:
H1 : 𝜇 ≠ 50
First, you get the sample mean:
#Get the sample mean for Biden's feeling thermometer rating
mean(anes20$V202143, na.rm=T)
[1] 53.41213
Here, you see that the mean feeling thermometer rating for Biden in the fall of
2020 was 53.41. This is higher than what you thought it would be (50), but you
9.6. NEXT STEPS 231
know that it’s possible to could get a sample outcome of 53.41 from a population
in which the mean is actually 50, so you need to do a t-test to rule out sampling
error as reason for the difference.
In R, the command for a one-sample two-tailed t-test is relatively simple, you
just have to specify the variable of interest and the value of 𝜇 under the null
hypothesis:
#Use 't.test' and specify the variable and mu
t.test(anes20$V202143, mu=50)
data: anes20$V202143
t = 8.1805, df = 7368, p-value = 3.303e-16
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
52.59448 54.22978
sample estimates:
mean of x
53.41213
These results are pretty conclusive, the t-score is 8.2 and the p-value is very
close to 0.2 Also, if it makes more sense for you to think of this in terms of a
confidence interval, the 95% confidence interval ranges from about 52.6 to 54.2,
which does not include 50. We should reject the null hypothesis and conclude
instead that Biden’s feeling thermometer rating in the fall of 2020 was greater
than 50.
Even though Joe Biden’s feeling thermometer rating was greater than 50, from a
substantive perspective it is important to note that a score of 53 does not mean
Biden was wildly popular, just that his rating was greater than 50. This point
is addressed at greater length in the next several chapters, where we explore
measures of substantive importance that can be used to complement measures
of statistical significance.
9.7 Exercises
9.7.1 Concepts and Calculations
1. The survey of 300 college students introduced in the end-of-chapter exer-
cises in Chapter 8 found that the average semester expenditure was $350
with a standard deviation of $78. At the same time, campus administra-
tion has done an audit of required course materials and claims that the
average cost of books and supplies for a single semester should be no more
than $340. In other words, the administration is saying the population
value is $340.
a. State a null and alternative hypothesis to test the administration’s
claim. Did you use a one- or two-tailed alternative hypothesis? Ex-
plain your choice
b. Test the null hypothesis and discuss the findings. Show all calcula-
tions
2. The same survey reports that among the 300 students, 55% reported being
satisfied with the university’s response to the COVID-19 pandemic. The
administration hailed this finding as evidence that a majority of students
support the course they’ve taken in reaction to the pandemic. (Hint: this
is a “proportion” problem)
a. State a null and alternative hypothesis to test the administration’s
claim. Did you use a one- or two-tailed alternative hypothesis? Ex-
plain your choice
b. Test the null hypothesis and discuss the findings. Show all calcula-
tions
3. Determine whether the null hypothesis should be reject for the following
pairs of t-scores and critical values.
a. t=1.99, c.v.= 1.96
b. t=1.64, c.v.= 1.65
9.7. EXERCISES 233
9.7.2 R Problems
For this assignment, you should use the feeling thermometers for Donald
Trump (anes20$V202144), liberals (anes20$V202161), and conservatives
(anes20$V202164).
1. Using descriptive statistics and either a histogram, boxplot, or density
plot, describe the central tendency and distribution of each feeling ther-
mometer.
2. Use the t.test function to test the null hypotheses that the mean for
each of these variables in the population is equal to 50. State the null and
alternative hypotheses and interpret the findings from the t-test.
3. Taking these findings into account, along with the analysis of the Joe
Biden’s feeling thermometer at the end of the chapter, do you notice any
apparent contradictions in American public opinion? Explain.
4. Use the pt() function to calculate the p-value (area at the tail) for each
of the following t-scores (assume one-tailed tests).
a. t=1.45, df=49
b. t=2.11, df=30
c. t=-.69, df=200
d. t=-1.45, df=100
5. What are the p-values for each of the t-scores listed in Problem 4 if you
assume a two-tailed test?
6. Treat the t-scores from Problem 4 as z-scores and use the pnorm() to
calculate the p-values. List the p-values and comment on the differences
between the p-values associated with the t- and z-scores. Why are some
closer in value than others?
234 CHAPTER 9. HYPOTHESIS TESTING
Chapter 10
235
236 CHAPTER 10. HYPOTHESIS TESTING WITH TWO GROUPS
100 (positive, “warm” feelings) scale. Which group do you think is likely to rate
feminists the highest, men or women?1 Although both women and men can
be feminists (or anti-feminists), the connection between feminism and the fight
for the rights of women leads quite reasonably to the expectation that women
support feminists at higher levels than men do.
Before shedding light on this with data, let’s rename the ANES measures
for respondent sex (anes20$V201600) and the feminist feeling thermometer
(anes20$V202160) so they are a bit easier to use in subsequent commands
#Create new respondent sex variable
anes20$Rsex<-factor(anes20$V201600)
#Assign category labels
levels(anes20$Rsex)<-c("Male", "Female")
#Create new feminist feeling thermometer variable
anes20$femFT<-anes20$V202160
Group.1 x
1 Male 54.54
2 Female 62.55
What this table shows is that the average feeling thermometer for feminists was
62.55 for women and 54.44 for men, a difference of about eight points on a scale
from 0 to 100. So, it looks like there is a difference in attitudes toward feminists,
with women viewing them more positively than men. At the same time, it is
1 Here, it is important to acknowledge multiple other forms of gender identity and gender
expression, including but not limited to transgender, gender-fluid, non-binary, and intersex.
The survey question used in the 2020 ANES, as well as in most research on “gender gap”
issues, utilizes a narrow sense of biological sex, relying on response categories of “Male” and
“Female.”
10.2. TESTING HYPOTHESES ABOUT TWO MEANS 237
important to note that both groups, on average, have positive feelings toward
feminists.
Another useful R function we can use to compare the means of these two
groups is compmeans, which produces subgroup means and standard devia-
tions, as well as a boxplot that shows the distribution of the dependent variable
for each value of the independent variable. The format for this command is:
compmeans(dependent, independent). You should also include axis labels
and other plot commands (see below) since a boxplot is included by default
(you can suppress it with plot=F). When you run this command, you will get
a warning message telling you that there are missing data. Don’t worry about
this for now unless the number of missing cases seems relatively large compared
to the sample size.
#List the dependent variable first, then the independent variable.
#Add graph commands
compmeans(anes20$femFT, anes20$Rsex,
xlab="Sex",
ylab="Feminist Feeling Thermometer",
main="Feminist Feeling Thermometer, by Sex")
80
60
40
20
0
Male Female
Sex
First, note that the subgroup means are the same as those produced using the
aggregate command. The key difference in the numeric output is that we
also get information on the standard deviation in both subgroups, as well as
the mean and standard deviation for the full sample. In addition, the boxplot
produced by the compmeans command provides a visualization of the two dis-
tributions side-by-side, giving us a chance to see how similar or dissimilar they
are. Remember, the box plots do not show the differences in means, but they
do show differences in medians and interquartile ranges, both of which can be
indicative of group-based differences in outcomes on the dependent variable.
The side-by-side distributions in the box plot do appear to be different, with
the distribution of outcomes concentrated a bit more at the high end of the
feeling thermometer among women than among men.
So, it looks like there is a gender gap in attitudes toward feminists, with women
rating feminists about eight points (8.01) higher than men rate them. But there
is a potential problem with this conclusion. The main problem is that we are
using sample data and we know from sampling theory that it is possible to
find a difference in the sample data even if there really is no difference in the
population. The question we need to answer is whether the difference we observe
in this sample is large enough that we can reject the possibility that there is no
difference between the two groups in the population. In other words, we need to
expand the logic of hypothesis test developed earlier to incorporate differences
between two sample means.
follow a normal curve and the mean will equal the difference between the the
two subgroup in the population (𝜇1 − 𝜇2 ).
To answer this, we need to transform 𝑥1̄ − 𝑥2̄ from a raw score for the difference
into a standard score difference (t or z-score). Since we are working with a sam-
ple, we focus on t-scores in this application. Recall, though, that the calculation
for a t-score is the same as for a z-score:
(𝑥1̄ − 𝑥2̄ )
𝑡=
𝑆𝑥̄1 −𝑥̄2
So, as we did with a single mean, we divide the raw score difference by the
standard error of the sampling distribution to convert the raw difference into a
t-score. The standard error of the difference is a function of the variance in both
subgroups, along with the sample sizes. Since we do not know the population
variances, we rely on sample variances to estimate them:
𝑆12 𝑆2
𝑆𝑥̄1 −𝑥̄2 = √ + 2
𝑛1 𝑛2
Where 𝑆12 and 𝑆22 are the sample variances of the sub-groups.
The standard error represents the standard deviation of the sampling distribu-
tion that would result from repeated (large, random) samples from which we
would calculate differences between the group means. When we calculate a t-
score based on this we are asking how many standard errors our sample finding
is from the population parameter (𝜇1 − 𝜇2 = 0)
In the theoretical figure above, the t score for the difference is -1.96. What is the
probability of getting a t-score of this magnitude if there is really no difference
between the two groups? That probability is equal to the area to the left of
t=-1.96 (the same for both t- and z-scores with a large sample), which is .025.
So, with a one-tailed hypothesis we would reject H0 because there is less than
a .05 probability that it is true.
[1] -1.645
3. Calculate the t-score from the sample data.
4. Compare t-score to the critical value. If |𝑡| > 𝑐.𝑣., then reject 𝐻0 ; if
|𝑡| < 𝑐.𝑣., then fail to reject.
You may have noticed that I subtracted the mean for women from the mean for
men in the numerator, leading to a negative t-score. The reason for this is that
the R function for conducting t-tests subtracts the second value it encounters
from the first value by default, so the calculation above is set up to reflect what
we should expect to find when using R to do the work for us. The negative
value makes sense in the context of our expectations, since it means that the
value for women is higher than the value for men.
In this case, the t-score far exceeds the critical value (-1.645), so we reject the null
hypothesis and conclude that there is a gender gap in evaluations of feminists on
the feeling thermometer scale, with women providing higher ratings of feminists
than those provided by men.
T-test in R. The R command for conducting a t-test (t.test) is straightfor-
ward and easy to use. The format is t.test(dependent~independent. The ~
symbol is used in this and other functions to signal that you are using a formula
that specifies a dependent and independent variable.
#use t.test to get t-score for Feminist FT by Sex
t.test(anes20$femFT~anes20$Rsex)
242 CHAPTER 10. HYPOTHESIS TESTING WITH TWO GROUPS
There are a few important things to pick up on here. First, the reported t-
score (-13) is very close to the one we calculated (12.89), the difference due to
rounding in the R output. Second, the reported p-value is 2e-16. Recall from
earlier that scientific notation like this is used as a shortcut when the actual
numbers have several digits. In this case, the notation means that the p-value
is less than .0000000000000002. Since R uses a two-tailed test by default, this is
the total area under the curve at the two tails combined (to the outside of t=-13
and t=+13). This means that there is virtually no chance the null hypothesis
is true. Of course, for a one-tailed test, we still reject the null hypothesis
since the p-value is even lower. Third, the t.test output also provides a 95%
confidence interval around the sample estimate of the difference between the
two groups. The way to interpret this is that based on these results, you can
be 95% certain that in the population the gender gap in ratings of feminists
is between -9.231 and -6.794 points. Importantly, since the confidence interval
does not include 0 (no difference), we can also use this as a basis for rejecting
the null hypothesis. Finally, you probably noticed that the reported degrees of
freedom (7111) is different than what I calculated above (7271). This is because
one of the assumptions underlying t-tests is that the variances in the two groups
are the same. If they are not, then some corrections need to be made, including
adjustments to the degrees of freedom. The Welch’s two-sample test used by R
does not assume that the two sample variances are the same and, by default,
always makes the correction. In this case, with such a large sample, the findings
are not really affected by the correction other than the degrees of freedom. This
makes the Welch’s two-sample test a slightly more conservative test, which I see
as a virtue.
In the output below, I run the same t-test but specify a one-tailed test
(alternative="less") and assume that the variances are equal (var.equal=T).
As you can see, the results are virtually identical, except that now 𝑑𝑓 = 7271
and the confidence interval is now negative infinity to -6.987.
#t.test with one-tailed test and equal variance
t.test(anes20$femFT~anes20$Rsex, var.equal=T, alternative="less")
10.3. HYPOTHESIS TESTING WITH TWO MEANS 243
𝑥1̄ − 𝑥2̄
𝐷=
𝑆
All of this information can be obtained from the following compmeans output:
#Get means and standard deviations using 'compmeans'
compmeans(anes20$femFT,anes20$Rsex, plot=F)
[1] -0.2993
#Cohen's D for Big Business FT
(48.41-47.29)/22.67
[1] 0.0494
Let’s check our work with the R command for getting Cohen’s D:
#Get Cohens D for impact of sex on 'femFT' and 'busFT'
cohens_d(anes20$femFT~anes20$Rsex)
Cohen's d | 95% CI
--------------------------
-0.30 | [-0.35, -0.26]
Cohen's d | 95% CI
------------------------
0.05 | [0.00, 0.10]
Note that these are all positive values, but the finding of 𝐷 = −.3 should be
treated the same as 𝐷 = 3. Using the guideline in Table 1, it is fair to describe
the impact of respondent sex on the feminist feeling thermometer rating as small
while the impact on big business ratings is tiny at best.
The mean of this variable is the proportion who think abortion should never
be permitted. Based on conventional wisdom, and on the gender gaps reported
in earlier in this chapter, the expectation is that the proportion of women who
think abortion should never be permitted is lower than the proportion of men
who support this position.
Since these means are actually proportions:
H 0 : 𝑃𝑊 = 𝑃 𝑀
H 1 : 𝑃𝑊 < 𝑃 𝑀
Let’s see what the sample statistics tell us about the sex-based difference in the
proportions who support banning all abortions. In this case, we suppress the
boxplot because it is not a useful tool for examining variation in dichotomous
outcomes (Go ahead and generate a boxplot if you want to see what I mean).
#Generate means (proportions), by sex
compmeans(anes20$banAb.n, anes20$Rsex, plot=F)
So, we need to go through the process again of calculating a t-score for the
difference between the two groups and compare it to the critical value (1.65 or
1.96). The formula should look very familiar to you:
𝑁1 + 𝑁 2
𝑆𝑝1 −𝑝2 = √𝑝𝑢 (1 − 𝑝𝑢 ) ∗ √
𝑁1 𝑁2
Here, 𝑝𝑢 is the estimate of the population proportion (𝑃 ), which can get from
the compmeans table. The proportion in the full sample supporting a ban on
abortions is .1059, so 𝑆𝑝1 −𝑝2 is:
#Calculate the standard error for the difference
sqrt(.1059*(1-.1059))*sqrt((3728+4422)/(3728*4422))
[1] 0.006842
𝑆𝑝1 −𝑝2 =.006842, so:
.0033
𝑡= = .4823
.006842
Okay, now let’s see what R tells us. First, we will use a t-test and treat this as
just another difference in means test:
#Test for difference in banAb.n, by sex
t.test(anes20$banAb.n~anes20$Rsex)
It looks like our calculations were just about spot on. There is no significant
relationship between respondent sex and supporting a ban on abortions. None
whatsoever. The t-score is only -.49, far less than a critical value of either 1.96
or 1.65, and the reported p-value is .60, meaning there is a pretty good chance
of drawing a sample difference of this magnitude from a population in which
there is no real difference. This is also reflected in the confidence interval for
the difference (-.0167 to .01), which includes the value of 0 (no difference). So,
we fail to reject the null hypothesis.
What about conventional wisdom? Doesn’t everyone know that there is a huge
gender gap on abortion? Sometimes, conventional wisdom meets data and con-
ventional wisdom loses. Results similar to the one presented above are not
unusual in quantitative studies of public opinion on this issue. Sometimes there
is no gender gap, and sometimes there is a gap, but it tends to be a small one.
For instance, if we focus on the other end of the original abortion variable and
create a dichotomous variable indicating those who think abortion generally
should be available as a matter of choice, we find a significant gender gap:
#Create "choice" variable
anes20$choice<-factor(anes20$V201336)
#Change levels to create two-category variable
levels(anes20$choice)<-c("Other","Other","Other","Choice by Law","Other")
#Create numeric indicator for "Choice by law"
anes20$choice.n<-as.numeric(anes20$choice=="Choice by Law")
to do that as long as you remember that any calculation need to be done using proportions
250 CHAPTER 10. HYPOTHESIS TESTING WITH TWO GROUPS
Cohen's d | 95% CI
--------------------------
-0.10 | [-0.14, -0.06]
Group.1 x
1 Male 54.54
2 Female 62.55
R stored this object as a data.frame with two variables,agg_femFT$x and
agg_femFT$Group.1. We use this information to create a bar chart using the
format barplot(dependent~independent) and add some labels:
10.5. PLOTTING MEAN DIFFERENCES 251
60
50
40
30
20
10
0
Male Female
Sex of Respondent
I think this bar plot does a nice job of showing the difference between the two
groups while also communicating that difference in the context of the scale of
the dependent variable. It shows that there is a difference, but a somewhat
modest difference.
62
60
58
56
54
Male Female
Respondent Sex
What you see here are two small circles representing the mean outcomes on
the dependent variable for each of the two independent variable categories and
error bars (vertical lines within end caps) representing the confidence intervals
around each of the two subgroup means. As you can see, there appears to be
a substantial difference between the two groups. This is represented by the
vertical distance between the group means (the circles), but also by the fact
that neither of the confidence intervals overlaps with the other group mean. If
the confidence interval of one group overlaps with the mean of the other, then
the two groups are not statistically different from each other.
I like the means plot as a graphing tool, but one drawback is that it can create
the impression that the difference between the two groups is more meaningful
than it is. Note in this graph that the span of the y-axis is only as wide as
the lowest and highest confidence limits. There is nothing technically wrong
with this. However, since the scale of the dependent variable is from 0 to 100,
restricting the view to these narrow limits can make it seem that there is a more
important difference between the two groups than there actually is. This is why
it is important to measure the effect size with something like Cohen’s D and to
pay attention to the scale of the dependent variable when evaluating the size of
the subgroup differences.
The plot means graph can be altered to give a more realistic sense of the magni-
tude of the effect. In the figure below, I expand the y-axis so the limits are now
45 and 70 (using ylim=c()); not the full range of the variable, but a much wider
range than before. As a result, you can still see that there is clearly a difference
in outcomes between the two groups, but the magnitude of the difference is, I
think, more realistically displayed, given the scale of the variable.5 The actual
difference between the two groups is the same in both figures, but the figures
5 The confidence intervals are difficult to see because they are very small, relative to the
size of the y-axis, making hard for R to print them. This will create multiple warnings if you
run the code. Don’t worry about these warnings
10.6. WHAT’S NEXT? 253
70
65
60
55
50
45
Male Female
Respondent Sex
10.7 Exercises
10.7.1 Concepts and Calculations
1. The means plot below illustrates the mean feeling feminist thermometer
rating for two groups of respondents, those aged 18 to 49, and those aged
50 and older. Based just on this graph, does there appear to be rela-
tionship between age and support for feminists? Justify and explain your
answer.
Mean Feminist Feeling Thermometer
60.0
59.5
59.0
58.5
58.0
18 to 49 50+
Age Range
2. In response to the student survey that was used for the exercises in Chap-
ters 8 and 9, a potential donor wants to provide campus bookstore gift
certificates as a way of defraying the cost of books and supplies. In consul-
tation with the student government leaders, the donor decides to prioritize
first and second year students for this program because they think that
upper-class students spend as less than other on books and supplies. Be-
fore finalizing the decision, the student government wants to test whether
there really is a difference in the spending patterns of the two groups of
students.
A. What are the null and alternative hypotheses for this problem? Ex-
plain.
B. Using the data listed below, test the null hypothesis and summarize
your findings for the student government. Is there a significant relationship
between class standing and expenditures? Is the relationship strong?
C. Based on these findings, should the donor prioritize first and second
year students for assistance? Be sure to go beyond just reciting the statis-
tics when you answer this question.
Expenditures on Books and Supply, by Class Status
10.7. EXERCISES 255
10.7.2 R Problems
For these problems, use the county20large data set to examine how county-
level educational attainment is related to COVID-19 cases per100k popula-
tion. You need to load the following libraries: dplyr,Hmisc, gplots, descr,
effectsize.
1. The first thing you need to do is take a sample of 500 counties from
the counties20large data set and store that sample in a new data set,
covid500, using the command listed below.
set.seed(1234)
#create a sample of 500 rows of data from the county20large data set
covid500<-sample_n(county20large, 500)
The sample_n command samples rows of data from the data set, so we now have
500 randomly selected counties with data on all of the variables in the data set.
The dependent variable in this assignment is covid500$cases100k_sept821
(cumulative COVID-19 cases per 100,000 people, up to September 8, 2021),
and the independent variable is covid500$postgrad, the percent of adults in
the county with a post-graduate degree. The expectation that case rates are
lower in counties with relatively high levels of education than in other counties.
256 CHAPTER 10. HYPOTHESIS TESTING WITH TWO GROUPS
257
258 CHAPTER 11. HYPOTHESIS TESTING WITH MULTIPLE GROUPS
internet access, we should get acquainted with its distribution, using a histogram
and some basic descriptive statistics. Let’s start with a histogram:
hist(countries2$internet, xlab="% Internet Access",
ylab="Number of Countries", main="")
25
Number of Countries
20
15
10
5
0
0 20 40 60 80 100
% Internet Access
Here we see a fairly even distribution from the low to the middle part of the
scale, and then a spike in the number of countries in the 70-90% access range.
The distribution looks a bit left-skewed, though not too severely. We can also
get some descriptive statistics to help round out the picture:
#Get some descriptive statistics for internet access
summary(countries2$internet, na.rm=T)
[1] -0.2114
sd(countries2$internet, na.rm = T)
[1] 28.79
The mean, median, and skew statistics confirm the initial impression of relatively
modest negative skewness. Both the histogram and descriptive statistics also
show that there is a lot of variation in internet access around the world, with
the bottom quarter of countries having access levels ranging from about 0% to
about 28% access, and the top quarter of countries having access levels ranging
from about 80% to almost 100% access.1 As with many such cross-national
comparisons, there are clearly some “haves” and “have nots” when it comes to
internet access.
1 Make sure you can tell where I got these numbers.
11.2. INTERNET ACCESS AS AN INDICATOR OF DEVELOPMENT 259
To add a bit more context and get us thinking about the factors that might be
related to country-level internet access, we can look at the types of countries
that tend to have relatively high and low levels of access. To do this, we use the
order function that was introduced in Chapter 2. The primary difference with
the use of this function here is that there are missing data for many variables in
the countries2 data set, so we need to omit the missing observations (na.last
= NA) so they are not listed at the end of data set.
#Sort the data set by internet access, omitting missing data.
#Store the sorted data in a new object
sorted_internet<-countries2[order(countries2$internet, na.last = NA),]
#The first six rows of two variables from the sorted data set
head(sorted_internet[, c("wbcountry", "internet")])
wbcountry internet
190 Korea (Democratic People's Rep. of) 0.000
180 Eritrea 1.309
194 Somalia 2.004
185 Burundi 2.661
176 Guinea-Bissau 3.931
188 Central African Republic 4.339
#The last six rows of the same two variables
tail(sorted_internet[, c("wbcountry", "internet")])
wbcountry internet
21 Liechtenstein 98.10
31 United Arab Emirates 98.45
42 Bahrain 98.64
5 Iceland 99.01
64 Kuwait 99.60
45 Qatar 99.65
Do you see anything in these sets of countries that might help us understand
and explain why some countries have greater internet access than others? First,
there are clear regional differences: most of the countries with high levels of
internet access ( > 98% access) are small, middle-eastern or northern European
countries, while countries with the most limited access (< 4.4% access) are
almost all African counties. Beyond regional patterns, one thing that stands out
is the differences in wealth in the two groups of countries: those with greater
access tend to be wealthier and more industrialized than those countries with
the least amount of access.
Now, let’s use compmeans to examine the mean differences in internet access
between low and high GDP countries. Note that I added an “X” to each box to
reflect the mean of the dependent variable within the columns.
#Evaluate mean internet access, by level of gdp
compmeans(countries2$internet, countries2$gdp2,
xlab="GDP per Capita",
ylab="Number of Countries")
#Add points to the plot, reflecting mean values of internet access
#"pch" sets the marker type, "cex" sets the marker size
points(c(33.39, 76.41),pch=4, cex=2)
100
80
Number of Countries
60
40
20
0
Low High
In this graph, we see what appears to be strong evidence that GDP per capita
is related to internet access: on average, 33.4% of households in countries in the
bottom half of GDP have internet access, compared to 76.4% in countries in the
top half of the distribution of per capita GDP. The boxplot provides a dramatic
illustration of this difference, with the vast majority of low GDP countries also
having low levels of internet access, and the vast majority of high GDP countries
having high levels of access. We can do a t-test and get the Cohens-D value just
to confirm that this is indeed a statistically significant and strong relationship.
#Test for the impact of "gdp2" on internet access
t.test(countries2$internet~countries2$gdp2, var.equal=T)
Cohen's d | 95% CI
--------------------------
-2.40 | [-2.78, -2.02]
traction of the mean from the second group (high GDP) from the mean of the first group
(Low GDP).
11.3. ANALYSIS OF VARIANCE 263
Q3 65.91 45 15.097
Q4 86.68 46 9.441
Total 54.78 183 27.994
#Add points to the plot reflecting mean values of internet access
points(c(19.39, 47.38,65.91,86.68),pch=4, cex=2)
100
% Households with Internet Access
80
60
40
20
0
Q1 Q2 Q3 Q4
The mean level of internet access increases from about 19.4% for countries in the
first quartile to 47.4% in the second quartile, to 65.9% for the third quartile, to
86.7% in the top quartile. Although there is a steady increase in internet access
as GDP per capita increases, there still is considerable variation in internet
access within each category of GDP. Because of this variation, there is overlap in
the internet access distributions across levels of GDP. There are some countries
in the lowest quartile of GDP that have greater access to the internet than some
countries in the second and third quartiles; some in the second quartile with
higher levels of access than some in the third and fourth quartiles; and so on.
On its face, this looks like a strong pattern, but the key question is whether the
differences in internet access across levels of GDP are great enough, relative to
the level of variation within categories, that we can determine that there is a
significant relationship between the two variables.
independent variable, the differences in the data may be due to random error.
Just as with the t-test, we use ANOVA to test a null hypothesis. The null
hypothesis in ANOVA is that the mean value of the dependent variable does
not vary across the levels of the independent variable:
𝐻0 ∶ 𝜇 1 = 𝜇 2 = 𝜇 3 = 𝜇 4
In other words, in the example of internet access and GDP, the null hypothesis
would be that the average level of internet access is unrelated to the level of GDP.
Of course, we will observe some differences in the level of internet access across
different levels of GDP. It’s only natural that there will be some differences.
The question is whether these differences are substantial enough that we can
conclude they are real differences and not just apparent differences that are
due to random variation. When the null hypothesis is true, we expect to see
relatively small differences in the mean of the dependent variable across values
of the independent variable. So, for instance, when looking at side-by-side box
plots, we would expect to see a lot of overlap in the distributions of internet
access across the four level of per capita GDP.
Technically, H1 states that at least one of the mean differences is statistically
significant, but we are usually a bit more casual and state it something like this:
𝐻1 ∶ The mean level of internet access in countries varies across levels of GDP
Even though we usually have expectations regarding the direction of the rela-
tionship (high GDP associated with high level of internet access), the alterna-
tive hypothesis does not state a direction, just that there are some differences in
mean levels of the dependent variable across levels of the independent variable.
Now, how do we test this? As with t-tests, it is the null hypothesis that we test
directly. I’ll describe the ideas underlying ANOVA in intuitive terms first, and
then we will move on to some formulas and technical details.
Recall that with the difference in means test we calculated a t-score that was
based on the size of the mean difference between two groups divided by the
standard error of the difference. The mean difference represented the impact
of the independent variable, and the standard error reflected the amount of
variation in the data. If the mean difference was relatively large, or the standard
error relatively small, then the t-score would usually be larger than the critical
value and we could conclude that there was a significant difference between the
means.
We do something very similar to this with ANOVA. We use an estimate of
the overall variation in means between categories of the independent variable
(later, we call this “Mean Square Between”) and divide that by an estimate of
the amount of variation around those means within categories (later, we call
11.3. ANALYSIS OF VARIANCE 265
this “Mean Squared Error”). Mean Squared Between represents the size of
the effect, and Mean Squared Error represents the amount of variation. The
resulting statistic is called an F-ratio and can be compared to a critical value
of F to determine if the variation between categories (magnitude of differences)
is large enough relative to the total variation within categories that we can
conclude that the differences we see are note due to sampling error. In other
words, can we reject the null hypothesis?
[1] 142629
𝑆𝑆𝑇 = 142629.5
This number represents the total variation in y. The SST can be decomposed
into two parts, the variation in y between categories of the independent variable
(the variation in means reported in compmeans) and the amount of variation in
y within categories of the independent variable (in spirit, this is reflected in the
separate standard deviations in the compmeans results). The variation between
categories is referred to as Sum of Squares Between (SSB) and the variation
within categories is referred to as sum of squares within (SSW).
3 This part of the first line, [countries2$gdp_pc!="NA"], ensures that the mean of the
dependent variable is taken just from countries that have valid observations on the independent
variable.
266 CHAPTER 11. HYPOTHESIS TESTING WITH MULTIPLE GROUPS
Here we subtract the mean of y for each category (k) from each observation in
the respective category, square those differences and sum them across categories.
It should be clear that SSW summarizes the variations around the means of the
dependent variable within each category of the independent variable similar to
variation seen in the side-by-side distributions shown in the barplot presented
above.
where 𝑦𝑘̄ is the mean of y within a given category of x, and 𝑁𝑘 is the number
of cases in the category k.
Here we subtract the overall mean of the dependent variable from the mean of
the dependent variable in each category (k) of the independent variable, square
that difference, and then multiply it times the number of observations in the
category. We then sum this across all categories. This provides us with a good
sense of how much the mean of y differs across categories of x, weighted by the
number of observations in each category. You can think of SSB as representing
the impact of the independent variable on the dependent variable. If y varies
very little across categories of the independent variable, then SSB will be small;
if it varies a great deal, then SSB will be larger.
Using information from the compmeans command, we can calculate the SSB
(Table 11.1). In this case, the sum of squares between = 112490.26.
Degrees of freedom. In order to use SSB and SSW to calculate the F-ratio
we need to calculate their degrees of freedom.
For SSW: 𝑑𝑓𝑤 = 𝑛 − 𝑘 (183 − 4 = 179)
For SSB: 𝑑𝑓𝑏 = 𝑘 − 1 (4 − 1 = 3)
Where n=total number of observations, k=number of categories in x.
The degrees of freedom can be used to standardize SSB and SSW, so we can
evaluate the variation between groups in the context of the variation within
groups. Here, we calculate the Mean Squared Error4 (MSE), based on SSW,
and Mean Squared Between, based on the SSB:
Mean Squared Error (MSE) =
𝑆𝑆𝑊 30139.24
= = 168.4
𝑑𝑓𝑤 179
𝑆𝑆𝐵 112490.26
= = 37496.8
𝑑𝑓𝑏 3
As described earlier, you can think of the MSB as summarizing the differences
in mean outcomes in the dependent variable across categories of the indepen-
dent variable, and the MSE as summarizing the amount of variation around
those means. As with t and z tests, we need to compare the magnitude of the
differences across groups to the amount of variation in the dependent variable.
4 The term “error” may be confusing to you. In data analysis, people often use “error” and
“variation” interchangeably, since both terms refer to deviation from some predicted outcome,
such as the mean.
268 CHAPTER 11. HYPOTHESIS TESTING WITH MULTIPLE GROUPS
This brings us to the F-ratio, which is the actual statistic used to determine
the significance of the relationship:
𝑀 𝑆𝐵
𝐹 =
𝑀 𝑆𝐸
In this case,
37496.8
𝐹 = = 222.7
168.4
Similar to the t and z-scores, if the value we obtain is greater the critical value
of F (the value that would give you an alpha area of .05 or less), then we can
reject H0 . We can use R to find the critical value for F. Using the qf function,
we need to specify the desired p-value (.05), the degrees of freedom between,
(3), the degrees of freedom within (174), and that we want the p-value for the
upper-tail of the distribution.
#Get critical value of F for p=.05, dfb=3, dfw=179
qf(.05, 3, 179, lower.tail=FALSE)
[1] 2.655
In this case, the critical value for dfw=179, dfb=3 is 2.66. The shape of the F-
distribution is different from the t and z distributions: it is one-tailed (hence the
non-directional alternative hypothesis), and the precise shape is a function of the
degrees of freedom. The figure below illustrates the shape of the f-distribution
and the critical value of F for dfw=179, dfb=3, and a p-value of .05.
C.V. = 2.66
0.6
Density
0.4
0.2
0.0
0 1 2 3 4 5 6 7
F−Ratio
Figure 11.1: F-Distribution and Critical Value for dfw=179, dfb=3, p=.05
Any obtained F-ratio greater than 2.66 provides a basis for rejecting the null
hypothesis, as it would indicate less than a .05 probability of getting a difference
11.4. ANOVA IN R 269
11.4 Anova in R
The ANOVA function in R is very simple: aov(dependent variable~independent
variable). When using this function, the results of the analysis should be
stored in a new object, which is labeled fit_gdp in the particular case.
#ANOVA--internet access, by level of gdp
#Store in 'fit_gdp'
fit_gdp<-aov(countries2$internet~countries2$gdp4)
We can use the summary command to view the information stored in fit_gdp:
#Show results of ANOVA
summary(fit_gdp)
$`countries2$gdp4`
diff lwr upr p adj
Q2-Q1 27.99 20.97 35.00 0
Q3-Q1 46.52 39.46 53.57 0
Q4-Q1 67.28 60.26 74.30 0
Q3-Q2 18.53 11.48 25.59 0
Q4-Q2 39.30 32.28 46.31 0
Q4-Q3 20.76 13.71 27.82 0
This output compares the mean level of internet access across all combinations
of the four groups of countries, producing six comparisons. For instance, the first
row compares countries in the second quartile to countries in the first quartile.
The first number in that row is the difference in internet access between the two
groups (27.99), the next two numbers are the confidence interval limits for that
difference (20.97 to 35.00), and the last number is the p-value for the difference
(0). Looking at this information, it is clear the significant f-value is not due to
just one or two significant group differences, but that all groups are statistically
different from each other.
These group-to-group patterns, as well as those from the boxplots help to con-
textualize the overall findings from the ANOVA results. The F-ratio doesn’t
tell us much about the individual group differences, but the size of the mean
differences in the compmeans results, the boxplots, and the TukeyHSD compar-
isons do provide a lot of useful information about what’s going on “under the
hood” of a significant F-test.
𝑆𝑆𝐵 112490
𝜂2 = = = .79
𝑆𝑆𝑇 142629
11.5. EFFECT SIZE 271
For one-way between subjects designs, partial eta squared is equivalent to eta squared.
Returning eta squared.
# Effect Size for ANOVA
80
60
40
20
0
Q1 Q2 Q3 Q4
80
60
40
20
Q1 Q2 Q3 Q4
Regardless of which method you use–boxplot (used earlier), bar plot, or means
plot–the data visualizations for this relationship reinforce the ANOVA findings:
there is a strong relationship between per capita GDP and internet access around
the world. This buttresses the argument that internet access is a good indicator
of economic development.
11.6. POPULATION SIZE AND INTERNET ACCESS 273
Let’s take a look at another example, using the same dependent variable, but
this time we will focus on how internet access is influenced by population size.
I can see plausible arguments that smaller countries might be expected to have
greater access to the internet, and I can see arguments for why larger countries
should be expected to have greater internet access. For right now, though, let’s
assume that we think the mean level of internet access varies across levels of
population size, though we are not sure how.
100
80
% Internet Access
60
40
20
0
Q1 Q2 Q3 Q4
Hmmm. This is interesting. The mean outcomes across all groups are fairly
similar to each other and not terribly different from the overall mean (54.2%).
In addition, all of the within-group distributions overlap quite a bit with all
others. So, we have what looks like very little between-group variation and a
lot of within-group variation. ANOVA can sort this out and tell us if there is a
statistically significant relationship between country size and internet access.
#Run ANOVA for impact of population size on internet access
fit_pop<-aov(countries2$internet~countries2$pop4)
#View results
summary(fit_pop)
[1] 2.652
We can check to see if any of the group differences are significant, using the
TukeyHSD command:
11.7. CONNECTING THE T-SCORE AND F-RATIO 275
C.V. = 2.65
Sample F−ratio
0.6
Density
0.4
0.2
0.0
0 1 2 3 4 5 6 7
F−Ratio
Figure 11.2: F-ratio and Critical Value for Impact of Population Size on Internet
Access
$`countries2$pop4`
diff lwr upr p adj
Q2-Q1 -1.057 -16.123 14.008 0.9979
Q3-Q1 -11.947 -27.012 3.119 0.1718
Q4-Q1 -3.914 -19.057 11.229 0.9083
Q3-Q2 -10.889 -25.877 4.098 0.2387
Q4-Q2 -2.857 -17.922 12.209 0.9609
Q4-Q3 8.033 -7.033 23.099 0.5122
Still nothing here. None of the six mean differences are statistically significant.
There is no evidence here to suggest that population size has any impact on
internet access.
the dependent variable while taking into account the amount of variation in the
dependent variable. One way to appreciate how similar they are is to compare
the f-ratio and t-score when an independent variable has only two categories.
At the beginning of this chapter, we looked at how GDP is related to internet
access, using a two-category measure of GDP. The t-score for the difference in
this example was -16.276. Although we wouldn’t normally use ANOVA when
the independent variable has only two categories, there is nothing technically
wrong with doing so, so let’s examine this same relationship using ANOVA:
#ANOVA with a two-category independent variable
fit_gdp2<-aov(countries2$internet~countries2$gdp2)
summary(fit_gdp2)
11.9 Exercises
11.9.1 Concepts and Calculations
1. The table below represents the results of two different experiments. In
both cases, a group of political scientists were trying to determine if voter
turnout could be influenced by different types of voter contacting (phone
calls, knocking on doors, or fliers sent in the mail). One experiment took
place in City A, and the other in City B. The table presents the mean level
of turnout and the standard deviation for participants in each contact
group in each city. After running an ANOVA test to see if there were
significant differences in the mean turnout levels across the three voter
contacting groups, the researchers found a significant F-ratio in one city
but not the other. In which city do you think they found the significant
effect? Explain your answer. (Hint: you don’t need to calculate anything.)
City City
A B
Mean Standard Mean Standard
Deviation Deviation
Phone 55.7 7.5 Phone 57.7 4.3
Door 60.3 6.9 Door 62.3 3.7
Knock Knock
Mail 53.2 7.3 Mail 55.2 4.1
2. When testing to see if poverty rates in U.S. counties are related to internet
access across counties, a researcher divided counties into four roughly equal
sized groups according to their poverty level and then used ANOVA to see
if poverty had a significant impact on internet access. The critical value
for F in this case is 2.61, the MSB=27417, and the MSE=52. Calculate
the F-ratio and decide if there is a significant relationship between poverty
rates and internet access across counties. Explain your conclusion.
3. A recent study examined the relationship between race and ethnicity and
the feeling thermometer rating (0 to 100) for Big Business, using data from
the 2020 American National Election Study. The results are presented
below. Summarize the findings, paying particular attention to statistical
significance and effect size. Do any findings stand out as contradictory or
surprising?
278 CHAPTER 11. HYPOTHESIS TESTING WITH MULTIPLE GROUPS
52
50
48
46
44
42
Respondent Race/Ethnicity
11.9.2 R Problems
For these questions, we will expand upon the R assignment from Chapter 10.
First, create the sample of 500 counties, as you did in Chapter 10, using the
code below.
set.seed(1234)
#create a sample of 500 rows of data from the county20large data set
covid500<-sample_n(county20large, 500)
12.2 Crosstabs
ANOVA is a useful tool for examining relationships between variables, especially
when the dependent variable is numeric. However, suppose we are interested
in using data from the 2020 ANES survey to study the relationship between
variables such as level of educational attainment (the independent variable) and
religiosity (dependent variable), measured as the importance of religion in one’s
life. We can’t do this with ANOVA because it focuses on differences in the
average outcome of the dependent variable across categories of the independent
variable, and since the dependent variable in this case (importance of religion)
is ordinal we can’t measure its average outcome. Many of the variables that
interest us—especially if we are using public opinion surveys—are measured at
the nominal or ordinal level, so ANOVA is not an appropriate method in these
cases.
Instead, we can use a crosstab (for cross-tabulation), also known as a contin-
gency table (I will use both “crosstab” and “contingency table” interchange-
281
282CHAPTER 12. HYPOTHESIS TESTING WITH NON-NUMERIC VARIABLES (CROSSTABS)
ably to refer to this technique). A crosstab is nothing more than a joint fre-
quency distribution that simultaneously displays the outcomes of two variables.
You were introduced to crosstabs in Chapter 7, where one was used to demon-
strate conditional probabilities.
Before looking at a crosstab for education and religiosity, we need to do some
recoding and relabeling of the categories:
#create new education variable
anes20$educ<-ordered(anes20$V201511x)
#shorten the category labels for education
levels(anes20$educ)<-c("LT HS", "HS", "Some Coll", "4yr degr", "Grad degr")
#Create religious importance variable
anes20$relig_imp<-anes20$V201433
#Recode Religious Importance to three categories
levels(anes20$relig_imp)<-c("High","High","Moderate", "Low","Low")
#Change order to Low-Moderate-High
anes20$relig_imp<-ordered(anes20$relig_imp, levels=c("Low", "Moderate", "High"))
Now, let’s look at the table below, which illustrates the relationship between
level of education and the importance respondents assign to religion.
#Get a crosstab, list the dependent variable first
crosstab(anes20$relig_imp, anes20$educ, plot=F)
Cell Contents
|-------------------------|
| Count |
|-------------------------|
===========================================================================
anes20$educ
anes20$relig_imp LT HS HS Some Coll 4yr degr Grad degr Total
---------------------------------------------------------------------------
Low 96 365 860 789 650 2760
---------------------------------------------------------------------------
Moderate 93 284 535 389 275 1576
---------------------------------------------------------------------------
High 187 684 1387 875 660 3793
---------------------------------------------------------------------------
Total 376 1333 2782 2053 1585 8129
===========================================================================
To review some of the basic elements of a crosstab covered earlier, this table con-
tains outcome information for the independent and dependent variables. Each
row represents a value of the dependent variable (religiosity) and each column
a value of the independent variable (education level). Each of the interior cells
in the table represent the intersection of a given row and column, or the joint
12.2. CROSSTABS 283
But this is quite a mouthful, isn’t it? Instead of relying on raw frequencies, we
need to standardize the frequencies by their base (the column totals). This gives
us column proportions/percentages, which can be used to make judgments about
the relative levels of religiosity across different educational groups. Column
percentages adjust each cell to the same metric, making the cell contents
comparable. The table below presents both the raw frequencies and the column
percentages (based on the column totals):
#Add "prop.c=T" to get column percentages
crosstab(anes20$relig_imp, anes20$educ, prop.c=T, plot=F)
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
============================================================================
anes20$educ
anes20$relig_imp LT HS HS Some Coll 4yr degr Grad degr Total
----------------------------------------------------------------------------
Low 96 365 860 789 650 2760
25.5% 27.4% 30.9% 38.4% 41.0%
----------------------------------------------------------------------------
Moderate 93 284 535 389 275 1576
24.7% 21.3% 19.2% 18.9% 17.4%
----------------------------------------------------------------------------
284CHAPTER 12. HYPOTHESIS TESTING WITH NON-NUMERIC VARIABLES (CROSSTABS)
Moderate
High
the relative sample size for each category of the independent variable, and the
vertical height of box within the columns reflects the magnitude of the column
percentages. Usually, the best thing to do is focus on how the height of the
differently shaded segments (boxes in each row) changes as you scan from left
to right. For instance, the height of the light gray bars (low religious importance)
increases as you scan from the lowest to highest levels of education, while the
darkest segments (high religious importance) generally grow smaller, albeit less
dramatically. These changes correspond to changes in the column percentages
and reinforce the finding that high levels of education are related to low levels of
religiosity. Does the mosaic plot clarify things for you? If so, make sure to take
advantage of its availability. If not, it’s worth checking in with your instructor
for help making sense of this useful tool.
(𝑂 − 𝐸)2
𝜒2 = ∑
𝐸
Where:
• O=observed frequency for each cell
• E=expected frequency for each cell; that is, the frequency we expect if
there is no relationship (H0 is true).
Chi-square is a table-level statistic that is based on summing information from
each cell of the table. In essence, 𝜒2 reflects the extent to which the outcomes
in each cell (the observed frequency) deviate from what we would expect to see
in the cell if the null hypothesis were true (expected frequency). If the null
hypothesis is true, we would expect to see very small differences between the
observed and expected frequencies.
We need to use raw frequencies instead of column percentages to calculate 𝜒2
because the percentages treat each column and each cell with equal weight,
even though there are important differences in the relative contribution of the
columns and cells to the overall sample size. For instance, the total number of
cases in the “LT HS” column (376) accounts for only 4.6% of total respondents
in the table, so giving its column percentages the same weight as those for
the “Some Coll” column, which has 34% of all cases, would over-represent the
importance of the first column to the pattern in the table.
Let’s walk through the calculation of the chi-square contribution of upper-left
cell in the table (Low, LT HS) to illustrate how the cell-level contributions are
determined. The observed frequency (the frequency produced by the sample)
in this cell is 96. From the discussion above, we know that for the sample
overall, about 34% (actually 33.95) of respondents are in the first row of the
dependent variable, so we expect to find about 34% of all respondents with
less than a high school education in this column. Since we need to use raw
frequencies instead of percentages, we need to calculate what 33.95% of the
total number of observations in the column is to get the expected frequency.
There are 376 respondents in this column, so the expected frequency for this
cell is .3395 ∗ 376 = 127.66.
288CHAPTER 12. HYPOTHESIS TESTING WITH NON-NUMERIC VARIABLES (CROSSTABS)
Cell Contents
|-------------------------|
| Count |
|-------------------------|
===========================================================================
anes20$educ
anes20$relig_imp LT HS HS Some Coll 4yr degr Grad degr Total
---------------------------------------------------------------------------
Low 96 365 860 789 650 2760
---------------------------------------------------------------------------
Moderate 93 284 535 389 275 1576
---------------------------------------------------------------------------
High 187 684 1387 875 660 3793
---------------------------------------------------------------------------
Total 376 1333 2782 2053 1585 8129
===========================================================================
A shortcut for calculating expected frequency for any cell is :
𝑓𝑐 ∗ 𝑓 𝑟
𝐸𝑐,𝑟 =
𝑛
where fc is the column total for a given cell, fr if the row total, and n is the total
sample size. For the upper-left cell in the crosstab: (376 ∗ 2760)/8129 = 127.66.
So the observed frequency for upper-left cell of the table is 96, and the
expected frequency is 127.66.
To estimate this cell’s contribution to the 𝜒2 statistic for the entire table:
To get the value of 𝜒2 for the whole table we need to calculate the expected
frequency for each cell, estimate the cell contribution to the overall value of
chi-square, and sum up all of the cell-specific values.
The table below shows the observed and expected frequencies, as well as the
cell contribution to overall 𝜒2 for each cell of the table:
#Get expected frequencies and cell chi-square contributions
crosstab(anes20$relig_imp, anes20$educ,
expected=T, #Add expected frequency to each cell
prop.chisq = T, #Total contribution of each cell
plot=F)
12.4. HYPOTHESIS TESTING WITH CROSSTABS 289
Cell Contents
|-------------------------|
| Count |
| Expected Values |
| Chi-square contribution |
|-------------------------|
=============================================================================
anes20$educ
anes20$relig_imp LT HS HS Some Coll 4yr degr Grad degr Total
-----------------------------------------------------------------------------
Low 96 365 860 789 650 2760
127.7 452.6 944.6 697.0 538.1
7.852 16.950 7.570 12.131 23.248
-----------------------------------------------------------------------------
Moderate 93 284 535 389 275 1576
72.9 258.4 539.4 398.0 307.3
5.544 2.529 0.035 0.205 3.393
-----------------------------------------------------------------------------
High 187 684 1387 875 660 3793
175.4 622.0 1298.1 957.9 739.6
0.761 6.184 6.091 7.180 8.559
-----------------------------------------------------------------------------
Total 376 1333 2782 2053 1585 8129
=============================================================================
First, note that the overall contribution of the upper-left cell to the table 𝜒2 is
7.852, essentially the same as the calculations made above. As you can see, there
are generally substantial differences between expected and observed frequencies,
again indicating that there is not a lot of support for the null hypothesis. If we
were to work through all of the calculations and sum up all of the individual
cell contributions to 𝜒2 , we would come up with 𝜒2 = 108.2.
So, what does this mean? Is this a relatively large value for 𝜒2 ? This is not a
t-score or F-ratio, so we should not judge the number based on those standards.
But still, this seems like a pretty big number. Does this mean the relationship
in the table is statistically significant?
One important thing to understand is that the value of 𝜒2 for any given table
is a function of three things: the strength of the relationship, the sample size,
and the size of the table. Of course, the strength of the relationship and the
sample size both affect the level of significance for other statistical tests, such
as the t-score and f-ratio. But it might not be immediately clear why the size
of the table matters? Suppose we had two tables, a 2x2 table and a 4x4 table,
and that the relationships in the two tables were of the same magnitude and
the sample sizes also were the same. Odds are that the chi-square for the 4x4
table would be larger than the one for the 2x2 table, simply due to the fact
290CHAPTER 12. HYPOTHESIS TESTING WITH NON-NUMERIC VARIABLES (CROSSTABS)
that it has more cells (16 vs. 4). For these two tables, the chi-square statistic
is calculated by summing all the individual cell contributions, four for the 2x2
table and sixteen for the 4x4 table. The more cells, the more individual cell
contributions to the table chi-square value. Given equally strong relationships,
and equal sample sizes, a larger table will tend to generate a larger chi-square.
This means that we need to account for the size of the table before concluding
whether the value of 𝜒2 is statistically significant. Similar to the t-score and
F-ratio, we do this by calculating the degrees of freedom:
𝑑𝑓𝜒2 = (𝑟 − 1)(𝑐 − 1)
.
Where r = number of rows and c = number of columns in the table. This is a
way of taking into account the size of the table.
The df for the table above is (3-1 )( 5-1 )=8
We can look up the critical value for 𝜒2 with 8 degrees of freedom, using the
chi-square table below.1 The values across the top of the 𝜒2 distribution table
are the desired levels of significance (area under the curve to the right of the
specified critical value), usually .05, and the values down the first column are
the degrees of freedom. To find the c.v. of 𝜒2 for this table, we need to look for
the intersection of the column headed by .05 (the desired level of significance)
and the row headed by 8 (the degrees of freedom). That value is 15.51.
1 Alex Knorre provided sample code that I adapted to produce this table https://
stackoverflow.com/questions/44198167.
12.4. HYPOTHESIS TESTING WITH CROSSTABS 291
We can also get the critical value for 𝜒2 from the qchisq function in R, which
requires you to specify the preferred p-value (.05), the degrees of freedom (8)
and the upper end of the distribution (lower.tail=F):
#Critical value of chi-square, p=.05, df=8
qchisq(.05, 8, lower.tail=F)
[1] 15.51
This function confirms that the critical value for 𝜒2 is 15.51. Figure 12.2 il-
lustrates location of chi-square value of 15.51 in a chi-square distribution when
df=8. The area under the curve to the right of the critical value equals .05 of
the total area under the curve, so for any obtained value of chi-square greater
than 15.51, the p-value is less than .05 and the null hypothesis can be rejected.
In the example used above, 𝜒2 = 108.2, a value far to the right of the critical
value, so we reject the null hypothesis. Based on the evidence presented here,
there is a relationship between level of education and the importance of religion
in one’s life.
You can get the value of 𝜒2 , along with degrees of freedom and the p-value from
R by adding chisq=T to the crosstab command. You can also get it from R by
using the chisq.test function, as shown below.
#Get chi-square statistic from R
chisq.test(anes20$relig_imp, anes20$educ)
292CHAPTER 12. HYPOTHESIS TESTING WITH NON-NUMERIC VARIABLES (CROSSTABS)
C.V. = 15.51
0 5 10 15 20 25 30
Chi−square
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
==========================================================================
SAMPLE: Census region
anes20$relig_imp 1. Northeast 2. Midwest 3. South 4. West Total
--------------------------------------------------------------------------
Low 549 649 811 782 2791
39.4% 32.6% 26.4% 43.5%
--------------------------------------------------------------------------
Moderate 320 385 548 343 1596
23.0% 19.3% 17.9% 19.1%
--------------------------------------------------------------------------
High 525 956 1710 671 3862
37.7% 48.0% 55.7% 37.4%
--------------------------------------------------------------------------
Total 1394 1990 3069 1796 8249
16.9% 24.1% 37.2% 21.8%
==========================================================================
Low
Religious Importance
Moderate
High
Region
The chi-square statistic (238.2) and p-value (less than .05) confirm that there is
a relationship between these two variables, so we can reject the null hypothesis.
There is a relationship between region of residence and the importance one
assigns to religion.
good idea to recode and relabel ordered variables so the substantive meaning of
the first category can be thought of as “low” or “negative” in value on whatever
the scale is. When assessing directionality we want to know how moving across
the columns from low to high outcomes of the independent variable changes
the likelihood of being in low or high values of the dependent variable. If low
values of the independent variable are generally associated with low values of
the dependent variable, and high values with high values, then we are talking
about a positive relationship. When low values of one variable are associated
with high values of another, then the relationship is negative. Check out the
table below, which provides generic illustrations of what positive and negative
relationships might look like in a crosstab.
For positive relationships, the column percentages in the “Low” row drop as
you move from the Low to High columns, and increase in the “High” row as
you move from the Low to High column. So, Low outcomes tend to be asso-
ciated with Low values of the independent variable, and High outcomes with
High values of the independent variable. The opposite pattern occurs if there
is a negative relationship: Low outcomes tend to be associated with High val-
ues of the independent variable, and High outcomes with Low values of the
independent variable.
Let’s reflect back on the crosstab between education and religious importance.
Does the pattern in the table suggest a positive or negative relationship? Or
is it hard to tell? Though it is not as clear as in the hypothetical the pattern
in Figure 12.3, there a negative pattern in the first crosstab: the likelihood of
being in the “low religious importance” row increases across columns as you
move from the lowest to highest levels of education, and there is a somewhat
weaker tendency for the likelihood of being in the “high religious importance”
row to decrease across columns as you move from the lowest to highest levels
of education. High values on one variable tend to be associated with low values
on the other, the hallmark of a negative relationship.
in the data than in either of the previous examples. The percent of respondents
who assign a low level of importance to religions drops precipitously from 50.4%
among the youngest respondents to just 19.4% among the oldest respondents.
At the same time, the percent assigning a high level of importance to religions
grows steadily across columns, from 30.7% in the youngest group to 63.1% in
the oldest group. As you might expect, given the strength of this pattern, the
chi-square statistic is quite large and the p-value is very close to zero. We can
reject the null hypothesis. There is a relationship between these two variables.
#Collapse age into fewer categories
anes20$age5<-cut2(anes20$V201507x, c(30, 45, 61, 76, 100))
#Assign labels to levels
levels(anes20$age5)<-c("18-29", "30-44", "45-60","61-75"," 76+")
#Crosstab with mosaic plot and chi-square
crosstab(anes20$relig_imp, anes20$age5, prop.c=T,
plot=T,chisq=T,
xlab="Age",
ylab="Religious Importance")
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
=================================================================
PRE: SUMMARY: Respondent age
anes20$relig_imp 18-29 30-44 45-60 61-75 76+ Total
-----------------------------------------------------------------
Low 505 877 666 522 135 2705
50.4% 43.7% 32.3% 24.4% 19.4%
-----------------------------------------------------------------
Moderate 189 357 396 470 122 1534
18.9% 17.8% 19.2% 22.0% 17.5%
-----------------------------------------------------------------
High 308 774 1003 1149 440 3674
30.7% 38.5% 48.6% 53.7% 63.1%
-----------------------------------------------------------------
Total 1002 2008 2065 2141 697 7913
12.7% 25.4% 26.1% 27.1% 8.8%
=================================================================
Low
Moderate
High
Age
The pattern is also clear in the mosaic plot, in which the height of the light gray
boxes (representing low importance) shrinks steadily as age increases, while the
height of the dark boxes (representing high importance) grows steadily as age
increases.
We should recognize that both variables are measured on ordinal scales, so it is
perfectly acceptable to evaluate the table contents for directionality. So what do
you think? Is this a positive or negative relationship? Whether focusing on the
percentages in the table or the size of the boxes in the mosaic plot, the pattern
of the relationship stands out pretty clearly: the importance respondents assign
to religion increases steadily as age increases. There is a positive relationship
between age and importance of religion.
12.8 Exercises
12.8.1 Concepts and Calculations
1. The table below shows the relationship between age (18 to 49, 50 plus)
and support for building a wall on the U.S. border with Mexico.
• State the null and alternative hypotheses for this table.
• There are three cells missing the column percentages (Oppose/18-
49, Neither/50+, and Favor/18-49). Calculate the missing column
percentages.
• After filling in the missing percentages, describe the relationship in
the table. Does the relationship appear to be different from what
you would expect if the null hypothesis were true? Explain yourself
with specific references to the data.
12.8. EXERCISES 299
2. The same two variables from problem 1 are shown again in the table below.
In this case, the cells show the observed (top) and expected (bottom)
frequencies, with the exception of three cells.
• Estimate the expected frequencies for the three cells in which they
are missing.
• By definition, what are expected frequencies? Not the formula, the
substantive meaning of “expected.”
• Estimate the degrees of freedom and critical value of 𝜒2 for this table
Mexico. What does the mosaic plot tell you about both the distribution
of the independent variable and the relationship between racial and ethnic
identity and attitudes toward the border wall? Is this a directional or non-
directional relationship?
Wht(NH) Blk(NH) Hisp API(NH)
Other
Wall at Southern Border
Oppose
Neither
Favor
Race/Ethnicity
4. The table below illustrates the relationship between between family in-
come and support for a wall at the southern U.S. border. Describe the
relationship, paying special attention to statistical significance, strength
of relationship, and direction of relationship. Make sure to use column
percentages to bolster your case.
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
======================================================================
PRE: SUMMARY: Total (family) income
anes20$wall Lt 50K 50K-75K 75k-125K 125K-175K 175K+ Total
----------------------------------------------------------------------
Oppose 1176 674 867 370 482 3569
43.9% 43.6% 47.9% 49.4% 55.7%
----------------------------------------------------------------------
Neither 561 278 262 112 129 1342
20.9% 18.0% 14.5% 15.0% 14.9%
----------------------------------------------------------------------
Favor 941 594 681 267 254 2737
35.1% 38.4% 37.6% 35.6% 29.4%
----------------------------------------------------------------------
Total 2678 1546 1810 749 865 7648
12.8. EXERCISES 301
Measures of Association
303
304 CHAPTER 13. MEASURES OF ASSOCIATION
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
============================================================================
anes20$educ
anes20$relig_imp LT HS HS Some Coll 4yr degr Grad degr Total
----------------------------------------------------------------------------
Low 96 365 860 789 650 2760
25.5% 27.4% 30.9% 38.4% 41.0%
----------------------------------------------------------------------------
Moderate 93 284 535 389 275 1576
24.7% 21.3% 19.2% 18.9% 17.4%
----------------------------------------------------------------------------
High 187 684 1387 875 660 3793
49.7% 51.3% 49.9% 42.6% 41.6%
----------------------------------------------------------------------------
Total 376 1333 2782 2053 1585 8129
4.6% 16.4% 34.2% 25.3% 19.5%
============================================================================
𝜒2
Cramer’s V = √
𝑁 ∗ 𝑚𝑖𝑛(𝑟 − 1, 𝑐 − 1)
Here we take the square root of chi-squared divided by the sample size times
either the number of rows minus 1 or the number of columns minus 1, whichever
is smaller.1 In the education and religiosity crosstab, rows minus 1 is smaller
1 Another popular measure of association for crosstabs based on chi-square is phi, which
306 CHAPTER 13. MEASURES OF ASSOCIATION
than columns minus 1, so plugging in the values of chi-squared (108.2) and the
sample size:
108.2
𝑉 =√ = .082
8129 ∗ 2
Interpreting Cramer’s V is relatively straightforward, as long as you understand
that it is bounded by 0 and 1, where a value of zero means there is absolutely
no relationship between the two variables, and a value of one means there is
a perfect relationship between the two variables. The Figure below illustrate
these two extremes:
In the table on the left, there is no change in the column percentages as you
move across columns. We know from the discussion of chi-square in the last
chapter, that this is exactly what statistical independence looks like–the row
outcome does not depend upon column outcomes. In the table on the right,
you can perfectly predict the outcome of the dependent variable based on levels
of the independent variable. Given these bounds, the Cramer’s V value for the
relationship between education and the religious importance (V=.08) suggests
that this is a fairly weak relationship.
Let’s calculate Cramer’s V for the other tables we used in Chapter 12, regional
and age-based differences in religious importance. First up, regional differences:
𝜒2
𝑝ℎ𝑖 = √
𝑁
I’m not adding phi to the discussion here because it is most appropriate for 2x2 tables, in
which case it is equivalent to Cramer’s V.
13.3. MEASURES OF ASSOCIATION FOR CROSSTABS 307
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
==========================================================================
SAMPLE: Census region
anes20$relig_imp 1. Northeast 2. Midwest 3. South 4. West Total
--------------------------------------------------------------------------
Low 549 649 811 782 2791
39.4% 32.6% 26.4% 43.5%
--------------------------------------------------------------------------
Moderate 320 385 548 343 1596
23.0% 19.3% 17.9% 19.1%
--------------------------------------------------------------------------
High 525 956 1710 671 3862
37.7% 48.0% 55.7% 37.4%
--------------------------------------------------------------------------
Total 1394 1990 3069 1796 8249
16.9% 24.1% 37.2% 21.8%
==========================================================================
238.2
𝑉 =√ = .12
8249 ∗ 2
Although this result shows a somewhat stronger impact on religious importance
from region than from education, Cramer’s V still points to a weak relationship.
308 CHAPTER 13. MEASURES OF ASSOCIATION
Finally, let’s take another look at the relationship between age and religious
importance.
#Crosstab got religious importance by age
crosstab(anes20$relig_imp, anes20$age5,
plot=F,
prop.c=T,
chisq=T)
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
=================================================================
PRE: SUMMARY: Respondent age
anes20$relig_imp 18-29 30-44 45-60 61-75 76+ Total
-----------------------------------------------------------------
Low 505 877 666 522 135 2705
50.4% 43.7% 32.3% 24.4% 19.4%
-----------------------------------------------------------------
Moderate 189 357 396 470 122 1534
18.9% 17.8% 19.2% 22.0% 17.5%
-----------------------------------------------------------------
High 308 774 1003 1149 440 3674
30.7% 38.5% 48.6% 53.7% 63.1%
-----------------------------------------------------------------
Total 1002 2008 2065 2141 697 7913
12.7% 25.4% 26.1% 27.1% 8.8%
=================================================================
Of the three tables, this one shows the greatest differences in column percentages
within rows, around thirty-point differences in both the top and bottom rows,
so we might expect Cramer’s V to show a stronger relationship as well.
396.7
𝑉 =√ = .16
8249 ∗ 2
13.3. MEASURES OF ASSOCIATION FOR CROSSTABS 309
Hmmm. That’s interesting. Cramer’s V does show that the age has a stronger
impact on religious importance than either region or education, but not by
much. Are you surprised by the rather meager readings on Cramer’s V for this
table? After all, we saw a pretty dramatic differences in the column percentages,
so we might have expected that Cramer’s V would come in at somewhat higher
values. What this reveals is an inherent limitation of focusing on a few of
the differences in column percentages rather than taking in information from
the entire table. For instance, by focusing on the large differences in the Low
and High row, we ignored the tiny differences in the Moderate row. Another
problem is that we tend to treat the column percentages as if they carry equal
weight. Again, column percentages are important–they tell us what it happening
between the two variables–but focusing on them exclusively invariably provides
an incomplete accounting of the relationship. Good measures of association
such as Cramer’s V take into account information from all cells in a table and
provide a more complete assessment of the relationship.
Getting Cramer’s V in R. The CramerV function is found in the DescTools
package and can be used to get the values of Cramer’s V in R.
#V for education and religious importance:
CramerV(anes20$relig_imp, anes20$educ)
[1] 0.081592
#V for region and religious importance:
CramerV(anes20$relig_imp, anes20$V203003)
[1] 0.12015
#V for age and religious importance:
CramerV(anes20$relig_imp, anes20$age5)
[1] 0.15831
These results confirm the earlier calculations.
13.3.2 Lambda
Another sometimes useful statistic for judging the strength of a relationship is
lambda (𝜆). Lambda is referred to as a proportional reduction in error (PRE)
statistic because it summarizes how much we can reduce the level of error from
guessing, or predicting the outcome of the dependent variable by using infor-
mation from an independent variable. The concept of proportional reduction in
error plays an important role in many of the topics included in the last several
chapters of this book.
The formula for lambda is:
𝐸1 − 𝐸2
𝐿𝑎𝑚𝑏𝑑𝑎(𝜆) =
𝐸1
310 CHAPTER 13. MEASURES OF ASSOCIATION
Where:
• E1 =error by guessing with no information on the independent variable.
Cell Contents
|-------------------------|
| Count |
|-------------------------|
==========================================================================
SAMPLE: Census region
anes20$relig_imp 1. Northeast 2. Midwest 3. South 4. West Total
--------------------------------------------------------------------------
Low 549 649 811 782 2791
--------------------------------------------------------------------------
Moderate 320 385 548 343 1596
--------------------------------------------------------------------------
High 525 956 1710 671 3862
--------------------------------------------------------------------------
Total 1394 1990 3069 1796 8249
==========================================================================
Now, suppose that you had to “guess,” or “predict” the value of the dependent
variable for each of the 8249 respondents from this table. What would be your
best guess? By “best guess” I mean which category would give the least overall
error in guessing? As a rule, with ordinal or nominal dependent variables, it is
best to guess the modal outcome (you may recall this from Chapter 4) if you
want to minimize error in guessing. In this case, that outcome is the “High”
row, which has 3862 respondents. If you guess this, you will be correct 3862
times and wrong 4387 times. This seems like a lot of error, but no other guess
can give you an error rate this low (go ahead, try).
From this, we get 𝐸1 = 4387.
E2 is the error we get when we are able to guess the value of the dependent
variable based on the value of independent variable. Here’s how we get E2 : we
look within each column of the independent variable and choose the category of
the dependent variable that will give us the least overall error for that column
(the modal outcome within each column). For instance, for “Northeast” we
would guess “Low” and we would be correct 549 times and wrong 845 times
13.3. MEASURES OF ASSOCIATION FOR CROSSTABS 311
(make sure you understand how I got these numbers); for “Midwest” we would
guess “High” and would be wrong 1034 times; for “South” we would guess
“High” and be wrong 1359 times; and for “West” we could guess “Low” and be
wrong 1014 times. Now that we have given our best guess within each category
of the independent variable we can estimate E2 , which is the sum of all of the
errors from guessing when we have information from the independent variable:
Note that E2 is less than E1 . This means that the independent variable reduced
the error in guessing the outcome of the dependent variable (𝐸1 − 𝐸2 = 135).
On its own, it is difficult to tell if reducing error by 135 represents a lot or a
little reduction in error. Because of this, we express reduction in error as a
proportion of the original error:
Here, we get confirmation that the calculations for the region/religious impor-
tance table were correct (lambda = .03). Also, since the 95% confidence interval
does not include 0 (barely!), we can reject the null hypothesis and conclude that
there is a statistically significant but quite small reduction in error predicting
religious importance with region.
We can also see the values of lambda for the other two tables: lambda is 0 for
education and religious importance and .07 for age and religious importance.
#For Education and Religious importance
Lambda(anes20$relig_imp, anes20$educ, direction="row", conf.level =.95)
Cell Contents
|-------------------------|
| Count |
|-------------------------|
13.4. ORDINAL MEASURES OF ASSOCIATION 313
===========================================================================
anes20$educ
anes20$relig_imp LT HS HS Some Coll 4yr degr Grad degr Total
---------------------------------------------------------------------------
Low 96 365 860 789 650 2760
---------------------------------------------------------------------------
Moderate 93 284 535 389 275 1576
---------------------------------------------------------------------------
High 187 684 1387 875 660 3793
---------------------------------------------------------------------------
Total 376 1333 2782 2053 1585 8129
===========================================================================
Lambda and Cramer’s V are using different standards to evaluate the strength of
relationship. Lambda focuses on whether using the independent variable leads
to less error in guessing outcomes of the dependent variable. Cramer’s V, on
the other hand, focuses on how different the pattern in the table is from what
you would expect if there were no relationship.
"Conservative")
#Crosstab of religious importance by ideology
crosstab(anes20$relig_imp, anes20$ideol3,
prop.c=T, chisq=T,
plot=F)
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
=============================================================
PRE: 7pt scale liberal-conservative self-placement
anes20$relig_imp Liberal Moderate Conservative Total
-------------------------------------------------------------
Low 1410 570 489 2469
56.6% 31.5% 17.9%
-------------------------------------------------------------
Moderate 428 427 498 1353
17.2% 23.6% 18.2%
-------------------------------------------------------------
High 654 815 1750 3219
26.2% 45.0% 63.9%
-------------------------------------------------------------
Total 2492 1812 2737 7041
35.4% 25.7% 38.9%
=============================================================
The observant among you might question whether political ideology is an ordinal
variable. After all, none of the category labels carry the obvious hints, such as
“Low” and “High”, or “Less than” and “Greater than.” Instead, as pointed out
in Chapter 1, you can think of the categories of political ideology as growing
more conservative (and less liberal) as you move from “Liberal” to “Moderate”
to “Conservative”.
Here’s what I would say about this relationship based on what we’ve learned
thus far. There is a positive relationship between political ideology and the
importance people attach to religion. Looking from the Liberal column to the
Conservative column, we see a steady decrease in the percent who attach low
importance to religion and a steady increase in the percent who attach high
importance to religion. Almost 57% of liberals are in the Low row, compared to
only about 18% among conservatives; and only 26% of liberals are in the High
row, compared to about 64% of conservatives. The p-value for chi-square is near
0, so we can reject the null hypothesis, the value of Cramer’s V (.27) suggests
a weak to moderate relationship, and the value of lambda shows that ideology
reduces the error in guessing religious importance by about 20%.
There is nothing wrong with using lambda and Cramer’s V for this table, as
long as you understand that they are not addressing the directionality in the
data. For that, we need to use an ordinal measure of association, one that
assesses the degree to which high values of the independent variable are related
to high or low values of the dependent variable. In essence, these statistics
focus on the ranking of categories and summarize how well we can predict the
ordinal ranking of the dependent variable based on values of the independent
variable. Ordinal-level measures of association range in value form -1 to +1,
with 0 indicating no relationship, -1 indicating a perfect negative relationship,
and + 1 indicating a perfect positive relationship.
13.4.1 Gamma
One example of an ordinal measure of association is gamma (𝛾). Very generally,
what gamma does is calculate how much of the table follows a positive pattern
and how much it follows a negative pattern, but using different terminology.
More formally, Gamma is based on the number of similarly ranked and differ-
ently ranked pairs of observations in the contingency table. Similarly ranked
pairs can be thought of as pairs that follow a positive pattern, and differently
ranked pairs are those that follow a negative pattern.
𝑁𝑠𝑖𝑚𝑖𝑙𝑎𝑟 − 𝑁𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡
𝐺𝑎𝑚𝑚𝑎 =
𝑁𝑠𝑖𝑚𝑖𝑙𝑎𝑟 + 𝑁𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡
If you look at the parts of this equation, it is easy to see that gamma is a
measure of the degree to which positive or negative pairs dominate the table.
When positive (similarly ranked) pairs dominate, gamma will be positive; when
316 CHAPTER 13. MEASURES OF ASSOCIATION
negative (differently ranked) pairs dominate, gamma will be negative; and when
there is no clear trend, gamma will be zero or near zero.
But how do we find these pairs? Let’s look at similarly ranked pairs in a generic
3x3 table first.
For each cell in the table in Figure 13.2, we multiply that cell’s number of
observations times the sum of the observations that are found in all cells ranked
higher in value on both the independent and dependent variables (below and to
the right). We refer to these cells as “similarly ranked” because they have the
same ranking on both the independent and dependent variables, relative to the
cell of interest. These pairs of cells are highlighted in figure 13.2.
For differently ranked pairs, we need to match each cell with cells that are
inconsistently ranked on the independent and dependent variables (higher in
value on one and lower in value on the other) . Because of this inconsistent
ranking, we call these differently ranked pairs. These cells are below and to the
left of the reference cell, as highlighted in Figure 13.3
Let’s start with the similarly ranked pairs. In the calculation below, I begin
in the upper-left corner (Liberal/Low) and pair those 1410 respondents with
all respondents in cells below and to the right (1410 x (427+498+815+1750)).
Note that the four cells below and to the right are all higher in value on both
the independent and dependent variables than the reference cell. Then I go to
the Liberal/Moderate cell and match those 428 respondents with the two cells
below and to the right (428 x (815+1750)). Next, I go to the Moderate/Low
cell and match those 570 respondents with the two cells below and to the right
(570 x (498+1750)), and finally, I go to the Moderate/Moderate cell and match
those 427 respondents with respondents in the one cell below and to the right
(427 X 1750).
#Calculate similarly ordered pairs
similar<-1410*(427+498+815+1750)+428*(815+1750)+570*(498+1750)+(427*1750)
similar
[1] 8047330
This gives us 8,047,330 similarly ranked pairs. That seems like a big number,
and it is, but we are really interested in how big it is relative to the number of
differently ranked pairs.
Cell Contents
|-------------------------|
| Count |
|-------------------------|
=============================================================
PRE: 7pt scale liberal-conservative self-placement
anes20$relig_imp Liberal Moderate Conservative Total
-------------------------------------------------------------
Low 1410 570 489 2469
-------------------------------------------------------------
Moderate 428 427 498 1353
-------------------------------------------------------------
High 654 815 1750 3219
-------------------------------------------------------------
Total 2492 1812 2737 7041
=============================================================
and to the left (489 x (427+428+654+815)). Note that the four cells below
and to the left are all lower in value on the independent variable and higher
in value on the dependent variable than the reference cell. Then, I match the
570 respondents in the Moderate/Low cell with all respondents in cells below
and to the left (570 x (428+654)). Next, I match the 498 respondents in the
Conservative/Moderate cell with all respondents in cells below and to the left
(498 x (654+815)). And, finally, I match the 427 respondents in the Moder-
ate/Moderate cell with all respondents in the one cell below and to the left
(427x654).
#Calculate differently ordered pairs
different=489*(427+428+654+815)+570*(428+654)+498*(654+815)+427*654
different
[1] 2763996
This gives us 2,763,996 differently ranked pairs. Again, this is a large number,
but we now have a basis of comparison, the 8,047,330 similarly ranked pairs.
Since there are more similarly ranked pairs, we know that there is a positive
relationship in this table. Let’s plug these values into the gamma formula to
get a more definitive sense of this relationship.
For this table,
outstrips the values of Cramer’s V (.266) and lambda (.198). We don’t expect to
get the same results, but when two measures of association suggest a somewhat
modest relationship and another suggests a much stronger relationship, it is a bit
hard to reconcile the difference. The reason gamma tends to produce stronger
effects is because it focuses only the diagonal pairs. In other words, gamma is
calculated only on the basis of pairs of observations that follow clearly positive
(similarly ranked) or negative (differently ranked) patterns. But we know that
not all pairs of observations in a table follow these directional patterns. Many
of the pairs of observations in the table are “tied” in value on one variable but
have different values on the other variable, so they don’t follow a positive or
negative pattern. If tied pairs were also taken into account, the denominator
would be much larger and Gamma would be somewhat smaller.
The two tables below illustrate examples of tied pairs. In the first table, the
shaded area represents the cells that are tied in their value of Y (Low) but
have different values of x (low, medium, and high). In the second table, the
highlighted cells are tied in their value of X (low) but differ in their value of
Y (low, medium, high). And there are many more tied pairs throughout the
tables. None of these pairs are included in the calculation of gamma, but they
can constitute a substantial part of the table, especially if the directional pattern
is weak. To the extent there are a lot of tied pairs in a given table, Gamma
is likely to significantly overstate the magnitude of the relationship between X
and Y because the denominator for Gamma is smaller than it should be if it is
intended to represent the entire table.
If you are using gamma, you should always take into account the fact that
it tends to overstate relationships and there are some alternative measures of
association that do account for the presence of tied pairs. Still, as a general
rule, gamma will not report a significant relationship when the other statistics
do not. In other words, gamma does not increase the likelihood of concluding
there is a relationship when there actually isn’t one.
320 CHAPTER 13. MEASURES OF ASSOCIATION
2 If you are itching for another formula, this one shows how tau-b integrates tied pairs:
𝑁𝑠𝑖𝑚𝑖𝑙𝑎𝑟 − 𝑁𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑡
tau-b =
√(𝑁𝑠 + 𝑁𝑑 + 𝑁𝑡𝑦 )((𝑁𝑠 + 𝑁𝑑 + 𝑁𝑡𝑥 ))
where 𝑁𝑡𝑦 and 𝑁𝑡𝑥 are the number of tied pairs on y and x, respectively.
13.5. REVISITING THE GENDER GAP IN ABORTION ATTITUDES 321
Always remember that the column percentages tell the story of how the variables
are connected, but you still need a measure of association to summarize the
strength and direction of the relationship. At the same time, the measure of
association is not very useful on its own without referencing the contents of the
table.
Cell Contents
322 CHAPTER 13. MEASURES OF ASSOCIATION
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
==========================================
anes20$Rsex
anes20$abortion Male Female Total
------------------------------------------
Illegal 388 475 863
10.4% 10.7%
------------------------------------------
Rape/Incest/Life 929 978 1907
24.9% 22.1%
------------------------------------------
Other Conditions 688 706 1394
18.5% 16.0%
------------------------------------------
Choice 1723 2263 3986
46.2% 51.2%
------------------------------------------
Total 3728 4422 8150
45.7% 54.3%
==========================================
Here, the dependent variable ranges from wanting abortion to be illegal in all
circumstances, to allowing it only in cases of rape, incest, or threat t0 the life of
the mother, to allowing it in some other circumstances, to allowing it generally
as a matter of choice. These categories can be seen as ranging from most to least
restrictive views on abortion. As expected, based on the analysis in Chapter
10, the differences between men and women on this issues are not very great.
In fact, the greatest difference in column percentages is in the “Choice” row,
where women are about five percentage points more likely than men to favor
this position. This weak effect is also reflected in the measures of association
reported below.
13.6. NEXT STEPS 323
[1] 0.054827
StuartTauC(anes20$abortion, anes20$Rsex, conf.level=0.95)
13.7 Exercises
13.7.1 Concepts and Calculations
The table below provides the raw frequencies for the relationship between re-
spondent sex and political ideology in the 2020 ANES survey.
Chi-square=66.23
1. What percent of female respondents identify as Liberal? What percent of
male respondents identify as Liberal? What about Conservatives? What
percent of male and female respondents identify as Conservative? Do the
differences between male and female respondents seem substantial?
2. Calculate and interpret Cramer’s V, Lambda, and Gamma for this table.
Show your work.
3. If you were using tau-b or tau-c for this table, which would be most ap-
propriate? Why?
13.7.2 R Problems
In the wake of the Dobbs v. Jackson Supreme Court decision, which overturned
the longstanding precedent from Roe v. Wade, it is interesting to consider who
is most likely to be upset by the Court’s decision. You should examine how
three different independent variables influence anes20$upset_court, a slightly
transformed version of anes20$V201340. This variable measures responses to a
question that asked respondents if they would be pleased or upset if the Supreme
Court reduced abortion rights. The three independent variables you should use
13.7. EXERCISES 325
Correlation and
Scatterplots
327
328 CHAPTER 14. CORRELATION AND SCATTERPLOTS
50
40
40
30
30
Frequency
Frequency
20
20
10
10
0
50 55 60 65 70 75 80 85 1 2 3 4 5 6 7
On the left, life expectancy presents with a bit of negative skew (mean=72.6,
median=73.9, skewness=-.5), the modal group is in the 75-80 years-old category,
and there is quite a wide range of outcomes, spanning from around 53 to 85 years.
When you think about the gravity of this variable–how long people tend to live–
this range is consequential. In the graph on the right, the pattern for fertility
rate is a mirror image of life expectancy–very clear right skew (mean=2.74,
median=2.28, skewness.93), modal outcomes toward the low end of the scale,
and the range is from 1-2 births per woman in many countries to more than 5
for several countries.1
Before looking at the relationship between these two variables, let’s think about
what we expect to find. I anticipate that countries with low fertility rates have
higher life expectancy than those with high fertility rates (negative relationship),
for a couple of reasons. First, in many places around the world, pregnancy and
childbirth are significant causes of death in women of childbearing age. While
maternal mortality has declined in much of the “developed” world, it is still
1 You may have noticed that I did not include the code for these basic descriptive statistics.
What R functions would you use to get these statistics? Go ahead, give it a try!
14.3. SCATTERPLOTS 329
a serious problem in many of the poorest regions. Second, higher birth rates
are also associated with high levels of infant and child mortality. Together,
these outcomes associated with fertility rates auger for a negative relationship
between fertility rate and life expectancy.
14.3 Scatterplots
Let’s look at this relationship with a scatterplot. Scatterplots are like crosstabs
in that they display joint outcomes on both variables, but they look a lot dif-
ferent due to the nature of the data:
#Scatterplot(Independent variable, Dependent variable)
#Scatterplot of "lifexp" by "fert1520"
plot(countries2$fert1520, countries2$lifexp,
xlab="Fertility Rate",
ylab="Life Expectancy")
85
80
Life Expectancy
75
70
65
60
55
1 2 3 4 5 6 7
Fertility Rate
In the scatter plot above, the values for the dependent variable span the vertical
axis, values of the independent variable span the horizontal axis, and each circle
represents the value of both variables for a single country. There appears to
be a strong pattern in the data: countries with low fertility rates tend to have
high life expectancy, and countries with high fertility rates tend to have low
life expectancy. In other words, as the fertility rate increases, life expectancy
declines. You might recognize this from the discussion of directional relation-
ships in crosstabs as the description of a negative relationship. But you might
also notice that the pattern looks different (sloping down and to the right) from
a negative pattern in a crosstab. This is because the low values for both the
independent and dependent variables are in the lower left-corner of the scatter
plot while they are in the upper-left corner of the crosstabs.
When looking at scatterplots, it is sometimes useful to imagine something like
330 CHAPTER 14. CORRELATION AND SCATTERPLOTS
a crosstab overlay, especially since you’ve just learned about crosstabs. If there
are relatively empty corners, with the markers clearly following an upward or
downward diagonal, the relationship is probably strong, like a crosstab in which
observations are clustered in diagonal cells. We can overlay a horizontal line
at the mean of the dependent variable and a vertical line at the mean of the
independent variable to help us think of this in terms similar to a crosstab:
plot(countries2$fert1520, countries2$lifexp, xlab="Fertility Rate",
ylab="Life Expectancy")
#Add horizontal and vertical lines at the variable means
abline(v=mean(countries2$fert1520, na.rm = T),
h=mean(countries2$lifexp, na.rm = T), lty=c(1,2))
#Add legend
legend("topright", legend=c("Mean Y", "Mean X"),lty=c(1,2), cex=.9)
85
Mean Y
Mean X
80
Life Expectancy
75
70
65
60
55
1 2 3 4 5 6 7
Fertility Rate
Using this framework helps illustrate the extent to which high or low values of
the independent variable are associated with high or low values of the dependent
variable. The vast majority of cases that are below average on fertility are above
average on life expectancy (upper-left corner), and most of the countries that
are above average on fertility are below average on life expectancy (lower-right
corner). There are only a few countries that don’t fit this pattern, found in
the upper-right and lower-left corners. Even without the horizontal and vertical
lines for the means, it is clear from looking at the pattern in the data that
the typical outcome for the dependent variable declines as the value of the
independent variable increases. This is what a strong negative relationship
looks like.
Just to push the crosstab analogy a bit farther, we can convert these data
into high and low categories (relative to the means) on both variables, create a
crosstab, and check out some of measures of associations used in the Chapter
13.
14.3. SCATTERPLOTS 331
Cell Contents
|-------------------------|
| Count |
| Column Percent |
|-------------------------|
==============================================================
countries2$Fertility.2
countries2$Life_Exp.2 Below Average Above Average Total
--------------------------------------------------------------
Below Average 20 63 83
17.9% 86.3%
--------------------------------------------------------------
Above Average 92 10 102
82.1% 13.7%
--------------------------------------------------------------
Total 112 73 185
60.5% 39.5%
==============================================================
This crosstab reinforces the impression that this is a strong negative relationship:
86.3% of countries with above average fertility rates have below average life
expectancy, and 82.1% of countries with below average fertility rates have above
average life expectancy. The measures of association listed below confirm this.
According to Lambda, information from fertility rate variable reduces error in
predicting life expectancy by 64%. In addition, the values of Cramer’s V (.67)
and tau-b (-.67) confirm the strength and direction of the relationship. Note
that tau-b also confirms the negative relationship.
#Get measures of association
Lambda(countries2$Life_Exp.2,countries2$Fertility.2, direction=c("row"),
conf.level = .05)
332 CHAPTER 14. CORRELATION AND SCATTERPLOTS
14.4 Pearson’s r
Similar to ordinal measures of association, a good interval/ratio measure of
association is positive when values of X that are relatively high tend to be
associated with values of Y that are relatively high, and values of X that are
relatively low are associated with values of Y that are relatively low. And, of
course, a good measure of association should be negative when values of X that
are relatively high tend to be associated with values of Y that are relatively low,
and values of X that are relatively low are associated with values of Y that are
relatively high.
One way to capture these positive and negative patterns is to express the value
of all observations as deviations from the means of both the independent variable
(𝑥𝑖 − 𝑥)̄ and the dependent variable (𝑦𝑖 − 𝑦).
̄ Then, we can multiply the deviation
from 𝑥̄ times the deviation from 𝑦 ̄ to see if the observation follows a positive or
negative pattern.
(𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
14.4. PEARSON’S R 333
If the product is negative, the observation fits a negative pattern (think of it like
a dissimilar pair from crosstabs); if the product is positive, the observation fits
a positive pattern (think similar pair). We can sum these products across all
observations to get a sense of whether the relationship, on balance, is positive
or negative, in much the same way as we did by subtracting dissimilar pairs
from similar pairs for gamma in Chapter 13.
Pearson’s correlation coefficient (r) does this and can be used to summarize
the strength and direction of a relationship between two numeric variables:
∑(𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
𝑟=
√∑(𝑥𝑖 − 𝑥)̄ ∑(𝑦𝑖 − 𝑦)̄ 2
2
Although this formula may look a bit dense, it is quite intuitive. The numerator
is exactly what we just discussed, a summary of the positive or negative pattern
in the data: it is positive when relatively high values of X are associated with
relatively high values of Y, negative when high values of X are associated with
low values of Y, and it will be near zero when values of X are randomly associated
with values of Y. The denominator standardizes the numerator by the overall
levels of variation in X and Y and provides -1 to +1 boundaries for Pearson’s r.
As was the case with the ordinal measures of association, values near 0 indicate
a weak relationship and the strength of the relationship grows stronger moving
from 0 to -1 or +1.
We can use data on the dependent and independent variables for these ten
countries to generate the components necessary for calculating Pearson’s r.
#Deviation of y from its mean
X10countries$y_dev=X10countries$lifexp-mean(X10countries$lifexp)
#Deviation of x from its mean
334 CHAPTER 14. CORRELATION AND SCATTERPLOTS
80
75
Life Expectancy
70
65
60
55
Fertility Rate
Figure 14.1: Fertily Rate and Life Expectancy in Ten Randmomly Selected
Countries
X10countries$x_dev=X10countries$fert1520-mean(X10countries$fert1520)
#Cross-product of x and y deviations from their means
X10countries$y_devx_dev=X10countries$y_dev*X10countries$x_dev
#Squared deviation of y from its mean
X10countries$ydevsq=X10countries$y_dev^2
#Squared deviation of x from its mean
X10countries$xdevsq=X10countries$x_dev^2
The table below includes the country names, the values of the independent and
dependent variables, and all the pieces needed to calculate Pearson’s r for the
sample of ten countries shown in the scatterplot. It is worth taking the time to
work through the calculations to see where the final figure comes from.
14.4. PEARSON’S R 335
Table 14.1 Data for Calculation of Pearson’s r using a Sample of Ten Countries
Here, the first two columns of numbers are the outcomes for the dependent and
independent variables, respectively, and the next two columns express these
values as deviations from their means. The fifth column of data represents the
cross-products of the mean deviations for each observation, and the sum of that
column is the numerator for the equation. Since 8 of the 10 of the cross-products
are negative, the numerator is negative (-81.885), as shown below. We can think
of this numerator as measuring the covariation between x and y.
#Multiply and sum the y and x deviations from their means
numerator=sum(X10countries$y_devx_dev)
numerator
[1] -81.885
The last two columns measure squared deviations from the mean for both y
and x, and their column totals (summed below) capture the total variation in
y and x, respectively. The square root of the product of the variation in y and
x (135.65) constitutes the denominator in the equation, giving us:
var_y=sum(X10countries$ydevsq)
var_x=sum(X10countries$xdevsq)
denominator=sqrt(var_x*var_y)
denominator
[1] 100.61
∑(𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄ −81.885
𝑟= = = −.81391
√∑(𝑥𝑖 − 𝑥)̄ 2 ∑(𝑦𝑖 − 𝑦)̄ 2 100.61
cor
-0.84326
For the full 185 countries shown in the first scatterplot, the correlation between
fertility rate and life expectancy is -.84. There are a few things to note about
this. First, this is a very strong relationship. Second, the negative sign confirms
that the predicted level of life expectancy decreases as the fertility rate increases.
There is a strong tendency for countries with above average levels of fertility
rate to have below average levels of life expectancy. Finally, it is interesting to
note that the correlation from the small sample of ten countries is similar to that
from the full set of countries in both strength and direction of the relationship.
Three cheers for random sampling!
75
70
65
60
55
2 4 6 8 10 12 14
fertility and life expectancy, there is a clear pattern in the data that doesn’t
require much effort see. This is the first sign of a strong relationship. If you
have to look closely to try to determine if there is a directional pattern to the
data, then the relationship probably isn’t a very strong one. While there is
a clear positive trend in the data, the observations are not quite as tightly
clustered as in the first scatterplot, so the relationship might not be quite as
strong. Eyeballing the data like this is not very definitive, so we should have R
produce the correlation coefficient to get a more precise sense of the strength
and direction of the relationship.
cor.test(countries2$lifexp, countries2$mnschool)
The first two independent variables illustrate what strong positive and strong
negative relationships look like. In reality, many of our ideas don’t pan out
and we end up with scatterplots that have no real pattern and correlations near
zero. If two variables are unrelated there should be no clear pattern in the
data points; that is, the data points appear to be randomly dispersed in the
scatterplot. The figure below shows the relationship between the logged values
of the population size2 and country-level outcomes on life expectancy.
#Take the log of population size (in millions)
countries2$log10pop<-log(countries2$pop19_M)
#Scatterplot of "lifexp" by "log10pop"
plot(countries2$log10pop,countries2$lifexp,
xlab="Logged Value of Country Population",
ylab="Life Expectancy")
2 Sometimes, when variables such as population are severely skewed, we use the logarithm of
the variable because to minimize the impact of outliers. In the case of country-level population
size, the skewness statistic is 8.3.
14.5. VARIATION IN STRENGTH OF RELATIONSHIPS 339
85
80
Life Expectancy
75
70
65
60
55
−4 −2 0 2 4 6
As you can see, this is a seemingly random pattern (which means there is no clear
pattern), and correlation for this relationship (below) is -.08 and not statistically
significant. This is what a scatterplot looks like when there is no discernible
relationship between two variables.
cor.test(countries2$lifexp, countries2$log10pop)
3
0 1 2 3
1 2
2
1
Y
0
−1
−2
−2
−3
−2 −1 0 1 2 −3 −2 −1 0 1 2 −3 −2 −1 0 1 2
X X X
0 1 2 3
2
2
1
Y
Y
0
−2
−2
−2
−2 −1 0 1 2 −3 −1 0 1 2 3 −3 −1 0 1 2
X X X
Figure 14.2: Generic Positive and Negative Correlations and Scatterplot Pat-
terns
Chapter 15.
Each entry in this matrix represents the correlation of the column and row
variables that intersect at that point. For instance, if you read down the first
column, you see how each of the independent variables is correlated with the
dependent variable (the r=1.00 entries on the diagonal are the intersections of
each variable with itself). If you look at the intersection of the “fert1520” col-
umn and the “mnschool” row, you can see that the correlation between these
two independent variables is a rather robust -.77, whereas scanning across the
bottom row, we see that population is not related to fertility rate or to level of
education (r=.07 and -.09, respectively). You might have noticed that the corre-
lations with the dependent variable are very slightly different from the separate
correlations reported above. This is because in this matrix, the sub-command
complete.obs deletes all missing cases from the matrix. When considering the
effects of multiple variables at the same time, an observation that is missing on
one of the variables is missing on all of them.
1 2 3 4 5 6 7 −4 0 2 4 6
85
lifexp
70
55
7
5
fert1520
3
1
10
mnschool
6
2
4
log10pop
0
−4
55 65 75 85 2 4 6 8 12
A scatterplot matrix is fairly easy to read. The row variable is treated as the
dependent variable, and the plot at the intersection of each column shows how
that variable responds to the column variable. The plots are condensed a bit, so
they are not as clear as the individual scatterplots we looked at above, but the
matrix does provide a quick and accessible way to look at all the relationships
at the same time.
pendent impact of both variables since this would mean that together they
explain 130% of the variance in the dependent variable, which is not possible
(you can’t explain more than 100%). What’s happening here is that the two
independent variables co-vary with each other AND with the dependent vari-
able, so there is some double-counting of variance explained. Consider the Venn
Diagram below, which illustrates how a dependent variable (Y) is related to two
independent variables (X and Z), and how those independent variables can be
related to each other:
The blue circle represents the total variation in Y, the yellow circle variation in
X, and the red circle variation in a third variable, Z. This setup is very much
like the situation we have with the relationships between life expectancy (Y)
and fertility rate (X) and educational attainment (Z). Both X and Z explain
significant portions of variation in Y but also account for significant portions of
each other. The area where all three circles overlap is the area where X and Z
share variance with each other and with Y. If we attribute this to both X and
Y, we are double-counting this portion of the variance in Y and overestimating
the impact of both variables.
The more overlap between the red and yellow circles, the greater the level of
shared explanation of the variation in the yellow circle. If we treat the Blue,
Red, and Yellow circles as representing life expectancy, fertility rate, and level
of education, respectively, then this all begs the question, “What are the inde-
344 CHAPTER 14. CORRELATION AND SCATTERPLOTS
pendent effects of fertility rate and level of education on life expectancy, after
controlling for the overlap between the two independent variables?”
One important technique for addressing this issue is the partial correlation
coefficient. The partial correlation coefficient considers not just how an inde-
pendent variable is related to a dependent variable but also how it is related
to other independent variables and how those other variables are related to the
dependent variable. The partial correlation coefficient is usually identified as
𝑟𝑦𝑥⋅𝑧 (correlation between x and y, controlling for z) and is calculated using the
following formula:
The key to understanding the partial correlation coefficient lies in the numerator.
Here we see that the original bivariate correlation between x and y is discounted
by the extent to which a third variable (z) is related to both x and y. If there
is a weak relationship between x and z or between y and z (or between z and
both x and y), then the partial correlation coefficient will be close in value to
the zero-order correlation. If there is a strong relationship between x and z or
between y and z (or between z and both x and y), then the partial correlation
coefficient will be significantly lower in value than the zero-order correlation.
Let’s calculate the partial correlations separately for fertility rate and mean
level of education for women.
# partial correlation for fertility, controlling for education levels:
pcorr_fert<-(-.8406-(.77158*-0.76502))/(sqrt(1-.77158^2)*sqrt(1- 0.76502^2))
pcorr_fert
[1] -0.61104
# partial correlation for education levels, controlling for fertility:
pcorr_educ<-(.77158-(-.8406*-0.76502))/(sqrt(1-.8406^2)*sqrt(1-.76502^2))
pcorr_educ
[1] 0.36839
Here, we see that the original correlation between fertility and life expectancy
(-.84) overstated the impact of this variable quite a bit. When we control for
mean years of education, the relationship is reduced to -.61, which is significant
lower but still suggests a somewhat strong relationship. We see a similar, though
somewhat more dramatic reduction in the influence of level of education. When
controlling for fertility rate, the relationship between level of education and
life expectancy drops from .77 to .37, a fairly steep decline. These findings
clearly illustrate the importance of thinking in terms of multiple, overlapping
explanations of variation in the dependent variable.
14.9. NEXT STEPS 345
You can get the same results using the pcor function in the ppcor package. The
format for this command is pcor.test(y, x, z). Note that when copying the
subset of variables to the new object, I used na.omit to drop any observations
that had missing data on any of the three variables.
#Copy subset of variables, dropping missing data
partial<-na.omit(countries2[,c("lifexp", "fert1520", "mnschool")])
#Partial correlation for fert1520, controlling for mnschool
pcor.test(partial$lifexp,partial$fert1520,partial$mnschool)
14.10 Exercises
14.10.1 Concepts and calculations
1. Use the information in the table below to determine whether there is a
positive or negative relationship between X and Y. As a first step, you
346 CHAPTER 14. CORRELATION AND SCATTERPLOTS
should note whether the values of X and Y for each observation is above
or below their respective means. Then, use this information to explain
why you think there is a positive or negative relationship.
2
2
1
1
1
0
0
Y
Y
0
−2 −1
−2
−2
−3 −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
X X X
4
2
1
Y
0
−2
−3 −2 −1 0 1 2
40
30
20
5 10 15 20 25
14.10.2 R Problems
For these exercises, you will use the states20 data set to identify po-
tential explanations for state-to-state differences in infant mortality
(states20$infant_mort). Show all graphs and statistical output and
make sure you use appropriate labels on all graphs.
1. Generate a scatterplot showing the relationship between per capita income
(states20$PCincome2020) and infant mortality in the states. Describe
contents of the scatterplot, focusing on direction and strength of the rela-
tionship. Offer more than just “Weak and positive” or something similar.
What about the scatterplot makes it look strong or weak, positive or neg-
ative? Then get the correlation coefficient for the relationship between per
capita income and infant mortality. Interpret the coefficient and comment
on how well it matches your initial impression of the scatterplot.
2. Repeat the same analysis from (1), except now use lowbirthwt (% of
births that are technically low birth weight) as the independent variable.
Again, show all graphs and statistical results. Use words to describe and
interpret the results.
3. Repeat the same analysis from (1), except now use teenbirth (the %
348 CHAPTER 14. CORRELATION AND SCATTERPLOTS
Simple Regression
loosely here and does not mean we are forecasting future outcomes. Instead, it is used here
and throughout the book to refer to guessing or estimating outcomes based on information
provided by different types of statistics.
349
350 CHAPTER 15. SIMPLE REGRESSION
Is life expectancy simply -.84 of the fertility rate, or -.84 time fertility? No,
it doesn’t work that way. Instead, with scatterplots and correlations, we are
limited to saying how strongly x is related to y, and in what direction. It is
important and useful to be able to say this, but it would be nice to be able to
say exactly how much y increases or decreases for a given change in x.
Even when r = 1.0 (when there is a perfect relationship between x and y) we
still can’t predict y based on x using the correlation coefficient itself. Is y equal
to 1*x when the correlation is 1.0? No, not unless x and y are the same variable.
Consider the following two hypothetical variables that are perfectly correlated:
#Create a data set named "perf" with two variables and five cases
x <- c(8,5,11,4,14)
y <- c(31,22,40,19,49)
perf<-data.frame(x,y)
#list the values of the data set
perf
x y
1 8 31
2 5 22
3 11 40
4 4 19
5 14 49
It’s a little hard to tell just by looking at the values of these variables, but they
are perfectly correlated (r=1.0). Check out the correlation coefficient in R:
#Get the correlation between x and y
cor.test(perf$x,perf$y)
30
25
20
4 6 8 10 12 14
Do you see a pattern in the data points listed above that allows you to predict
y outcomes based on values of x? You might have noticed in the listing of data
values that each value of x divides into y at least three times, but that several
of them do not divide into y four times. So, we might start out with y=3x for
each data point and then see what’s left to explain. To do this, we create a new
variable (predictedy) where we predict y=3x, and then use that to calculate
another variable (leftover) that measures how much is left over:
#Multiply x times three
perf$predictedy<-3*perf$x
#subtract "predictedy" from "y"
perf$leftover<-perf$y-perf$predictedy
#List all data
perf
352 CHAPTER 15. SIMPLE REGRESSION
x y predictedy leftover
1 8 31 24 7
2 5 22 15 7
3 11 40 33 7
4 4 19 12 7
5 14 49 42 7
Now we see that all the predictions based on y=3x under-predicted y by the
same amount, 7. In fact, from this we can see that for each value of x, y is
exactly equal to 3x + 7. Plug one of the values of x into this equation, so you
can see that it perfectly predicts the value of y.
You probably learned from math classes you’ve had before that the equation for
a straight line is
𝑦 = 𝑚𝑥 + 𝑏
Where 𝑚 is the slope of the line (how much y changes for a unit change in x)
and b is a constant. In the example above m=3 and b=7, and x and y are the
independent and dependent variables, respectively.
𝑦 = 3𝑥 + 7
𝑌𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜖𝑖
Where:
𝑌𝑖 = the dependent variable
𝛼 = the constant (aka the intercept)
15.3. ORDINARY LEAST SQUARES REGRESSION 353
But, of course, we usually cannot know the population values of these parame-
ters, so we estimate an equation based on sample data:
𝑦𝑖 = 𝑎 + 𝑏𝑥𝑖 + 𝑒𝑖
Where:
𝑦𝑖 = the dependent variable
𝑎 = the sample constant (aka the intercept)
𝑏 = the sample slope (change in y for a unit change in x)
𝑥𝑖 = the independent variable
𝑒𝑖 = error term; actual values of y minus predicted values of y (𝑦𝑖 − 𝑦𝑖̂ ).
𝑦𝑖̂ = 𝑎 + 𝑏𝑥𝑖
Where the caret above the y signals that it is the predicted value and implies
𝑒𝑖 .
To estimate the predicted value of the dependent variable (𝑦𝑖̂ ) for any given case
we need know the values of 𝑎, 𝑏, and the outcome of the independent variable
for that case (𝑥𝑖 ). We can calculate the values of 𝑎 and 𝑏 with the following
formulas:
∑(𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄
𝑏=
∑(𝑥𝑖 − 𝑥)̄ 2
Note that the numerator is the same as the numerator for Pearson’s r and,
similarly, summarizes the direction of the relationship, while the denominator
reflects the variation in x. This focus on the variation in 𝑥 is necessary because
we are interested in the expected (positive or negative) change in 𝑦 for a unit
change in 𝑥. Unlike Pearson’s r, the magnitude of 𝑏 is not bounded by -1 and
+1. The size of 𝑏 is a function of the scale of the independent and dependent
variables and the strength of relationship. This is an important point, one we
will revisit in Chapter 17.
Once we have obtained the value of the slope, 𝑏, we can plug it into the following
formula to obtain the constant, 𝑎.
𝑎 = 𝑦 ̄ − 𝑏𝑥̄
354 CHAPTER 15. SIMPLE REGRESSION
Let’s work through calculating these quantities for a small data set using data on
the 2016 and 2020 presidential U.S. presidential election outcomes from seven
states. Presumably there is a close relationship between how states voted in
2016 and how they voted in 2020. We can use regression analysis to analyze
that relationship and to “predict” values in 2020 based on values in 2016 for the
seven states listed below:
#Enter data for three different columns
#State abbreviation
state <- c("AL","FL","ME","NH","RI","UT","WI")
#Democratic % of two-party vote, 2020
vote20 <-c(37.1,48.3,54.7,53.8,60.6,37.6,50.3)
#Democratic % of two-party vote, 2016
vote16<-c(35.6,49.4,51.5,50.2,58.3,39.3,49.6)
#Combine three columns into a single data frame
d<-data.frame(state,vote20,vote16)
#Display data frame
d
This data set includes two very Republican states (AL and UT), one very Demo-
cratic state (RI), two states that lean Democratic (ME and NH), and two very
competitive states (Fl and WI). First, let’s take a look at the relationship using
a scatterplot and correlation coefficient to get a sense of how closely the 2020
outcomes mirrored those from 2016:
#Plot "vote20" by "vote16"
plot(d$vote16, d$vote20, xlab="Dem two-party % 2016",
ylab="Dem two-party % 2020")
15.3. ORDINARY LEAST SQUARES REGRESSION 355
60
Dem two−party % 2020
55
50
45
40
35 40 45 50 55
Now we have all of the components we need to estimate the regression equation.
The sum of the column headed by YdevXdev is the numerator for 𝑏, and the
sum of the column headed by Xdevsq is the denominator:
#Create the numerator for slope formula
numerator=sum(d$YdevXdev)
#Display value of the numerator
numerator
[1] 397.65
#Create the denominator for slope formula
denominator=sum(d$Xdevsq)
#Display value of the denominator
denominator
[1] 356.52
#Calculate slope
b<-numerator/denominator
#Display slope
b
[1] 1.1154
15.3. ORDINARY LEAST SQUARES REGRESSION 357
∑(𝑥𝑖 − 𝑥)(𝑦
̄ 𝑖 − 𝑦)̄ 397.65
𝑏= = = 1.115
∑(𝑥𝑖 − 𝑥)̄ 2 356.52
Now that we have a value of 𝑏, we can plug that into the formula for the constant:
#Get x and y means to calculate constant
mean(d$vote20)
[1] 48.914
mean(d$vote16)
[1] 47.7
#Calculate Constant
a<-48.914 -(47.7*b)
#Display constant
a
[1] -4.2889
𝑎 = 𝑦 ̄ − 𝑏𝑥̄ = −4.289
𝑦 ̂ = −4.289 + 1.115𝑥
Actual
State Predicted Outcome Outcome Error (𝑦 − 𝑦)̂
Alabama 𝑦 ̂ = −4.289 + 37.1 1.70
1.115(35.6) = 35.41
Wisconsin 𝑦 ̂ = −4.289 + 50.3 -.72.
1.115(49.6) = 51.02
The regression equation under-estimated the outcome for Alabama by 1.7 per-
centage points and over-shot Wisconsin by .72 points. These are near-misses
and suggest a close fit between predicted and actual outcomes. This ability to
predict outcomes of numeric variables based on outcomes of independent vari-
able is something we get from regression analysis that we do not get from other
methods we have studied. The error reported in the last column (𝑦 − 𝑦)̂ can
be calculated across all observations and used to evaluate how well the model
explains variation in the dependent variable.
60
Dem two−party % 2020
55
50
y=−4.289 + 1.115x
45
40
35
35 40 45 50 55
of issue before (Chapter 6) when trying to calculate the average deviation from
the mean. As we did then, we can use squared prediction errors to evaluate the
amount of error in the model.
60
Dem two−party % 2020
55
50
45
40
35
35 40 45 50 55
The table below provides the actual and predicted values of y, the deviation of
each predicted outcome from the actual value of y, and the squared deviations
(fourth column). The sum of the squared prediction errors is 20.259. This,
we refer to as total squared error (or variation) in the residuals (𝑦 − 𝑦).
̂ The
“Squares” part of Ordinary Least Squares regression refers to the fact that it
produces the lowest squared error possible, given the set of variables.
360 CHAPTER 15. SIMPLE REGRESSION
State 𝑦 𝑦̂ 𝑦 − 𝑦̂ (𝑦 − 𝑦)̂ 2
AL 37.1 35.418 1.682 2.829
FL 48.3 50.810 -2.510 6.300
ME 54.7 53.153 1.547 2.393
NH 53.8 51.703 2.097 4.397
RI 60.6 60.737 -0.137 0.019
UT 37.6 39.545 -1.945 3.783
WI 50.3 51.033 -0.733 0.537
Sum 20.259
55
50
45
40
mean=48.914
35
35 40 45 50 55
The horizontal line in this figure represents the mean of y (48.914) and the verti-
cal lines between the mean and the data points represent the error in prediction
if we used the mean of y as the prediction. As you can tell by comparing this to
the previous figure, there is a lot more prediction error when using the mean to
predict outcomes than when using the regression model. Just as we measured
the total squared prediction error from using a regression model (Table 15.3), we
can also measure the level of prediction error without a regression model. Table
15.4 shows the error in prediction from the mean (𝑦 − 𝑦)̄ and its squared value.
The total squared error when predicting with the mean is 463.789, compared to
20.259 with predictions from the regression model.
Table 15.4. Total Error (Variation) Around the Mean
State 𝑦 𝑦 − 𝑦̄ (𝑦 − 𝑦)̄ 2
AL 37.1 -11.814 139.577
FL 48.3 -0.614 0.377
ME 54.7 5.786 33.474
NH 53.8 4.886 23.870
RI 60.6 11.686 136.556
UT 37.6 -11.314 128.013
WI 50.3 1.386 1.920
Mean 48.914 Sum 463.789
We can now use this information to calculate the proportional reduction in error
(PRE) that we get from the regression equation. In the discussion of measures of
association in Chapter 13 we calculated another PRE statistic, Lambda, based
on the difference between predicting with no independent variable (Error1 ) and
predicting on the basis of an independent variable (Error2 ):
𝐸1 − 𝐸2
𝐿𝑎𝑚𝑏𝑑𝑎(𝜆) =
𝐸1
Okay, so let’s break this down. The original sum of squared error in the model—
the error from predicting with the mean—is 463.789, while the sum of squared
residuals (prediction error) based on the model with an independent variable
is 20.259. The difference between the two (443.53) is difficult to interpret on
its own, but when we express it as a proportion of the original error, we get
.9563. This means that we see about a 95.6% reduction in error predicting the
outcome of the dependent variable by using information from the independent
variable, compared to using just the mean of the dependent variable. Another
way to interpret this is to say that we have explained 95.6% of the error (or
variation) in y by using x as the predictor.
Earlier in this chapter we found that the correlation between 2016 and 2020
votes in the states was .9779. If you square this, you get .9563, the value of r2 .
This is why it was noted in Chapter 14 that one of the virtues of Pearson’s r
is that it can be used as a measure of proportional reduction in error. In this
case, both r and r2 tell us that this is a very strong relationship. In fact, this is
very close to a perfect positive relationship.
Call:
lm(formula = d$vote20 ~ d$vote16)
Residuals:
1 2 3 4 5 6 7
1.682 -2.510 1.547 2.097 -0.137 -1.945 -0.733
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.289 5.142 -0.83 0.44229
d$vote16 1.115 0.107 10.46 0.00014 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
15.6. GETTING REGRESSION RESULTS IN R 363
Here we see that our calculations for the constant (4.289), the slope (1.115), and
the value of r2 (.9563) are the same as those produced in the regression output.
You will also note here that the output includes values for t-scores and p-values
that can be used for hypothesis testing. In the case of regression analysis, the
null and alternative hypotheses for the slope are:
H0 : 𝛽 = 0 (There is no relationship)
H1 : 𝛽 ≠ 0 (There is a relationship)
Or, since we expect a positive relationship:
H1 : 𝛽 > 0 (There is a positive relationship)
The t-score is equal to the slope divided by the standard error of the slope. In
this case the t-score (10.46) and p-value (p=.00014), so we are safe rejecting the
null hypothesis. Note also that the t-score and p-values for the slope are identical
to the t-score and p-values from the correlation coefficient shown earlier. That
is because in simple (bi-variate) regression, the results are a strict function of
the correlation coefficient.
and 2020 votes in all fifty states, using the states20 data set. First, the scatter
plot:
#Plot 2020 results against 2016 results, all states
plot(states20$d2pty16, states20$d2pty20, xlab="Clinton % 2016",
ylab="Biden % 2020",
main="Democratic % Two-party Vote")
50
40
30
30 40 50 60
Clinton % 2016
This pattern is similar to that found in the smaller sample of seven states: a
strong, positive relationship. This impression is confirmed in the regression
results:
#Get linear model and store results in new object 'fit50'
fit50<-lm(states20$d2pty20~states20$d2pty16)
#View regression results stored in 'fit50'
summary(fit50)
Call:
lm(formula = states20$d2pty20 ~ states20$d2pty16)
Residuals:
Min 1Q Median 3Q Max
-3.477 -0.703 0.009 0.982 2.665
Coefficients:
Estimate Std. Error t value Pr(>|t|)
15.7. UNDERSTANDING THE CONSTANT 365
𝑦 ̂ = 3.468 + .964𝑥
The value for b (.964) means that for every one unit increase in the observed
value of x (one percentage point in this case), the expected increase in y is .964
units. If it makes it easier to understand, we could express this as a linear
equation with substantive labels instead of “y” and “x”:
̂
2020 Vote = 3.468 + .964(2016 Vote)
To predict the 2020 outcome for any state, just plug in the value of the 2016
outcome for x, multiply that time the slope (.964), and add the value of the
constant.
60
Biden % 2020
50
y=3.468 +.964x
40
30
30 40 50 60
Clinton % 2016
Just to reiterate, this line represents the best fitting line possible for this pair of
variables. There is always some error in predicting outcomes in y, but no other
linear equation will produce less error in prediction than those produced using
OLS regression.
The slope (angle) of the line is equal to the value of b, and the vertical placement
of the line is determined by the constant. Literally, the value of the constant is
the predicted value of y when x equals zero. In many cases, such as this one,
zero is not a plausible outcome for x, which is part of the reason the constant
does not have a really straightforward interpretation. But that doesn’t mean it
is not important. Look at the scatterplot below, in which I keep the value of b
the same but use the abline function to change the constant to 1.468 instead
of 3.468:
#Plot 2020 results against 2016 reults, all states
plot(states20$d2pty16, states20$d2pty20, xlab="Clinton % 2016",
ylab="Biden % 2020")
#Add prediction line to the graph, using 1.468 as the constant
abline(a=1.468, b=.964)
legend("right", legend="y=1.468 + .964x", bty = "n", cex=.9)
15.8. NON-NUMERIC INDEPENDENT VARIABLES 367
60
Biden % 2020
50
y=1.468 + .964x
40
30
30 40 50 60
Clinton % 2016
Changing the constant had no impact on the angle of the slope, but the line
doesn’t fit the data nearly as well. In fact, it looks like the regression predic-
tions underestimate the outcomes in almost every case. The point here is that
although we are rarely interested in the substantive meaning of the constant, it
plays a very important role in providing the best fitting regression line.
Call:
lm(formula = states20$d2pty20 ~ states20$southern)
Residuals:
Min 1Q Median 3Q Max
15.8. NON-NUMERIC INDEPENDENT VARIABLES 369
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.97 1.66 30.78 <2e-16 ***
states20$southernSouthern -8.33 3.25 -2.56 0.014 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is because the regression model does not apply the Welch’s correction to account for
unequal variances across groups, which the t.test function does by default.
370 CHAPTER 15. SIMPLE REGRESSION
60
50
40
30
0=Non−South, 1=South
Here, the “slope” is just connecting the mean outcomes for the two groups.
Again, it does not make sense to think about a predicted outcome for any
values of the independent variable except 0 and 1, because there are no such
values. So, the best way to think about coefficients for dichotomous variables
is that they represent intercept shifts—we simply add (or subtract) the value of
the coefficient to the intercept when the dichotomous variable equals 1.
All of this is just a way of pointing out to you that regression analysis is really
an extension of things we’ve already been doing.
the chapter.
#Regression model of "infant_mort" by "PCincome2020"
inf_mort<-lm(states20$infant_mort ~states20$PCincome2020)
#View results
summary(inf_mort)
Call:
lm(formula = states20$infant_mort ~ states20$PCincome2020)
Residuals:
Min 1Q Median 3Q Max
-1.4453 -0.7376 0.0331 0.6455 1.9051
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.5344672 0.7852998 13.41 < 2e-16 ***
states20$PCincome2020 -0.0000796 0.0000135 -5.92 0.00000033 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
to those shown earlier in the chapter.√ One useful way to think about the r2 statistic is to
convert it to a correlation coefficient ( 𝑟2 ), which in this case, gives us r=.65, a fairly strong
correlation.
372 CHAPTER 15. SIMPLE REGRESSION
8
7
6
5
One thing we can do to make the pattern more meaningful is replace the
data points with state abbreviations. Doing this enables us to see which
states are high and low on the two variables, as well as which states are
relatively close to or far from the regression line. To do this, you need to
reduce the size of the original markers (cex=.01 below) and instruct R to
add the state abbreviations using the text function. The format for this is
text(independent_variable,dependent_variable, ID_variable), where
ID_variable is the variable that contains information that identifies the data
points (state abbreviations, in this case).
#Scatterplot for per capita income and infant mortality
plot(states20$PCincome2020, states20$infant_mort,
xlab="Per Capita Income",
ylab="# Infant Deaths per 1000 live Births", cex=.01)
#Add regression line
abline(lm(inf_mort))
#Add State abbreviations
text(states20$PCincome2020,states20$infant_mort, states20$stateab, cex=.7)
15.10. NEXT STEPS 373
# Infant Deaths per 1000 live Births
AL
MS
OK
8
AR LA GA
KY NC IN
7
WV
TN OH
ME DE
IL AK
SC MI MD
WY
KS
AZMO FL
6
SD PA
TX VA
WI
ID MT
NV
NM
NE
OR
5
UT ND MN
CT
IA CO
VT NY
HI WA NJ
RI MA
NH CA
This scatterplot provides much of the same information as the previous one,
but now we also get information about the outcomes for individual states. The
added value is that now we can see which states are well explained by the
regression model and which states are not. States that stand out as having
higher than expected infant mortality (farthest above the regression line), given
their income level are Alabama, Oklahoma, Alaska, and Maryland. States with
lower than expected infant mortality (farthest below the regression line) are
New Mexico, Utah, Iowa, Vermont, and Rhode island. It is not clear if there is
a pattern to these outcomes, but there might be a slight tendency for southern
states to have a higher than expected level of infant mortality, once you control
for income.
15.11 Assignments
15.11.1 Concepts and Calculations
1. Identify the parts of this regression equation: 𝑦 ̂ = 𝑎 + 𝑏𝑥
• The independent variable is:
• The slope is:
• The constant is:
• The dependent variable is:
2. The regression output below is from the 2020 ANES and shows the impact
of respondent sex (0=male, 1=female) on the Feminist feeling thermome-
ter rating.
• Interpret the coefficient for anes20$female
• What is the average Feminist feeling thermometer rating among fe-
male respondents?
• What is the average Feminist feeling thermometer rating among male
respondents?
• Is the relationship between respondent sex and feelings toward fem-
inists strong, weak, or something in between? Explain your answer.
Call:
lm(formula = anes20$feministFT ~ anes20$female)
Residuals:
Min 1Q Median 3Q Max
-62.55 -12.55 -2.55 22.45 45.46
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.536 0.459 118.8 <2e-16 ***
anes20$female 8.012 0.623 12.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = county500$internet ~ county500$povtyAug21)
Residuals:
Min 1Q Median 3Q Max
-25.764 -3.745 0.266 4.470 19.562
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 89.0349 0.7600 117.1 <2e-16 ***
county500$povtyAug21 -0.8656 0.0432 -20.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
15.11.2 R Problems
This assignment builds off the work you did in Chapter 14, as well as the
scatterplot discussion from this chapter.
1. Following the example used at the end of this chapter, produce a scatter-
plot showing the relationship between the percent of births to teen moth-
ers (states20$teenbirth) and infant mortality states20$infant_mort
in the states (infant mortality is the dependent variable). Make sure you
include the regression line and the state abbreviations. Identify any states
that you see as outliers, i.e., that don’t follow the general trend. Is there
any pattern to these states? Then, run and interpret a regression model
using the same two variables. Discuss the results, focusing on the general
direction and statistical significance of the relationship, as well as the ex-
pected change in infant mortality for a unit change in the teen birth rate,
and interpret the R2 statistic.
2. Repeat the same analysis from (1), except now use states20$lowbirthwt
(% of births that are technically low birth weight) as the independent
376 CHAPTER 15. SIMPLE REGRESSION
variable.
As always, use words. The statistics don’t speak for themselves!
3. Run a regression model using the feminist feeling thermometer
(anes20$V202160) as the dependent variable and the feeling ther-
mometer for Christian fundamentalist (V202159) as the independent
variable.
• Describe the relationship between these two variables, focusing on
the value and statistical significance of the slope.
Multiple Regression
377
378 CHAPTER 16. MULTIPLE REGRESSION
Let’s turn, first, to the analysis of the impact of fertility rates on life expectancy
across countries. Similar to regression models used in Chapter 15, the results
of the analysis are stored in a new object. However, instead of using summary
to view the results, we use stargazer to report the results in the form of a
well-organized table. In its simplest form, you just need to tell stargazer to
produce a “text” table using the information from the object where you stored
the results of your regression model. To add to the clarity of the output, I also
include code to generate descriptive labels for the dependent and independent
variables. Without doing so, the R variable names would be used, which would
be fine for you, but using the descriptive labels makes it easier for everyone to
read the table.
#Regression model of "lifexp" by "fert1520"
fertility<-(lm(countries2$lifexp~countries2$fert1520))
#Have 'stargazer' use the information in 'fertility' to create a table
stargazer(fertility, type="text",
dep.var.labels=c("Life Expectancy"),#Label dependent variable
covariate.labels = c("Fertility Rate"))#Label independent variable
===============================================
Dependent variable:
---------------------------
Life Expectancy
-----------------------------------------------
Fertility Rate -4.911***
(0.231)
Constant 85.946***
(0.700)
-----------------------------------------------
Observations 185
R2 0.711
Adjusted R2 0.710
Residual Std. Error 4.032 (df = 183)
F Statistic 450.410*** (df = 1; 183)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
As you can see, stargazer produces a neatly organized, easy to read table, one
that stands in stark contrast to the standard regression output from R.1 You
can copy and paste this table into whatever program you are using to produce
your document. Remember, though, that when copying and pasting R output,
you need to use a fixed-width format such consolas or courier, otherwise your
1 You might get this warning message when you use stargazer: “length of NULL cannot be
changed”. This does not affect your table so you can ignore it.
16.2. ORGANIZING THE REGESSION OUTPUT 379
columns will wander about and the table will be difficult to read.
We can use stargazer to present the results of the models for the education and
population models alongside those from the fertility model, all in one table. All
you have to do is run the other two models, save the results in new, differently
named objects, and add the new object names and descriptive labels to the
command line. One other thing I do here is suppress the printing of the F-ratio
and degrees of freedom omit.stat=c("f"). This helps fit the three models into
one table.
#Run education and population models
education<-(lm(countries2$lifexp~countries2$mnschool))
pop<-(lm(countries2$lifexp~log10(countries2$pop19_M)))
#Have 'stargazer' use the information in ''fertility', 'education',
#and 'pop' to create a table
stargazer(fertility, education, pop, type="text",
dep.var.labels=c("Life Expectancy"),#Label dependent variable
covariate.labels = c("Fertility Rate", #Label 'fert1529'
"Mean Years of Education", #Label 'mnschool'
"Log10 Population"),#Label 'log10pop'
omit.stat=c("f")) #Drop F-stat to make room for three columns
==========================================================================
Dependent variable:
--------------------------------------------------
Life Expectancy
(1) (2) (3)
--------------------------------------------------------------------------
Fertility Rate -4.911***
(0.231)
--------------------------------------------------------------------------
Observations 185 189 191
R2 0.711 0.591 0.007
Adjusted R2 0.710 0.589 0.002
Residual Std. Error 4.032 (df = 183) 4.738 (df = 187) 7.423 (df = 189)
==========================================================================
380 CHAPTER 16. MULTIPLE REGRESSION
75
70
65
60
55
Y=73.21−.699X
Mean of Y
−2 −1 0 1 2 3
Log10 of Population
As you can see, the slope of the regression line is barely different from what we
would expect if b=0. This is a perfect illustration of support for the null hypoth-
382 CHAPTER 16. MULTIPLE REGRESSION
Where:
𝑦𝑖 = the predicted value of the dependent variable
𝑎 = the sample constant (aka the intercept)
𝑥1𝑖 = the value of x1
𝑏1 = the partial slope for the impact of x1 , controlling for the impact of other
variables
𝑥2𝑖 = the value of x2
𝑏2 = the partial slope for the impact of x2 , controlling for the impact of other
variables
𝑘= the number of independent variables
𝑒𝑖 = error term; the difference between the predicted and actual values of y
(𝑦𝑖 − 𝑦𝑖̂ ).
The formulas for estimating the constant and slopes for a model with two inde-
pendent variables are presented below:
𝑎 = 𝑌 ̄ − 𝑏1 𝑋̄ 1 − 𝑏2 𝑋̄ 2
We won’t calculate the constant and slopes, but I want to point out that they
are based on the relationship between each of the independent variables and
the dependent variable and also the interrelationships among the independent
variables. You should recognize the right side of the formula as very similar
to the formula for the partial correlation coefficient. This illustrates that the
partial regression coefficient is doing the same thing that the partial correlation
does: it gives an estimate of the impact of one independent variable while
controlling for how it is related to the other independent variables AND for
how the other independent variables are related to the dependent variable. The
primary difference is that the partial correlation summarizes the strength and
direction of the relationship between x and y, controlling for other specified
variables, while the partial regression slope summarizes the expected change in
y for a unit change in x, controlling for other specified variables and can be used
to predict outcomes of y. As we add independent variables to the model, the
level of complexity for calculating the slopes increases, but the basic principle
of isolating the independent effects of multiple independent variables remains
the same.
To get multiple regression results from R, you just have to add the independent
variables to the linear model function, using the “+” sign to separate them.
Note that I am using stargazer to produce the model results and that I only
specified one model (fit), since all three variables are now in the same model.2
#Note the "+" sign before each additional variable.
#See footnote for information about "na.action"
#Use '+' sign to separate the independent variables.
fit<-lm(countries2$lifexp~countries2$fert15 +
countries2$mnschool+
log10(countries2$pop),na.action=na.exclude)
#Use 'stargazer' to produce a table for the three-variable model
stargazer(fit, type="text",
dep.var.labels=c("Life Expectancy"),
covariate.labels = c("Fertility Rate",
"Mean Years of Education",
"Log10 Population"))
hidden information from fit (predicted outcomes) to the countries2 data set after the results
are generated. Without adding this to the regression command, R would skip all observations
with missing data when generating predictions and the rows of data for the predictions would
not match the appropriate rows in the original data set. This is a bit of a technical point.
Ignore it if it makes no sense to you.
384 CHAPTER 16. MULTIPLE REGRESSION
===================================================
Dependent variable:
---------------------------
Life Expectancy
---------------------------------------------------
Fertility Rate -3.546***
(0.343)
Constant 75.468***
(2.074)
---------------------------------------------------
Observations 183
R2 0.748
Adjusted R2 0.743
Residual Std. Error 3.768 (df = 179)
F Statistic 176.640*** (df = 3; 179)
===================================================
Note: *p<0.1; **p<0.05; ***p<0.01
This output looks a lot like the output from the individual models, except now
we have coefficients for each of the independent variables in a single model.
Here’s how I would interpret the results:
• Two of the three independent variables are statistically significant: fertility
rate has a negative impact on life expectancy, and educational attainment
has a positive impact. The p-values for the slopes of these two variables
are less than .01. Population size has no impact on life expectancy.
• For every additional unit of fertility rate, life expectancy is expected to
decline by 3.546 years, controlling for the influence of other variables.
So, for instance, if Country A and Country B are the same on all other
variables, but Country A’s fertility rate is one unit higher than Country
B’s, we would predict that Country A’s life expectancy is expected to be
be 3.546 years less than Country B’s.
• For every additional year of mean educational attainment, life expectancy
is predicted to increase by .75 years, controlling for the influence of other
variables. So, if Country A and Country B are the same on all other
variables, but Country A’s outcome mean level of education is one unit
16.3. MULTIPLE REGRESSION 385
higher than Country B’s, we expect that Country A’s life expectancy
should be about .75 years higher than Country B’s.
• Together, these three variables explain 74.8% of the variation in country-
level life expectancy.
• You can also write this out as a linear equation, if that helps you with
interpretation:
One other thing that is important to note is that the slopes for fertility and level
of education are much smaller than they were in the bivariate (one independent
variable) models, reflecting the consequence of controlling for overlapping influ-
ences. For fertility rate, the slope changed from -4.911 to -3.546, a 28% decrease;
while the slope for education went from 1.839 to .75, a 59% decrease in impact.
This is to be expected when two highly correlated independent variables are put
in the same multiple regression model.
Mean Logged
Country Fertility Rate Education Population
Country A 1.8 10.3 .75
Country B 3.9 4.9 .75
Country B has a higher fertility rate and lower value on educational attainment
than Country A, so we expect it to have an overall lower predicted outcome on
life expectancy. Let’s plug in the numbers and see what we get.
For Country A:
𝑦 ̂ = 75.468 − 3.546(1.8) + .75(10.3) + .296(.75) = 77.03
For Country B:
386 CHAPTER 16. MULTIPLE REGRESSION
2 𝑁 −1
𝑅𝑎𝑑𝑗 = 1 − (1 − 𝑅2 ) ( )
𝑁 −𝑘−1
2 182
𝑅𝑎𝑑𝑗 = 1 − (.2525) ( ) = .743
179
This new estimate is slightly smaller than the unadjusted R2 , but still represents
a real improvement over the strongest of the bivariate models, in which the
adjusted R2 was .710.
Root Mean Squared Error. The R2 (or adjusted R2 ) statistic is a nice
measure of the explanatory power of the models, and a higher R2 for a given
model generally means less error. However, the R2 statistic does not tell us
16.4. MODEL ACCURACY 387
exactly how much error there is in the model estimates. For instance, the
results above tell us that the model explains about 74% of the variation in life
expectancy, but that piece of information does not speak to the typical error
in prediction ((𝑦1 − 𝑦)̂ 2 ). Yes, the model reduces error in prediction by quite a
lot, but how much error is there, on average? For that, we can use a different
statistic, the Root Mean Squared Error (RMSE). The RMSE reflects the typical
error in the model. It simply takes the square root of the mean squared error:
∑(𝑦𝑖 − 𝑦)̂ 2
𝑅𝑀 𝑆𝐸 = √
𝑁 −𝑘−1
The sum of squared residuals (error) constitutes the numerator, and the model
degrees of freedom (see above) constitute the denominator. Let’s run through
these calculations for the regression model.
#Calculate squared residuals using the saved residuals from "fit"
residsq <- residuals(fit)^2
[1] 2541.9
#Divide the sum of squared residuals by N-K-1 and take the square root
RMSE=sqrt(sumresidsq/179)
RMSE
[1] 3.7684
The resulting RMSE can be taken as the mean prediction error. For the model
above, RMSE= 3.768. This appears in the regression table as “Residual Std
Error: 3.768 (df = 179)”
One drawback to RMSE is that “3.768” has no standard meaning. Whether it
is a lot or a little error depends on the scale of the dependent variable. For a
dependent variable that ranges from 1 to 15 and has a mean of 8, this could
be a substantial amount of error. However, for a variable like life expectancy,
which ranges from about 52 to 85 and has a mean of 72.6, this is not much
error. The best use of the RMSE lies in comparison of error across models that
use the same dependent variable. For instance, we will be adding a couple of
new variables to the regression model in the next section and the RMSE will
give us a sense of how much more accurate the new model is in comparison to
the current model.
This comparability issue points to a key advantage of the (adjusted) R2 : its
value has a standard meaning, regardless of the scale of the dependent variable.
In the current model, the adjusted=R2 (.743) means there is a 74.3% reduction
388 CHAPTER 16. MULTIPLE REGRESSION
in error, and this interpretation would hold whether the scale of the variable is
1 to 15, 82 to 85, or 1 to 1500.
It is useful to think of the strength of the model in terms of how well its pre-
dictions correlate overall with the dependent variable. In the initial discussion
of the simple bivariate regression model, I pointed out that the square root of
the R2 is the correlation between the independent and dependent variables. We
can think of the multiple regression model in a similar fashion: the square root
of R2 is the correlation (Multiple R) between the dependent variable and the
predictions from the regression model. By this, I mean that Multiple R is lit-
erally the correlation between the observed values of of the dependent variable
(y) and the values predicted by the regression model (𝑦).̂
We can generate predicted outcomes based on the three-variable model and look
at a scatterplot of the predicted and actual outcomes to gain an appreciation
for how well the model explains variation in the dependent variable. First,
generate predicted outcomes (yhat) for all observations, based on their values
of the independent variables, using information stored in fit.
#Use information stored in "fit" to predict outcomes
countries2$yhat<-(predict(fit))
The scatterplot illustrating the relationship between the predicted and actual
values of life expectancy is presented below.
#Use predicted values as an independent variable in the scatterplot
plot(countries2$yhat,countries2$lifexp,
xlab="Predicted Life Expectancy",
ylab = "Actual Life Expectancy")
#Plot the regression line
abline(lm(countries2$lifexp~countries2$yhat))
16.5. PREDICTED OUTCOMES 389
85
80
Actual Life Expectancy
75
70
65
60
55
55 60 65 70 75 80
Here we see the predicted values along the horizontal axis, the actual values on
the vertical axis, and a fitted (regression) line. The key takeaway from this plot
is that the model fits the data fairly well. The correlation (below) between the
predicted and actual values is .865, which, when squared, equals .748, the R2 of
the model.
#get the correlation between y and yhat
cor.test(countries2$lifexp, countries2$yhat)
plot(countries2$yhat,countries2$lifexp,
xlab="Predicted Life Expectancy",
ylab = "Actual Life Expectancy",
cex=.01)
#"cex=.01" reduces the marker size to to make room for country codes
text(countries2$yhat,countries2$lifexp, countries2$ccode, cex=.6)
#This tells R to use the country code to label the yhat and y coordinates
abline(lm(countries2$lifexp~countries2$yhat))
85
HKG
JPN
ESP
AUS CHE
ITA SGP
ISR ISL
SWE
FRAMLT KOR
CAN
PRTIRL
NZL NOR
NLD
GRCLUX
FIN
BEL AUT
DNK CYPDEU
SVN
GBR
80
CRI
QAT CHL
Actual Life Expectancy
TZA
SDN
MRT
AFG ZAR
COM
MWI PNG
LBR GHANAM HTI ZAF
UGA ZMB
NER
BDIGMB
BFA BEN
AGOGIN TGO ZWE
MOZ
60
COG SWZ
MLI CMR
GNQ
GNB
SSDCIV
55
55 60 65 70 75 80
16.7 Exercises
16.7.1 Concepts and Calculations
1. Answer the following questions regarding regression model below, which
focuses on multiple explanations for county-level differences in internet
access, using a sample of 500 counties. The independent variables are the
percent of the county population living below the poverty rate, the percent
of the county population with advanced degrees, and the logged value of
population density (logged because density is highly skewed).
=====================================================
Dependent variable:
---------------------------
% With Internet Access
-----------------------------------------------------
Poverty Rate (%) -0.761***
(0.038)
Constant 79.771***
(0.924)
-----------------------------------------------------
Observations 500
R2 0.605
Adjusted R2 0.602
Residual Std. Error 5.684 (df = 496)
F Statistic 252.860*** (df = 3; 496)
=====================================================
Note: *p<0.1; **p<0.05; ***p<0.01
2. Use the information from the model in Question #1, along with the in-
formation presented below to generate predicted outcomes for two hypo-
16.7. EXERCISES 393
• Prediction of County A:
• Prediction of County B:
• What did you learn from predicting these hypothetical outcomes that you
could not learn from the model output?
16.7.2 R Problems
1. Building on the regression models from the R problems in the last chapter,
use the states20 data set and run a multiple regression model with infant
mortality (infant_mort) as the dependent variable, and per capita income
(PCincome2020), the teen birth rate (teenbirth), and percent low birth
weight births lowbirthwt as the independent variables. Store the results
of the regression model in an object called rhmwrk.
• Produce a readable table with stargazer using the following command:.
stargazer(rhmwrk, type="text",
dep.var.labels = "Infant Mortality",
covariate.labels = c("Per capita Income", "%Teen Births",
"Low Birth Weight"))
• Generally, does it look like there is a good fit between the model
predictions and the actual levels of infant mortality? Explain.
394 CHAPTER 16. MULTIPLE REGRESSION
• Identify states that stand out as having substantially higher or lower than
expected levels of infant mortality. Do you see any pattern among these
states?
Chapter 17
395
396 CHAPTER 17. ADVANCED REGRESSION TOPICS
stargazer(fit, type="text",
dep.var.labels=c("Life Expectancy"), #Dependent variable label
covariate.labels = c("Fertility Rate", #Indep Variable Labels
"Mean Years of Education",
"Log10 Population", "% Urban",
"Doctors per 10,000"))
===================================================
Dependent variable:
---------------------------
Life Expectancy
---------------------------------------------------
Fertility Rate -3.237***
(0.341)
% Urban 0.050***
(0.015)
Constant 73.923***
(2.069)
---------------------------------------------------
Observations 179
R2 0.768
Adjusted R2 0.761
Residual Std. Error 3.606 (df = 173)
F Statistic 114.590*** (df = 5; 173)
===================================================
Note: *p<0.1; **p<0.05; ***p<0.01
• The first thing to note is that four of the five variables are statistically
significant (counting docs10k as significant with a one-tailed test), and the
overall model fit is improved in comparison to the fit for the three-variable
model from Chapter 16. The only variable that has no discernible impact
17.2. INCORPORATING ACCESS TO HEALTH CARE 397
is population size. The model R2 is now .768, indicating that the model
explains 76.8% of the variation in life expectancy, and the adjusted R2 is
.761, an improvement over the previous three-variable model (adjusted R2
was .743). Additionally, the RMSE is now 3.606, also signaling less error
compared to the previous model (3.768). Interpretation of the individual
variables should be something like the following:
• Fertility rate is negatively and significantly related to life expectancy. For
every one unit increase in the value of fertility rate, life expectancy is
expected to decline by about 3.24 years, controlling for the influence of
the other variables in the model.
• Average years of education is positively related to life expectancy. For ev-
ery unit increase in education level, life expectancy is expected to increase
by about .413 years, controlling for the influence of the other variables in
the model.
• Percent of urban population is positively related to life expectancy. For
every unit increase in the percent of the population living in urban areas,
life expectancy is expected to increase by .05 years, controlling for the
influence of other variables in the model.
• Doctors per 10,000 population is positively related to life expectancy. For
every unit increase in doctors per 10,000 population, life expectancy is
expected to increase by about .05 years, controlling for the influence of
the other variables in the model.
I am treating the docs10k coefficient as statistically significant even though the
p-value is greater than .05. This is because the reported p-value (< .10) is a two-
tailed p-value, and if we apply a one-tailed test (which makes sense in this case),
the p-value is cut in half and is less than .05. The two-tailed p-value (not shown
in the table) is .068, so the one-tailed value is .034. Still, in a case like this,
where the level of significance is borderline, it is best to assume that the effect
is pretty small. A weak effect like this is surprising for this variable, especially
since there are strong substantive reasons to expect greater access to health care
to be strongly related to life expectancy. Can you think of an explanation for
this? We explore a couple of potential reasons for this weaker-than-expected
relationship in the next two sections.
Missing Data. One final thing that you need to pay attention to whenever you
work with multiple regression, or any other statistical technique that involves
working with multiple variables at the same time, is the number of missing
cases. Missing cases occur on any given variable when cases do not have any
valid outcomes. We discussed this a bit earlier in the book in the context of
public opinion surveys, where missing data usually occur because people refuse
to answer questions or do not have an opinion to offer. When working with
aggregate cross-national data, as we are here, missing outcomes usually occur
because the data are not available for some countries on some variables. For
instance, some countries may not report data on some variables to the interna-
398 CHAPTER 17. ADVANCED REGRESSION TOPICS
tional organizations (e.g., World Bank, United Nations, etc.) that are collecting
data, or perhaps the data gathering organizations collect certain types of data
from certain types of countries but not for others. This is a more serious prob-
lem for multiple regression than simple regression because multiple regression
uses “listwise” deletion of missing data, meaning that if a case is missing on
one variable it is dropped from all variables. This is why it is important to
pay attention to missing data as you add more variables to the model. It is
possible that one or two variables have a lot of missing data, and you could end
up making generalizations based on a lot fewer data points than you realize. At
this point, using this set of five variables there are sixteen missing cases (there
are 195 countries in the data set and 179 observations used in the model).1 This
is not too extreme but is something that should be monitored.
17.3 Multicollinearity
One potential explanation for the tepid role of docs10k in the life expectancy
model is multicollinearity. Recall from the discussions in Chapters 14 and 16
that when independent variables are strongly correlated with each other the sim-
ple bi-variate relationships are likely overstated and the independent variables
lose strength when considered in conjunction with other overlapping explana-
tions. Normally, this is not a major concern. In fact, one of the virtues of regres-
sion analysis is that the model sorts out overlapping explanations (this is why
we use multiple regression). However, when the degree of overlap is very high,
it can make it difficult for substantively important variables to demonstrate
statistical significance. Perfect collinearity violates a regression assumption (see
next chapter), but generally high levels of collinearity can also cause problems,
especially for interpreting significance levels.
Here’s how this problem comes about: the t-score for a regression coefficient (𝑏)
is calculated as 𝑡 = 𝑆𝑏 , and the standard error of b (𝑆𝑏 ) for any given variable
𝑏
is directly influenced by how that variable is correlated with other independent
variables. The formula below illustrates how collinearity can affect the standard
error of 𝑏1 in a model with two independent variables:
𝑅𝑀 𝑆𝐸
𝑆𝑏1 = √ 2 )
∑(𝑥𝑖 − 𝑥1̄ )2 (1 − 𝑅1⋅2
The key to understanding how collinearity affects the standard error, which
then affects the t-score, lies in part of the denominator of the formula for the
2
standard error: (1 − 𝑅1⋅2 ). As the correlation (and R2 ) between x1 and x2
increases, the denominator of the formula decreases in size, leading to larger
standard errors and smaller t-scores. Because of this, high correlations among
1 The number of missing cases is reported in the standard R output (not stargazer) as “16
You should recognize this from the denominator of the formula for the standard
400 CHAPTER 17. ADVANCED REGRESSION TOPICS
error of the regression slope; it is the part of the formula that inflates the
standard error. Note that as the proportion of variation in x1 that is accounted
for by the other independent variables increases, the tolerance value decreases.
The tolerance is the proportion of variance in one independent variable that is
not explained by variation in the other independent variables.
The VIF statistic tells us how much the standard error of the slope is inflated due
to inter-item correlation. The calculation of the VIF is based on the tolerance
statistic:
1
VIF𝑏1 =
Tolerance𝑏1
We can calculate tolerance for docs10k stats by regressing it on the other inde-
pendent variables to get 𝑅𝑥2 1 ,𝑥2 ⋅⋅𝑥𝑘 . In other words, treat docs10k as a depen-
dent variable and see how much of its variation is accounted for by the other
variables in the model. The model below does this:
#Use 'docs10k' as the DV to Calculate its Tolerance
fit_tol<-lm(countries2$docs10k~countries2$fert1520 + countries2$mnschool
+ log10(countries2$pop19_M) +countries2$urban)
#Use information in 'fit_tol' to create a table of results
stargazer(fit_tol, type="text",
title="Treating 'docs10k' as the DV to Calculate its Tolerance",
dep.var.labels=c("Doctors per 10,000"),
covariate.labels = c("Fertility Rate",
"Mean Years of Education",
"Log10 Population", "% Urban"))
17.3. MULTICOLLINEARITY 401
% Urban 0.099**
(0.043)
Constant -6.080
(5.840)
---------------------------------------------------
Observations 179
R2 0.623
Adjusted R2 0.615
Residual Std. Error 10.200 (df = 174)
F Statistic 71.900*** (df = 4; 174)
===================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Note that the R2 =.623, so 62.3 percent of the variation in docs10k is accounted
for by the other independent variables. Using this information, we get:
Tolerance: 1-.623 = .377
VIF (1/tolerance): 1/.377 = 2.65
So, is a VIF of 2.65 a lot? In my experience, no, but it is best to evaluate
this number in the context of the VIF statistics for all variables in the model.
Instead of calculating all of these ourselves, we can just have R get the VIF
statistics for us:
402 CHAPTER 17. ADVANCED REGRESSION TOPICS
Does collinearity explain the marginally significant slope for docs10k? Yes
and no. Sure, the slope for docs10k would probably have a smaller p-value if
we excluded the variables that are highly correlated with it, but some loss of
impact is almost always going to happen when using multiple regression; that’s
the point of using it! Also, the issue of collinearity is not appreciably worse
for docs10k than it is for fertility rate, and it is less severe that it is for mean
education, so it’s not like this variable faces a higher hurdle than the others.
Based on these results I would conclude that collinearity is a slight problem
for docs10k, but that there is probably a better explanation for its level of
statistical significance. While there is no magic cutoff point for determining
that collinearity is a problem that needs to be addressed, in my experience VIF
values in the 7-10 range signal there could be a problem, and values greater
than 10 should be taken seriously. You always have to consider issues like this
in the context of the data and model you are using.
Another alternative that could work well in this scenario is to combine the highly
correlated variables into an index that uses information from all of the variables
to measure acces to health care with a single variable. We did something like
this in Chapter 3, where we created a single index of LGBTQ policy preferences
based on outcomes from several separate LGBTQ policy questions. Indexing like
this minimizes the collinearity problem without dropping any variables. There
are many different ways to combine variables, but expanding on that here is a
bit beyond the scope of this book.
17.4. CHECKING ON LINEARITY 403
75
70
65
60
55
0 20 40 60 80
This is interesting. A rather strange looking pattern, right? Note that the
linear prediction line does not seem to fit the data in the same way we have
seen in most of the other scatterplots. Focusing on the pattern in the markers,
this doesn’t look like a typical linear pattern. At low levels of the independent
variable there is a lot of variation in life expectancy, but, on average, it is fairly
low. Increases in doctors per 10k from the lowest values to about 15 leads to
substantial increases in life expectancy but then there are diminishing returns
from that point on. Looking at this plot, you can imagine that a curved line
would fit the data points better than the existing straight line. One of the
important assumptions of OLS regression is that all relationships are linear,
that the expected change in the dependent variable for a unit change in the
independent variable is constant across values of the independent variable. This
404 CHAPTER 17. ADVANCED REGRESSION TOPICS
is clearly not the case here. Sure, you can fit a straight line to the pattern in
the data, but if the pattern in the data is not linear, then the line does not fit
the data as well as it could, and we are violating an important assumption of
OLS regression (see next Chapter).
Just as 𝑦 = 𝑎 + 𝑏𝑥 is the equation for a straight line, there are a number of
possibilities for modeling a curved line. Based on the pattern in the scatterplot–
one with diminishing returns–I would opt for the following model:
𝑦 ̂ = 𝑎 + 𝑏 ∗ 𝑙𝑜𝑔10 𝑥
Here, we transform the independent variable by taking the logged values of its
outcomes.
Log transformations are very common, especially when data show a curvilinear
pattern or when a variable is heavily skewed. In this case, we are using a log(10)
transformation. This means that all of the original values are transformed into
their logged values, using a base of 10. A logged value is nothing more than
the power to which you have to raise your log base (in this case, 10) in order
to get the original raw score. For instance, log10 of 100 is 2, because 102 =100,
and log10 of 50 is 1.699 because 101.699 =50, and so on. We have used log10 for
population size since Chapter 14, and you may recall that a logged version of
population density was used in one of the end-of-chapter assignments in Chapter
16. Using logged values has two primary benefits: it minimizes the impact of
extreme values, which can be important for highly skewed variables, and it
enables us to model relationships as curvilinear rather than linear.
One way to assess whether the logged version of docs10k fits the data better
than the raw version is to look at the bivariate relationships with the dependent
variable using simple regression:
#Simple regression of "lifexp" by "docs10k"
fit_raw<-lm(countries2$lifexp~countries2$docs10k, na.action=na.exclude)
#Simple regression of "lifexp" by "log10(docs10k)"
fit_log<-lm(countries2$lifexp~log10(countries2$docs10k), na.action=na.exclude)
#Use Stargazer to create a table comparing the results from two models
stargazer(fit_raw, fit_log, type="text",
dep.var.labels = c("Life Expectancy"),
covariate.labels = c("Doctors per 10k", "Log10(Doctors per 10k)"))
17.4. CHECKING ON LINEARITY 405
===========================================================
Dependent variable:
----------------------------
Life Expectancy
(1) (2)
-----------------------------------------------------------
Doctors per 10k 0.315***
(0.024)
-----------------------------------------------------------
Observations 186 186
R2 0.487 0.673
Adjusted R2 0.484 0.671
Residual Std. Error (df = 184) 5.290 4.230
F Statistic (df = 1; 184) 175.000*** 379.000***
===========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
All measures of fit, error, and relationship point to the logged (base 10) version
of docs10k as the superior operationalization. Based on these outcomes, it is
hard to make a case against using the log-transformed version of docs10k. You
can also see this in the scatterplot in Figure 17.1, which includes the prediction
lines from both the raw and logged models.
Note that the curved, solid line fits the data points much better than the
straight, dashed line, which is what we expect based on the relative perfor-
mance of the two regression models above. The interpretation of the curvilinear
relationship is a bit different than for the linear relationship. Rather than con-
cluding that life expectancy increases as doctors per capita increases, we see
in the scatterplot that life expectancy increases substantially as we move from
nearly 0 to about 15 doctors per 10,000 people, but increases beyond that point
have less and less impact on life expectancy.
This is illustrated further below in Figure 17.2, where the graph on the left
shows the relationship between docs10k among countries below the median
level (14.8) of doctors per 10,000 population, and the graph on the right shows
the relationship among countries with more than the median level of doctors
per 10,000 population.
This is not quite as artful as Figure 17.1, with the plotted curved line, but
it does demonstrate that the greatest gains in life expectancy from increased
406 CHAPTER 17. ADVANCED REGRESSION TOPICS
85
80
Life Expectancy
75
70
y= a + b*x
y= a + b*log10(x)
65
60
55
0 20 40 60 80
Figure 17.1: Alternative Models for the Relationship between Doctors per 10k
and Life Expectancy
50 55 60 65 70 75 80 85
Life Expectancy
Life Expectancy
0 2 4 6 8 12 20 40 60 80
access to health care come at the low end of the scale, among countries with
severely limited access. For the upper half of the distribution, there is almost
no relationship between levels of docs10k and life expectancy.
Finally, we can re-estimate the multiple regression model with the appropriate
version of docs10k to see if this transformation has an impact on its statistical
significance.
#Re-estimate five-variable model using "log10(docs10k)"
fit<-lm(countries2$lifexp~countries2$fert1520 + countries2$mnschool+
log10(countries2$pop) +countries2$urban+
log10(countries2$docs10k),na.action = na.exclude)
#Using information in 'fit' to create a table of results
stargazer(fit, type="text", dep.var.labels=c("Life Expectancy"),
covariate.labels = c("Fertility Rate","Mean Years of Education",
"Log10(Population)",
"% Urban","log10(Doctors per 10k)"))
===================================================
Dependent variable:
---------------------------
Life Expectancy
---------------------------------------------------
Fertility Rate -2.810***
(0.399)
Log10(Population) 0.050
(0.329)
% Urban 0.045***
(0.015)
Constant 72.100***
(2.130)
---------------------------------------------------
Observations 179
R2 0.772
Adjusted R2 0.765
Residual Std. Error 3.580 (df = 173)
408 CHAPTER 17. ADVANCED REGRESSION TOPICS
are the most important and of similar magnitude, followed by education level
in distant third (b=.36), followed by log of population (b=.05) and percent
urban (b=.045), neither one of which seem to matter very much. But this rank
ordering is likely flawed due to differences in the way the variables are measured.
A tip off that something is wrong with this type of comparison is found in the
fact that the slopes for urban population and the log of population size are
of about the same magnitude, even though the slope for urban population is
statistically significant (p<.01) while the slope for population size is not and has
not been significant in any of the analyses shown thus far. How can a variable
whose effect is not distinguishable from zero have an impact equal to one that
has had a significant relationship in all of the examples shown thus far? Seems
unlikely.
Generally, the “raw” regression coefficients are not a good guide to the relative
impact of the independent variables because the variables are measured on dif-
ferent scales. Recall that the regression slopes tell us how much Y is expected
to change for a unit change in x. The problem we encounter when comparing
these slopes is that the units of measure for the independent variables are mostly
different from each other, making it very difficult to compare the “unit changes”
in a fair manner.
To gain a sense of the impact of measurement scale on the size of the regression
slope, consider the two plots below in Figure 17.3. Both plots show the relation-
ship between the size of the urban population and country-level life expectancy.
The plot on the left uses the percentage of a country’s population living in urban
areas, and the plot on the right uses the proportion of a country’s population
living in urban areas. These are the same variable, just using different scales.
As you can see, other than the metric on the horizontal axis, the plots are iden-
tical. In both cases, there is a moderately strong positive relationship between
the size of the urban population and life expectancy: as the urban population
increases in size, life expectancy also increases. But look at the linear equations
that summarize the results of the bi-variate regression models. The slope for the
urban population variable in the second model is 100 times the size of the slope
in the first model, not because the impact is 100 times greater but because of
the difference in the scale of the two variables. This is the problem we encounter
when comparing regression coefficients across independent variables measured
on different scales.
What we need to do is standardize the variables used in the regression model so
they are all measured on the same scale. Variables with a common metric (same
mean and standard deviation) will produce slopes that are directly comparable.
One common way to standardize variables is to transform the raw scores into
z-scores:
𝑥𝑖 − 𝑥 ̄
𝑍𝑖 =
𝑆
All variables transformed in this manner will have 𝑥̄ = 0 and 𝑆 = 1. If we do this
410 CHAPTER 17. ADVANCED REGRESSION TOPICS
85
85
80
80
Life Expectancy
Life Expectancy
75
75
70
70
65
65
60
60
55
55
20 40 60 80 100 0.2 0.4 0.6 0.8 1.0
Figure 17.3: The Impact of Scale of the Independent Variable on the Size of
Regression Coefficients
to all variables (including the dependent variable), we can get the standardized
regression coefficients (sometimes confusingly referred to as “Beta Weights”).
Fortunately, though, we don’t have to make these conversions ourselves since
R provides the standardized coefficients. In order to get this information, we
can use the StdCoef command from the DescTools package. The command is
then simply StdCoef(fit), where “fit” is the object with the stored information
from the linear model.
#Produce standardized regression coefficients from 'fit'
StdCoef(fit)
more that twice the impact as doctors per 10,000 (.21), which is followed pretty
closely level of education (.15) and urban population (.14), and population size
comes in last (.006). These results give a somewhat different take on relative
impact than if we relied on the raw coefficients.
We can also make more literal interpretations of these slopes. Since the variables
are all transformed into z-scores, a “one unit change” in x is always a one stan-
dard deviation change in x. This means that we can interpret the standardized
slopes as telling how many standard deviations y is expected to change for a one
standard deviation change in x. For instance, the standardized slope for fertility
(-.48) means that for every one standard deviation increase in the fertility rate
we can expect to see a .48 standard deviation decline in life expectancy. Just
to be clear, though, the primary utility of these standardized coefficients is to
compare the relative impact of the independent variables.
================================================
Dependent variable:
---------------------------
Life Expectancy
------------------------------------------------
Male Life Expectancy 0.810***
(0.018)
Constant 17.400***
(1.410)
412 CHAPTER 17. ADVANCED REGRESSION TOPICS
------------------------------------------------
Observations 184
R2 0.989
Adjusted R2 0.989
Residual Std. Error 0.787 (df = 181)
F Statistic 8,098.000*** (df = 2; 181)
================================================
Note: *p<0.1; **p<0.05; ***p<0.01
We had been feeling pretty good about the model that explains 77% of the
variance in life expectancy and then along comes this much simpler model that
explains almost 100% of the variation. Not that it’s a contest, but this does
sort of raise the question, “Which model is best?” The new model definitely
has the edge in measures of error: the RMSE is just .787, compared to 3.58 in
the five-variable model, and, as mentioned above, the R2 =.99.
So, which model is best? Of course, this is a trick question. The second model
provides a stronger statistical fit but at significant theoretical and substantive
costs. Put another way, the second model explains more variance without pro-
viding a useful or interesting substantive explanation of the dependent variable.
In effect, the second model is explaining life expectancy with other measures of
life expectancy. Not very interesting. Think about it like this. Suppose you
are working as an intern at the World Health Organization and your supervisor
asks you to give them a presentation on the factors that influence life expectancy
around the world. I don’t think they would be impressed if you came back to
them and said you had two main findings:
1. People live longer in countries where men live longer.
2. People live longer in countries where fewer people die really young.
Even with your impressive R2 and RMSE, your supervisor would likely tell you
to go back to your desk, put on your thinking cap, and come up with a model
that provides a stronger substantive explanation, something more along the lines
of the first model. With the results of the first model you can report back that
the factors that are most strongly related to life expectancy across countries are
fertility rates, access to health care, and education levels–things that policies
and aid programs might be able to affect. The share of the population living
in urban areas is also related to life expectancy, but this is not something that
can be addressed as easily through policies and aid programs.
17.8 Exercises
A B
Y X1 X2 X3 Y X1 X2 X3
Y 1.0 0.55 0.32 -0.75 Y 1.0 0.48 0.39 -0.79
X1 0.55 1.0 0.25 -0.43 X1 0.48 1.0 0.3 -0.85
X2 0.32 0.25 1.0 0.02 X2 0.39 0.3 1.0 -0.58
X3 -0.75 -0.43 0.02 1.0 X3 -0.79 -0.85 -0.58 1.0
• For which independent variable in the data set you chose is collinearity
likely to present a problem? Why?
• What test would you advise using to get a better sense of the overall level
of collinearity?
3.0
3.0
3.0
2.0
2.0
2.0
Y
Y
1.0
1.0
1.0
0.0
0.0
0.0
5 10 15 20 6 8 10 12 14 5 10 15 20
X X X
Plot D Plot E
3.0
3.0
2.0
2.0
Y
Y
1.0
1.0
0.0
0.0
5 10 15 20 25 30 50 60 70 80 90
X X
3. The regression output copied below summarizes the impact and statisti-
cal significance of four independent variables (X1 , X2 , X3 , and X4 ) on a
dependent variable (Y).
• Rank-order the independent variables from strongest to weakest in terms
of their relative impact on the dependent variable. Explain the basis for
your ranking.
• Provide an interpretation of both the b value and the standardized co-
efficient for the variable that you have identified as having the greatest
impact on the dependent variable.
17.8.2 R Problems
Building on the regression models from the exercises in the last chapter,
use the states20 data set and run a multiple regression model with infant
mortality (infant_mort) as the dependent variable and per capita income
(PCincome2020), the teen birth rate (teenbirth), percent of low birth weight
births (lowbirthwt), and southern region (south) as the independent variables.
17.8. EXERCISES 415
Regession Assumptions
18.3 Linearity
The pattern of the relationship between x and y is linear; that is, the rate
of change in y for a given change in x is constant across all values of x, and
the model for a straight line fits the data best. The example of the curvilinear
relationship between doctors per 10,000 and life expectancy shown in Chapter
17 is a perfect illustration of the problem that occurs when a linear model is
applied to a non-linear relationship: the straight line is not the best fitting line
and the model will fall short of producing the least squared error possible for a
given set of variables. The best way to get a sense of whether you should test
417
418 CHAPTER 18. REGESSION ASSUMPTIONS
85
85
Life Expectancy
Life Expectancy
Life Expectancy
75
75
75
65
65
65
55
55
55
1 2 3 4 5 6 7 2 4 6 8 10 14 −2 −1 0 1 2 3
85
Life Expectancy
Life Expectancy
75
75
65
65
55
55
20 40 60 80 100 0 20 40 60 80
Take a look at the other variables in the countries2 data set to see if you
think there is something that might be strongly related to country-level life
expectancy, taking care to avoid variables that are really just other measures
of life expectancy. Do you notice any variables whose exclusion from the might
bias the findings for the other variables? One that comes to mind is food_def,
a measure of the average daily supply of calories as a percent of the amount
of calories needed to eliminate hunger. It makes sense that nutrition plays a
role in life expectancy and also that this variable could be related to other
variables in the model. If so, then the other coefficients may be biased without
including food_def in the model. The scatterplot and correlation presented
below show that there is a positive, moderate relationship between food deficit
and life expectancy across countries.
#Scatterplot of "lifexp" by "food_def"
plot(countries2$food_def, countries2$lifexp)
#Add regression line to scatterplot
abline(lm(countries2$lifexp~countries2$food_def))
85
80
countries2$lifexp
75
70
65
60
55
countries2$food_def
0.587
The scatterplot and correlation (r=.59) both suggest that taking into account
this nutritional measure might make an important contribution to the model.
Let’s see how the model is affected when we add food_def to it:
#Five-variable model
fit<-lm(countries2$lifexp~countries2$fert1520 +
countries2$mnschool+log10(countries2$pop19_M)+
countries2$urban+log10(countries2$docs10k))
#Adding "food deficit" to the model
fit2<-lm(countries2$lifexp~countries2$fert1520 +
countries2$mnschool+ log10(countries2$pop19_M)+
countries2$urban+log10(countries2$docs10k)
+countries2$food_def)
#Produce a table with both models included
stargazer(fit, fit2, type="text",
title = "Table 18.1. The Impact of Adding Food Deficit to the Model",
dep.var.labels=c("Life Expectancy"),
covariate.labels = c("Fertility Rate",
"Mean Years of Education",
"Log10 Population", "% Urban",
"Log10 Doctors/10k",
"Food Deficit"))
18.4. INDEPENDENT VARIABLES ARE NOT CORRELATED WITH THE ERROR TERM421
-------------------------------------------------------------------------
Observations 179 167
R2 0.772 0.790
Adjusted R2 0.765 0.782
Residual Std. Error 3.580 (df = 173) 3.430 (df = 160)
F Statistic 117.000*** (df = 5; 173) 100.000*** (df = 6; 160)
=========================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Food deficit is significantly related to life expectancy and adding it to the model
had a couple of important consequences: the coefficient for mean years of edu-
cation is now not statistically significant, the slope for percent urban was cut
almost in half, and the model fit (adjusted R2 ) increased from .765 to .782.
Taking into account this measure of nutritional health, showed that the original
estimates for mean years of education and percent urban were biased upward.
The tricky part of satisfying this assumption is to avoid “kitchen sink” models,
where you just toss a bunch of variables in to see what works. The best approach
422 CHAPTER 18. REGESSION ASSUMPTIONS
is to think hard about which sorts of variables should be taken into account,
and then see if you have measures of those variables that you can use.
[1] 2.81e-17
The mean error in the regression model is .0000000000000000281, pretty darn
close to 0.
18.7. THE ERROR TERM IS NORMALLY DISTRIBUTED 423
If you find that the mean of the error term is different from zero, it could be due
to specification error, so you should look again at whether you need to transform
any of your existing variables or incorporate other variables into the model.
Histogram of residuals(fit2)
0.12
Normal Curve
0.08
Density
0.04
0.00
−10 −5 0 5 10
residuals(fit2)
generated from this test is less than .05, then we reject the null hypothesis and
conclude that the residuals are not normally distributed. The results of this test
(shapiro.test) are presented below.
#Test for normality of fit2 residuals
shapiro.test(residuals(fit2))
data: residuals(fit2)
W = 1, p-value = 0.3
The p-value is greater than .05, so we can conclude that the distribution of
residuals from this model do not deviate significantly from a normal distribution.
If the distribution of the error term does deviate significantly from normal, it
could be the result of outliers or specification error (e.g., omitted variable or
non-linear relationship).
5
residuals(fit2)
0
−5
−10
55 60 65 70 75 80
predict(fit2)
Though the pattern is not striking, it looks like there is a tendency for greater
variation in error at the low end of the predicted outcomes than at the high end,
but it is hard to tell if this is a problem based on simply eyeballing the data.
Instead, we can test the null hypotheses that variance in error is constant, using
the Breusch-Pagan test (bptest) from the lmtest package:
bptest(fit2)
data: fit2
BP = 20, df = 6, p-value = 0.002
With a p-value of .002, we reject the null hypothesis that there is constant vari-
ance in the error term. So, what to do about this violation of the homoscedas-
ticity assumption? Most of the solutions, including those discussed below, are
a bit beyond the scope of an introductory textbook. Nevertheless, I’ll illustrate
a set of commands that can be used to address this issue.
There are a number of ways to address heteroscedasticity, mostly by weighting
the observations based on the size of their error. This approach, called Weighted
Least Squares, entails determining if a particular variable (x) is the source of
the heteroscedasticity and then weight the observations by the reciprocal of that
variable (1/𝑥), or to transform the residuals and weight the data by the recip-
rocal of that transformation. Alternatively, you can use the vcovHC function,
which adjusts the standard error estimates so they account for the non-constant
variance in the error. Then, we can incorporate information created by vcovHC
into the stargazer command to produce a table with new standard errors,
t-scores, and p-values. The code below shows how to do this.1
1 This explanation and solution are a bit more “black box” in nature than I usually prefer,
426 CHAPTER 18. REGESSION ASSUMPTIONS
#Use the regression model (fit2) and method HC1 in 'vcovHC' function.
#Store the new covariance matrix in object "hc1"
hc1<-vcovHC(fit2, type="HC1")
#Save the new "Robust" standard errors in a new object
robust.se <- sqrt(diag(hc1))
#Integrate the new standard error into the regression output
#using stargazer. Note that the model (fit2) is listed twice, so a
#comparison can be made. To only show the corrected model, only list fit2
#once, and change "se=list(NULL, robust.se)" to "se=list(robust.se)"
stargazer(fit2, fit2, type = "text",se=list(NULL, robust.se),
column.labels=c("Original", "Corrected"),
dep.var.labels=c("Life Expectancy"),
covariate.labels = c("Fertility Rate",
"Mean Years of Education",
"Log10 Population", "% Urban",
"Log10 Doctors/10k",
"Food Deficit"))
===========================================================
Dependent variable:
----------------------------
Life Expectancy
Original Corrected
(1) (2)
-----------------------------------------------------------
Fertility Rate -2.680*** -2.680***
(0.409) (0.492)
but the topic and potential solutions are complicated enough that I think this is the right way
to go about it in an introductory textbook.
18.9. INDEPENDENT ERRORS 427
-----------------------------------------------------------
Observations 167 167
R2 0.790 0.790
Adjusted R2 0.782 0.782
Residual Std. Error (df = 160) 3.430 3.430
F Statistic (df = 6; 160) 100.000*** 100.000***
===========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Two important things to note about these new estimates. First, as expected,
the standard errors changed for virtually all variables. Most of the changes
were quite small and had little impact on the p-values. The most substantial
change occurred for docs10k, whose slope is still statistically significant using
a one-tailed test. The other thing to note is that this transformation had no
impact on the slope estimates, reinforcing the fact that heteroscedasticity does
not lead to biased slopes but does render standard errors unreliable.
learned sit on the shelf very long. It’s not like there is an expiration date on what
you’ve learned, but it will start to fade and be less useful if you don’t find ways to
use it. The most obvious thing you can do is follow up with another course. One
type of course that would be particularly useful is one that focuses primarily on
regression analysis and its extensions. While the last four chapters in this book
focused on regression analysis, they really just scratched the surface. Taking a
stand-alone course on regression analysis will reinforce what you’ve learned and
provide you with a stronger base and a deeper understanding of this important
method. Be careful, though, to make sure you find a course that is pitched
at the right level for you. For instance, if you felt a bit stretched by some of
the technical aspects of this book, you should not jump right into a regression
course offered in the math or statistics department. Instead, a regression-based
course in the social sciences might be more appropriate for you. But if you
are comfortable with a more math-stats orientation, then by all means find a
regression course in the math or statistics department at your school.
There are other potentially fruitful follow-up courses you might consider. If
you were particularly taken with the parts of this book that focused on survey
analysis using the 2020 ANES data, you should look for a course on survey
research. These courses are usually found in political science, sociology, or mass
communications departments. At the same time, if you found the couple of very
brief discussions of experimental research interesting, you could follow up with
a course on analyzing experimental data. The natural sciences are a “natural”2
place to find these courses, but if your tastes run more to the social sciences, you
should check out the psychology department on your campus for these courses.
Since you’ve already invested some valuable time learning how to use R, you
should also find ways to build on that part of the course. If you take another
data analysis course, try to find one that uses R so you can continue to improve
your programming skills. This brings me to one of the most important things
you should look for in other data analysis courses: make sure the course requires
students to do “hands-on” data analysis problems utilizing real-world data. It
would be great if the course also required the use of R, but one of the most
important things is that students are required to use some type of data analysis
program (e.g., R, SPSS, Stata, Minitab, Excel, etc.) to do independent data
analysis. Absent this, you might learn about data analysis, but you will not
learn to do data analysis.
18.11 Exercises
18.11.1 Concepts and Calculations
1. Do any regression assumptions appear to be violated in this scatterplot?
If so, which one? Explain your answer. If you think an assumption was
2 Sorry, I can’t resist the chance for an obvious pun.
18.11. EXERCISES 429
violated, discuss what statistical test you would use to confirm your sus-
picion.
2
1
Residual
0
−1
−2
0 20 40 60 80 100
Predicted
0.04
0.02
0.00
−15 −10 −5 0 5 10
Residuals
0.04
0.03
Density
0.02
0.01
0.00
−40 −20 0 20 40
Residuals
0
−20
−40
0 20 40 60 80 100
Predicted
0.04
0.03
Density
0.02
0.01
0.00
−40 −20 0 20 40
Residuals
18.11.2 R Problems
These problems focus on the same regression model used for the R exercises
in Chapter 17. With the states20 data set, run a multiple regression model
with infant mortality (infant_mort) as the dependent variable and per capita
income (PCincome2020), the teen birth rate (teenbirth), percent of low birth
weight births (lowbirthwt), southern region (south), and percent of adults
with diabetes (diabetes) as the independent variables.
1. Use a scatterplot matrix to explore the possibility that one of the inde-
pendent variables violates the linearity assumption. If it looks like there
might be a violation, what should you do about it?
2. Test the assumption that the mean of the error term equals zero.
3. Use both a histogram and the shapiro.test function to test the assump-
tion that the error term is normally distributed.
4. Plot the residuals from the infant mortality model against the predicted
values and evaluate the extent to which the assumption of constant vari-
ation in the error term is violated. Use the bptest function as a more
formal test of this assumption. Interpret both the scatterplot and the
results of the bptest.
5. The infant mortality model includes five independent variables. While
parsimony is a virtue, it also creates the potential for omitted variable
bias. Look at the other variables in the states20 data set and identify
one of them that you think might be an important influence on infant
mortality and whose exclusion from the model might bias the estimated
effects of the other variables. Justify your choice. Add the new variable to
the model, use stargazer to display the new model alongside the original
model, and discuss the differences between the models and whether the
432 CHAPTER 18. REGESSION ASSUMPTIONS
ANES20
Variable Variable Descriptions
version Version of ANES 2020 Time Series
Release
V200001 2020 Case ID
V200010a Full sample pre-election weight
V200010b Full sample post-election weight
V200010c Full sample variance unit
V200010d Full sample variance stratum
V201006 PRE: How interested in following
campaigns
V201114 PRE: Are things in the country on
right track
V201119 PRE: How happy R feels about how
things are going in the country
V201120 PRE: How worried R feels about how
things are going in the country
V201121 PRE: How proud R feels about how
things are going in the country
V201122 PRE: How irritated R feels about
how things are going in the country
V201123 PRE: How nervous R feels about how
things are going in the country
V201129x PRE: SUMMARY: Approve or
disapprove President handling job
V201151 PRE: Feeling Thermometer: Joe
Biden, Democratic Presidential
candidate
V201152 PRE: Feeling Thermometer: Donald
Trump, Republican Presidential
candidate
433
434 CHAPTER 18. REGESSION ASSUMPTIONS
County20large
Countries2
Variable Variable Description
wbcountry Country Name
ccode Country Code
hdi_rank HDI Rank
hdi Human Development Index (HDI)
lifexp Life expectancy at birth
mnschool Mean years of schooling
gini1019 Gini coefficient: Income inequality
femexp Female life expectancy at birth
malexp Male life expectancy at birth
fem_mnschool Female mean years of schooling
male_mnschool Male mean years of schooling
gender_inequality Gender Inequality Index: inequality
in reproductive health, empowerment
and the labour market
matmort Maternal mortality ratio: Number of
deaths due to pregnancy-related
causes per 100,000 live births.
444 CHAPTER 18. REGESSION ASSUMPTIONS
States20
Variable Name Variable Description
state State name
stateab State abbeviation
abortion_laws # Abortion restrictions in state law
age1825 % 18 to 24 years old
age65plus % older than 65
Approve Biden approval, January-June 2021
446 CHAPTER 18. REGESSION ASSUMPTIONS