Jhonson
Jhonson
political science research projects: the choice of research topics, the formulation
the definition of concepts. In this chapter, we take the next step toward testing
order to test empirically the accuracy and utility of a scientific explanation or a political
phenomenon, we will have to observe and measure the presence
important because it provides the bridge oetween our proposed explanations and the
empirical world they are supposed to explain. How researchers measure
their concepts can have a significant impact on their findings; differences in measurement
and the bottom of the earnings distribution. Kenworthy and Pontusson argued
individuals. The unemployed are excluded from the calculations of individual earnings
ers disproportionately drop out of the employed labor force. Using working-age
Kenworthy and Pontusson found that when individual income was used as a basis
for measuring inequality, inequality had increased the most in the United States,
New Zealand, and the United Kingdom, all liberal market economies. They, further
found that income inequality had increased significantly more in these countries
than in Europe'.s social market economies and Japan. When household income was
used, the data indicated that inequality had increased in all countries with the
chapter 1). Political scientists have investigated whether turnout rates in the United
States have declined in recent decades.2 The answer may depend on how the number
age, or should this number be adjusted to take into account those who are not
eligible to vote, or should the turnout rate be calculated using just the number of
some of which posed greater challenges than others. Milner, Poe, and Leblang wanted to
measure.
three different types of human rights: personal integrity or security rights, subsistence
rights, and civil and political rights. Each of these types of rights has
multiple dimensions. For example, civil and political rights consist of both civil
property rights. Jeffrey A. Segal and Albert D. Cover measured both the political
involving civil rights and liberties.3 Valerie]. Hoekstra measured people's opinions
about issues connected to Supreme Court cases and their opinions about the
Court. 4 Richard L. Hall and Kristina Miler wanted to measure oversight activity
could be tested. All of these researchers made important choices regarding their
measurements.
Devising Measurement Strategies
As we pointed out in chapter 4, researchers must define the concepts they use
measure the presence, absence, or amount of these concepts in the real world.
numerals or scores.
Let us consider, for example, a researcher trying to explain the existence of democracy
and democracy-would be necessary. The researcher could then develop a strategy, based on
the definitions of the two concepts, for measuring the existence
Suppose literacy was defined as "the completion of six years of formal education"
and democracy was defined as "a system of government in which public officials
would indicate what should be observed empirically to measure both literacy and
democracy, and they would indicate specifically what data should be collected to
test the researchers hypothesis. In this example, the operational definition of literacy
might be "those nations in which at least 50 percent of the populatio~ has had
and the operational definition of democracym ight be "those countries in which the
second-place finisher in elections for the chief executive office has received at least
we now know exactly what the researcher means by literacy and democracy.
Since different people often mean different things by the same concept, operational
definitions are especially important. Someone might argue that defining literacy
in terms of formal education ignores the possibili.ty that people who complete six
years of formal education might still be unable to read or write well. Similarly, it
the number of competing candidates, the size of the margin of victory, or the number
they are evaluated according to how well they correspond to the concepts they are
meant to measure.
It is useful to think of arriving at the operational definition as being the last stage in
the process of defining a concept precisely. We often begin with an abstract concept
decide in specific terms how we are going to measure it. At the end of this process,
we hope to attain a definition that is sensible, close to our meaning of the concept,
and exact in what it tells us about how to go, about measuring the concept.
some individuals are more liberal than others. The concept of liberalism might be
defined as "believing that government ought to pursue policies that provide benefits
for the less well-off." The task, then, is to develop an operational definition that can be used to
measure whether particular individuals are liberal or not. The
following question from the Generiil Social Survey might be used to operationalize
the concept:
reduce the income differences between the rich and the poor, perhaps by
poor. Others think that the government should not concern itself with
reducing this income difference between the rich and the poor.
not concern itself with reducing income differences. What score between
An abstract concept, liberalism has now been given an operational definition that
can be used to measure the concept for individuals. This definition is also related
to the original definition of the concept, and it indicates precisely what observations
Others might suggest that questions regarding affirmative action, same-sex marriage,
school vouchers, the death penalty, welfare benefits, and pornography could
The important thing is to think carefully about the operational definition you
choose and to try to ensure that the definition coincides closely with the meaning
are made and interpreted. For example, general statements about liberals or
yourself with the operational definitions used by researchers so that you are better
Let us take a closer look at some operational definitions used by the political
proposed air quality regulations during five oversight hearings held in Congress
and during the public comment period. 8 Agencies are required to maintain a
public docket that contains all the comments received during the comment period.
Transcripts were available for each of the hearings. The researchers ended up
with two variables: one was the number of supporting comments; the other was
addressed by the proposed regulations). Because Hall and Miler were interested in
friendly toward the lobbyists' positions, they needed to measure the pro- or
antienvironmental policy positions for each member of Congress, and this variable
had to measure position before the oversight hearings and regulatory comment
period. Fortunately for the researchers, the leaders of the health and environmental
coalition had classified members in terms of their likely support for the rule prior
to the lobbying period and were willing to share their ratings. These measures were
The research conducted by Segal and Cover on the behavior of US Supreme Court
to test a scientific hypothesis. 9 Recall that Segal and Cover were interested,
as many others have been before them, in the extent to which the votes cast by
Supreme Court justices were dependent on the justices' personal political attitudes.
Measuring the justices' votes on the cases decided by the Supreme Court is no
problem; the votes are public information. But measuring the personal political attitudes
chapter 4 on avoiding tautologies, or statements that link two concepts that mean
essentially the same thing). Many of the judges whose behavior is of interest have
died, and it is difficult to get living Supreme Court justices to reveal their political
would like a measure of attitudes that is comparable across many judges and that
Segal and Cover limited their inquiry to votes on civil liberties cases between 1953
and 1987, so they needed a measure of related political attitudes for the judges
serving on the Supreme Court over that same period. They decided to infer the
judges' attitudes from the newspaper editorials written about them in four major daily
newspapers from the time each justice was appointed by the president until
the justices confirmation vote by the Senate. They selected the editorials appearing
in two liberal papers and in two conservative papers. Trained analysts read the editorials
and coded each paragraph for whether it asserted that a justice designate was
"support for the rights of defendants in criminal cases, women and racial minorities
in equality cases, and the individual against the government in privacy and First
Amendment cases."10
Because of practical barriers to ideal measurement, then, Segal and Cover had to
rather than on a measure of the attitudes themselves. Although this approach may
have resulted in flawed measures, it also permitted the test of an interesting and
important hypothesis about the behavior of Supreme Court justices that had not
been tested previously. Without such measurements, the hypothesis could not have
been tested.
Next, let us consider research conducted by Bradley and his colleagues on the relationship
database, which provides cross-national income data over time in OECD (Organisation
for retirees, retirees in these countries make little provision on their own for retirement.
Thus, many of these people would be counted as "poor" before any government
well as the extent of income transfer for these countries. Therefore, Bradley and
his colleagues limited their analysis to households' with a head aged twenty-five to
fifty-nine (thus excluding the student-age population as well) and calculated their
own measures of income inequality from the LIS data. They argued that their data
of income, such as transfers to students and retireμ persons. Income was defined as
income from wages and salaries, self-employment income, property income, and
private pension income. The researchers also made adjustments for household size
to an equivalent number of adults. The equivalence scale takes into account the
to a survey question that asked respondents if they recalled a campaign ad' and
whether or not it was negative or positive in tone. Finally, Ansolabehere and his
colleagues measured exposure to negative campaign ads in the 1990 Senate elections
by accessing newspaper and magazine articles about the campaigns and determining
The cases discussed here are good examples of researchers' attempts to measure
legislators, the researchers devised measurement strategies that could detect and
measure the presence and amount of the concept in question. These observations
were then generally used as the basis for an empirical test of the researchers'
hypotheses.
researcher's concepts. They must also provide the researcher with enough information
Because we are going to use· our measurements to test whether or not our
since they will interfere with our ability to observe the actual relationship between
There are two major threats to the accuracy of measurements. Measures may be
inaccurate because they are unreliablea nd/or because they are invalid.
Reliability
that produces the same result each time the measure is used. An unreliable measure
Suppose, for example, you want to measure support for the president among college
students. You select two similar survey questions (Ql and Q2) and ask the
from this sample were 50 percent support for the president using Ql and 50 percent
support for the president using Q2. But what might you find if you ask the
same questions of multiple random samples of students? Will the results from each
question remain consistent, assuming that the samples are identical? If a second
sample of students is polled, you may find the same result, 50 percent, for Ql but
60 percent for Q2. If you were to ask Ql of multiple random samples of students
and the result was consistently 50 percent, you could assert that your measure, Ql,
is reliable. If Q2 were asked to multiple random samples of students and each sample
and 60 percent, you could conclude that Q2 is less reliable than Ql because Q2
Likewise, you can assess the reliability of procedures as well. Suppose you are given
the responsibility of counting a stack of one thousand paper ballots for some public
office. The first time you count them, you obtain a particular result. But as you were
counting the ballots, you might have been interrupted, two or more ballots might
have stuck together, some might have been blown onto the floor, or you might have
written down the totals incorrectly: As a precaution, then, you count them five more
times and get four other people to count them once each as well. The similarity of
the,results of all ten counts would be an indication of the reliability or-the counting
process.
Similarly, suppose you wanted to test the hypothesis that the New York Times is
more critical of the federal government than is the Wall Street Journal. This would
require you to measure the level of criticism found in articles in the two papers.
You would need to develop criteria or instructions for identifying or measuring
two people read all the articles, independently rate the level of criticism in them
according to your instructions, and then compare their results. Reliability would be
ways. We describe three methods here that are often associated with written test
items or survey questions, but the ideas may be applied in other research contexts.
The test-retest method involves applying the same "test" to the same observations
after a period of time and then comparing the results of the different measurements.
frequently engage in test-retest behavior in our everyday lives. How often have you
The test-retest method of measuring reliability may be both difficult and problematic,
since one must measure the phenomenon at two different points. It is possible
that two different results may be obtained because what is being measured has
changed, not because the measure is unreliable. For example, if your bathroom
scale gives you two different weights within a few seconds, the scale is unreliable,
as your weight cannot have changed. However, if you weigh yourself once a week
for a month and find that you get different results each time, is the scale unreliable,
or has your weight changed between measurements? A further problem with the
test-retest check for reliability is that the administration of the first measure may
affect the second measure's results. For instance, the difference between SAT Reasoning
Test scores the first and second times that individuals take the test may not
be assumed to-be a measure of the reliability of the test, since test takers might alter
their behavior the second time as a result of taking the test the first time (e.g., they
concept rather than the same measure. For example, a researcher could devise
two different sets of questions to measure the concept of lib~ralism, ask the same
respondents questions at two different times using one set of questions the first
time and the other set of questions the second time, and compare the respondents'
scores. Using two different forms of the measure reduces the chance that the second
scores are influenced by the first measure, but it still requires the phenomenon to
be measured twice. Depending on the length of time between the two measurements, what is
being measured may change.
of the same concept at the same time. The results of the two measures are then
compared. This method avoids the problem that the concept being measured may
change between measures. The split-halves method is often used when a multiitem
measure can be split into two equivalent halves. For example, a researcher
a public opinion survey. Half of these questions could be selected to represent one
measure of liberalism, and the other half selected to represent a second measure of
liberalism. If individual scores on the two measures of liberalism are similar, then
The test-retest, alternative-form, and split-halves methods provide a basis for calculating
measures. The less consistent the results are, the less reliable the measure. Political
scientists take very seriously the reliability of the measures they use. Survey
researchers are often concerned about the reliability of the answers they receive. For
the instruments are given at two different times. 16 If respondents are not concentrating
or taking the survey seriously, the answers they provide may as well have
Now, let us return to the example of measuring your weight using a home scale.
If you weigh yourself on your home scale, then go to the gym and weigh yourself
again there, and get the same number (alternative forms test of reliability), you
may conclude that your home scale is reliable. But what if you get two different
numbers? Assuming your weight has not changed, what is the problem? If you go
back home immediately and step back on your home scale and find that it gives
you a measurement that is different from the first it gave you, you could conclude
that your scale has a faulty mechanism, is inconsistent, and therefore is unreliable.
However, what if your bathroom scale gives you the same weight as the first time?
It would appear to be reliable. Maybe the gym scale is unreliable. You could test
this out by going back to the gym and reweighing yourself. If the gym scale gives a
reading different from the one it gave the first time, then it is unreliable. But what if
the gym scale gives consistent readings? Each scale appears to be reliable (the scales
are not giving you different weights at random), but at least one of them is giving
you a wrong measurement (that is, not giving you your correct weight). This is a
problem of validity.
Validity
equivalent measures yield the same result, validity refers to the degree of correspondence
Let us consider first an example of a measure whose validity has been questioned:
voter turnout. Many studies examine the factors that affect voter turnout and, thus, require an
accurate measurement of voter turnout. One way of measuring voter
However, given the social desirability of voting in the United States-wearing the "I
voted" sticker or posting "I voted" on a social media site can bring social rewardswill
claim in surveys to have voted, resulting in an invalid measure of voter turnout tqat
overstates the number of voters. In fact, this is what usually happens. Voter surveys
than intended. For example, assume that a researcher intends to measure ideology,
survey respondents, 'To which party do you feel closest, the Democratic Party or
the Republican Party?" This measure would be invalid because it fails to measure
is not the same as ideology. This measure could be a valid measure of party identification,
and the actual presence or amount of the concept itself. Information regarding the
ways of thinking about validity including face, content, construct, and interitem
validity.
Face validity may be asserted (not empirically demonstrated) when the measurement
assess the face validity of a measure, we need to know the meaning of the concept
being measured and whether the information being collected is "germane to that
concept." 18 For example, let us return to thinking about how we might meas.ure
Such a measure could be as simple as a question used by the Pew Research Center:
this measure appears to capture the intended concept, so it has face validity. It
but.one would be assuming that all Democrats are liberal and all Republicans are conservative.
Also, if the party identification variable included a category for
independents, what would be their ideology? Can you assume they are all moderates?
For these reasons, a question measuring party identification would lack face
In general, measures lack face validity when there are good reasons to question the
to be problematic.
Content validity is similar to face validity but involves determining the full domain
or meaning of a particular concept and then making sure that all components of
the meaning are included in the measure. For example, suppose you wanted to
As noted earlier, democracym eans many things to many people. Raymond D. Gastil
and civil liberties. His checklists for each dimension consisted of eleven items. 20
complex domains, like democracy, and spend quite a bit of time discussing and
justifying the content of their measures. In order for a measure of Gastil's conception
of democracy to achieve content validity, the measure should capture all eleven
concept with which the original concept is thought to be related. In other words,
education with income) or a negative manner (say, democracy and human rights
abuses). The researcher then develops a measure of each of the concepts and examines
correlated, then one measure has convergent validity for the other measure. In the
case that there is no relationship between the measures, then the ~heoretical relationship
the concept, or the procedure used to test the relationship is faulty. The absence
of a hypothesized relationship does not mean a measure is invalid, but the presence
low or weak. If the measures do not correlate with one another, then discriminate
Let us return to the question of measuring the power of legislative leaders because it
use than the formal-powers approach. Therefore, if the two measures are shown to
approach by itself might be a valid way to measure the concept. If the two measures
do not have construct validity, then it would be clear that the two approaches are
not measuring the same thing. Thus, which measure is used could greatly affect the
findings of research into the factors associated with the presence of strong leadership
power or on the consequences of such power. These were the very questions
power. The results, shown-in table 5-1, show that the me~ure of formal power
correlates only weakly with three measures of perceived power ( which, as expected,
correlate w~l with one another). Therefore, measures of perceived power and the
type of validity test most often used by political scientists. It relies on the similarity
the entire measurement scheme. It is often preferable to use more ,than one item to
of a case.22
Let us return to the researcher who wants to develop a valid measure of liberalism.
First, the researcher might measure peoples attitudes toward (1) welfare, (2) military
spending, (3) abortion, ( 4) Social Security benefit levels, (5) affirmative action,
(6) a progressive income tax, (7) school vouchers, and (8) protection of the rights
of the accused. Then the researcher could determine how the responses to each .
question relate to the responses to each of the other questions. The validity of the
The results of such interitem association tests are often displayed in-a correlation
matrix. Such a display shows how strongly related each qf the items in the measurement
scheme is to all the other items. In the hypothetical data shown in table
5-2, we can see that people's responses to six of the eight measures were strongly
related to each other, whereas responses to the questions on protection of the rights
of the accused and school vouchers were not part of the general pattern. Thus, the
researcher would probably conclude that the first six items all measure liberalism
· The figures in table 5-2 are product-moment correlations: numbers that can vary
in value from -1.0 to +LO and that indicate the extent to which one variable is
related to another. The closer the correlation is to ±1, the stronger the relationship;
the closer the correlation is to 0.0, the weaker the relationship (see chapter 13 for
a full explanation). The figures in the last two rows are considerably closer to 0.0
than are the other entries, indicating that peoples answers to the questions about
school vouchers and rights of the accused did not follow the same pattern as their
answers to the other questions. Therefore, it looks like school vouchers and rights
of the accused are not connected to the same concept of liberalism as meas~red by
Content and face validity are difficult to assess when agreement is lacking on the
test requires multiple measures of the same concept. Although these validity
of this basic variable illustrates the numero~s threats to the reliability and
validity of political science measures. The following is a question used in the 2004
Please look at the booklet and tell me the letter of the income group
that includes the income of all members of your family living here in
2003 before taxes. This figure should include salaries, wages, pensions,
dividends, interest, and all other income. Please tell me the letter of the
income group· that includes the income you had in 2003 before taxes.
B. $3,000-$4,999 N. $35,000-$39,999
D. $7,000-$8,999 P. $45,000-$49,999
E $11,000-$12,999 R. $60,000-$69,999
J. $20,000-$21,999 V $105,000-$119,999
L. $25,000-$29,999
Both the reliability and the validity of this method of measuring income are questionable.
• Respondents may not know how much money they make and therefore
• Respondents may not know how much money other family members
• Respondents may know how much they make but carelessly select the
wrong categories.
• Data-entry personnel may touch the wrong numbers when entering the
answers into the computer.
income total; some respondents may include only a few family members,
not know which one to pick. Some may pick the higher category; others,
Because of these measurement problems, if this measure were applied to the same
people at two different times, we could expect the results to vary, resulting in inaccurate
measures that are too high for some respondents and too low for others.
Some amount of random measurement error is likely to occur with any measurement
scheme.
In addition to these threats to reliability, there are numerous threats to the validity
of this measure:
• Respondents may have illegal income they do not want to reveal and_
because they think of their take-home pay and underestimate how much
Notice that this second list of problems contains the word systematicallyT. hese
some bein,g. too high and others too low for unpredictable reasons. Systematicm easurement
error introduces error that may bias research results, thus compromising
This long list of problems with both the reliability and the validity of this fairly
straightforward measure of a relatively concrete concept is worrisome. Imagine how
much more difficult it is to develop reliable and valid measures when the concept
The relia~ility and validity of the measures used by political scientists are seldom
neither completely invalid or valid nor thoroughly unreliable or reliable but, rather,
are partly accurate. Therefore, researchers generally present the rationale and evidence
that their measures are at least as accurate as alternative measures would be.
Nonetheless, a skeptical stance on the part of the reader toward the reliability and
Note, finally, that reliability and validity are not the same thing. A measure may
be reliable without being valid. One may devise a series of questions to measure
liberalism, for example, that yields the same result for the same people every time
but that misidentifies individuals. A valid measure, however, will also be reliable: if
it accurately measures the concept in question, then it will do so, consistently across
occur. It is more important, then, to demonstrate validity than reliability, but reliability
Measurements should be not only accurate but also precise; that is, measurements
being measured. The more precise our measures, the more complete and informative
Suppose, for example, that we wanted to measure the height of political candidates
to see if taller candidates usually win elections. Height could be measured in many
different ways. We could have two categories of the variable "height"-tall and
short-and assign different candidates to the two categories based on whether they
of candidates running for the same office arid measure which candidate was the
tallest, which the next tallest, and so on. Or we could take a tape measure and measure
each candidate's height in inches and record that measure. The last method of
measurement captures the most information about each candidate's height and is,
Levels of Measurement
think our measurements contain and the mathematical properties that determine
the type of comparisons that can be made across a number of observations on the
same variable. The level of measurement also refers to the claim-we are ~lling to
There are four different levels of measurement: nominal, ordinal, interval, and ratio.
While few concepts used in political science research inherently require a particular
provide more information and better mathematical properties than others. So the
We begin with nominal measurement, the level that has the fewest mathematical
properties 9f the· four levels. A nominal-level measure indicat~s that the values
variable. In such a case, no category is more or less than another category; they
are simply different.. For example, suppose we measure the religion of individuals
by asking them to indicate whether they are Christian, Jewish, Muslim, or other.
Since the four categories or values for the variable religion are simply different,
Green, Libertarian, other, and none. Numbers will be assigned to the categories
when the data are coded for statistical analysis, but these numbers do not represent
assigned any number, as long as those numbers are different from each other. In
this sense, nominal-level measures provide the least amount of information about a
but also assumes observations can be compared in terms of having more or less
information about the measured concept and has more mathematical properties
formal education completed with the following categqries: "eighth grade or less,"
"some high school," "high school graduate," "some college," and "college degree or
more." Here we are concerned not with the exact difference between the categories
of education but only with whether one category is more or less than another.
When coding this variable, we would assign higher numbers to higher categories
of education. The intervals between the numbers have no meaning; all that matters
is that the higher numbers represent more of the attribute than do the lower numbers.
or descending order.
measures. For example, we could measure nuclear capability with two categories,
where a country that has nuclear capabilities would be coded as a one and a country
that does not would be coded as a zero. One could interpret this variable as
nuclear capability being present or absent in a country and therefore a one represents
who did not vote in the last election lacks, or has less of, the attribute of having
sure these variables are both exhaustive and exclusive. Exhaustive refers to making
sure that all possible categories--or answer choices-are accounted for. The simplest
Exclusive efers to making sure that a single value or answer can only fit into one
category. Each category should be distinct from the others, with no overlap.
of the nominal level (characteristics are different) and the ordinal level (characteristics
can be put in a meaningful order). But unlike the preceding levels of measurement,
have meaning. The value of a particular observation is important not just in terms of
whether it is larger or smaller than another value (as in ordinal measures) but also in
terms of how much larger or smaller it is. For example, suppose we record the year
1977-we know that the event in 1950 occurred twelve years before the one in 1962
and twenty-seven years before the one in 1977. A one-unit change (the interval) all
along this measurement is identical in meaning: the passage of one years time.
from the next level of measurement (ratio) is that an interval-level measure has an
arbitrarily assigned zero point that does not represent the absence of the attribute
being measured. For example, many time and temperature scales have arbitrary
zero points. Thus, the year O CE does not indicate the beginning of time-if this
were true, there would be no BCE dates. Nor does 0°C indicate the absence of heat;
rather, it indicates the temperature at which water freezes. For this reason, with
interval-level measurements we cannot calculate ratios; that is, we cannot say that
60°F is twice as warm as 30°E So while the interval level of measurement captures
more information and mathematical properties than the nominal and ordinal levels,
involves the full mathematical properties of numbers and contains the most possible
the values of the categories, the order of the categories, and the intervals between
the categories; it also precisely indicates the relative amounts of the variable that
the categories represent because its scale includes a meaningful zero. If, for example,
units of that variable, then a ratio-level measurement exists. The key to making this
assumption is that a value of zero on the variable actually represents the absence
of that variable. Because ratio measures have a true zero point, it makes sense to
say that one measurement is x times another. It makes sense to say a sixty-year-old
person is twice the age of a thirty-year-old person (60/30 = 2), whereas it does not
Political science researchers have measured many concepts at the ratio level. People's
and crim'e rates are all measures that contain a zero point and possess the full mathematical
ratio-level measures. This has restricted the types of hypotheses and analysis techniques
data analysis, techniques that can be used and the conclusions that can be drawn
about the relationships between variables. Higher-order methods often require higher
levels of measurement, while other methods have been developed for lower levels of
measurement. The decision of which level of measurement to use is not always a
straightforward one, and uncertainty and disagreement often exist among researchers
and the claims the researcher is willing to make about the resulting measure.
Researchers usually try to devise as high a level of measurement for their concepts
as possible (nominal being the lowest level of measurement and ratio the highest).
With a higher level of measurement, more advanced data analysis techniques can
be used, and more precise statements can be made about the relationships between
variables. Thus, researchers measuring attitudes or concepts with multiple operational
the construction of indexes and scales in greater detail in the following paragraphs.
It is easy to transform ratio-level information (e.g., age in number of years) into ordinallevel
information (e.g., age groups). However, if you start with the ordinal-level
measure, age groups, you will not have each person's actual age. If you decide you
want to use a person's actual age, you will have to collect that d~ta-it cannot be
measuring how much each candidate spent on his or her campaign. This information
could be used to construct a new variable indicating how much more one
candidate spent than the other, or simply whether or not a candidate spent more
than his or her oppone~t. Candidate spending could also be grouped into ranges.
Nominal and ordinal variables with many categories or interval- and ratio-level
mea_sures using more decimal places are more precise than measures with fewer
categories or decimal places, but sometimes the result may provide more information
than can be used. Researchers frequently start out with ratio-level measures or
~th ordinal and nominal measures with quite a few categories but then collapse
or combine the data to create groups or fewer categories. They do this so Jh~t they
have enough cases in e·ach category for statistical analysis or to make comparisons
easier to follow. For example, one might want to present comparisons simply
between Democrats and Republicans rather than presenting data broken down into
It may seem contradictory now to point out ~hat extremely precise measures also
may-create problems. For example, measures with many response possibilities take
up space if they are questions on a written questionnaire or require more time to explain if
they are included in a telephone survey. Such questions may also confuse
or tire survey respondents. A more serious problem is that they·may lead to measurement
position and 9 is least favorable or "coldest" and 100 most favorable. Some respondents
may not use the whole scale (to them, no candidate ever deserves more than
an 80 or less than a 20), whereas others may use the ends and the very middle of
the scale and ignore the scores in between. We might predict that a person who
gives a candidate a 100 is more likely to vote for that candidate than is a person
who gives the same candidate an 80, but in reality they may like the candidate
pretty much the same way and would be equally likely to vote for the candidate.
Another problem with overly precise measurements is that they may be unreliable.
If asked to rate candidates on more than one occasion, respondents could vary
slightly the number that they choose, even if their opinion has not changed.
Multi-Item Measures
Many meaisures consist of a single item. For example, the measures of party
the vote received by a candidate, how concerned about an issue a person is, the
policy area of a judicial case, and age are all based on a single measure of each
more complicated phenomena that have more than one facet or dimension. For
political power, and the extent to which a person is politically active are complex
In this sit~ation, researchers often develop a measurement strategy that allows them
the several dimensions of the phenomenon. These multi-item measures are useful
reducing them to a more manageable size, and increase the level of measurement
Indexes
A summation index is a method of accumulating scores on individual items to
score for each item for each observation, and then combining the scores for each
observation across all the items. The resulting summary score is the representative
for example, might construct an index of political freedom by devising a list of items
germane to the concept, determining where individual countries score on each item,
and then adding these scores to get a summary measure. In table 5-3, such a hypothetical
The index in table 5-3 is a simple, additive one; that is, each item counts equally
toward the calculation of the index score, and the total score is the summation of
the individual item scores. However, indexes may be constructed with more complicated
than others. In the preceding example, a researcher might consider some indicators
of freedom as more important than others and wish to have them contribute more
to the calculation of the final index score. This could be done either by weighting (multiplying)
some item scores by a number indicating their importance or by
Indexes are often used with public opinion surveys to measure political attitudes.
This is because attitudes are complex phenomena and we usually do not know
enough about them to devise single-item measures. So we often ask several questions
of people about a single attitude and aggregate the answers to represent the
(1) Abortions should be permitted in the first three months of pregnancy. (2) Abortions
should be permitted if the woman's life is in danger. (3) Abortions should be permitted
values to each response (such as 1 for stronglya gree,2 for agree,3 for undecideda, nd
so on) and then adding the values of a respondent's answers to these three questions.
(The researcher would have to decide what to do when a respondent did not
answ~r one or more of the questions.) The lowest possible score in this case would
be a 3, indicating the most extreme pro-abortion attitude, and the highest possible
score would be a 15, indicating the most extreme anti-abortion attitude. Scores in
Indexes are typically fairly simple ways of producing single representative scores of
complicated phenomena such as political attitudes. They are probably more accurate
than most single-item measures, but they may also be flawed in important
ways. Aggregating scores across several items assumes, for example, that each item
is equally impqrtant to the summary measure of the concept and that the items
used faithfully encompass the domain of the concept. Altho.ugh individual item
scores can be weighted to change their contribution to the summary measure, the
researcher often has little information upon which to base a weighting scheme.
Several standard indexes are often used in political science research. The FBI crime
. index, the Consumer Confidence Index, and the Consumer Price Index, for example,
have been used by many researchers. Before using these or any other readily available
index, you should familiarize yourself with its construction and be aware of any
questions raised about its validity. Although simple summation indexes are generally
unclear how valid they are or what level of measurement they represent. For example,
or even a ratio-level measure? Another possible issue with indexes -such as the
Consumer Price Index is that what goes into its calculation can change over time.
Scales
items making up the index and the way in which the scores on individual items
are aggregated are based on the researchers judgment. Scales are also multi-item
measures, but the selection and combination of items in them is accomplished more
systematically than is usually the case for indexes. Over the years, several different
kinds of multi-item scales have been used frequently in political science research.
We discuss three of them: Likert scales, Guttman scales, and Mokken scales.
A Likert scale score is calculated from the scores obtained on individual items.
with the item, as with the abortion questions discussed earlier. A Likert scale differs from an
index, however, in that once the scores on each of the items
are obtained, only some of the items are selected for inclusion in the calculation of
the final score. Those items that allow a researcher to distinguish most readily those
scoring high on an attribute from those scoring low will be retained, and a new
sure how many aspects of liberalism need to be measured. With Liken scaling, the
researcher would begin with a large group of questions thought to express various
A provisional Liken scale for liberalism, then, might look like the one in table 5-4.
so that respondents do not see them as related. Some of the questions might
also be worded in the opposite way (that is, so an "agree" response is a conservative
a-provisional score. The scores in this case can range from 8 to 40. Then the responses of the
most liberal and the most conservative people to each question
would be compared; any questions with similar answers from the disparate respondents
conservatives. A new summary scale score for all the respondents would be calculated
from the questions that remained. A statistic called Cronbach's alpha, which
measures internal consistency of the items in the scale and has a maximum value of
1.0, is used to determine which items to drop from the scale. The rule of thumb is
that Cronbach's alpha should be 0.8 or above; items are dropped from the scale one
Liken scales are improvements over multi-item indexes because the items that
make up the multi-item measure are selected in part based on the respondents'
behavior rather than on the researchers judgment. Likert scales suffer two of the
other defects of indexes, however. The researcher cannot be sure that all the dimensions
of a concept have been measured, and the relative importance of each item is
The Guttman scale also uses a series of items to produce a scale score for respondents.
Unlike the Likert scale, however, a Guttman scale presents respondents with a
range of att,itude choices that are increasingly difficult to agree with; that is, the items
composing the scale range from those easy to agree with to those difficult to agree
with. Respondents who agree with one of the "more difficult" attitude items will also
generally agree with the "less difficult" ones. (Guttman scales have also been used to
measure attributes other than attitudes. Their main application has been in the area
He or she might devise a series of items ranging from "easy to agree with" to "difficult
This array of items seems likely to result in responses consistent with Guttman
scaling. A respondent agreeing with any one of the items is likely also to agree with
those items numbered lower than that one. This would result in the "stepwise"
Suppose six respondents answered this series of questions, as shown in table 5-5.
Generally speaking, the pattern of responses is as expected; those who agreed with
the "most difficult" questions were also likely to agree with the "less difficult" ones.
However, the responses of three people (2, 4, and 5) to the question about the father's
preferences do not fit the pattern. Consequently, the question about the father does
not seem to fit the pattern and would be removed from the scale. Once that has been
With real data, it is unlikely that every respondent would give answers that fit the
pattern perfectly. For example, in table 5-5, respondent 6 gave an "agree" response
to the question about incest or rape. This response is unexpected and does not fit
to respondent 6. When the data fit the scale pattern well (number of errors is small),
researchers assume that the scale is an appropriate measure and that the respondent's
"error" may be "corrected" (in this case, either the "agree" in the case of incest
or rape or the "disagree" in the case of the life of the woman). There are standard
procedures to follow to determine· how to correct the ·data to make it conform to the
scale pattern. We emphasize, however, that this is done. only if the changes are few.
Guttman scales differ from Likert scales in that, in the former case, generally only
one set of responses will yield a particular scale score. That is;-to -get-a score of 3 on the
abortion scale, a particular pattern of responses (or something very close to
it) is necessary: In the case of a Likert scale, however, many different patterns of
responses can yield the same scale score. A Guttman scale is also much more difficult
to achieve than a Likert scale, since the items must have been ordered and be
Both Likert and Guttman scales have shortcomings in their level of measurement.
The level of measurement produced by Likert scales is, at best, ordinal (since we
· do not know the relative importance of each item and so cannot be sure that a 5
answer on one item is the same as a 5 answer on another), and the level of measurement
Another type of scaling procedure, called Mokken scaling, also analyzes responses
to multiple items by respondents to see if, for each item, respondents can be
ordered and if items can be ordered.26 The Mokken scale was used by Saundra K.
Schneider, William G. Jacoby, and Daniel C. Lewis to see if there was structure and
whether th~y thought state or local governments "should take the lead" rather than
the national government for thirteen different policy areas. The scaling procedure
allowed the researchers to see if a specific sequence of policies emerged while moving
from one end of the scale to the other. One end of the scale would indicate
maximal support for national policy activity, while the other end would indicate
The results of their analysis are shown in figure 5-1. The scale runs from Oto 13,
with O indicating that the national government should take the lead in all thirteen
policy areas and a score of 13 indicating that the respondent believes that state and
local governments should take the lead in all policy areas. A person at any scale
score believes that the state and local governments should take the lead in all policy
areas that fall below that score. Thus, a person with a score of 9 believes that
the national government should take the lead in health care, equality for women,
protecting the environment, and equal opportunity, and that state and local governments
should take the lead responsibility for natural disasters down to urban
development. The bars in the figure correspond to the percentage of respondents who
received a particular score. Thus, just slightly more than 5 percent of the
respondents thought that state and local governments should take the lead in all
policy areas, whereas only 1 percent of the respondents thought the national government
should take the lead in all thirteen policy areas. Most respondents divided
up the responsibilities. The authors concluded from their analysis that the public does have a
rationale behind its preferences for the distribution of policy responsibilities
The procedures described so far for constructing multi-item measures are fairly
useful when a researcher has a large number of measures and when there is uncertainty
strategy of planting trees in a wide band (called riparian buffers) along the sides of streams.28
He asked landowners to rate the importance of twelve items thought
wanted to know whether the attitudes could be grouped into distinct dimensions·
that could be used as summary variables instead of using each of the eleven items
separately. Using factor analysis, he found that the items factored into three dimensions.
These dimensions and the items included in each dimension are listed in
table 5-6. The first dimension, which he labeled "maintaining property aesthetics,"
included items such as maintaining a view of the stream, neatness, and maintaining
open space. A second dimension contained items related to concern over water
quality. The third dimension related to protecting property against damage or loss.
Factor analysis is just one of many techniques developed to explore the dimensionality
of measures and to construct multi-item scales. The readings listed at the end
of this chapter include some resources for students who are especially interested in
Through indexes and scales, researchers attempt to enhance both the accuracy and
the precision of their measures. Although these multi-item measures have received
most use in attitude research, they are often useful in other endeavors as well. Both
indexes and scales require researchers to make decisions regarding the selection of
individual items and the way in which the scores on those items will be combined
Conclusion
To a large extent, a research project is only as good as the measurements that are
developed and used in it. Inaccurate measurements will interfere with the testing
conclusions. Imprecise measurements will limit the extent of the comparisons that
can be made between observations and the precision of the.knowledge that results
Abstract concepts are difficult to measure in a valid way, and the practical constraints
of time and money often jeopardize the reliability and precision of measurements.
to the results of his or her empirical research and should not be lightly or routinely
sacrificed. Sometimes the accuracy of measurements may be enhanced through the use of
multi-item measures. With indexes and scales, researchers select multiple indicators
scores into a summary measure. Although these methods hav'e been used most
frequently in attitude research, they can also be used in other situations to improve
numerals.