0% found this document useful (0 votes)
41 views35 pages

Jhonson

This document discusses the importance of measurement in social scientific research. It explains that researchers must operationalize concepts by developing measurable definitions. Two examples are provided: measuring literacy rates and types of government. Operational definitions allow concepts to be systematically studied and hypotheses to be empirically tested.

Uploaded by

DanielaLunaRojas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views35 pages

Jhonson

This document discusses the importance of measurement in social scientific research. It explains that researchers must operationalize concepts by developing measurable definitions. Two examples are provided: measuring literacy rates and types of government. Operational definitions allow concepts to be systematically studied and hypotheses to be empirically tested.

Uploaded by

DanielaLunaRojas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Título: The Building Blocks of Social Scientific Research: Measurement

IN THE PREVIOUS CHAPTERS, WE DISCUSSED the beginning stages of

political science research projects: the choice of research topics, the formulation

of scientific explanations, the development of testable hypotheses, and

the definition of concepts. In this chapter, we take the next step toward testing

hypotheses empirically. Before testing hypotheses, we must understand

some issues involving the measurement of the concepts we have decided

to investigate and how we record systematic observations using numerals or

scores to create variables that represent the concepts for analysis.

I n chapter 2, we said that scientific knowledge is based on empirical research.

order to test empirically the accuracy and utility of a scientific explanation or a political
phenomenon, we will have to observe and measure the presence

the concepts we are using to understand that phenomenon. Furthermore,

thiis test is to be adequate, our measurements of the political phenomenon

just be as accurate and precise as possible. The process of measurement is

important because it provides the bridge oetween our proposed explanations and the
empirical world they are supposed to explain. How researchers measure

their concepts can have a significant impact on their findings; differences in measurement

can lead to totally different conclusions.

lane Kenworthy and Jonas Pontusson'.s investigation qf income inequality in affluent

countries illustrates well the impact on research findings of how a concept is

measured. 1 One way to measure income distribution is to look at the earnings of

full-time-employed individuals and to compare the incomes of those at the top

and the bottom of the earnings distribution. Kenworthy and Pontusson argued

that it is more appropriate to compare the incomes of households than incomes of

individuals. The unemployed are excluded from the calculations of individual earnings

inequality, but households include the unemployed. Also, low-income work-.

ers disproportionately drop out of the employed labor force. Using working-age

household income reflects changes in employment among household members.

Kenworthy and Pontusson found that when individual income was used as a basis

for measuring inequality, inequality had increased the most in the United States,

New Zealand, and the United Kingdom, all liberal market economies. They, further
found that income inequality had increased significantly more in these countries

than in Europe'.s social market economies and Japan. When household income was

used, the data indicated that inequality had increased in all countries with the

exception of the Netherlands.

Another example involves the measurement of turnout rates (discussed in

chapter 1). Political scientists have investigated whether turnout rates in the United

States have declined in recent decades.2 The answer may depend on how the number

of eligible voters is measured. Should it be the number of all citizens of voting

age, or should this number be adjusted to take into account those who are not

eligible to vote, or should the turnout rate be calculated using just the number of

registered yoters as the potential voting population?

The researchers discussed in chapter 1 measured a variety of political phenomena,

some of which posed greater challenges than others. Milner, Poe, and Leblang wanted to
measure.

three different types of human rights: personal integrity or security rights, subsistence

rights, and civil and political rights. Each of these types of rights has

multiple dimensions. For example, civil and political rights consist of both civil

liberties, such as freedom of speech, as well as economic liberties, including private

property rights. Jeffrey A. Segal and Albert D. Cover measured both the political

ideologies and the written opinions of US Supreme Court justices in cases

involving civil rights and liberties.3 Valerie]. Hoekstra measured people's opinions

about issues connected to Supreme Court cases and their opinions about the

Court. 4 Richard L. Hall and Kristina Miler wanted to measure oversight activity

by members of Congress, the number of times they were contacted by lobbyists,

and whether members of Congress and lobbyists were pro-regulation or antiregulation.

5 And Stephen Ansolabehere, Shanta Iyengar, Adam Simon, and Nicholas

Valentino measured the intention to vote reported by study participants to see

if it was affected by exposure to negative campaign advertising.6 In each case,

some ·political behavior or attribute was measured so that a scientific explanation

could be tested. All of these researchers made important choices regarding their

measurements.
Devising Measurement Strategies

As we pointed out in chapter 4, researchers must define the concepts they use

in their hypotheses through conceptualization. They also must decide how to

measure the presence, absence, or amount of these concepts in the real world.

Political scientists refer to this process as operationalization, -or ptoviding an

operational definition of their concepts. Operationalization is deciding how to

record empirical observations of the occurrence of an attribute or a behavior using

numerals or scores.

Let us consider, for example, a researcher trying to explain the existence of democracy

in different nations. If the researcher were to hypothesize that higher rates

of literacy make democracy more "likely, then a definition of two concepts-literacy

and democracy-would be necessary. The researcher could then develop a strategy, based on
the definitions of the two concepts, for measuring the existence

and amount of both -attributes in nations.

Suppose literacy was defined as "the completion of six years of formal education"

and democracy was defined as "a system of government in which public officials

are selected in competitive elections." These definitions would then be used to

develop operational definitions of the two concepts. These operational definitions

would indicate what should be observed empirically to measure both literacy and

democracy, and they would indicate specifically what data should be collected to

test the researchers hypothesis. In this example, the operational definition of literacy

might be "those nations in which at least 50 percent of the populatio~ has had

six years of formal education, as indicated in a publication of the United Nations,"

and the operational definition of democracym ight be "those countries in which the

second-place finisher in elections for the chief executive office has received at least

25 percent of the vote at least once in the past eight years."

When a researcher specifies a concept's operational definition, the concepts precise

meaning in a particular research study becomes clear. In the preceding example,

we now know exactly what the researcher means by literacy and democracy.

Since different people often mean different things by the same concept, operational

definitions are especially important. Someone might argue that defining literacy
in terms of formal education ignores the possibili.ty that people who complete six

years of formal education might still be unable to read or write well. Similarly, it

might be argued that defining democracy in terms of competitive elections ignores

other important features of democracy, such as freedom of expression and citizen

involvement in government activity. In addition, the operational definition of competitivee

lectionsis clearly debatable. Is the "competitiveness" of elections based on

the number of competing candidates, the size of the margin of victory, or the number

of con~ecutive victories by a single party in a series of elections? Unfortunately,

operational definitions are seldom absolutely correct or absolutely incorrect; rather,

they are evaluated according to how well they correspond to the concepts they are

meant to measure.

It is useful to think of arriving at the operational definition as being the last stage in

the process of defining a concept precisely. We often begin with an abstract concept

(such as democracy), then attempt to define it in a meaningful way, and finally

decide in specific terms how we are going to measure it. At the end of this process,

we hope to attain a definition that is sensible, close to our meaning of the concept,

and exact in what it tells us about how to go, about measuring the concept.

Let us consider another example: imagine that a researcher is interested in why

some individuals are more liberal than others. The concept of liberalism might be

defined as "believing that government ought to pursue policies that provide benefits

for the less well-off." The task, then, is to develop an operational definition that can be used to
measure whether particular individuals are liberal or not. The

following question from the Generiil Social Survey might be used to operationalize

the concept:

73A. Some people think that the government in Washington ought to

reduce the income differences between the rich and the poor, perhaps by

raising the taxes of wealthy families or by giving income assistance to the

poor. Others think that the government should not concern itself with

reducing this income difference between the rich and the poor.

Here is a card with a scale from 1 to 7. Think of a score of 1 as meaning

that the government ought to reduce the income differences between


rich and poor, and a score of 7 as meaning that the government should

not concern itself with reducing income differences. What score between

1 and 7 comes closest to the way you feel? (CIRCLE ONE)

An abstract concept, liberalism has now been given an operational definition that

can be used to measure the concept for individuals. This definition is also related

to the original definition of the concept, and it indicates precisely what observations

need to be made. It is not, however, the only operational definition possible.

Others might suggest that questions regarding affirmative action, same-sex marriage,

school vouchers, the death penalty, welfare benefits, and pornography could

be used to measure liberalism.

The important thing is to think carefully about the operational definition you

choose and to try to ensure that the definition coincides closely with the meaning

of the original concept. How a concept is operationalized. affects how generalizations

are made and interpreted. For example, general statements about liberals or

conservatives apply to liberals or conservatives only as they have been operationally

defined, in this case by this one question regarding government involvement

in reducing income differences. As a consumer of research, you should familiarize

yourself with the operational definitions used by researchers so that you are better

able to interpret and generalize research results.

Examples of Political Measurements: Getting to Operationalizc;ltion

Let us take a closer look at some operational definitions used by the political

science researchers referred to in chapter 1, as well as some others. To measure the

strength of a legislator's intervention in air pollution regulations proposed by the


Environmental Protection Agency, Hall and Miler coded and counted the number

of substantive comments made by legislators challenging or defending the agency's

proposed air quality regulations during five oversight hearings held in Congress

and during the public comment period. 8 Agencies are required to maintain a

public docket that contains all the comments received during the comment period.

Transcripts were available for each of the hearings. The researchers ended up

with two variables: one was the number of supporting comments; the other was

the number of comments in opposition to the proposed regulation. To measure


constituency interests in each of the members' districts, they measured the number

of manufacturingjobs in each district and created an index of air pollution based on

district levels of PM 10 particulate matter and ground-level ozone (the pollutants

addressed by the proposed regulations). Because Hall and Miler were interested in

investigating whether lobbyists targeted their efforts toward members of Congress

friendly toward the lobbyists' positions, they needed to measure the pro- or

antienvironmental policy positions for each member of Congress, and this variable

had to measure position before the oversight hearings and regulatory comment

period. Fortunately for the researchers, the leaders of the health and environmental

coalition had classified members in terms of their likely support for the rule prior

to the lobbying period and were willing to share their ratings. These measures were

based on legislators' previous voting record on health and environmental issues.

The research conducted by Segal and Cover on the behavior of US Supreme Court

justices is a good example of an attempt to overcome a serious measurement problem

to test a scientific hypothesis. 9 Recall that Segal and Cover were interested,

as many others have been before them, in the extent to which the votes cast by

Supreme Court justices were dependent on the justices' personal political attitudes.

Measuring the justices' votes on the cases decided by the Supreme Court is no

problem; the votes are public information. But measuring the personal political attitudes

of judges, independenot f theirv otes,i s a problem (remember the discussion in

chapter 4 on avoiding tautologies, or statements that link two concepts that mean

essentially the same thing). Many of the judges whose behavior is of interest have

died, and it is difficult to get living Supreme Court justices to reveal their political

attitudes through personal interviews or questionnaires. Furthermore, one ideally

would like a measure of attitudes that is comparable across many judges and that

measures attitudes related to the cases decided by the Court.

Segal and Cover limited their inquiry to votes on civil liberties cases between 1953

and 1987, so they needed a measure of related political attitudes for the judges

serving on the Supreme Court over that same period. They decided to infer the

judges' attitudes from the newspaper editorials written about them in four major daily
newspapers from the time each justice was appointed by the president until
the justices confirmation vote by the Senate. They selected the editorials appearing

in two liberal papers and in two conservative papers. Trained analysts read the editorials

and coded each paragraph for whether it asserted that a justice designate was

liberal, moderate, or conservative (or if the paragraph was inapplicable) regarding

"support for the rights of defendants in criminal cases, women and racial minorities

in equality cases, and the individual against the government in privacy and First

Amendment cases."10

Because of practical barriers to ideal measurement, then, Segal and Cover had to

rely on an indirect measure of judicial attitudes as perceivedb y Jour newspapers

rather than on a measure of the attitudes themselves. Although this approach may

have resulted in flawed measures, it also permitted the test of an interesting and

important hypothesis about the behavior of Supreme Court justices that had not

been tested previously. Without such measurements, the hypothesis could not have

been tested.

Next, let us consider research conducted by Bradley and his colleagues on the relationship

between party control of government and the distribution and redistribution

of wealth. 11 The researchers relied on the Luxembourg Income Study (LIS)

database, which provides cross-national income data over time in OECD (Organisation

for Economic Co-operation and Development) countries. 12 They decided,

however, to make adjustments to published LIS data on income inequality. That

data included_ pensioners. Because some countries make comprehensive provisions

for retirees, retirees in these countries make little provision on their own for retirement.

Thus, many of these people would be counted as "poor" before any government

transfers. lnclμding pensioners would inflate the pretransfer poverty level as

well as the extent of income transfer for these countries. Therefore, Bradley and

his colleagues limited their analysis to households' with a head aged twenty-five to

fifty-nine (thus excluding the student-age population as well) and calculated their

own measures of income inequality from the LIS data. They argued that their data

would measure redistribution across income groups, not life-cycle redistributions

of income, such as transfers to students and retireμ persons. Income was defined as

income from wages and salaries, self-employment income, property income, and
private pension income. The researchers also made adjustments for household size

using an equivalence scale, which adjusts the number of persons in a household

to an equivalent number of adults. The equivalence scale takes into account the

economies of scale resulting from sharing household expenses.

Martin P Wattenberg and Craig Leonard Brians measured exposure by responses

to a survey question that asked respondents if they recalled a campaign ad' and

whether or not it was negative or positive in tone. Finally, Ansolabehere and his

colleagues measured exposure to negative campaign ads in the 1990 Senate elections

by accessing newspaper and magazine articles about the campaigns and determining

how the tone of the campaigns was described in these articles.

The cases discussed here are good examples of researchers' attempts to measure

important political phenomena (behaviors or attributes) in the real world.

Whether the phenomenon in question was judges' political attitudes, income

inequality, the tone of campaign advertising, or the attitudes and behavior of

legislators, the researchers devised measurement strategies that could detect and

measure the presence and amount of the concept in question. These observations

were then generally used as the basis for an empirical test of the researchers'

hypotheses.

To be useful in providing scientific explanations for political behavior, measurements

of political phenomena must correspond closely to the original meaning of a

researcher's concepts. They must also provide the researcher with enough information

to make valuable comparisons and contrasts. Hence, the quality of measurements

is judged in regard to both their accuracy and their precision.

The Accuracy of Measurements

Because we are going to use· our measurements to test whether or not our

explanations for political phenomena are valid, those measurements must be as

accurate as possible. Inaccurate measurements may lead to erroneous conclusions,

since they will interfere with our ability to observe the actual relationship between

two or more variables.

There are two major threats to the accuracy of measurements. Measures may be

inaccurate because they are unreliablea nd/or because they are invalid.
Reliability

Reliability describes the consistency of results from a procedure or measure in

repeated tests or trials. In the context of measurement, a reliable measure is one

that produces the same result each time the measure is used. An unreliable measure

is one that produces inconsistent results-sometimes higher, sometimes lower.15

Suppose, for example, you want to measure support for the president among college

students. You select two similar survey questions (Ql and Q2) and ask the

participants in a random sample of students to answer each question. The results

from this sample were 50 percent support for the president using Ql and 50 percent

support for the president using Q2. But what might you find if you ask the

same questions of multiple random samples of students? Will the results from each

question remain consistent, assuming that the samples are identical? If a second

sample of students is polled, you may find the same result, 50 percent, for Ql but

60 percent for Q2. If you were to ask Ql of multiple random samples of students

and the result was consistently 50 percent, you could assert that your measure, Ql,

is reliable. If Q2 were asked to multiple random samples of students and each sample

of students returned different answers ranging somewhere between 40 percent

and 60 percent, you could conclude that Q2 is less reliable than Ql because Q2

generates inconsistent results each time it is used.

Likewise, you can assess the reliability of procedures as well. Suppose you are given

the responsibility of counting a stack of one thousand paper ballots for some public

office. The first time you count them, you obtain a particular result. But as you were

counting the ballots, you might have been interrupted, two or more ballots might

have stuck together, some might have been blown onto the floor, or you might have

written down the totals incorrectly: As a precaution, then, you count them five more

times and get four other people to count them once each as well. The similarity of

the,results of all ten counts would be an indication of the reliability or-the counting

process.

Similarly, suppose you wanted to test the hypothesis that the New York Times is

more critical of the federal government than is the Wall Street Journal. This would

require you to measure the level of criticism found in articles in the two papers.
You would need to develop criteria or instructions for identifying or measuring

criticism. The reliability of your measuring scheme could be assessed by having

two people read all the articles, independently rate the level of criticism in them

according to your instructions, and then compare their results. Reliability would be

demonstrated if both people reached similar conclusions regarding the content of

the articles in question.

The reliability of political science measures can be calculated in many different

ways. We describe three methods here that are often associated with written test

items or survey questions, but the ideas may be applied in other research contexts.

The test-retest method involves applying the same "test" to the same observations

after a period of time and then comparing the results of the different measurements.

For example, if a series of questions measuring liberalism is asked of a group

of respondents on two different days, a comparison of their scores at both times

could be used as an indication of the reliability of the measure of liberalism. We

frequently engage in test-retest behavior in our everyday lives. How often have you

stepped on the bathroom scale twice in a matter of seconds?

The test-retest method of measuring reliability may be both difficult and problematic,

since one must measure the phenomenon at two different points. It is possible

that two different results may be obtained because what is being measured has

changed, not because the measure is unreliable. For example, if your bathroom

scale gives you two different weights within a few seconds, the scale is unreliable,

as your weight cannot have changed. However, if you weigh yourself once a week

for a month and find that you get different results each time, is the scale unreliable,

or has your weight changed between measurements? A further problem with the

test-retest check for reliability is that the administration of the first measure may

affect the second measure's results. For instance, the difference between SAT Reasoning

Test scores the first and second times that individuals take the test may not

be assumed to-be a measure of the reliability of the test, since test takers might alter

their behavior the second time as a result of taking the test the first time (e.g., they

might learn from their first experience with the test).

The alternative-form method of measuring reliability also involves measuring the


same attribute more than once, but it uses two different measures of the same

concept rather than the same measure. For example, a researcher could devise

two different sets of questions to measure the concept of lib~ralism, ask the same

respondents questions at two different times using one set of questions the first

time and the other set of questions the second time, and compare the respondents'

scores. Using two different forms of the measure reduces the chance that the second

scores are influenced by the first measure, but it still requires the phenomenon to

be measured twice. Depending on the length of time between the two measurements, what is
being measured may change.

The split-halves method of measuring reliability involves applying two measures

of the same concept at the same time. The results of the two measures are then

compared. This method avoids the problem that the concept being measured may

change between measures. The split-halves method is often used when a multiitem

measure can be split into two equivalent halves. For example, a researcher

may devise a measure of liberalism consisting of the responses to ten questions on

a public opinion survey. Half of these questions could be selected to represent one

measure of liberalism, and the other half selected to represent a second measure of

liberalism. If individual scores on the two measures of liberalism are similar, then

the ten-item measure may be said to be reliable by the split-halves approach.

The test-retest, alternative-form, and split-halves methods provide a basis for calculating

the similarity of results of two or more applications of the same or equivalent

measures. The less consistent the results are, the less reliable the measure. Political

scientists take very seriously the reliability of the measures they use. Survey

researchers are often concerned about the reliability of the answers they receive. For

example, respondents' answers to survey questions often vary considerably when

the instruments are given at two different times. 16 If respondents are not concentrating

or taking the survey seriously, the answers they provide may as well have

been pulled out of a hat.

Now, let us return to the example of measuring your weight using a home scale.

If you weigh yourself on your home scale, then go to the gym and weigh yourself

again there, and get the same number (alternative forms test of reliability), you
may conclude that your home scale is reliable. But what if you get two different

numbers? Assuming your weight has not changed, what is the problem? If you go

back home immediately and step back on your home scale and find that it gives

you a measurement that is different from the first it gave you, you could conclude

that your scale has a faulty mechanism, is inconsistent, and therefore is unreliable.

However, what if your bathroom scale gives you the same weight as the first time?

It would appear to be reliable. Maybe the gym scale is unreliable. You could test

this out by going back to the gym and reweighing yourself. If the gym scale gives a

reading different from the one it gave the first time, then it is unreliable. But what if

the gym scale gives consistent readings? Each scale appears to be reliable (the scales

are not giving you different weights at random), but at least one of them is giving

you a wrong measurement (that is, not giving you your correct weight). This is a

problem of validity.

Validity

Essentially, a yalid measure is one that measures what it is supposed to measure.

Unlike reliability, which depends on whether repeated applications of the same or

equivalent measures yield the same result, validity refers to the degree of correspondence

between the measure and the concept it is thought to measure.

Let us consider first an example of a measure whose validity has been questioned:

voter turnout. Many studies examine the factors that affect voter turnout and, thus, require an
accurate measurement of voter turnout. One way of measuring voter

turnout is to ask people if they voted in the last election-self-reported voting.

However, given the social desirability of voting in the United States-wearing the "I

voted" sticker or posting "I voted" on a social media site can bring social rewardswill

nonvoters admit their failure to vote to an interviewer? Some nonvoters may

claim in surveys to have voted, resulting in an invalid measure of voter turnout tqat

overstates the number of voters. In fact, this is what usually happens. Voter surveys

commonly overestimate turnout by several percentage points. 17

A measure can also be invalid if it measures a slightly or very different concept

than intended. For example, assume that a researcher intends to measure ideology,

conceptualized as an individual's political views on a continuum between conservative,


moderate, and liberal. The researcher proposes to measure ideology by asking

survey respondents, 'To which party do you feel closest, the Democratic Party or

the Republican Party?" This measure would be invalid because it fails to measure

ideology as conceptualized. Partisan affinity, while often consistent with ideology,

is not the same as ideology. This measure could be a valid measure of party identification,

but not ideology.

A measure'.s-ilalidityis more difficult to demonstrate empirically than is its reliability

because validity involves the relationship between the measurement of a concept

and the actual presence or amount of the concept itself. Information regarding the

correspondence is seldom abundant. Nonetheless, there are ways to evaluate the

validity of any particular measure. In the following paragraphs we explain several

ways of thinking about validity including face, content, construct, and interitem

validity.

Face validity may be asserted (not empirically demonstrated) when the measurement

instrument appears to measure the concept it is supposed to measure. To

assess the face validity of a measure, we need to know the meaning of the concept

being measured and whether the information being collected is "germane to that

concept." 18 For example, let us return to thinking about how we might meas.ure

political ideology-that is, whether someone is conservative, moderate, or liberal.

Such a measure could be as simple as a question used by the Pew Research Center:

"Do you think of yourself as conservative, moderate or liberal?"19 On its face,

this measure appears to capture the intended concept, so it has face validity. It

might be tempting to use individuals' responses to a question on party identification,

but.one would be assuming that all Democrats are liberal and all Republicans are conservative.
Also, if the party identification variable included a category for

independents, what would be their ideology? Can you assume they are all moderates?

For these reasons, a question measuring party identification would lack face

validity as a measure of ideology.

In general, measures lack face validity when there are good reasons to question the

correspondence of the measure to the concept in question. In other words, assessing

face validity is essentially a matter of judgment. If no consensus exists about the


meaning of the concept to be measured, the face validity of the measure is bound

to be problematic.

Content validity is similar to face validity but involves determining the full domain

or meaning of a particular concept and then making sure that all components of

the meaning are included in the measure. For example, suppose you wanted to

design a measure of the extent to which a nation's political system is democratic.

As noted earlier, democracym eans many things to many people. Raymond D. Gastil

constructed a measure of democracy that included two dimensions, political rights

and civil liberties. His checklists for each dimension consisted of eleven items. 20

Political scientists are often interested in concepts with multiple dimensions or

complex domains, like democracy, and spend quite a bit of time discussing and

justifying the content of their measures. In order for a measure of Gastil's conception

of democracy to achieve content validity, the measure should capture all eleven

components in the definition.

A third way to evaluate the validity of a measure is by empirically demonstrating

construct validity. Construct validity can be understood in two different ways:

convergent construct validity and divergent construct validity. Convergent const1:

11cvt alidity is when a measure of a concept is related to a measure of another

concept with which the original concept is thought to be related. In other words,

a researcher may specify, on theoretical grounds, that two concepts ought to be

related in a positive manner (say, political efficacy with political participation or

education with income) or a negative manner (say, democracy and human rights

abuses). The researcher then develops a measure of each of the concepts and examines

the relationship between them. If the measures are positively or negatively

correlated, then one measure has convergent validity for the other measure. In the

case that there is no relationship between the measures, then the ~heoretical relationship

is in error, at least one of the measures is not an accurate representation of

the concept, or the procedure used to test the relationship is faulty. The absence

of a hypothesized relationship does not mean a measure is invalid, but the presence

of a relationship gives some assurance of the measures validity.


Discriminant construct validity involves two measures that theoretically are

expected not to be related; thus, the correlation between them is expected to be

low or weak. If the measures do not correlate with one another, then discriminate

construct validity is demonstrated.

Let us return to the question of measuring the power of legislative leaders because it

provides a good example of the importance of construct validity. As we pointed out

before, the perceived-influence approach to measuring power is more difficult to

use than the formal-powers approach. Therefore, if the two measures are shown to

have construct validity, operationalizing leadership power using the formal-powers

approach by itself might be a valid way to measure the concept. If the two measures

do not have construct validity, then it would be clear that the two approaches are

not measuring the same thing. Thus, which measure is used could greatly affect the

findings of research into the factors associated with the presence of strong leadership

power or on the consequences of such power. These were the very questions

raised by political scientist James Coleman Battista.21 He constructed several measures

of perceived leadership power and correlated them with a measure of formal

power. The results, shown-in table 5-1, show that the me~ure of formal power

correlates only weakly with three measures of perceived power ( which, as expected,

correlate w~l with one another). Therefore, measures of perceived power and the

measure of formal power do not demonstrate,convergent construct validity.

A fourth way to demonstrate validity is through interitem association. This is the

type of validity test most often used by political scientists. It relies on the similarity

of outcomes_ of more than one measure of a concept to demonstrate the validity of

the entire measurement scheme. It is often preferable to use more ,than one item to

measure a concept-reliance on just one measure is more prone to error or misclassification

of a case.22

Let us return to the researcher who wants to develop a valid measure of liberalism.

First, the researcher might measure peoples attitudes toward (1) welfare, (2) military

spending, (3) abortion, ( 4) Social Security benefit levels, (5) affirmative action,

(6) a progressive income tax, (7) school vouchers, and (8) protection of the rights

of the accused. Then the researcher could determine how the responses to each .
question relate to the responses to each of the other questions. The validity of the

measurement scheme would be demonstrated if strong relationships existed among

people's responses across the eight questions.

The results of such interitem association tests are often displayed in-a correlation

matrix. Such a display shows how strongly related each qf the items in the measurement

scheme is to all the other items. In the hypothetical data shown in table

5-2, we can see that people's responses to six of the eight measures were strongly

related to each other, whereas responses to the questions on protection of the rights

of the accused and school vouchers were not part of the general pattern. Thus, the

researcher would probably conclude that the first six items all measure liberalism

and that, taken together, they are a valid measurement of liberalism.

· The figures in table 5-2 are product-moment correlations: numbers that can vary

in value from -1.0 to +LO and that indicate the extent to which one variable is

related to another. The closer the correlation is to ±1, the stronger the relationship;

the closer the correlation is to 0.0, the weaker the relationship (see chapter 13 for

a full explanation). The figures in the last two rows are considerably closer to 0.0

than are the other entries, indicating that peoples answers to the questions about

school vouchers and rights of the accused did not follow the same pattern as their

answers to the other questions. Therefore, it looks like school vouchers and rights

of the accused are not connected to the same concept of liberalism as meas~red by

the other questions.

Content and face validity are difficult to assess when agreement is lacking on the

meaning of a concept, and construct validity, which requires a well-developed theoretical

perspective, usually yields a less-than-definitive result. The interitem association

test requires multiple measures of the same concept. Although these validity

"tests" provide important evidence, none of them is likely to support an unequivocal

decision concerning the validity of particular measures.

Problems with Reliability and Validity

in Political Science Measurement

Survey researchers often want to measure respondents' household income. Measurement

of this basic variable illustrates the numero~s threats to the reliability and
validity of political science measures. The following is a question used in the 2004

Amer.ican National Election Study (ANES):

Please look at the booklet and tell me the letter of the income group

that includes the income of all members of your family living here in

2003 before taxes. This figure should include salaries, wages, pensions,

dividends, interest, and all other income. Please tell me the letter of the

income group· that includes the income you had in 2003 before taxes.

Respondents were given the following choices:

A. None, or less than $2,999 M. $30,000-$34 ,999

B. $3,000-$4,999 N. $35,000-$39,999

C. $5,000-$6,999 0. $40 ,000-$44 ,999

D. $7,000-$8,999 P. $45,000-$49,999

E. $9,000-$10,999 Q. $50,000-$59 ,999

E $11,000-$12,999 R. $60,000-$69,999

G. $13,000-$14,999 S. $70,000-$79 ,999

H. $15,000-$16,999 T. $80,000-$89 ,999

I. $17 ,000-$19 ,999 U. $90,000-$104,999

J. $20,000-$21,999 V $105,000-$119,999

K. $22,000-$24,999 w $120,000 and over

L. $25,000-$29,999

Both the reliability and the validity of this method of measuring income are questionable.

Threats to the reliability of the measure include the following:

• Respondents may not know how much money they make and therefore

incorrectly guess their income.

• Respondents may not know how much money other family members

make"and guess incorrectly. ·

• Respondents may know how much they make but carelessly select the

wrong categories.

• Interviewers may circle the wrong categories when listening to the

selections of the respondents.

• Data-entry personnel may touch the wrong numbers when entering the
answers into the computer.

• Dishonest interviewers may incorrectly guess the income of a respondent

who does not complete the interview.

• Respondents may not know which family members to include in the

income total; some respondents may include only a few family members,

while others may include even distant relations.

• Respondents whose income is on the border between two categories may

not know which one to pick. Some may pick the higher category; others,

the lower one.

Because of these measurement problems, if this measure were applied to the same

people at two different times, we could expect the results to vary, resulting in inaccurate
measures that are too high for some respondents and too low for others.

Some amount of random measurement error is likely to occur with any measurement

scheme.

In addition to these threats to reliability, there are numerous threats to the validity

of this measure:

• Respondents may have illegal income they do not want to reveal and_

therefore may systematically underestimate their income.

• Respondents may try to impre~s the interviewer, or themselves, by

systematically overestimating their income.

• Respondents may systematically underestimate their before-tax income

because they think of their take-home pay and underestimate how much

money is being withheld from their paychecks.

• Respondents may systematically skip the question due to privacy

concerns over providing a precise number even if they know it.

Notice that this second list of problems contains the word systematicallyT. hese

problems are not simply caused by random inconsistencies in measurements, with

some bein,g. too high and others too low for unpredictable reasons. Systematicm easurement

error introduces error that may bias research results, thus compromising

the confidence we have in them.

This long list of problems with both the reliability and the validity of this fairly
straightforward measure of a relatively concrete concept is worrisome. Imagine how

much more difficult it is to develop reliable and valid measures when the concept

is abstract (for example, tolerance, environmental conscience, self-esteem, or liberalism)

and the measurement scheme is more complicated.

The relia~ility and validity of the measures used by political scientists are seldom

demonstrated to everyone's satisfaction. Most measures of political phenomena are

neither completely invalid or valid nor thoroughly unreliable or reliable but, rather,

are partly accurate. Therefore, researchers generally present the rationale and evidence

available in support of their measures and attempt to persuade their audience

that their measures are at least as accurate as alternative measures would be.

Nonetheless, a skeptical stance on the part of the reader toward the reliability and

validity of political science measures is often warranted.

Note, finally, that reliability and validity are not the same thing. A measure may

be reliable without being valid. One may devise a series of questions to measure

liberalism, for example, that yields the same result for the same people every time

but that misidentifies individuals. A valid measure, however, will also be reliable: if

it accurately measures the concept in question, then it will do so, consistently across

measurements-allowing, of course, for some random measurement error that may

occur. It is more important, then, to demonstrate validity than reliability, but reliability

is usually more easily and precisely tested.

The Precision of Measurements

Measurements should be not only accurate but also precise; that is, measurements

should contain as much information as possible about the attribute or behavior

being measured. The more precise our measures, the more complete and informative

can be our test of the relationships between two or more variables.

Suppose, for example, that we wanted to measure the height of political candidates

to see if taller candidates usually win elections. Height could be measured in many

different ways. We could have two categories of the variable "height"-tall and

short-and assign different candidates to the two categories based on whether they

were of above-average or below-average height. Or we could compare the heights

of candidates running for the same office arid measure which candidate was the
tallest, which the next tallest, and so on. Or we could take a tape measure and measure

each candidate's height in inches and record that measure. The last method of

measurement captures the most information about each candidate's height and is,

therfefore, the most precise measure of the attribute.

Levels of Measurement

When we consider the precision of our measurements, we refer to the level of

measurement. The level of measurement involves the type of information that we

think our measurements contain and the mathematical properties that determine

the type of comparisons that can be made across a number of observations on the

same variable. The level of measurement also refers to the claim-we are ~lling to

make when we assign numbers to our measurements.

There are four different levels of measurement: nominal, ordinal, interval, and ratio.

While few concepts used in political science research inherently require a particular

level of measurement, there are methodological limitations because some measures

provide more information and better mathematical properties than others. So the

level of measurement used to measure any particular concept is a function of both

the researcher's imagination and resources, and methodological needs.

We begin with nominal measurement, the level that has the fewest mathematical

properties 9f the· four levels. A nominal-level measure indicat~s that the values

assigned to a variable represent only different categories or classifications for that

variable. In such a case, no category is more or less than another category; they

are simply different.. For example, suppose we measure the religion of individuals

by asking them to indicate whether they are Christian, Jewish, Muslim, or other.

Since the four categories or values for the variable religion are simply different,

the measurement is at a nominal level. Other common examples of nominal-level measures


are gender, marital status, and state of residence. A nominal measure

of partisan affiliation might have the following categories: Democrat, Republican,

Green, Libertarian, other, and none. Numbers will be assigned to the categories

when the data are coded for statistical analysis, but these numbers do not represent

mathematical differences between the categories--any of the parties could be

assigned any number, as long as those numbers are different from each other. In
this sense, nominal-level measures provide the least amount of information about a

concept. An ordiI\al measurement has all of the properties of a nominal measure

but also assumes observations can be compared in terms of having more or less

of a particular attribute. Hence, the ordinal level of measurement captures more

information about the measured concept and has more mathematical properties

than a nominal-level measure. For example, we could create an ordinal measure of

formal education completed with the following categqries: "eighth grade or less,"

"some high school," "high school graduate," "some college," and "college degree or

more." Here we are concerned not with the exact difference between the categories

of education but only with whether one category is more or less than another.

When coding this variable, we would assign higher numbers to higher categories

of education. The intervals between the numbers have no meaning; all that matters

is that the higher numbers represent more of the attribute than do the lower numbers.

An ordinal variable measuring partisan affiliation with the categories "strong

Republican," "weak Republican," "neither leaning Republican nor Democrat,"

"weak Democrat," and "strong Democrat" could be assigned codes 1, 2, 3, 4, 5 or 1,

2, 5, 8, 9 or any other combination of numbers, as long as they were in ascending

or descending order.

Dichotomous nominal variables-that is, nominal-level variables with only two

categories--are nominal-level measures, but frequently treated as ordinal-level

measures. For example, we could measure nuclear capability with two categories,

where a country that has nuclear capabilities would be coded as a one and a country

that does not would be coded as a zero. One could interpret this variable as

nuclear capability being present or absent in a country and therefore a one represents

more of the concept, nuclear capability. To give another example, a person

who did not vote in the last election lacks, or has less of, the attribute of having

voted than a person who did vote.

Because nominal and ordinal measures rely on categories it is important to make

sure these variables are both exhaustive and exclusive. Exhaustive refers to making

sure that all possible categories--or answer choices-are accounted for. The simplest

solution to make sure a variable is exhaustive is to 1nclude an "other" category


that can be used for values that are not represented in the identified categories.

Exclusive efers to making sure that a single value or answer can only fit into one

category. Each category should be distinct from the others, with no overlap.

Debating the Level of Measurement

The next level of measurement, an interval•measurement, includes the properties

of the nominal level (characteristics are different) and the ordinal level (characteristics

can be put in a meaningful order). But unlike the preceding levels of measurement,

the intervals. between the categories or values assigned to the observations do

have meaning. The value of a particular observation is important not just in terms of

whether it is larger or smaller than another value (as in ordinal measures) but also in

terms of how much larger or smaller it is. For example, suppose we record the year

in which certain events occurred. If we have three observations-1950, 1962, and

1977-we know that the event in 1950 occurred twelve years before the one in 1962

and twenty-seven years before the one in 1977. A one-unit change (the interval) all

along this measurement is identical in meaning: the passage of one years time.

Another characteristic of an interval level of measurement that distinguishes it

from the next level of measurement (ratio) is that an interval-level measure has an

arbitrarily assigned zero point that does not represent the absence of the attribute

being measured. For example, many time and temperature scales have arbitrary

zero points. Thus, the year O CE does not indicate the beginning of time-if this

were true, there would be no BCE dates. Nor does 0°C indicate the absence of heat;

rather, it indicates the temperature at which water freezes. For this reason, with

interval-level measurements we cannot calculate ratios; that is, we cannot say that

60°F is twice as warm as 30°E So while the interval level of measurement captures

more information and mathematical properties than the nominal and ordinal levels,

it does not have the full properties of mathematics.

The final level of measurement is a ratio measurement. This type of measurement

involves the full mathematical properties of numbers and contains the most possible

information about a measured concept. That is, a ratio-level measure includes

the values of the categories, the order of the categories, and the intervals between

the categories; it also precisely indicates the relative amounts of the variable that
the categories represent because its scale includes a meaningful zero. If, for example,

a researcher is willing to claim that an observation with ten units of a variable

possesses exactly twice as much of that attribute as an observation with five

units of that variable, then a ratio-level measurement exists. The key to making this

assumption is that a value of zero on the variable actually represents the absence

of that variable. Because ratio measures have a true zero point, it makes sense to

say that one measurement is x times another. It makes sense to say a sixty-year-old

person is twice the age of a thirty-year-old person (60/30 = 2), whereas it does not

make sense to say that 60°C is twice as warm as 30°C. 23

Political science researchers have measured many concepts at the ratio level. People's

ages, unemployment rates, percentage of the vote for a particular candidate,

and crim'e rates are all measures that contain a zero point and possess the full mathematical

properties of the numbers used. However, more political science research

has probably relied on nominal- and ordinal-level measures than on interval- or

ratio-level measures. This has restricted the types of hypotheses and analysis techniques

that political scientists have been willing and able to use.

Identifying the level of measurement of variables is important, since it affects the

data analysis, techniques that can be used and the conclusions that can be drawn

about the relationships between variables. Higher-order methods often require higher

levels of measurement, while other methods have been developed for lower levels of
measurement. The decision of which level of measurement to use is not always a

straightforward one, and uncertainty and disagreement often exist among researchers

concerning these decisions. Few phenomena inherently require one particular

level of measurement. Often, a phenomenon can be measured with any level of

measurement, depending on the particular technique designed by the researcher

and the claims the researcher is willing to make about the resulting measure.

Working with Precision: Too Little or Too Much

Researchers usually try to devise as high a level of measurement for their concepts

as possible (nominal being the lowest level of measurement and ratio the highest).

With a higher level of measurement, more advanced data analysis techniques can

be used, and more precise statements can be made about the relationships between
variables. Thus, researchers measuring attitudes or concepts with multiple operational

definitions often construct a scale or an index from nominal-level measures

that permits at least ordinal-level comparisons between observations. We discuss

the construction of indexes and scales in greater detail in the following paragraphs.

It is easy to transform ratio-level information (e.g., age in number of years) into ordinallevel

information (e.g., age groups). However, if you start with the ordinal-level

measure, age groups, you will not have each person's actual age. If you decide you

want to use a person's actual age, you will have to collect that d~ta-it cannot be

created from an ordinal-level measurement. Similarly, a researcher investigating the

effect of campaign spending on election outcomes could use a ratio-level variable

measuring how much each candidate spent on his or her campaign. This information

could be used to construct a new variable indicating how much more one

candidate spent than the other, or simply whether or not a candidate spent more

than his or her oppone~t. Candidate spending could also be grouped into ranges.

Nominal and ordinal variables with many categories or interval- and ratio-level

mea_sures using more decimal places are more precise than measures with fewer

categories or decimal places, but sometimes the result may provide more information

than can be used. Researchers frequently start out with ratio-level measures or

~th ordinal and nominal measures with quite a few categories but then collapse

or combine the data to create groups or fewer categories. They do this so Jh~t they

have enough cases in e·ach category for statistical analysis or to make comparisons

easier to follow. For example, one might want to present comparisons simply

between Democrats and Republicans rather than presenting data broken down into

categories of strong, moderat_e, and weak for each party.

It may seem contradictory now to point out ~hat extremely precise measures also

may-create problems. For example, measures with many response possibilities take

up space if they are questions on a written questionnaire or require more time to explain if
they are included in a telephone survey. Such questions may also confuse

or tire survey respondents. A more serious problem is that they·may lead to measurement

error. Think about the possible responses to a question asking respondents

to use a 100-point scale (called a thermometer scale) to indicate their support


for or opposition to a political candidate, assuming that 50 is consid€red the neutral

position and 9 is least favorable or "coldest" and 100 most favorable. Some respondents

may not use the whole scale (to them, no candidate ever deserves more than

an 80 or less than a 20), whereas others may use the ends and the very middle of

the scale and ignore the scores in between. We might predict that a person who

gives a candidate a 100 is more likely to vote for that candidate than is a person

who gives the same candidate an 80, but in reality they may like the candidate

pretty much the same way and would be equally likely to vote for the candidate.

Another problem with overly precise measurements is that they may be unreliable.

If asked to rate candidates on more than one occasion, respondents could vary

slightly the number that they choose, even if their opinion has not changed.

Multi-Item Measures

Many meaisures consist of a single item. For example, the measures of party

identification, whether or not one party controls Congress, the percentage of

the vote received by a candidate, how concerned about an issue a person is, the

policy area of a judicial case, and age are all based on a single measure of each

phenomenon in question. Often, however, researchers need to devise measures. of

more complicated phenomena that have more than one facet or dimension. For

example, internationalism, political ideology, political knowledge, dispersion of

political power, and the extent to which a person is politically active are complex

phenomena or concepts that may be measured in many different ways.

In this sit~ation, researchers often develop a measurement strategy that allows them

to capture numerous aspects of a complex phenomenon while representing the

existence of that phenomenon in particular cases with a single representative value.

Usually this involves the construction of a multi-item index or scale "representing

the several dimensions of the phenomenon. These multi-item measures are useful

because they enhance the accuracy of a measure, simplify a researchers data by

reducing them to a more manageable size, and increase the level of measurement

of a phenomenon. In the remainder of this section, we describe several common

types of indexes and scales.

Indexes
A summation index is a method of accumulating scores on individual items to

form a composite measure of a complex phenomenon. An index is constructed by assigning a


range of possible scores for a certain number of items, determining the

score for each item for each observation, and then combining the scores for each

observation across all the items. The resulting summary score is the representative

measurement of the phenomenon.

A researcher interested in measuring how much freedom exists in different countries,

for example, might construct an index of political freedom by devising a list of items

germane to the concept, determining where individual countries score on each item,

and then adding these scores to get a summary measure. In table 5-3, such a hypothetical

index is used to measure the amount of freedom in countries A through E.

The index in table 5-3 is a simple, additive one; that is, each item counts equally

toward the calculation of the index score, and the total score is the summation of

the individual item scores. However, indexes may be constructed with more complicated

aggregation procedures and by counting some items as more important

than others. In the preceding example, a researcher might consider some indicators

of freedom as more important than others and wish to have them contribute more

to the calculation of the final index score. This could be done either by weighting (multiplying)
some item scores by a number indicating their importance or by

assigning a higher score than 1 to those attributes considered more important.

Indexes are often used with public opinion surveys to measure political attitudes.

This is because attitudes are complex phenomena and we usually do not know

enough about them to devise single-item measures. So we often ask several questions

of people about a single attitude and aggregate the answers to represent the

attitude. A researcher might measure attitudes toward abortion, for example, by

asking respondents to choose one of five possible responses-strongly agree, agree,

undecided, disagree, and strongly disagree-to the following three statements:

(1) Abortions should be permitted in the first three months of pregnancy. (2) Abortions

should be permitted if the woman's life is in danger. (3) Abortions should be permitted

whenever a woman wants one.

An index of attitudes toward abortion could be comput~d by assigning numerical

values to each response (such as 1 for stronglya gree,2 for agree,3 for undecideda, nd
so on) and then adding the values of a respondent's answers to these three questions.

(The researcher would have to decide what to do when a respondent did not

answ~r one or more of the questions.) The lowest possible score in this case would

be a 3, indicating the most extreme pro-abortion attitude, and the highest possible

score would be a 15, indicating the most extreme anti-abortion attitude. Scores in

between would indicate varying degrees of approval of abortion.

Indexes are typically fairly simple ways of producing single representative scores of

complicated phenomena such as political attitudes. They are probably more accurate

than most single-item measures, but they may also be flawed in important

ways. Aggregating scores across several items assumes, for example, that each item

is equally impqrtant to the summary measure of the concept and that the items

used faithfully encompass the domain of the concept. Altho.ugh individual item

scores can be weighted to change their contribution to the summary measure, the

researcher often has little information upon which to base a weighting scheme.

Several standard indexes are often used in political science research. The FBI crime

. index, the Consumer Confidence Index, and the Consumer Price Index, for example,

have been used by many researchers. Before using these or any other readily available

index, you should familiarize yourself with its construction and be aware of any

questions raised about its validity. Although simple summation indexes are generally

more accurate than single-item measures of complicated phenomena, it is often

unclear how valid they are or what level of measurement they represent. For example,

is the index of freedom an ordinal-level measure, or could it be an interval-level

or even a ratio-level measure? Another possible issue with indexes -such as the

Consumer Price Index is that what goes into its calculation can change over time.

Scales

Although indexes are generally an improvement over single-item measures, their

construction also contains an element of arbitrariness. Both the selection of particular

items making up the index and the way in which the scores on individual items

are aggregated are based on the researchers judgment. Scales are also multi-item

measures, but the selection and combination of items in them is accomplished more

systematically than is usually the case for indexes. Over the years, several different
kinds of multi-item scales have been used frequently in political science research.

We discuss three of them: Likert scales, Guttman scales, and Mokken scales.

A Likert scale score is calculated from the scores obtained on individual items.

Each item generally asks a respondent to indicate a degree of agreement or disagreement

with the item, as with the abortion questions discussed earlier. A Likert scale differs from an
index, however, in that once the scores on each of the items

are obtained, only some of the items are selected for inclusion in the calculation of

the final score. Those items that allow a researcher to distinguish most readily those

scoring high on an attribute from those scoring low will be retained, and a new

scale score will be calculated based only on those items.

For example, consider the researcher interested in measuring the liberalism of a

group of respondents. Since definitions of liberalismv ary, the researcher cannot be

sure how many aspects of liberalism need to be measured. With Liken scaling, the

researcher would begin with a large group of questions thought to express various

aspects of liberalism with which respondents would be asked to agree or disagree.

A provisional Liken scale for liberalism, then, might look like the one in table 5-4.

In practice, a set of questions like this would be scattered throughout a questionnaire

so that respondents do not see them as related. Some of the questions might

also be worded in the opposite way (that is, so an "agree" response is a conservative

response) to ensure genuine answers.

The respondents' answers to these eight questions would be summed to produce

a-provisional score. The scores in this case can range from 8 to 40. Then the responses of the
most liberal and the most conservative people to each question

would be compared; any questions with similar answers from the disparate respondents

would be eliminated-such questions would not distinguish liberals from

conservatives. A new summary scale score for all the respondents would be calculated

from the questions that remained. A statistic called Cronbach's alpha, which

measures internal consistency of the items in the scale and has a maximum value of

1.0, is used to determine which items to drop from the scale. The rule of thumb is

that Cronbach's alpha should be 0.8 or above; items are dropped from the scale one

at a time until this value is reached. 25

Liken scales are improvements over multi-item indexes because the items that
make up the multi-item measure are selected in part based on the respondents'

behavior rather than on the researchers judgment. Likert scales suffer two of the

other defects of indexes, however. The researcher cannot be sure that all the dimensions

of a concept have been measured, and the relative importance of each item is

still determined arbitrarily.

The Guttman scale also uses a series of items to produce a scale score for respondents.

Unlike the Likert scale, however, a Guttman scale presents respondents with a

range of att,itude choices that are increasingly difficult to agree with; that is, the items

composing the scale range from those easy to agree with to those difficult to agree

with. Respondents who agree with one of the "more difficult" attitude items will also

generally agree with the "less difficult" ones. (Guttman scales have also been used to

measure attributes other than attitudes. Their main application has been in the area

of attitude research, however, so an example of that type is used here.)

Let us return to the researcher interested in measuring attitudes toward abortion.

He or she might devise a series of items ranging from "easy to agree with" to "difficult

to agree with." Such an approach might be represented by the following items:

Doy ou agreeo r disagreeth at abortionss houldb e permitted:

1. Wheri the life of the woman is in danger

2. In the case of incest or rape

3. When the fetus appears to be unhealthy

4. When the father does not want to have a baby

5. When the woman cannot afford to have a baby

6. Whenever the woman wants one

This array of items seems likely to result in responses consistent with Guttman

scaling. A respondent agreeing with any one of the items is likely also to agree with

those items numbered lower than that one. This would result in the "stepwise"

pattern of responses characteristic of a Guttman scale.

Suppose six respondents answered this series of questions, as shown in table 5-5.

Generally speaking, the pattern of responses is as expected; those who agreed with

the "most difficult" questions were also likely to agree with the "less difficult" ones.

However, the responses of three people (2, 4, and 5) to the question about the father's
preferences do not fit the pattern. Consequently, the question about the father does

not seem to fit the pattern and would be removed from the scale. Once that has been

done, the stepwise pattern becomes clear.

With real data, it is unlikely that every respondent would give answers that fit the

pattern perfectly. For example, in table 5-5, respondent 6 gave an "agree" response

to the question about incest or rape. This response is unexpected and does not fit

the pattern. Therefore, we would be making an error if we assigned a scale score of 0

to respondent 6. When the data fit the scale pattern well (number of errors is small),

researchers assume that the scale is an appropriate measure and that the respondent's

"error" may be "corrected" (in this case, either the "agree" in the case of incest

or rape or the "disagree" in the case of the life of the woman). There are standard

procedures to follow to determine· how to correct the ·data to make it conform to the

scale pattern. We emphasize, however, that this is done. only if the changes are few.

Guttman scales differ from Likert scales in that, in the former case, generally only

one set of responses will yield a particular scale score. That is;-to -get-a score of 3 on the
abortion scale, a particular pattern of responses (or something very close to

it) is necessary: In the case of a Likert scale, however, many different patterns of

responses can yield the same scale score. A Guttman scale is also much more difficult

to achieve than a Likert scale, since the items must have been ordered and be

perceived by the respondents as representing increasingly more difficult responses

reflecting the same attitude.

Both Likert and Guttman scales have shortcomings in their level of measurement.

The level of measurement produced by Likert scales is, at best, ordinal (since we

· do not know the relative importance of each item and so cannot be sure that a 5

answer on one item is the same as a 5 answer on another), and the level of measurement

produced by Guttman scales is usually assumed to be ordinal.

Another type of scaling procedure, called Mokken scaling, also analyzes responses

to multiple items by respondents to see if, for each item, respondents can be

ordered and if items can be ordered.26 The Mokken scale was used by Saundra K.

Schneider, William G. Jacoby, and Daniel C. Lewis to see if there was structure and

coherence in public opinion regarding the distribution of responsibilities between


the federal government and state and local governments.27 Respondents were asked

whether th~y thought state or local governments "should take the lead" rather than

the national government for thirteen different policy areas. The scaling procedure

allowed the researchers to see if a specific sequence of policies emerged while moving

from one end of the scale to the other. One end of the scale would indicate

maximal support for national policy activity, while the other end would indicate

maximal support for subnational government policy responsibility.

The results of their analysis are shown in figure 5-1. The scale runs from Oto 13,

with O indicating that the national government should take the lead in all thirteen

policy areas and a score of 13 indicating that the respondent believes that state and

local governments should take the lead in all policy areas. A person at any scale

score believes that the state and local governments should take the lead in all policy

areas that fall below that score. Thus, a person with a score of 9 believes that

the national government should take the lead in health care, equality for women,

protecting the environment, and equal opportunity, and that state and local governments

should take the lead responsibility for natural disasters down to urban

development. The bars in the figure correspond to the percentage of respondents who
received a particular score. Thus, just slightly more than 5 percent of the

respondents thought that state and local governments should take the lead in all

policy areas, whereas only 1 percent of the respondents thought the national government

should take the lead in all thirteen policy areas. Most respondents divided

up the responsibilities. The authors concluded from their analysis that the public does have a
rationale behind its preferences for the distribution of policy responsibilities

between national versus state and local governments.

The procedures described so far for constructing multi-item measures are fairly

straightforward. There are other advanced statistical techniques for summarizing

or combining individual items or variables. For example, it is possible that several

variables are related to some underlying concept. Factor analysis is a statistical

technique that may be used to uncover patterns across measures. It is especially

useful when a researcher has a large number of measures and when there is uncertainty

about how the measures are interrelated.

An example is the analysis by Daniel D. Dutcher, who conducted research on the


attitudes of owners of streamside property toward the water-quality improvement

strategy of planting trees in a wide band (called riparian buffers) along the sides of streams.28
He asked landowners to rate the importance of twelve items thought

to affect the willingness of landowners to create and maintain riparian buffers. He

wanted to know whether the attitudes could be grouped into distinct dimensions·

that could be used as summary variables instead of using each of the eleven items

separately. Using factor analysis, he found that the items factored into three dimensions.

These dimensions and the items included in each dimension are listed in

table 5-6. The first dimension, which he labeled "maintaining property aesthetics,"

included items such as maintaining a view of the stream, neatness, and maintaining

open space. A second dimension contained items related to concern over water

quality. The third dimension related to protecting property against damage or loss.

Factor analysis is just one of many techniques developed to explore the dimensionality

of measures and to construct multi-item scales. The readings listed at the end

of this chapter include some resources for students who are especially interested in

this aspect of variable measurement.

Through indexes and scales, researchers attempt to enhance both the accuracy and

the precision of their measures. Although these multi-item measures have received

most use in attitude research, they are often useful in other endeavors as well. Both

indexes and scales require researchers to make decisions regarding the selection of

individual items and the way in which the scores on those items will be combined

to produce more useful measures of political phenomena.

Conclusion

To a large extent, a research project is only as good as the measurements that are

developed and used in it. Inaccurate measurements will interfere with the testing

of scientific explanations for political phenomena and may lead to erroneous

conclusions. Imprecise measurements will limit the extent of the comparisons that

can be made between observations and the precision of the.knowledge that results

from empirical research.

Despite the importance of good measurement, political science researchers often


find that their measurement schemes are of uncertain accuracy and precision.

Abstract concepts are difficult to measure in a valid way, and the practical constraints

of time and money often jeopardize the reliability and precision of measurements.

The quality of a researcher's measurements makes an important contribution

to the results of his or her empirical research and should not be lightly or routinely

sacrificed. Sometimes the accuracy of measurements may be enhanced through the use of

multi-item measures. With indexes and scales, researchers select multiple indicators

of a phenomenon, assign scores to each of these indicators, and combine those

scores into a summary measure. Although these methods hav'e been used most

frequently in attitude research, they can also be used in other situations to improve

the accuracy and precision of single-item measures.

Measurement. The process by which phenomena are

observeds ystematicallya nd representedb y scoreso r

numerals.

Mokken scale. A type of scaling procedure that assesses

the extent to which there is order in the responses of

respondents to multiple items. Similar to Guttman scaling.

Nominal measurement. A measure for which different

scores represent different, but not ordered, categories.

Operational definition. The rules by which a concept

is measured and scores assigned.

Operationalization. The process of assigning numerals

or scores to a variable to represent the values of a concept.

Ordinal measurement. A measure for which the scores

represent ordered categories that are not necessarily

equidistant from each other.

Random measurement error. Errors in measurement

that have no systematic direction or caus

You might also like