Richard G. Netemeyer, William O. Bearden, Subhash Sharma-Scaling Procedures - Issues and Applications-SAGE Publications, Inc (2003)
Richard G. Netemeyer, William O. Bearden, Subhash Sharma-Scaling Procedures - Issues and Applications-SAGE Publications, Inc (2003)
All rights reserved. No part of this book may be reproduced or utilized in any form or
by any means, electronic or mechanical, including photocopying, recording, or by
any information storage and retrieval system, without permission in writing from the
publisher.
For information:
Sage Publications, Inc.
2455 Teller Road
Thousand Oaks, California 91320
E-mail: [email protected]
Sage Publications Ltd.
6 Bonhill Street
London EC2A 4PU
United Kingdom
Sage Publications India Pvt. Ltd.
B-42 Panchsheel Enclave
New Delhi 110 017 India
03 04 05 06 07 10 9 8 7 6 5 4 3 2 1
CONTENTS
Acknowledgments xi
Foreword xiii
2 Dimensionality 18
Introduction 18
Dimensionality of a Construct, Items, and a Set of Items 19
Unidimensionality 19
Multidimensionality 21
Does Unidimensionality of a Set of Items Imply
Unidimensionality of the Items or the Construct? 23
Relevance of Unidimensionality 26
How to Assess Dimensionality of Constructs 27
Exploratory Factor Analysis (EFA) 27
Confirmatory Factor Analysis (CFA) 36
Summary 39
Appendix 2A: Computing Partial Correlations 40
3 Reliability 41
Introduction 41
The True-Score Model 41
Relationship Between the True-Score Model
and the Factor Model 43
Types of Reliability 43
Test-Retest Reliability 44
Alternative-Form Reliability 45
Internal Consistency Methods 46
Split-Half Reliability 47
Coefficient Alpha (α) 49
Coefficient Alpha Example 53
Coefficient Alpha and Dimensionality 54
Coefficient Alpha, Scale Length, Interitem
Correlation, and Item Redundancy 57
Generalizability Theory 59
G-Study Illustration 63
Generalizability Coefficient 65
Decision Studies 68
Summary 69
4 Validity 71
Overview of Construct Validity 71
Translation Validity 72
Face Validity 72
FM-Netemeyer.qxd 2/12/03 12:48 PM Page vii
Content Validity 73
Criterion Validity 76
Predictive and Post-dictive Validity 76
Concurrent Validity 76
Convergent and Discriminant Validity 77
Known-Group Validity 80
Nomological Validity 82
Socially Desirable Response Bias 83
Summary 85
References 187
Author Index 195
Subject Index 199
About the Authors 205
FM-Netemeyer.qxd 2/12/03 12:48 PM Page x
FM-Netemeyer.qxd 2/12/03 12:48 PM Page xi
ACKNOWLEDGMENTS
T he authors would like to thank the following people for their helpful
comments on earlier drafts of this text: David Hardesty, Ajith Kumar,
Ken Manning, Susan Netemeyer, Ed Rigdon, Eric Spangenberg, and Kelley
Tepper Tian. We would also like to express our sincere appreciation to
A. J. Sobczak for his assistance in editing and improving this text.
xi
FM-Netemeyer.qxd 2/12/03 12:48 PM Page xii
FM-Netemeyer.qxd 2/12/03 12:48 PM Page xiii
FOREWORD
E ffective measurement is a cornerstone of scientific research. There are
several strategies for developing and refining measures, and the relevance
of a given strategy will depend on what type of scientific phenomenon is being
measured. In this book, we focus on developing and validating measures of
latent social-psychological constructs. Given their latent nature, these con-
structs represent abstractions that can be assessed only indirectly. The indirect
assessment of these constructs is accomplished via self-report/paper-and-pencil
measures on which multiple items or indicators are used to measure the
construct, that is, “scaling” a construct. The purpose of our book is to discuss
issues involved in developing and validating multi-item scales of self-report/
paper-and-pencil measures.
In Chapter 1, we emphasize the importance of theory in scale develop-
ment and briefly overview the concepts of dimensionality, reliability, and
validity. We also offer a four-step approach for scale development that
encompasses these three concepts. The book then proceeds with individual
chapters regarding scale dimensionality (Chapter 2), reliability (Chapter 3),
and validity (Chapter 4). Chapters 5, 6, and 7 elaborate on the four-step
approach and offer empirical examples relevant to each step. The four-step
approach is consistent with much of the extant scale development literature
(Churchill, 1979; Clark & Watson, 1995; DeVellis, 1991; Haynes, Nelson, &
Blaine, 1999; Nunnally & Bernstein, 1994; Spector, 1992). We are certainly
indebted to the many authors who have proposed appropriate scale develop-
ment procedures and/or described quality scale development endeavors. As
the readers of this book undoubtedly will realize, the recommended steps, as
well as the overlapping activities assumed to constitute each step, are offered
as a logical, sequential process; however, the process of scale development
may well be an iterative and ongoing procedure in which restarts are
xiii
FM-Netemeyer.qxd 2/12/03 12:48 PM Page xiv
ONE
INTRODUCTION
AND OVERVIEW
PURPOSE OF THE BOOK
1
01-Netemeyer.qxd 2/12/03 12:45 PM Page 2
2 SCALING PROCEDURES
PERSPECTIVES ON MEASUREMENT
IN THE SOCIAL SCIENCES
What Is Measurement?
4 SCALING PROCEDURES
tested adequately only to the extent that the attributes of the theory (constructs)
are adequately measured. When agreed upon procedures exist for measuring
the attributes of interest, the objectivity of theory tests is enhanced.
Second, standardization produces quantifiable numerical results. Again,
though this text does not address classification per se, such quantification does
allow for the creation of categories (e.g., low, medium, high) for mathematical
and statistical analyses (ANOVA), or for use as factor levels in experimental
designs. Quantification also enhances communication and generalizability of
results. Knowledge accumulates in the social sciences when researchers com-
pare their results with the results of previous studies. When the same, standard-
ized measures are used across scientific applications, results termed as “low” in
self-esteem or “high” in self-esteem have common meaning across researchers.
This enhances both the communication of results and generalizability of
findings.
Third, measure/scale development is a time-consuming endeavor. If a
measure has been well developed, however, the time spent is also time “well
spent.” Once standardization occurs, the measure is available for use with
little or no time invested because of its agreed upon standards. At the very
heart of repeatability and standardization are the measurement properties of
reliability and validity. These two concepts are elaborated upon later in this
chapter and extensively discussed and examined in Chapters 3 through 7.
6 SCALING PROCEDURES
LATENT CONSTRUCTS
8 SCALING PROCEDURES
Theory is concerned not only with the latent construct of interest but with the
validity of the measurement of the construct as well. The two, theory and valid-
ity, are intertwined: The relevance of a latent construct largely depends on
its “construct validity.” Simply stated, construct validity is an assessment of the
degree to which a measure actually measures the latent construct it is intended to
measure. Cronbach and Meehl (1955) stated that demonstrating construct valid-
ity involves at least three steps: (a) specifying a set of theoretical constructs and
their relations (a theory), (b) developing methods to measure the constructs of
the theory, and (c) empirically testing how well manifest (observable) indicators
(items) measure the constructs in the theory and testing the hypothesized relations
among the constructs of theory as well (i.e., the nomological net). Furthermore,
assessing construct validity is an ongoing process. One study supporting a con-
struct’s validity is not enough to conclude that the measure has been validated.
Multiple tests and applications over time are required, and some of these may
require a refinement of the construct itself, as well as its measure. As stated by
Clark and Watson (1995), “The most precise and efficient measures are those
with established construct validity; they are manifestations of constructs in an
articulated theory that is well supported by empirical data” (p. 310).
OVERVIEW OF DIMENSIONALITY,
RELIABILITY, AND VALIDITY
Dimensionality
10 SCALING PROCEDURES
Reliability
DeVellis, 1991; Nunnally & Bernstein, 1994; Robinson et al., 1991). The most
widely used internal consistency reliability coefficient is Cronbach’s (1951)
coefficient alpha. (Others are briefly discussed later in this text. For now, we
limit our discussion to coefficient alpha.) Although a number of rules of thumb
also exist concerning what constitutes an acceptable level of coefficient alpha,
scale length must be considered. As the number of items increases, alpha will
tend to increase. Because parsimony is also a concern in measurement
(Clark & Watson, 1995; Cortina, 1993), an important question is “How many
items does it take to measure a construct?” The answer to this question
depends partially on the domain and dimensions of the construct. Naturally, a
construct with a wide domain and multiple dimensions will require more items
to adequately tap the domain/dimensions than a construct with a narrow
domain and few dimensions. Given that most scales are self-administered and
that respondent fatigue and/or noncooperation need to be considered, scale
brevity is often advantageous (Churchill & Peter, 1984; Cortina, 1993;
DeVellis, 1991; Nunnally & Bernstein, 1994).
With the advent of structural equation modeling, other tests of internal
consistency or internal structure/stability became available. Composite relia-
bility (construct reliability), which is similar to coefficient alpha, can be
calculated directly from the LISREL, CALIS, EQS, or AMOS output (cf.
Fornell & Larcker, 1981). A more stringent test of internal structure/stability
involves assessing the amount of variance captured by a construct’s measure
in relation to the amount of variance due to measurement error—the average
variance extracted (AVE). By using a combination of the criteria above
(i.e., corrected item-to-total correlations, examining the average interitem
correlation, coefficient alpha, composite reliability, and AVE), scales can be
developed in an efficient manner without sacrificing internal consistency.
Construct Validity
12 SCALING PROCEDURES
14 SCALING PROCEDURES
OVERVIEW OF RECOMMENDED
PROCEDURES AND STEPS IN
SCALE DEVELOPMENT
As is clearly evidenced from the preceding pages, numerous articles and books
advocate “how” to develop a scale (e.g., Churchill, 1979; Clark & Watson,
1995; DeVellis, 1991; Haynes et al., 1999; Nunnally & Bernstein, 1994;
Spector, 1992). Steps and procedures vary from author to author based on the
goals and purposes of the measurement. Still, most writings do share a com-
mon set of guidelines for scale development. Given our focus, the steps and
procedures used to guide this text are based on scaling self-report paper-and-
pencil measures of latent social-psychological constructs. Figure 1.1 offers a
diagram of the steps we recommend in scale development. Each of these steps
is elaborated upon in upcoming chapters. For now, we offer a brief overview
of what each step entails.
01-Netemeyer.qxd 2/12/03 12:45 PM Page 15
Step 3: Designing and Conducting Studies to Develop and Refine the Scale
Issues to Consider:
(a) Pilot testing as an item-trimming procedure
(b) The use of several samples from relevant populations for scale
development
(c) Designing the studies to test psychometric properties
(d) Initial item analyses via exploratory factor analyses (EFAs)
(e) Initial item analyses and internal consistency estimates
(f) Initial estimates of validity
(g) Retaining items for the next set of studies
16 SCALING PROCEDURES
This second step involves generating and judging a pool of items from
which the scale will be derived. Several issues must be considered, including
the following: (a) theoretical assumptions about items (e.g., domain sam-
pling), (b) generating potential items and determining the response format
(i.e., how many items as an initial pool, dichotomous vs. multichotomous
response formats, and item wording issues), (c) the focus on “content” validity
and its relation to theoretical dimensionality, and (d) item judging (both expert
and layperson)—the focus on “content” and “face” validity.
Once a suitable pool of items has been generated and judged, empirical
testing of the items on relevant samples is the next step. Issues and procedures
to be considered include (a) pilot testing as an item-trimming procedure,
(b) the use of several samples from relevant populations for scale develop-
ment, (c) designing studies to test psychometric properties, (d) initial item
analyses via exploratory factor analyses (EFAs), (e) initial item analyses and
internal consistency estimates, (f) initial estimates of validity, and (g) retaining
items for the next set of studies.
Several studies should be used to help finalize the scale. Many of the
procedures used and issues involved in refining the scale will also be applicable
01-Netemeyer.qxd 2/12/03 12:45 PM Page 17
to deriving the final form of the scale. These include (a) the importance of
several samples from relevant populations, (b) designing the studies to test the
various types of validity, (c) item analyses via EFA with a focus on the
consistency of EFA results across samples from Step 3 to Step 4 in testing an
initial factor structure, (d) item analyses and confirmatory factor analyses
(CFAs), (e) additional item analyses via internal consistency estimates,
(f) additional estimates of validity, (g) establishing norms across studies, and
(h) given that numerous studies have been done across various settings, applying
generalizability theory to the final form of the scale.
In this opening chapter, we have tried to provide the reader with an overview
of the purpose of our text. To reiterate, our purpose is to focus on measuring
latent perceptual social-psychological constructs via paper-and-pencil self-
reports. For a construct to be valuable, it must have theoretical and/or practi-
cal relevance to the social scientist. Thus, a careful consideration must be
made of what the construct of interest predicts and/or what predicts the con-
struct of interest. Here, the notion of theory and “knowing” the literature is all-
important. Furthermore, given the importance of measurement in the social
sciences, any measure must be valid to allow for constructing confident infer-
ences from empirical studies. Such validity rests on how well the latent
construct being measured is based in theory.
Also in this opening chapter, we have overviewed the concepts of dimen-
sionality, reliability, and validity, as well as summarized a series of steps for
deriving measures with adequate psychometric properties. The remainder of
our text elaborates on dimensionality, reliability, and validity, and the four
steps in scale construction. Specifically, Chapter 2 discusses dimensionality,
its relation to reliability and validity, and procedures for establishing dimen-
sionality. Chapter 3 discusses reliability, its relation to validity, and proce-
dures for establishing reliability, including G-Theory. Chapter 4 discusses
validity and procedures for providing evidence of validity. Chapters 5, 6, and
7 provide detailed examples of the four steps in scale development, and
Chapter 8 offers concluding remarks, with a focus on the need to constantly
reevaluate constructs, their measures, and the validity of the measures.
02-Netemeyer.qxd 2/12/03 12:46 PM Page 18
TWO
DIMENSIONALITY
INTRODUCTION
18
02-Netemeyer.qxd 2/12/03 12:46 PM Page 19
Dimensionality 19
F1
.9 .8 .7 .9
X1 X2 X3 X4
DIMENSIONALITY OF A CONSTRUCT,
ITEMS, AND A SET OF ITEMS
Unidimensionality
Table 2.1 offers a correlation matrix for the factor (or measurement)
model given in Figure 2.1. Consistent with our focus on “reflective” or
“effect” measures, Figure 2.1 shows that F1 affects x1, x2, x3, and x4. It is also
assumed that the x variables can be measured or observed, whereas F1 cannot
be measured or observed. That is, F1 typically is referred to as an unobservable
or latent construct, and the x variables are referred to as indicators, items, or
manifest variables of the latent construct. The partial correlation, r12•3,
between any pair of variables (i.e., variables 1 and 2) after removing or
partialing the effect of a third variable (i.e., variable 3) is given by
r12 − r13r23
r12•3 = .
1 − r 132 1 − r 232 (2.1)
02-Netemeyer.qxd 2/12/03 12:46 PM Page 20
20 SCALING PROCEDURES
F1 x1 x2 x3 x4
F1 1.00
x1 0.90 1.00
x2 0.80 0.72 1.00
x3 0.70 0.63 0.56 1.00
x4 0.90 0.81 0.72 0.63 1.00
Using the values in Table 2.1 and the above formula, the partial correlation
between x1 and x2 after partialing the effect of F1 is equal to
Using Equation 2.1, it can be shown easily that the partial correlations
among all pairs of x variables given in Table 2.1 are equal to zero. That is,
once the effect of F1 has been removed or partialed out, the partial correlations
or the relationships among the x variables disappear. In other words, F1 is
responsible for all the relationships among the x variables; therefore, F1 is
referred to as the common factor. The x1-x4 set of items is unidimensional
because the correlations among them, after they have been partialed out for the
effect of a single common factor (i.e., F1), are equal to zero. Thus, a set of
items is considered to be unidimensional if the correlations among them
can be accounted for by a single common factor. This conceptualization of
unidimensionality is consistent with that proposed by McDonald (1981) and
Hattie (1985).
Notice that each of the four items in Figure 2.1 is a measure of one and
only one construct and therefore each item is unidimensional. That is, an item
is considered to be unidimensional if it is a measure of only a single construct
or latent factor. Now, if multiple sets of n items from the domain of the con-
struct are taken and the partial correlations among each set are equal to zero,
then the construct is said to be unidimensional. It is possible, however, that a
set of items may be unidimensional, yet the construct or the individual indica-
tors may not be unidimensional. This interesting issue will be elaborated on
further later in the chapter.
02-Netemeyer.qxd 2/12/03 12:46 PM Page 21
Dimensionality 21
F1 F2
.3
.6
.3 .5
.4 .4
.3 .3
.6 .5
X1 X2 X3 X4 X5
Figure 2.2
Multidimensionality
Table 2.2 shows the correlation matrix for the two-factor model given in
Figure 2.2, in which the x variables are affected by common factors F1 and F2.
The partial correlation between x1 and x2 after the effect of F1 is partialed out
is equal to
Similarly, the partial correlation between x1 and x2 after partialing out the
effect of F2 is equal to
Thus, both partial correlations are not equal to zero. This suggests that the
correlations among the variables cannot be accounted for by a single factor
and, therefore, the set of items is not unidimensional. The partial correlation
02-Netemeyer.qxd 2/12/03 12:46 PM Page 22
22 SCALING PROCEDURES
F1 F2 x1 x2 x3 x4 x5
F1 1.00
F2 0.00 1.00
x1 0.60 0.60 1.00
x2 0.50 0.50 0.60 1.00
x3 0.40 0.40 0.48 0.40 1.00
x4 0.30 0.30 0.36 0.30 0.24 1.00
x5 0.30 0.30 0.36 0.30 0.24 0.18 1.00
between any two variables 1 and 2 after partialing the effect of variables 3 and
4 is given by Equation 2.2.
Using Equation 2.2, the partial correlation between x1 and x2 after partialing
the effects of F1 and F2 is equal to zero (see Appendix 2A for the computa-
tions). Using the computational steps shown in Appendix 2A, it can be shown
easily that the partial correlations among the x1-x5 set of variables after con-
trolling for F1 and F2 are all equal to zero. That is, two factors or latent con-
structs are needed to account for the correlations among the x variables and,
therefore, two dimensions account for the x1-x5 set of variables. The dimen-
sionality of a given set of variables is equal to the number of latent constructs
needed to account for the correlations among the variables. Notice further that
each of the five items represents two constructs. That is, the dimensionality of
each item or construct is equal to two because each of these items is a measure
of two constructs. Once again, if multiple sets of items are taken from the
domain of the two constructs and the partial correlations among variables of
each set after removing the effect of F1 and F2 are equal to zero, then the
construct is said to be multidimensional (two-dimensional in this case).
The above rationale can be extended to more than two constructs. If two
factors do not reduce the partial correlations to zero, then the construct has
more than two dimensions. In general, the dimensionality of a set of items is
equal to the number of constructs or factors needed to reduce the partial cor-
relations to zero. Furthermore, if multiple sets of items are drawn from the
domain of the construct and p factors are needed to account for the correlations
among the items, then the dimensionality of the construct is said to be p.
02-Netemeyer.qxd 2/12/03 12:46 PM Page 23
Dimensionality 23
F1 F2
.6 .6 .6
.9 .8 .7 .6
.9 .6
X1 X2 X3 X4 Z1 Z2 Z3
Figure 2.3
Table 2.3 shows the correlation matrix for the factor model given in Figure 2.3.
In Figure 2.3, F1 affects x1, x2, x3, x4, z1, and z2. F2 affects z1, z2, and z3. Notice
that x1-x4 are manifest items only for F1, z1 and z2 are manifest items for both
F1 and F2, and z3 is a manifest item only for F2. Table 2.4 gives partial corre-
lations among various sets of variables pertaining to Figure 2.3. Following are
some of the observations that can be drawn from the results presented in
Tables 2.4a-2.4e.
F1 F2 x1 x2 x3 x4 z1 z2 z3
F1 1.00
F2 0.00 1.00
x1 0.90 0.00 1.00
x2 0.80 0.00 0.72 1.00
x3 0.70 0.00 0.63 0.56 1.00
x4 0.90 0.00 0.81 0.72 0.63 1.00
z1 0.60 0.60 0.54 0.48 0.42 0.54 1.00
z2 0.60 0.60 0.54 0.48 0.42 0.54 0.72 1.00
z3 0.00 0.60 0.00 0.00 0.00 0.00 0.36 0.36 1.00
02-Netemeyer.qxd 2/12/03 12:46 PM Page 24
24 SCALING PROCEDURES
x1 x2 x3 x4
x1 1.00
x2 0.00 1.00
x3 0.00 0.00 1.00
x4 0.00 0.00 0.00 1.00
Table 2.4b Partial Correlations for the Model in Figure 2.3: Case 2—Partialing
the Effect of F1 From x1−x4 and z1
x1 x2 x3 x4 z1
x1 1.00
x2 0.00 1.00
x3 0.00 0.00 1.00
x4 0.00 0.00 0.00 1.00
z1 0.00 0.00 0.00 0.00 1.00
Table 2.4c Partial Correlations for the Model in Figure 2.3: Case 2—Partialing
the Effect of F1 From x1−x4 and z2
x1 x2 x3 x4 z2
x1 1.00
x2 0.00 1.00
x3 0.00 0.00 1.00
x4 0.00 0.00 0.00 1.00
z2 0.00 0.00 0.00 0.00 1.00
Table 2.4d Partial Correlations for the Model in Figure 2.3: Case 3—Partialing
the Effect of F1 From x1−x4, z1, and z2
x1 x2 x3 x4 z1 z2
x1 1.00
x2 0.00 1.00
x3 0.00 0.00 1.00
x4 0.00 0.00 0.00 1.00
z1 0.00 0.00 0.00 0.00 1.00
z2 0.00 0.00 0.00 0.00 0.225 1.00
02-Netemeyer.qxd 2/12/03 12:46 PM Page 25
Dimensionality 25
Table 2.4e Partial Correlations for the Model in Figure 2.3: Case 3—Partialing
the Effect of F1 and F2 From x1−x4, z1, and z2
x1 x2 x3 x4 z1 z2
x1 1.00
x2 0.00 1.00
x3 0.00 0.00 1.00
x4 0.00 0.00 0.00 1.00
z1 0.00 0.00 0.00 0.00 1.00
z2 0.00 0.00 0.00 0.00 0.00 1.00
Case 1: In the case shown in Table 2.4a, the x1-x4 set of items is
unidimensional, as one common factor accounts for the correlations
among the items, and each item is unidimensional, as it measures one and
only one construct.
Case 2: In the case shown in Tables 2.4b and 2.4c, the x1-x4 and z1 set of
items is unidimensional, as the correlations among the items can be
accounted for by a single factor. Item z1 is not unidimensional, however,
as it is a measure of two factors. That is, it is possible for a set of items to
be unidimensional, yet each item in the set may or may not be unidimen-
sional. Similarly, the set of items x1-x4 and z2 is unidimensional; however,
item z2 is not unidimensional.
Case 3: In the case shown in Tables 2.4d and 2.4e, the x1-x4, z1, and z2 set
of items is not unidimensional, as the correlation between z1 and z2 can-
not be accounted for by one factor. The correlations among this set of
items can be accounted for by two factors. Therefore, this set of items is
multidimensional (i.e., two dimensions). Furthermore, items z1 and z2 are
multidimensional and not unidimensional. The reader can easily see that
items x1-x4 and z1-z3 are not unidimensional, as the partial correlations
among them cannot be accounted for by a single factor.
26 SCALING PROCEDURES
conceptual meaning might change depending upon which set of items is used
to measure the construct.
RELEVANCE OF UNIDIMENSIONALITY
Dimensionality 27
HOW TO ASSESS
DIMENSIONALITY OF CONSTRUCTS
The implicit assumption underlying the use of EFA is that the researcher
generally has a limited idea with respect to the dimensionality of constructs
and which items belong or load on which factor. Furthermore, EFA typi-
cally is conducted during the initial stage of scale development. Still, EFA
can be used to gain insights as to the potential dimensionality of items
and scales. Table 2.5 gives the SPSS syntax commands for reading the
1
A detailed discussion of exploratory and confirmatory factor analysis is beyond the scope of this
textbook. For further details on these techniques, the interested reader is referred to Sharma (1996).
02-Netemeyer.qxd 2/12/03 12:46 PM Page 28
28 SCALING PROCEDURES
correlation matrix given in Table 2.3 for performing EFA, and Exhibit 2.1
depicts a portion of the output. Following is a brief discussion of the
resulting output.
The number of factors accounting for the correlations among the vari-
ables represents the dimensionality of a set of variables. A number of rules of
thumb or heuristics are used to determine the number of factors. They include
(a) the eigenvalue-greater-than-one rule, (b) the scree plot, and (c) the scree
plot with parallel analysis.2 According to the eigenvalue-greater-than-one rule,
the number of factors is equal to the number of eigenvalues greater than one.
The rationale is that a given factor must account for at least as much variance
as can be accounted for by a single item or variable. In the present case, the
eigenvalue-greater-than-one rule suggests the presence of two factors, and
therefore it might be concluded that the set of items is not unidimensional. The
eigenvalue-greater-than-one rule has come under considerable criticism. Cliff
2
The eigenvalue-greater-than-one rule and the scree plot with parallel analysis can be used only
when the correlation matrix is used in factor analysis.
02-Netemeyer.qxd 2/12/03 12:46 PM Page 29
Dimensionality 29
Initial Eigenvalues
Scree Plot
5
2
Eigenvalue
0
1 2 3 4 5 6 7
Factor Number
02-Netemeyer.qxd 2/12/03 12:46 PM Page 30
30 SCALING PROCEDURES
(1988) specifically provides strong arguments against its use in identifying the
number of factors.
The scree plot proposed by Cattell (1966) is another popular technique.
The scree plot is a plot of the eigenvalues against the number of factors, and
one looks for an “elbow” signifying a sharp drop in variance accounted for by
a given factor. It is assumed that factors at or beyond the elbow are nuisance
factors and merely represent error or unique components. As can be seen from
the scree plot given in Exhibit 2.2, the elbow suggests the presence of two fac-
tors. That is, the set of items is multidimensional. In many instances, however,
it is not possible to completely identify the elbow. In such cases, one can use
the parallel plot procedure suggested by Horn (1965). With this procedure, the
parallel plot represents the eigenvalues that would result if the data set were to
contain no common factors. That is, the correlations among the variables are
completely due to sampling error. Extensive simulations are required to esti-
mate the eigenvalues for the parallel plot; however, based on empirically
derived equations, Allen and Hubbard (1986) have developed the following
equation to estimate the eigenvalues for obtaining the parallel plot:
4.5
3.5
Scree plot
3
Eigenvalue
2.5
1.5
Parallel plot
1
0.5
0
0 2 4 6 8
Number of factors
02-Netemeyer.qxd 2/12/03 12:46 PM Page 31
Dimensionality 31
ln λk = ak + bk ln (n − 1) + ck ln {( p − k − 1)
( p − k + 2)/2} + dk ln λk–1 (2.3)
λ1 = e.257 = 1.293
λ1 = e.148 = 1.160.
3
If the computed eigenvalues are negative, they are assumed to be zero.
02-Netemeyer.qxd 2/12/03 12:46 PM Page 32
32 SCALING PROCEDURES
Number of
Root (k) Pointsa a b c d R2
Dimensionality 33
Number of
Root (k) Pointsa a b c d R2
SOURCE: From Table 1 of “Regression Equations for the Latent Roots of Random Data
Correlation Matrices With Unities on the Diagonal,” Multivariate Behavioral Research, 21,
pp. 393-398, Allen and Hubbard, copyright 1986 by Lawrence Erlbaum Associates. Reprinted
with permission.
a. The number of points used in the regression.
k k
2 res2ij
i=1 j =i
RMSR =
k(k − 1)
k k
2 pc2ij
i=1 j =i
RMSP = (2.4)
k(k − 1)
where resij and pcij are, respectively, the residual correlation and the partial
correlation between variables i and j. Exhibit 2.4 gives the residual correlation
matrix provided by SPSS for one- and two-factor models. Using Equation 2.4,
the RMSR for a two-factor model is .00021, and for the one-factor model it is
.215. Although there are no guidelines as to how low is “low,” there is a sub-
stantial difference between the RMSR for the one- and two-factor models;
therefore, the two-factor model provides the better account of correlations
among the variables.4
4
SPSS does not provide the RMSP. SAS, on the other hand, provides both of the matrices and
also computes the two indices.
Exhibit 2.4 Partial SPSS Output for Two-Factor and One-Factor Solutions
34
02-Netemeyer.qxd
Two-Factor Solution
Reproduced Correlation
X1 X2 X3 X4 Z1 Z2 Z3
Reproduced Correlations
X1 X2 X3 X4 Z1 Z2 Z3
b
Reproduced X1 .754 .674 .592 .754 .606 .606 .123
Correlation
X2 .674 .602b .529 .674 .541 .541 .110
2/12/03 12:46 PM
35
02-Netemeyer.qxd 2/12/03 12:46 PM Page 36
36 SCALING PROCEDURES
5
For detailed discussion of confirmatory factor analysis, see Sharma (1996).
6
It should be noted that the model fit is perfect, as hypothetical data were used for the model in
Figure 2.3. Normally, for actual data and large sample sizes, the fit will not be perfect and the
chi-square statistic test will be large and significant.
02-Netemeyer.qxd 2/12/03 12:46 PM Page 37
Dimensionality 37
corr x1 1.00 . . . . . .
corr x2 0.72 1.00 . . . . .
corr x3 0.63 0.56 1.00 . . . .
corr x4 0.81 0.72 0.63 1.00 . . .
corr z1 0.54 0.48 0.42 0.54 1.00 . .
corr z2 0.54 0.48 0.42 0.54 0.72 1.00 .
corr z3 0.00 0.00 0.00 0.00 0.36 0.36 1.00
std . 1 1 1 1 1 1 1
N . 200 200 200 200 200 200 200
;
run;
Proc Calis;
Lineqs
x1 = l1f1 f1 + e1,
x2 = l2f1 f1 + e2,
x3 = l3f1 f1 + e3,
x4 = l4f1 f1 + e4,
z1 = l5f1 f1 + lz1f2 f2 + e5,
z2 = l6f1 f1 + lz2f1 f2 + e6,
z3 = l7f2 f2 + e7;
std
e1 e2 e3 e4 e5 e6 e7 = vare1 vare2 vare3 vare4 vare5
vare6 vare7,
f1 f2 = 2*1.0;
run;
two-factor model are, within rounding error, the same as those reported in
Figure 2.3.
There are a number of issues pertaining to the use of the chi-square test
statistic for hypothesis testing, the most important being its sensitivity to sam-
ple size (Bearden, Sharma, & Teel, 1982; Hoyle, 1995; McDonald & Marsh,
1990). A number of other alternate goodness-of-fit statistics have been
proposed that supposedly are not sensitive to sample size and other model
parameters. These will be discussed in Chapter 7.
02-Netemeyer.qxd 2/12/03 12:46 PM Page 38
38 SCALING PROCEDURES
Goodness-of-Fit Results
Parameter Estimates
Manifest Variable Equations with Standardized Estimates
x1 = 0.9000*f1 + 0.4359 e1
l1f1
x2 = 0.8000*f1 + 0.6000 e2
l2f1
02-Netemeyer.qxd 2/12/03 12:46 PM Page 39
Dimensionality 39
x3 = 0.7000*f1 + 0.7141 e3
l3f1
x4 = 0.9000*f1 + 0.4359 e4
l4f1
z1 = 0.6000*f1 + 0.6000*f2 + 0.5292 e5
l5f1 lz1f2
z2 = 0.6000*f1 + 0.6000*f2 + 0.5292 e6
l6f1 lz2f1
z3 = 0.6000*f2 + 0.8000 e7
SUMMARY
Appendix 2A
COMPUTING PARTIAL CORRELATIONS
.60 − .6 × .50
r12•F1 = = .43
1 − .62 1 − .52
.60 − .6 × .50
r12•F 2 = = .43
1 − .62 1 − .52
r12•F2 − r1F1•F2 × r2F1•F2
r12•F1,F2 =
1 − r 1F
2
1•F2
1 − r 22F1•F2
r1F1 − r1F2 × rF1F2 .6 − .6 × 0
r1F1•F2 = = = .75
1−r 2
1F2 1−r 2
F1F2 1 − .62 1−0
r2F1 − r2F2 × rF1F2 .5 − .5 × 0
r2F1•F2 = = = .58
1−r 2
2F2 1−r 2
F1F2 1 − .52 1−0
.43 − .75 × .58
r12•F, F21 = ≈0
1 − .752 1 − .582
40
03-Netemeyer.qxd 2/12/03 12:46 PM Page 41
THREE
RELIABILITY
INTRODUCTION
In the previous chapter, we discussed issues related to the dimensionality of
latent constructs and their measurement items. We concluded that knowledge
of the dimensionality of a construct is critical for developing items to measure
that construct. After the items have been developed and trimmed, the next
important step is to examine the reliability and validity of the scale. The pur-
pose of this chapter is to discuss the concept of reliability and provide sug-
gested procedures to assess it. In addition, we will discuss generalizability
theory. The validity of a scale and procedures to assess validity will be
addressed in the next chapter.
x=T+e (3.1)
1
See Crocker and Algina (1986) for a detailed discussion of the true-score model and the
related theory.
41
03-Netemeyer.qxd 2/12/03 12:46 PM Page 42
42 SCALING PROCEDURES
(a) (b)
T F
x x
e d
where x is the observed score, T is the true score, and e is random measure-
ment error. When the true score, T, varies, so does the observed score, x,
because the latent construct influences the observed score (i.e., “x” is a reflec-
tive indicator). Assuming that the true score and error are uncorrelated (i.e.,
Cov(T,e) = 0), it can be shown that the variance of x is equal to the variance
of the true score plus the variance of the error. That is,
Var(T) σ2
Reliability = ρxx = = T2 (3.2)
Var(x) σx
Reliability 43
x = λF + δ
where x is the observed score, F is the latent factor, δ is the error, and λ (the
factor loading) is the extent to which the latent factor F affects the observed
score x. Once again, as F varies, so does the observed score; however, the
extent to which it varies is determined by the value of λ. It is clear that the term
λF is equivalent to the true score in the true-score model and δ represents the
error.2 The reliability of the observed measure is given by
Notice that the numerators in Equations 3.2 and 3.3 are equivalent in the
sense that they measure the variance that is due to the effect of the latent con-
struct on the observed score.
TYPES OF RELIABILITY
Because the true score (i.e., the latent construct) and error are unobservable,
reliability has to be inferred from the observed score. Procedures used to
assess reliability frequently discussed in the literature can be grouped into
three general types: (a) test-retest reliability; (b) alternative-form reliability;
and (c) internal consistency reliability. In discussing the three types of relia-
bility, we will use both hypothetical examples and examples with data
collected from our own research.
2
In a strict sense, one might argue that conceptually the factor and the true-score models are not
equivalent. In the factor model, δ in addition to measurement error also includes the unique or
specific error. Empirically, however, it is impossible to separate the two. Consequently, at an
empirical level it is not possible to differentiate between the true-score and the factor model.
03-Netemeyer.qxd 2/12/03 12:46 PM Page 44
44 SCALING PROCEDURES
1 4 5
2 4 5
3 6 5
4 5 3
5 5 4
6 3 3
7 2 3
8 5 4
9 3 3
10 7 7
Test-Retest Reliability
3
See Lichtenstein, Ridgway, and Netemeyer (1993) for further information regarding the CP
construct.
03-Netemeyer.qxd 2/12/03 12:46 PM Page 45
Reliability 45
Alternative-Form Reliability
46 SCALING PROCEDURES
Occasion 1 Occasion 2
Subject (Statement 1) (Statement 2)
1 4 4
2 4 4
3 6 5
4 5 3
5 5 4
6 3 5
7 2 4
8 5 4
9 3 5
10 7 7
Reliability 47
CP
x1 x2 x3 x4
e1 e2 e3 e4
take repeated measures or use alternate forms. In such cases, the concept of
internal consistency can be used to estimate reliability. The internal consis-
tency concept to measure reliability requires only a single administration of
the items to respondents; however, it assumes availability of multiple mea-
sures or items for measuring a given construct. Before discussing the most
widely used measure of internal consistency (coefficient alpha), we first pro-
vide a brief discussion of split-half reliability, which is one form of internal
consistency reliability.
Split-Half Reliability
As depicted in Figure 3.2, suppose that the following four items are used
to measure the CP construct.
48 SCALING PROCEDURES
1 4 4 5 5 8 10
2 4 4 5 5 8 10
3 6 5 6 5 11 11
4 5 3 5 3 8 8
5 5 4 3 4 9 7
6 3 5 5 3 8 8
7 2 4 5 3 6 8
8 5 4 3 4 9 7
9 3 5 5 3 8 8
10 7 7 7 7 14 14
4
The number of splits that can be formed is equal to (2n′)!/ 2(n′!)2 where n′ = n/2 and n is the
number of statements.
03-Netemeyer.qxd 2/12/03 12:46 PM Page 49
Reliability 49
alpha. Coefficient alpha or Cronbach’s alpha (α) can be computed using the
following formula:
k k
j =1 Cov (xi xj )
k
i=1 i =j
α =
k − 1 k k Cov (xi xj ) +
k
Var(xi )
j =1
i=1 i =j i=1
k k
j =1 σij
k
i=1
i =j
.
= (3.4)
k−1 k k
σ +
k
σ 2
j =1 ij i
i=1 i =j i=1
50 SCALING PROCEDURES
In Figure 3.2, the variance in each item (i.e., xi) that is due to the latent
variable CP is considered common or shared variance. When CP varies, so
do the scores on the individual items, because the latent construct influences
the scores on the items. Thus, scores on all items vary jointly with the latent
construct CP, theoretically implying that all the items are correlated. The
error terms (ei) depicted in Figure 3.2 are considered unique to each item.
That is, they represent variance that is not attributable to the latent construct
CP, and according to classical test theory, the error terms are not correlated.
Both the individual item scores and, therefore, the overall scale score vary
as functions of two sources: (a) the source of variation common to itself (the
overall score) and other items and (b) unshared or unique variation associ-
ated with that particular item. Thus, total scale variance and variance for
each item are that which is attributable to common and unique (error)
sources. As discussed below, alpha is conceptually equivalent to the ratio of
common source variation to total variation (Cortina, 1993; DeVellis, 1991;
Nunnally & Bernstein, 1994).
The covariance matrix of the four-item CP scale is given by
σ 12
σ 21
σ 12
σ 22
σ 13
σ 23
σ 14
σ 24
σ 31 σ 32 σ 23 σ 34 .
σ 41 σ 42 σ 43 σ 24
TS = x1 + x2 + x3 + x4.
2
Var (TS) = s TS
= σ 1 + σ 22 + σ 32 + σ 42 + σ 12 + σ 13 + σ 14 + σ 21 + σ 23 + σ 24 + σ 31 + σ 32
2
+ σ 34 + σ 41 + σ 42 + σ 43
k k k
= σ 2i + σ ij .
i=1 i= 1 j=1
i≠ j
03-Netemeyer.qxd 2/12/03 12:46 PM Page 51
Reliability 51
Note that the variance of the summed score, TS, is equal to the sum of all
the variances and covariances in the covariance matrix. To partition the total
variance into common variance (i.e., variance of true score) and unique variance
(i.e., variance due to measurement error), the following must be considered.
The diagonal elements essentially represent the covariance of an item with
itself, that is, the variability in the score of an item from a given sample of
individuals. As such, the diagonal elements are unique sources of variance and
not variance that is common or shared among items. The off-diagonal elements
are covariances that represent the variance that is common or shared by any
pair of items in the scale. Thus, the entries in the covariance matrix consist of
unique (error) and common (shared/joint) variance. That which is unique is
represented along the main diagonal (∑σ 2i ), that which is common is repre-
sented by the off-diagonal elements, and the total variance (σ TS2 ) is equal to the
sum of all the entries in the matrix. As such, the ratio of unique (non-common)
variance to total variance is given by
k
σ2
i=1 i
2
.
σTS
52 SCALING PROCEDURES
k k
j =1 σij /(k
2 − k)
i=1 i =j
k k k
j =1 σij + σ2 /k 2
i=1 i =j i=1 i
k k
j =1 σij
k
i=1
i =j
=
k−1 k k
σ +
k
σ 2
j =1 ij i (3.6)
i=1 i =j i=1
kr
(3.7)
1 + (k − 1) r
where k is the number of items in the scale and r is the average correlation
among the items in the scale.
Finally, for items that are scored dichotomously, the Kuder-Richardson
20 (KR-20) formula is used to compute coefficient alpha. The KR-20 formula
is identical to the variance-covariance version of alpha (Equation 3.6) with the
exception that (Σpq) replaces (Σσ 2i ). The (Σpq) term specifies that the variance
of each item is computed, and then these variances are summed for all items,
03-Netemeyer.qxd 2/12/03 12:46 PM Page 53
Reliability 53
where “p” represents each item’s mean and “q” is (1 – item mean). As such,
the variance of an item becomes “pq.” (See Crocker and Algina, 1986,
pp. 139-140, for an example of the KR-20 formula.)
k
pq
α= 1− (3.8)
k−1 σ 2x
CP5: Beyond the money I save, redeeming coupons gives me a sense of joy.
The covariance and correlation matrices for these data are reported in
Tables 3.4a and 3.4b.
Using the formula for α given in Equation 3.6,
α=
5 42.356 = .876
5−1 42.356 + 18.112
where
k = 5,
σ 2i = 3.6457 + 4.3877 + 2.9864 + 3.7786 + 3.3138 = 18.112, and
k k
σ i j = 2(2.7831 + 2.3934 + 2.2280 + 2.0706 + 2.3973 + 1.9109
i= 1 j =1
i≠ j + 1.8338 + 2.0782 + 1.6011 + 1.8813) = 42.356.
03-Netemeyer.qxd 2/12/03 12:46 PM Page 54
54 SCALING PROCEDURES
CP1 3.6457
CP2 2.7831 4.3877
CP3 2.3934 2.2280 2.9864
CP4 2.0706 2.3973 1.9109 3.7786
CP5 1.8338 2.0782 1.6011 1.8813 3.3138
CP1 1.0000
CP2 .6969 1.0000
CP3 .7254 .6155 1.0000
CP4 .5579 .5888 .5688 1.0000
CP5 .5276 .5450 .5089 .5317 1.0000
Note that the sum of the off-diagonal elements was multiplied by 2 to reflect
the covariances both below and above the diagonal.
Repeating the above procedure for the correlation matrix and using
Equation 3.7 gives a value of 0.876 for α. Notice that there is no difference
between the two values of α. This usually will be the case; however, if the
variances of the items are very different, then the two values may not be the
same. The calculation (with r = .5865) is as follows:
kr 5 × .5865
= = .876.
1 + (k − 1) r 1 + (5 − 1) × .5865
Reliability 55
nicely by Cortina (1993): (a) α is the mean of all split-half reliabilities, (b) α
is the lower bound reliability of a measure, (c) α is a measure of first-factor
saturation, (d) α is equal to reliability under a tau equivalence assumption, and
(e) α is a general version of the KR-20 coefficient for dichotomous items.
Although these descriptions have been widely used, the validity of some of
them rests on certain assumptions (e.g., the use of α in its standardized or
unstandardized form as just noted). One conclusion that can be drawn from the
various descriptions of α in its relation to dimensionality is as follows
(Cortina, 1993):
56 SCALING PROCEDURES
Reliability 57
58 SCALING PROCEDURES
Reliability 59
GENERALIZABILITY THEORY
5
Shavelson and Webb (1991) provide an excellent discussion of Generalizability Theory.
03-Netemeyer.qxd 2/12/03 12:46 PM Page 60
60 SCALING PROCEDURES
following section discusses these procedures and illustrates how they can be
used in scale development.
Suppose an educational toy company has developed a new toy for 3-year-old
children. The objective of the toy is to develop social interaction among
children. To determine whether the new toy is successful with respect to this
objective, the toy company arranges a focus group interview in which ten
3-year-old children are given the toy and the session is videotaped. Three
experts in social interaction are asked to view the tape and rate the level of
interaction on a 7-point item. The procedure is repeated after 2 weeks; the same
children are videotaped, and the same experts are asked to rate the children
again. In generalizability terminology, this study is referred to as a two-facet
study. The two generalization facets are time (i.e., the two time periods during
which data are collected) and the raters. These facets or factors are referred to
as generalization factors because of interest in generalizing the results of the
study across these factors. The level of social interaction will vary across
children, the object of measurement. Research participants (in this case, children)
usually are referred to as the differentiation or universe score factor, as the
children are expected to differ with respect to their level of social interaction.
As depicted in Table 3.5, the total variation in the data can be decomposed
into seven sources of variation. The sources of variation and their implications
are discussed below.
a. Because there is only one observation per cell, the three-way interaction is confounded with error.
61
03-Netemeyer.qxd 2/12/03 12:46 PM Page 62
62 SCALING PROCEDURES
this that multiple measurement occasions are not needed. On the other
hand, large variation due to this source suggests that the measurement
occasion does have an effect. It is possible, although highly unlikely,
that the children might have changed with respect to their social inter-
action level. It also is possible that the time of day (i.e., morning,
afternoon, evening) might be the reason for variation in children’s
interaction level, and therefore measurement at each of these times of
day is necessary to generalize study findings.
4. The interaction between subject and raters (S × R). High variation due
to this source suggests inconsistencies of raters in rating subjects, in
that the rating of a given subject differs across raters. This usually
should not be the case; however, if it is, this might suggest that bias due
to the raters and the rating process might need further investigation.
5. The interaction between subject and time (S × T). High variation due
to this source suggests that a given subject’s behavior (i.e., social inter-
action) changes across the two measurement time periods, suggesting
that the time period used could be too long.
6. The interaction between rating and time (R × T). Variation in a given
rater’s evaluation across the two time periods; differences in this inter-
action among raters suggests that some raters are more consistent than
other raters.
7. The interaction between subject, rating, and time (S × R × T). The
residual variation, which is due to the combination of subjects, rating,
and time, and/or effects of factors not taken into consideration and
random error.
It should be noted that the first four columns of Table 3.5 are similar to
that obtained in regular ANOVA (analysis of variance) results. The fifth col-
umn gives the expected mean square, and the last column is the variation due
to the respective source. The equations in the fifth column can be solved to
obtain an estimate of variation due to the respective sources given in the last
column. The following empirical illustration shows how this can be done.
G-Study Illustration
Reliability 63
6
We use the term “subjects” instead of “respondents” to be consistent with its use in the
Generalizability Theory literature.
03-Netemeyer.qxd 2/12/03 12:46 PM Page 64
64 SCALING PROCEDURES
The above estimates and the percentage of the total variation due to the
respective source are also reported in Table 3.6. The results suggest that 44.9%
of the total variation is due to differences in CETSCALE scores of subjects,
which is to be expected. This essentially means that subjects do differ with
respect to their ethnocentric tendencies. The results also suggest that 15.8% of
the variation is due to the items. Some variance due to the items is to be
expected, as the items tap different aspects of the construct. High variance due
to items is not desirable, however, as it suggests that the items may not be
internally consistent—that is, that responses to items vary considerably,
thereby affecting the generalizability of the scale.
The above procedure of computing the variance components by hand
becomes cumbersome for multifacet studies. Procedures in standard statistical
packages such as SAS and SPSS are available for computing variance com-
ponents. Table 3.7 gives the statements for PROC VARCOMP in SAS to com-
pute variance components. The METHOD option specifies the method that
should be used in estimating the variance components. The minimum norm
quadratic unbiased estimator (MIVQUE0) is the default method and is based
on the technique suggested by Hartley, Rao, and LaMotte (1978). Other meth-
ods are the maximum likelihood (ML), ANOVA method using type 1 sum of
squares (TYPE1), and restricted maximum likelihood method (REML). (See
the SAS manual for further details about these methods.) Exhibit 3.1 gives the
partial SAS output. As can be seen from the output, the estimates, within
rounding error, are the same as those reported in Table 3.6.
In SPSS, the drop-down menu feature can be used to estimate variance
components. Exhibit 3.2 gives the dialog boxes resulting from the Analyze →
03-Netemeyer.qxd 2/12/03 12:46 PM Page 65
Reliability 65
Var(ITEM) 0.42220381
Var(PERSON) 1.20032844
Var(ITEM*PERSON) 1.04952361
Var(Error) 0.00000000
Generalizability Coefficient
σ 2S× I
Relative error = σ ä = n
2
. (3.9)
I
03-Netemeyer.qxd 2/12/03 12:46 PM Page 66
66 SCALING PROCEDURES
Reliability 67
Variance Estimates
Component Estimate
σ 2I σ 2S ×I
Absolute error = σ∇2 = + . (3.10)
nI ns
σ 2S
G2 = . (3.11)
σ 2S + σ 2δ
σ 2δ = 1.05/17 = 0.062
1.20
G= = 0.951
1.20 + 0.062
68 SCALING PROCEDURES
1
Generalizability
0.9
Coefficient
0.8
0.7
0.6
0.5
0 5 10 15 20 25
Number of Items
Decision Studies
σ 2δ = 1.05/2 = .53
1.20
G= = .694.
1.20 + .53
Table 3.8 gives the relative error and the G coefficient for different
numbers of statements, and Figure 3.3 depicts the relationship between the
G coefficient and the number of statements. As can be seen from the table
and the figure, the increase in the G coefficient becomes negligible after a
03-Netemeyer.qxd 2/12/03 12:46 PM Page 69
Reliability 69
Table 3.8 Number of Items, Relative Error Variance, and the G Coefficient
1 1.050 0.533
2 0.525 0.694
3 0.350 0.774
4 0.263 0.820
5 0.210 0.851
6 0.175 0.873
7 0.150 0.889
8 0.131 0.902
9 0.117 0.911
10 0.105 0.920
11 0.095 0.927
12 0.088 0.932
13 0.081 0.937
14 0.075 0.941
15 0.070 0.945
16 0.066 0.948
17 0.062 0.951
18 0.058 0.954
19 0.055 0.956
20 0.053 0.958
21 0.050 0.960
22 0.048 0.962
SUMMARY
This chapter has discussed various forms of reliability, most notably coeffi-
cient alpha. It also discussed the relationship between coefficient alpha,
unidimensionality, interitem correlation, scale length, and item-wording
redundancy. We highlighted the importance of establishing dimensionality
prior to assessing internal consistency as well as the effects of interitem
03-Netemeyer.qxd 2/12/03 12:46 PM Page 70
70 SCALING PROCEDURES
FOUR
VALIDITY
OVERVIEW OF CONSTRUCT VALIDITY
71
04-Netemeyer.qxd 2/12/03 12:46 PM Page 72
72 SCALING PROCEDURES
1. Translation validity
Content validity
Face validity
2. Criterion-related validity
Predictive and post-dictive validity
Concurrent validity
Convergent validity
Discriminant validity
Known-group validity
3. Nomological validity
TRANSLATION VALIDITY
Content and face validity reflect the extent to which a construct is translated
into the operationalization of the construct (Trochim, 2002). The terms are
often confused and/or used interchangeably. Although the distinction between
face and content validity frequently is unclear, Rossiter (2001) argued that the
two validity concepts differ in important ways and should not be confused.
Face Validity
Evidence of face validity is provided from a post hoc evaluation that the
items in a scale adequately measure the construct (cf. Nunnally & Bernstein,
1994; Rossiter, 2001). Face validity can be judged after a measure has been
04-Netemeyer.qxd 2/12/03 12:46 PM Page 73
Validity 73
Content Validity
74 SCALING PROCEDURES
(Clark & Watson, 1995). The initial item pool should be comprehensive in
coverage and include a large number of potential items across the a priori
theoretical dimensions. The initial item pool undoubtedly will include some
items that subsequently will be eliminated in follow-up judging procedures
and psychometric analyses. A large number of items for each dimension in the
beginning item pool increases the likelihood that all dimensions will be repre-
sented adequately. That is, the focus on breadth of the item pool applies to
all possible construct areas or dimensions, so that individual areas are not
underrepresented in the final scale.
Content validation is particularly important for constructs that are
ambiguous or complex. Content validation is enhanced via precise construct
definition and conceptualization, including the specification of dimensionality
and the individual definitions of the various dimensions that the construct
comprises. In addition, content validity is ensured to the extent that lay and
expert judges agree that items are reflective of the overall construct and that
these judges agree that the items are representative of the domain and facets
of the construct.
Items to be included in an item pool may be obtained using any number
of sources. The most frequently used sources of items in scale development
articles include previously employed statements from prior research involving
the construct, open-ended elicitation from samples of representative subjects,
and researcher-generated statements based on the researcher’s knowledge and
understanding of the domain of the construct and its theoretical underpinnings.
As an example, Bearden, Hardesty, and Rose (2001) reported the follow-
ing efforts in the development of their 31-item, six-dimension measure of
consumer self-confidence. Briefly, self-confidence was conceptualized initially
as a two-factor higher-order model with seven first-order factors. The a priori
higher-order factors (and the hypothesized subdimensions) were as follows:
1. Decision-making self-confidence
a. Information acquisition
b. Information processing
c. Consideration set formation
d. Personal outcomes
e. Social outcomes
2. Protection self-confidence
a. Persuasion knowledge
b. Marketplace interfaces.
04-Netemeyer.qxd 2/12/03 12:46 PM Page 75
Validity 75
76 SCALING PROCEDURES
CRITERION VALIDITY
Concurrent Validity
Validity 77
78 SCALING PROCEDURES
1
Job Satisfaction .896
2
Method 1 Role Conflict -.236 .670
Likert Scale
Role Ambiguity -.356 .075 .817
3 4
Job Satisfaction .450 -.082 -.054
4 2
Method 2
Thermometer Role Conflict -.244 .395 .142 -.147
Scale
Role Ambiguity -.252 .141 .464 -.170 .289
Validity 79
1. The entries in the validity diagonal 3 should be higher than the entries
in the heteromethod block 4 that share the same row and column.
2. The correlations in the validity diagonal should be higher than the cor-
relations in the heterotrait-monomethod triangles. This more stringent
requirement suggests that the correlations between different measures
for a trait should be higher than correlations among traits which have
methods in common.
3. The pattern of the correlations should be the same in all the heterotrait
triangles (i.e., 2 and 4).
For this example, the first two conditions for discriminant validity are
satisfied. An excellent review and series of applications involving lifestyle
measures and much different types of methods, including both qualitative and
quantitative methods within traits, is provided by Lastovicka, Murry, and
Joachimsthaler (1990). As demonstrated effectively by Lastovicka et al.
(1990), the use of divergent methods enables more rigorous examination of
convergent and discriminant validity.
Other procedures for providing evidence of convergent validity have been
employed in the development of measures of unobservable constructs. For
example, using procedures similar to those recommended by Bagozzi (1993),
Bearden, Hardesty, et al. (2001) provided evidence of convergent validity for
their six-dimension measure of consumer self-confidence. They did so via
04-Netemeyer.qxd 2/12/03 12:46 PM Page 80
80 SCALING PROCEDURES
Known-Group Validity
Tattoo and 1 39 3.05 .70 621 2.60 .56 3.22 < .001 Supported
body piercing
artists
2/12/03 12:46 PM
Owners of 2 22 2.99 .45 621 2.60 .56 3.22 < .001 Supported
customized
low rider
autos
Page 81
Members of 3 21 2.91 .44 621 2.60 .56 2.49 < .01 Supported
medievalist
reenactment
group
Student art 4 22 3.06 .45 273 2.71 .50 3.15 < .01 Supported
majors
Student 5 78 2.83 .43 273 2.71 .50 1.89 < .05 Supported
purchasers
of unique
poster art
SOURCE: Adapted from “Consumers’ Need for Uniqueness: Scale Development and Validation,” Journal of Consumer Research, 28(June), p. 58, Tian,
Bearden, and Hunter, copyright 2001 by the University of Chicago Press.
81
04-Netemeyer.qxd 2/12/03 12:46 PM Page 82
82 SCALING PROCEDURES
they found significant differences across samples from sales positions that
differed widely in their professional status.
NOMOLOGICAL VALIDITY
Validity 83
Socially desirable responding (SDR) is a complex issue that has been dis-
cussed by psychologists for years. We raise the issue in this chapter to reiter-
ate its importance and to remind readers that SDR warrants consideration in
research, particularly when the potential for response bias affects relationships
among constructs. Briefly, SDR can be viewed as a response style or bias that
reflects tendencies to provide favorable responses with respect to norms and
practices (Nederhof, 1985). Mick (1996) defined socially desirable responding
as the tendency of individuals to make themselves look good with respect to
cultural norms when answering researchers’ questions. As discussed below,
this aspect of SDR is consistent with Paulhus’s (1993) “impression manage-
ment” concept, which highlights respondents’ attempts to shape their answers
purposefully to reflect the most positive image.
SDR can affect the measurement of constructs as well as the relationships
among constructs (cf. Mick, 1996, pp. 109-110). Briefly, SDR can increase
relationships such that correlations between constructs are due to shared vari-
ance in SDR. This phenomenon is termed the spuriousness effect. In the sup-
pression effect, the true correlation between two measures is masked by SDR.
A third possible effect of SDR occurs when the form of the relationship
between two measured variables is affected. In these latter situations, SDR
moderates the relationship between constructs. Procedures for investigating
these alternative problems associated with response bias are described by
Mick (1996) and Ganster, Hennessey, and Luthans (1983).
The methods for coping with SDR bias were summarized recently by
Tian, Bearden, and Manning (2002). The procedures fit into two categories:
methods devised to prevent survey participants from responding in a socially
desirable manner and methods aimed at detecting and measuring social
desirability response bias (Nederhof, 1985; Paulhus, 1991). Techniques
04-Netemeyer.qxd 2/12/03 12:46 PM Page 84
84 SCALING PROCEDURES
Validity 85
SUMMARY
86 SCALING PROCEDURES
Validity 87
FIVE
STEPS 1 AND 2
Construct Definition and Generating and
Judging Measurement Items
INTRODUCTION
88
05-Netemeyer.qxd 2/12/03 12:47 PM Page 89
and content domain, (b) the importance of a thorough literature review and
definition judging by experts and individuals from relevant populations,
(c) the focus on effect (reflective) items/indicators, and (d) the importance of
an a priori construct dimensionality. The second part of this chapter deals with
the second step in scale development: generating and judging an initial pool of
items to reflect the construct. With this step, we will briefly cover the theoret-
ical assumptions of domain sampling. We will also discuss issues in the
generation of an item pool, question/statement wording options, choice of
response formats, and judging the items for content and face validity by both
experts and potential respondents from relevant populations. The last part of
this chapter uses examples that illustrate and apply many of the procedures
recommended for the first two steps in scale development.
90 SCALING PROCEDURES
Personality Traits
Cynicism Ad Skepticism Ad Appeals
Self Esteem
Brand Beliefs
Consumption Experiences Attitudes- Attitudes-
Ad Information
Age Marketing Advertising Brand Attitudes
Education Processing
Reliance on Ads
92 SCALING PROCEDURES
thus, each indicator is important to the validity of the construct (Bollen &
Lennox, 1991; MacCallum & Browne, 1993; Neuberg et al., 1997). This is not
necessarily the case with effect indicators. With reflective indicators, the items
must represent a reasonable sample of items tapping the domain of the
construct (Nunnally & Bernstein, 1994).
Second, the fact that the indicators of a formative construct are combined
to produce an overall index does not necessarily imply that all individual indi-
cator scores are intercorrelated, and whether they are or are not correlated is
irrelevant to the reliability of the measure (Bollen & Lennox, 1991; Smith &
McCarthy, 1995). Formative indicators need not be internally consistent, and
reliability methods based on internal consistency do not apply. With effect
items, the interrelatedness among items, and hence internal consistency, is of
concern for the reliability of the measure.
Not only are there conceptual differences between formative and effect
indicators, the methods used in developing such measures differ as well. (For
an excellent review of these methods, see Diamontopoulus and Winklhofer,
2001.) Here again, the importance of a well-defined and thought-out construct
definition is very useful. Theory and a thorough literature review can help
determine if the construct is a formative or effect indicator measure.
94 SCALING PROCEDURES
Domain Sampling
Once the construct has been accurately defined and delineated, the task of
generating items to tap the construct’s domain begins. A model that is consis-
tent with elements of both classical test theory and generalizability theory for
generating items is domain sampling. As a model of measurement error,
05-Netemeyer.qxd 2/12/03 12:47 PM Page 95
96 SCALING PROCEDURES
construct has been sampled. That is, the items should appear consistent with the
theoretical domain of the construct in all respects, including response formats
and instructions.
Second, in generating items, face validity must also be considered. A
highly face valid scale enhances cooperation of respondents because of its
ease of use, proper reading level, and clarity, as well as its instructions and
response formats. Thus, from a practical perspective, face validity may be
more concerned with what respondents from relevant populations infer with
respect to what is being measured, and content validity is concerned with face
validity as well as what the researcher believes he or she is constructing
(Haynes et al., 1995; Nunnally & Bernstein, 1994).
Third, even with the focus on content and face validity, two other issues
should be considered in constructing a pool of items. Clark and Watson (1995)
advocated that the scale developer go beyond his or her own view of the con-
struct in generating an initial item pool and that the initial pool contain items
that ultimately will be only tangentially related to the content domain of the
construct. Thus, it is better to be overinclusive of the construct’s domain rather
than underinclusive in generating an item pool. Care must also be taken to
ensure that each content area of the construct has an adequate sampling of
items. Although difficult to achieve in practice, broader content areas should
be represented by a larger item pool than narrower content areas.
With these issues in mind, item generation can begin with careful thought
to (a) what should be the source of the items, (b) item wording issues, and
(c) how many items should serve as an initial pool. For these issues, definitive
answers do not exist, but some practical guidelines are evident.
Item Sources
Item Writing
98 SCALING PROCEDURES
Such items produce little item variance and thus little scale variance. As Clark
and Watson (1995) noted, items that everyone endorses the same way in a
positive manner (e.g., “Sometimes I am happier than other times”) or the same
way in a negative manner (e.g., “I am always furious”) add little to the content
validity of a construct.
Positively and Negatively Worded Items. The choice of using all positively
worded items vs. some positively and some negatively worded items is also of
interest. Several scale developers have written items that reflect low levels
of, the opposite of, or the absence of the target construct. The primary goal of
such a procedure is too “keep the respondent honest” and thus avoid response
bias in the form of acquiescence, affirmation, or yea-saying. Cacioppo and
Petty’s (1982) Need for Cognition scale is an example. They used items that
reflect a higher level of the construct, such as “ I prefer complex to simple
problems,” and items that reflect a low level of the construct, such as “I only
think as hard as I have to.” It has been our experience, however, that nega-
tively worded items either do not exhibit as high a reliability as positively
worded items do or can be confusing to respondents. Such items also may
contribute to a methods factor in factor analytic models because positively
worded items tend to load highly on one factor and negatively worded items
tend to load highly on another factor. (See Herche and Engellend [1996] for a
discussion and examples.) With that possibility in mind, the researcher must
weigh the potential advantages and disadvantages of using negatively worded
items in the item pool.
Most multichotomous scales use between 5 and 9 scale points, with some
going as low as 3 and as high as 11 scale points. It has been our experience
that 5- or 7-point formats suffice, and providing more response alternatives
may not enhance scale reliability or validity. If the researcher wants to provide
a “label” for each scale, it is easier and probably more meaningful for both
scale developer and respondent if 5- or 7-point formats are used. More alter-
natives may require more effort on the respondent’s behalf by forcing him or
her to make finer distinctions. This, in turn, can produce random responding
and more scale error variance.
Another consideration is using an odd or even number of scale points. With
an odd number, the respondent is offered a scale midpoint or “neutral” response.
Such a response, in effect, expresses a “no opinion,” “not sure,” or “neither
agree nor disagree” option depending on what is being assessed. The scale
developer must be careful in choosing an appropriate wording for such a mid-
point if he or she chooses to label each point on the scale. As stated by Clark and
Watson (1995, p. 313), a scale midpoint of “cannot say” may actually confound
the midpoint of the scale with an uncertainty of what the item means to the
respondent. On the other hand, an even number of scale points forces the respon-
dent to have an opinion, or at least make a weak commitment to what is being
expressed in an item (i.e., express some level that reflects a preference for
endorsing or not endorsing an item), that he or she may not actually have.
Although some researchers believe that neither an odd nor an even number of
scale points is superior, it would seem that for some items a neutral response is
a valid answer, so that an odd number of scale points is appropriate.
Overall, it has been shown that if reliably and validly constructed,
dichotomous and multichotomous scales will yield similar results. It is
strongly recommended that the format, wording of scale points, and number
of scale points be carefully judged by experts and pilot tested prior to other
scale construction steps.
pool as small as 20-30 items may suffice. In fact, DeVellis (1991) recommends
that for narrow constructs, a pool that is twice the size of the final scale will
suffice. For broader, multifaceted constructs, many more items will be needed
to serve as an initial pool. Some researchers advocate a pool of 250 items as
exemplary for item generation for multifaceted constructs (Robinson et al.,
1991, pp. 12-13). Still, one must consider other issues, such as item redun-
dancy, a desired level of internal consistency, and respondent cooperation. It
is possible for a pool to be so large as to hinder respondent cooperation on a
single occasion.
In sum, there are no hard-and-fast rules for the size of an initial item pool.
Narrowly defined, single-facet constructs will require fewer items than will
complex multifaceted constructs. In general though, a larger number is preferred,
as overinclusiveness is more desirable than underinclusiveness.
Establishing content and face validity varies with how precisely the
construct is defined and the degree to which experts agree about the domain and
facets of the construct. The wider the discrepancy as to what the construct is, the
more difficult the content validation will be. Content validity is threatened if
(a) items reflecting any of the domains (facets) were omitted from the measure,
(b) items measuring domains (facets) outside the definition of the construct are
included in the measure, (c) an aggregate score on the construct disproportion-
ately reflects one domain (facet) over other domains (facets), and (d) the
instrument was difficult to administer and respond to for the target populations.
The first part of this chapter offered a conceptual review of the first two steps
in scale development and validation. The remainder of this chapter will offer
examples and an application of these two steps.
05-Netemeyer.qxd 2/12/03 12:47 PM Page 104
Wording of all items used common consumer vernacular, and via the consumer
responses and author judgment, 7-point Likert-type strongly disagree to
strongly agree scales were used to evaluate the items. This large item pool was
trimmed to a more manageable number for the developmental studies that
followed. Statements that were viewed as redundant (i.e., useless redundancy)
were trimmed by the authors, resulting in 180 items to be judged by six experts
(five holding PhDs and 1 doctoral candidate). Using an a priori decision rule
in which at least five of six judges had to agree that an item tapped the
construct definition/domains, a total of 117 items were retained for the initial
scale developmental study. In sum, Shimp and Sharma aptly followed Steps 1
and 2 in scale development.
Another example of construct definition and item generation and
refinement can be found in Lichtenstein et al. (1993). With their construct “price
consciousness,” a more narrow definition and domain was a major focus to
differentiate the construct from other pricing-related concepts. Using theory and
extant literature, Lichtenstein et al. (1993) defined price consciousness as the
“degree to which the consumer focuses exclusively on paying low prices”
(p. 235). A primary theoretical premise was that consumers view the price they
pay for a product as having either a negative or positive role, and that each role
is represented by a higher-order factor with first-order subcomponents (dimen-
sions). As a first-order factor of the negative role of price, price consciousness
is distinct from the positive role of price concepts of price-quality schema (i.e.,
the higher the price, the higher the quality of the product) and price prestige sen-
sitivity (i.e., the prominence and status that higher prices signal to other people
about the purchaser). As such, what was excluded from the domain of price con-
sciousness took on a dominant role in defining the construct and determining its
dimensionality within the presence of other constructs. Theoretically then,
Lichtenstein et al. (1993) posited price consciousness as a first-order factor of
the higher-order factor of “negative role of price.”
They further delineated their construct within a nomological network in
which other first-order factors of the negative role of price (e.g., coupon
proneness and sale proneness) would be highly positively related to price con-
sciousness, and price-quality schema and price prestige sensitivity would be
negatively related to price consciousness. Basing their logic in theories of con-
sumer buying and information search/processing, Lichtenstein et al. (1993)
further posited consequences of price consciousness including “searching for
low prices inside the store” and the “actual quantity” and “dollar amounts” of
sale products purchased.
05-Netemeyer.qxd 2/12/03 12:47 PM Page 106
constructs, 110 items were generated for the initial pool, with about one third of
the items being negatively worded. Four expert judges, using both qualitative
and quantitative procedures, judged these items for representativeness. The
judges were asked to place items in one of three categories based on the con-
struct definition and content domain: (a) very representative of the construct
(domain), (b) somewhat representative of the construct (domain), or (c) not
representative of the construct (domain). The judges were also asked to change
the wording of items so that, when modified, they would be at least somewhat
representative of the construct (domain).
Using a variation of Cohen’s kappa (Perreault & Leigh, 1989), an inter-
rater reliability coefficient was constructed that ranged from 0 to 1. (Appendix
5B offers the calculations for this interrater reliability index.) For all four
judges simultaneously, the value of this coefficient was .51 across items; for
two judges at a time, this coefficient ranged from .63 to .79. Based on these
results, a decision rule was adopted to retain only those items that all four
judges classified as at least “somewhat representative” of the construct
(domain). Via further author judging for item redundancy (i.e., useless redun-
dancy), the initial pool for the first developmental study contained 7, 8, and 7
items (22 in total) for the WFC general demand, time-based, and strain-based
WFC domain facets, respectively, and 8, 7, and 6 items (21 in total) for the
FWC general demand, time-based, and strain-based domain facets. In sum,
Netemeyer et al. (1996) used several procedures to generate and judge their
item pool. (The final forms of the scales are offered in Appendix 5A.)
SUMMARY
In this chapter, we have offered several illustrations of the important first two
steps in scale development: Step 1 of construct definition and content domain
and Step 2 of generating and judging measurement items. We discussed sev-
eral important issues related to the first step, including the role of theory in
construct definition and content domain, the importance of a thorough litera-
ture review and definition, and the importance of an a priori construct dimen-
sionality. Issues relevant to the second step also were highlighted, including
generating a sufficiently large pool of items to tap the content domain, the
various aspects of item writing, and the use of expert and lay judges for judg-
ing content and face validity. Chapters 6 and 7 will turn to empirical studies
to further develop, refine, and finalize a scale.
05-Netemeyer.qxd 2/12/03 12:47 PM Page 108
Appendix 5A
Consumer Self-Confidence Scales
108
05-Netemeyer.qxd 2/12/03 12:47 PM Page 109
SOURCE: Reprinted from “Lifestyle of the Tight and Frugal: Theory and Measurement,”
Journal of Consumer Research, 26(June), p. 89 (Table 2), Lastovicka, Bettencourt, Hughner,
and Kuntze, copyright 1999 by the University of Chicago Press. Reprinted with permission of
the University of Chicago Press.
05-Netemeyer.qxd 2/12/03 12:47 PM Page 111
FWC:
1) The demands of my family or spouse/partner
interfere with work-related activities. .73
2) I have to put off doing things at work because of demands
on my time at home. .89
3) Things I want to do at work don’t get done because of
the demands of my family or spouse/partner. .83
4) My home life interferes with my responsibilities at
work such as getting to work on time, accomplishing
daily tasks, and working overtime. .83
5) Family-related strain interferes with my ability
to perform job-related duties. .75
SOURCE: Reprinted from “Development and Validation of Work-Family Conflict and Family-
Work Conflict Scales,” Journal of Applied Psychology, 81(4), p. 410, Netemeyer, Boles, and
McMurrian, copyright 1996 by the American Psychological Association. Reprinted with
permission from the American Psychological Association.
NOTE: The factor loadings are based on the two-factor correlated model from the confirmatory
factor analysis of the Study Three (real-estate sample).
05-Netemeyer.qxd 2/12/03 12:47 PM Page 112
SOURCE: Reprinted from “Price Perceptions and Consumer Shopping Behavior: A Field
Study,” Journal of Marketing Research, 30(May), pp. 233-234, Lichtenstein, Ridgway, and
Netemeyer, copyright 1993 by the American Marketing Association. Reprinted with permission
of the American Marketing Association.
05-Netemeyer.qxd 2/12/03 12:47 PM Page 113
CETSCALE Items
Factor Loading
1) American people should always buy American-made products instead of imports. .81
2) Only those products that are unavailable in the U.S. should be imported. .79
3) Buy American-made products. Keep America working. .71
4) American products, first, last, and foremost. .81
5) Purchasing foreign-made products is un-American. .80
6) It is not right to purchase foreign products because it puts Americans out of jobs. .85
7) A real American should always buy American-made products. .84
8) We should purchase products manufactured in America instead of letting
other countries get rich off us. .82
9) It is always best to purchase American products. .77
10) There should be very little trading or purchasing of goods from other countries
unless out of necessity. .73
11) Americans should not buy foreign products, because this hurts
American business and causes unemployment. .82
12) Curbs should be put on all imports. .72
13) It may cost me in the long-run but I prefer to support American products. .74
14) Foreigners should not be allowed to put their products on our markets. .72
15) Foreign products should be taxed heavily to reduce this entry into the U.S. .76
16) We should buy from foreign countries only those products that we
cannot obtain within our own country. .77
17) American consumers who purchase products made in other countries are
responsible for putting their fellow Americans out of work. .81
Appendix 5B
After compiling the results across all items, Netemeyer et al. (1996)
constructed their interrater reliability index based on Perreault and Leigh’s
(1989, p. 141) variation of Cohen’s kappa. The equation is as follows:
F is the absolute level of observed agreement among all judges for each
item placed in the same category,
All four judges placed 56 of the 110 items (for both WFC and FWC) into
the same category. Thus, the equation is solved as follows:
= .51
For two judges, where they placed 82 of the 110 items into the same
category, the equation is solved as follows:
= .79
114
06-Netemeyer.qxd 2/12/03 12:47 PM Page 115
SIX
STEP 3
Designing and Conducting
Studies to Develop a Scale
INTRODUCTION
PILOT TESTING
115
06-Netemeyer.qxd 2/12/03 12:47 PM Page 116
Sharma collected data from a portion of that study’s sample with the following
open-ended question: “Please describe your views of whether it is right and
appropriate to purchase products that are manufactured in foreign countries.”
Responses were coded by two independent judges as “ethnocentric” or “non-
ethnocentric” with a 93% agreement rate. Then, the correlation between this
dichotomously scored variable and the CETSCALE was calculated (r = .54,
n = 388). Although this result may be considered modest in terms of conver-
gent validity, the measures on which it was based were separated by a 2-year
time difference, and the correlation was between a continuous and a dichotomous
measure.
Shimp and Sharma also tested for discriminant validity. They correlated
the CETSCALE with measures of patriotism, politico-economic conser-
vatism, and dogmatism (i.e., potential antecedents that should be distinct from
consumer ethnocentrism). Correlations ranged from .39 to .66, providing
evidence of discriminant validity. Nomological validity and elements of
predictive validity were also tested in all six studies, using attitudes toward
foreign-made products, intention to buy foreign-made products, intention to
buy American-made products only, and domestic automobile ownership (i.e.,
potential consequences) as correlates. Of the 28 relationships hypothesized, 26
were significant in the predicted direction, demonstrating evidence of the
nomological and predictive validity of the CETSCALE. Finally, in one of their
studies, Shimp and Sharma (1987) showed that the CETSCALE demonstrated
aspects of known-group validity. Theory strongly suggested that those whose
economic livelihood (i.e., job) is most threatened by foreign products would
show higher mean-level scores on the CETSCALE than those whose eco-
nomic livelihood is less threatened. This hypothesis was supported, as a
sample from Detroit (home of the U.S. auto industry) showed significantly
higher mean scores on the CETSCALE than those from three other areas of
the country. Known-group validity thus was supported.
A second example of a validation strategy comes from the consumer
self-confidence scales developed Bearden et al. (2001). They assessed
aspects of convergent, discriminant, nomological, predictive, and known-
group validity. They also examined an often-overlooked aspect of validity
testing—contamination from social desirability bias (Mick, 1996). In terms
of convergent validity, in one of their samples, Bearden et al. (2001) pre-
sented respondents with definitions of the facets and then asked them to rate
the degree to which their behaviors were characteristic of each definition on
7-point scales. Two weeks later, the same sample responded to all facets of
06-Netemeyer.qxd 2/12/03 12:47 PM Page 120
the consumer self-confidence scales. The summed scores of these ratings (for
each facet) were correlated with the respective definition-based rating of
consumer self-confidence scale. The correlations, ranging from .39 to .63,
showed evidence of convergence.
Bearden et al. (2001) examined the discriminant validity among the six
facets via confirmatory factor analysis (CFA) in two of their studies. For each
possible pair of facets, two discriminant validity tests were performed. In the
first test, if the average variance extracted (AVE) by each facet’s measure was
greater than the shared variance between the facets (i.e., the square of the
correlation between the facets as separate factors), evidence of discriminant
validity was deemed to exist. In the second test, if a two- factor model (where
the items of the two facets are treated as distinct yet correlated constructs) fit
significantly better than a one-factor model (where the items of two facets are
combined into one factor), evidence of discriminant validity was deemed to
exist (Anderson & Gerbing, 1988). Both tests supported the discriminant
validity among the facets of consumer self-confidence. (Chapter 7 expands on
the use of CFA for testing discriminant validity.)
In three of their studies, Bearden et al. (2001) included measures to test
aspects of nomological and predictive validity. In two of the studies, they
correlated the facets of consumer self-confidence with susceptibility to norma-
tive influence, subjective product knowledge, price-quality schema, and trait
self-esteem. The pattern of correlations supported the nomological validity of
the consumer self-confidence scales. In another study, Bearden et al. (2001)
assessed the relations between the facets of consumer self-confidence and
state self-esteem, information processing confidence, and product-specific
self-confidence for five product categories via multiple regression analyses.
The pattern of results for these equations supported the predictive validity of
the consumer self-confidence measures. Known-group validity was tested by
comparing the mean scores between the respondents of two of the samples
(i.e., the general population) with a sample of professionals with an average of
23.4 years of experience in sales and marketing (n = 44). For two facets of
consumer self-confidence (persuasion knowledge and protection), the mean
score differences between these two groups were significant in the predicted
direction (t values ranged from 3.52 to 7.16).
Given that individuals tend to overestimate their desirable characteristics
and underestimate their undesirable characteristics, Bearden et al. (2001)
tested for social desirability bias in one of their studies. Paulhus’s (1993)
impression management scale was correlated with the facets of consumer
06-Netemeyer.qxd 2/12/03 12:47 PM Page 121
self-confidence. Only two of the six facets were significantly (but lowly)
correlated with impression management (r = .18 and r = .21), suggesting that
the scales are mostly free of social desirability bias. In sum, the Shimp and
Sharma (1987) and Bearden et al. (2001) examples demonstrate strategies for
designing multiple studies for validity assessment.
Exploratory factor analyses (EFA) can be used for two primary purposes in
scale development: (a) to reduce the number of items in a scale so that the
remaining items maximize the explained variance in the scale and maximize the
scale’s reliability and (b) to identify potential underlying dimensions in a scale.
These two uses are related, and in the case of scale development, they can be
used in a complementary fashion. The following sections will discuss the use of
EFA, focusing on how it can be used to evaluate a potential a priori theoretical
factor of measures and as a tool to reduce the number of items. We then apply
common factor analysis to the data collected by Netemeyer et al. (1996) from
their work-family conflict (WFC) and family-work conflict (FWC) scales to
demonstrate the use of common factor analysis in scale development.
EFA Options
explain the correlations among the items. Common factor analysis, however,
uses the communality estimates of the items on the main diagonal.
Furthermore, the variance in a given item is partitioned into that which is
common to a factor or latent variable (based on the shared variance with other
items in the analysis) and variance that is unique to a given item—a combina-
tion of specific variance and random error variance. Common factor analysis
can be used to identify the theoretical construct(s) whose effects are reflected
by responses to items in a scale. Common factor analysis also can offer
information as to what items to delete or retain, similar to that of PCA.
Many authors report that the solutions derived from PCA and common
factor analyses tend to be quite similar. This is particularly the case when the
number of items exceeds 30 or commonalities exceed .60 for most items (e.g.,
Hair et al., 1998). Others have questioned the finding that PCA and common
factor analyses yield similar results. Differences may be most pronounced
with a small number of items and low commonalities. In fact, in such a case,
PCA and common factor analyses can offer divergent results (Floyd &
Widaman, 1995). Some authors therefore suggest that common factor analy-
ses are preferred over PCA, as most scale development applications look to
understand a construct in terms of the number of latent factors that underlie it,
as well as to reduce the number of items in a scale. Furthermore, given that
confirmatory factor analysis (CFA) is a widely used tool in scale finalization,
EFA-based common factor analyses may generalize better to CFA than does
PCA (Floyd & Widaman, 1995).
There are various criteria for extracting factors. Hair et al. (1998), Floyd
and Widaman (1995), and Sharma (1996) offer eloquent expositions of these
criteria. We will only briefly review them here. Although some statistical tests
are available for factor extraction, they generally are restricted to estimation
techniques more closely aligned with CFA (e.g., maximum likelihood estima-
tion). (This issue will be discussed in Chapter 7.)
For EFA, psychometric criteria and “rules of thumb” are most often
applied. As noted in Chapter 2, the “eigenvalue-greater-than-1” rule (Kaiser-
Guttman criterion or Latent Root criterion) is often used as a psychometric cri-
terion. Each component (factor) has an eigenvalue that represents the amount
of variance accounted for by the component, where the sum of all eigenvalues
is equal to the number of items analyzed. An eigenvalue less than 1 indicates
06-Netemeyer.qxd 2/12/03 12:47 PM Page 123
that that component accounts for less variance than any single item. Thus, with
data reduction as a goal, a component with an eigenvalue less than 1 is not
considered meaningful. It should be noted that the Kaiser-Guttman criterion
can underestimate the number of components in some circumstances and may
not be reliable. Cliff (1988) showed that the eigenvalue-greater-than-one rule
is flawed, and it is probably best used as a guide rather than as an absolute
criterion to extract factors.
Other rules of thumb are often advocated for factor extraction. The first of
these involves the scree test. The scree test plots the eigenvalues and shows the
slope of a line connecting the eigenvalues. Factors are retained where the slope
of this line approaches zero, and at which point a sharp “elbow” occurs. Deleting
a factor well below this elbow will show little loss of explained variance. Such
a procedure may be particularly the case with PCA, but with common factor
analyses, eigenvalues less than 1 may be worth examining because the judgment
of the elbow is subjective. In fact, Floyd and Widaman (1995) suggested that for
common factor analysis, when two or more factors are near the elbow cutoff,
alternative factor solutions with differing numbers of factors should be exam-
ined in efforts to avoid a useful factor. Because determining the elbow is very
subjective, Horn (1965) proposed the use of parallel analysis to identify the
elbow. (See Chapter 2 for more details.) The parallel analysis computes the
eigenvalues of variables for a given sample size assuming that the correlations
among the variables are the result solely of sampling error. That is, this analysis
provides an estimate of the eigenvalues for items that have no common factors.
(Equation 2.3 of Chapter 2 can be used to determine these eigenvalues.)
A second rule of thumb for retaining factors pertains to the number of
items that substantially load on a factor. What is considered substantial is
somewhat open for debate, but loadings in the .40 range and above have been
classified as substantial (Floyd & Widaman, 1995), and loadings above .50
have been considered as “very significant” (Hair et al., 1998). Factors with
only a single substantial loading will be of little consequence because only the
specific factor variance associated with that item is being accounted for, and
it has been suggested that at least three items that load highly are needed to
identify a factor (Comrey, 1988). Still, many authors suggest that sample size
must be considered in judging not only the size of the factor loading but also
its stability. Rules of thumb for EFA techniques range from a minimum sam-
ple size of 100 to a size of 200 to 300; another recommendation is a sample
size of 5 to 10 respondents per item (Clark & Watson, 1995; Comrey, 1988;
Floyd & Widaman, 1995; Hair et al., 1998).
06-Netemeyer.qxd 2/12/03 12:47 PM Page 124
Rotational Methods
To make factors more interpretable (and item retention and deletion more
meaningful), factors are “rotated” after extraction. A basic goal for scale
developers is to look for simple structure after rotation. Simple structure
occurs when each item loads highly on as few factors as possible, or, more
preferably, has a substantial loading on only one factor. Rotation can be spec-
ified either as orthogonal or as oblique, where orthogonal rotation keeps fac-
tors uncorrelated and oblique rotation allows factors to correlate. VARIMAX
is the most common form of orthogonal rotation for EFA and will show sim-
ple structure in most cases. Given, however, that a goal of EFA for scale
06-Netemeyer.qxd 2/12/03 12:47 PM Page 125
Given that EFA can be used as a method to reduce the number of items
in a scale, questions arise regarding how many items should be deleted and
what criteria to use in deleting items. As stated above, obtaining simple
structure is a goal of EFA achieved by looking for loadings that are sub-
stantial (.40 and above). This goal should be used with EFA across multiple
data sets (i.e., at least two). Furthermore, scale developers must also look for
extremely high loadings, as items with such loadings may be indicative of
wording redundancy that does not add substantively to a scale’s internal con-
sistency or validity. Thus, in general, we advocate retaining items via multi-
ple EFAs with loadings no less than .40 but no greater than .90. At this stage,
however, we caution against deleting items that may not meet this criterion
but still are judged to have face and/or content validity. Furthermore, item
deletion and retention in the early studies of scale development should
simultaneously consider reliability and item-based statistics such as cor-
rected item-to-total correlations, average interitem correlations, and item
variances.
defined constructs, they advocated a range of .40 to .50. They also advocated
a coefficient alpha level of at least .80 for a new scale and suggested that
retaining items with greater variances relative to other items will help to
increase overall scale variance. Bearden and Netemeyer (1998) advocated cor-
rected item-to-total correlations of .50 and above and alpha levels of .80 or
above (subject to scale length considerations).
These heuristics make sense, but it must also be remembered that coeffi-
cient alpha is related to scale length, average interitem correlation (covariance),
item redundancy, and dimensionality. First, consider scale length. It was noted
in Chapter 3 that as the number of items increases, alpha will tend to increase.
Because parsimony is also a concern in measurement and most scales are
self-administered, scale brevity is often a concern.
Item redundancy and average interitem correlation also affect coefficient
alpha. Although it is advocated that several items are needed to adequately
tap the domain of the construct, when the wording of the items is too similar,
coefficient alpha (as well as content validity and dimensionality) may not be
enhanced. Items that are worded too similarly will increase average interitem
correlation and coefficient alpha; however, such items may just contribute to
the “attenuation paradox” in psychometric theory, whereby increasing coef-
ficient alpha beyond a certain point does not increase internal consistency
(see Chapter 3). With these considerations in mind, researchers must be care-
ful in their interpretation of alpha and consider its relationship to the number
of items in a scale, the level of interitem correlation, and item redundancy.
Last, and as discussed in Chapters 2 and 3, it must also be remembered that
it is possible for a set of items to be interrelated but not unidimensional.
Coefficient alpha is not a measure of unidimensionality and should be used to
assess internal consistency only after unidimensionality is established (Clark
& Watson, 1995; Cortina, 1993).
A Final Caveat
next “round” of studies. If they continue to perform poorly, they can always
be deleted when deriving the final form of the scale.
Factor Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative %
Scree Plot
20
10
Eigenvalue
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
Factor Number
129
06-Netemeyer.qxd 2/12/03 12:47 PM Page 130
Factor 1 2
WFC1 .811
WFC2 .735
WFC3 .834
WFC4 .871
WFC5 .694
WFC6 .647
WFC7 .808
WFCT1 .662
WFCT2 .545
WFCT3 .675
WFCT4 .794
WFCT5 .692
WFCT6 .753
WFCT7 .828
WFCT8 .866
WFCS1 .481
WFCS2
WFCS3 .755
WFCS4 .828
WFCS5 .698
WFCS6 .593
WFCS7 .706
FWC1 .584
FWC2 .755
FWC3 .550
FWC4 .465
FWC5 .775
FWC6 .823
FWC7 .750
FWC8 .752
FWCT1 .518
FWCT2 .526
FWCT3 .658
FWCT4 .779
FWCT5 .886
FWCT6 .750
FWCT7 .767
FWCS1 .692
FWCS2 .804
FWCS3
FWCS4 .549
FWCS5 .588
FWCS6 .628
06-Netemeyer.qxd 2/12/03 12:47 PM Page 131
Factor 1 2
1 1.000 .403
2 .403 1.000
magnitude are printed. (Because the sample sizes were within the 150 to
200 range, loadings of .40 to .45 are statistically significant; Hair et al.
[1998, pp. 111-112]).
It is clear from these tables that most items loaded highly on their
intended factor, and this pattern was consistent across the samples.
Inconsistencies for the samples were noted for the following items. For the
small business owner sample, WFCS2 did not load above .40 on the WFC
factor, FWCT1 loaded higher on the WFC factor than on the FWC factor,
and FWCS3 did not load above .40 on the FWC factor. For the real estate
salespeople sample, WFCT2 and WFCS2 did not load above .40 on the WFC
factor, and FWCS3 did not load above .40 on the FWC factor. These items
were considered as candidates for deletion in later item analyses. Furthermore,
across the two samples, three high loadings were found: FWCT5 (.89 and .94)
and, to a lesser extent, FWCT6 (.75 and .93) and FWC7 (.75 and .93). These
items were also examined as candidates for deletion in later item analyses.
Still, the primary purpose here was to determine if a two-factor solution was
tenable, and common factor analysis revealed a consistent pattern of results
across the samples for the hypothesized two-factor WFC-FWC structure
underlying the data.
132
06-Netemeyer.qxd
Factor Total % of Variance Cumulative % Total % of Variance Cumulative % Total % of Variance Cumulative %
Scree Plot
20
10
Eigenvalue
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43
Factor Number
06-Netemeyer.qxd 2/12/03 12:47 PM Page 133
Factor 1 2
WFC1 .754
WFC2 .749
WFC3 .788
WFC4 .874
WFC5 .766
WFC6 .774
WFC7 .774
WFCT1 .772
WFCT2
WFCT3 .474
WFCT4 .607
WFCT5 .478
WFCT6 .614
WFCT7 .673
WFCT8 .650
WFCS1 .628
WFCS2
WFCS3 .776
WFCS4 .792
WFCS5 .667
WFCS6 .495
WFCS7 .618
FWC1 .750
FWC2 .768
FWC3 .747
FWC4 .651
FWC5 .896
FWC6 .889
FWC7 .932
FWC8 .824
FWCT1 .709
FWCT2 .755
FWCT3 .813
FWCT4 .891
FWCT5 .939
FWCT6 .929
FWCT7 .777
FWCS1 .816
FWCS2 .670
FWCS3
FWCS4 .603
FWCS5 .518
FWCS6 .698
06-Netemeyer.qxd 2/12/03 12:47 PM Page 134
Factor 1 2
1 1.000 .482
2 .482 1.000
Table 6.3a Sample of Small Business Owners: WFC Scale, Reliability Analysis
Scale (Alpha)
Item Means
Mean Minimum Maximum Range Max/Min Variance
3.3188 2.5192 4.1923 1.6731 1.6641 .2666
06-Netemeyer.qxd 2/12/03 12:47 PM Page 135
Item Variances
Mean Minimum Maximum Range Max/Min Variance
3.6815 2.6641 4.6784 2.0142 1.7561 .2462
Interitem Correlations
Mean Minimum Maximum Range Max/Min Variance
.5157 .1363 .8562 .7199 6.2817 .0201
Table 6.3b Sample of Small Business Owners: WFC Scale, Item-Total Statistics
Table 6.3c Sample of Small Business Owners: WFC Scale, Reliability Analysis
Scale (Alpha)
Item Means
Mean Minimum Maximum Range Max/Min Variance
2.0771 1.6711 3.5921 1.9211 2.1496 .1466
Item Variances
Mean Minimum Maximum Range Max/Min Variance
1.9516 1.1920 4.6934 3.5014 3.9375 .6058
Interitem Correlations
Mean Minimum Maximum Range Max/Min Variance
.4232 − .1055 .8505 .9560 − 8.0599 .0391
06-Netemeyer.qxd 2/12/03 12:47 PM Page 137
Table 6.3d Sample of Small Business Owners: WFC Scale, Item-Total Statistics
same items used for the preceding EFA analyses were used here. These tables
show recommended levels of average interitem correlations for both the WFC
and FWC items (.42-.58 across samples), as well as initial corrected item-to-
total correlations mostly in the .50 to .80 range. (Coefficient alpha estimates
were all above .90.) These two tables also show results consistent with those
from the restricted two-factor EFA; that is, those items that did not load highly
on their intended EFA factors also showed low corrected item-to-total corre-
lations. For the small business owner sample, WFCS2 and FWCS3 had low
06-Netemeyer.qxd 2/12/03 12:47 PM Page 138
Item Means
Mean Minimum Maximum Range Max/Min Variance
3.3890 2.5027 4.3333 1.8306 1.3571 .3571
Item Variances
Mean Minimum Maximum Range Max/Min Variance
2.9427 2.3063 3.6817 1.3754 1.5963 .1413
Interitem Correlations
Mean Minimum Maximum Range Max/Min Variance
.4335 .0552 .8758 .8206 15.8783 .0298
06-Netemeyer.qxd 2/12/03 12:47 PM Page 139
Table 6.4b Sample of Real Estate Salespeople: FWC Scale, Item-Total Statistics
corrected item-to-total correlations, and for the real estate salespeople sample,
WFCT2 and FWCS3 had low item-to-total correlations. Furthermore, those
items from the EFA that had consistently high loadings were also found to
have corrected item-to-total correlations consistently above .80. A close
inspection of these items revealed that they had wording that was redundant
with other items and that they had extremely high correlations with the items
of similar wording (.85-.90). Thus, based on the EFA and reliability and item-
based statistics of two development studies, items WFCT2, WFCS2, FWCT1,
06-Netemeyer.qxd 2/12/03 12:47 PM Page 140
Table 6.4c Sample of Real Estate Salespeople: FWC Scale, Reliability Analysis
Scale (Alpha)
Item Means
Mean Minimum Maximum Range Max/Min Variance
2.3312 1.5414 2.9448 1.4033 1.9104 .0989
Item Variances
Mean Minimum Maximum Range Max/Min Variance
2.4831 1.2163 3.3228 2.1065 2.7319 .2236
Interitem Correlations
Mean Minimum Maximum Range Max/Min Variance
.5824 .1835 .9020 .7185 4.9159 .0221
06-Netemeyer.qxd 2/12/03 12:47 PM Page 141
Table 6.4d Sample of Real Estate Salespeople: FWC Scale, Item-Total Statistics
FWCS3, FWCT5, FWCT6, and FWC7 were primary candidates for deletion
for the next set of studies to finalize the scale.
SUMMARY
This chapter addressed the following issues in Step 3 of the scale development
process: pilot testing a pool of items as an item trimming and initial validity
06-Netemeyer.qxd 2/12/03 12:47 PM Page 142
SEVEN
STEP 4
Finalizing a Scale
INTRODUCTION
This chapter focuses on procedures to finalize the scale and further establish
its psychometric properties—our recommended fourth step in scale develop-
ment. We will specifically address the following: (a) exploratory factor analy-
sis (EFA) and additional item analyses as precursors to confirmatory factor
analysis (CFA), (b) CFA to help finalize and confirm a theoretical factor struc-
ture and test for the invariance of the factor structure over multiple data sets,
(c) additional validity testing, (d) establishing norms, and (e) applying gener-
alizability theory. As part of the chapter, we again offer examples from recent
published scale development articles in the marketing and organizational
behavior literatures. The material in the chapter thus provides additional
elaboration for many of the issues discussed in our previous chapters.
143
07-Netemeyer.qxd 2/12/03 12:47 PM Page 144
EFA
Item-to-Total Correlations
Interitem Correlations
wide range of multi-item scales and include many more measures than those
developed employing the rigorous scale development procedures described in
this volume.
Bagozzi (personal correspondence, February 1, 2002) suggested that, for
scales measuring personality traits or similar constructs, 8 to 10 items per con-
struct, or per dimension for multifactor constructs, might represent an ideal. A
related issue involves decisions regarding the number of indicators to use per
factor in CFA or structural equation modeling (SEM) with structural paths.
Again, some researchers argue for a minimum of four indicators, and others for
a minimum of three. The problem becomes even more troublesome when sam-
ple size is small relative to the number of parameters to be estimated. In these
cases, researchers have on occasion combined indicators. At least two options
are tenable. First, a composite index or a single construct indicator can be
formed by averaging all the indicators or items in a scale and then setting con-
struct error terms based on the reliability of the construct (i.e., set error at 1
minus the square root of alpha). A second approach involves combining subsets
of indicators into “parcels” (cf. Bagozzi & Edwards, 1998; Kishton & Widaman,
1994). In these instances, a smaller number of indicators per dimension is
employed in SEM analyses, where each parceled indicator is actually a com-
posite of multiple indicators as well. Finally, scale length considerations were
also addressed earlier, at the end of Chapter 3, in our discussion of the general-
izability coefficient and how a scale can be developed with a minimum number
of items without sacrificing the generalizability coefficient.
Overview
With CFA, a theoretical factor structure is specified and tested for its
degree of correspondence (or “fit”) with the observed covariances among the
items in the factor(s). Confirming a factor structure depends on many issues.
Three of these issues are addressed here. First, the level of interitem correla-
tion and the number of items come into play. (We speak of correlations here,
but it is important to note that covariances among items should be used as
input to the CFA, particularly in tests of measurement invariance over multi-
ple samples.) Pairs of items that are highly correlated and that share variance
beyond that variance that is accounted for by their factor(s) can result in cor-
related measurement errors in CFA. These instances violate a basic tenet of
classical test theory (true-score model) that error terms among items are
uncorrelated. When a substantial number of error terms are highly correlated,
scale dimensionality is threatened. Threats to dimensionality often reveal
themselves in a number of CFA diagnostics, including fit indices, standardized
07-Netemeyer.qxd 2/12/03 12:47 PM Page 149
as well as the potential cross-loading of an item to another factor (i.e., the item
loads highly on a factor other than its intended factor). EFA and item statis-
tics, however, do not always reveal potential correlated errors among items.
Given that correlated errors are a violation of the true-score model, they may
threaten the dimensionality of a scale. CFA can be very useful in detecting
correlated measurement error. Thus, CFA can be used as a means of scale
reduction to finalize a scale and then confirm the scale’s final form (Floyd &
Widaman, 1995). Later in this chapter, we offer examples of articles that used
CFA not only to confirm a scale structure but also to trim items from the scale.
2. fit indices,
Fit Indices
If the CFA model fits well, the parameter estimates are used to further
evaluate the model. In scale development, individual item loadings should be
assessed for both statistical significance and magnitude. Clearly, items that do
not load significantly on their intended factors should be deleted. The statisti-
cal significance of an item’s loading and its magnitude have been referred to
as the convergent validity of the item to the construct (Anderson & Gerbing,
1988; Fornell & Larcker, 1981). In CFA, a statistically significant loading at
the .01 level will have a t/z value greater than 2.57. A rather rigorous rule of
07-Netemeyer.qxd 2/12/03 12:47 PM Page 153
V(δi) = variance of the error term for the ith indicator, and
p 2
λi
i=1
p 2 p
λi + V(δ)
i=1 i
p
λ2
i=1 i
p
A fifth criterion for evaluating scales via CFA involves measurement invari-
ance testing. Increasingly, scale developers are using CFA to test the invariance
of their measures across samples. When parallel data exist across samples, multi-
group CFA offers a powerful test of the invariance of factor loadings, factor vari-
ances and covariances (correlations), and error terms for individual scale items.
If evidence for invariance exists, the generalizability of the scale is enhanced
(Bollen, 1989; Marsh, 1995; Steenkamp & Baumgartner, 1998).
07-Netemeyer.qxd 2/12/03 12:47 PM Page 156
models in the invariance hierarchy. Second, and as previously noted, the issue
of sample size must be considered. It is important to note that statistical tests
of invariance have the same limitations as statistical tests for other confirma-
tory models. That is, “invariance constraints are a priori false when applied to
real data with a sufficiently large sample size” (Marsh, 1995, p. 12). Given
this, fit indices also should be used to judge the invariance of parameter esti-
mates across samples. If the fit indices are adequate and do not change appre-
ciably from one model to the next in the hierarchy, reasonable evidence of
parameter equivalence exists.
Several examples from the literature highlight the use of CFA for delet-
ing potentially problematic items and confirming a scale’s structure. As pre-
viously noted, Shimp and Sharma (1987) used EFA to trim the number of
CETSCALE items from 100 to 25. Based on several of the criteria noted
above (CFA loadings greater than .70 and problematic correlated measure-
ment errors), they further finalized their scale to 17 items via CFA. After trim-
ming their pool of items from 97 to 39 via EFA (and item-based statistics),
Bearden et al. (2001) trimmed another eight items via CFA. Using MIs that
indicated high cross-loadings and/or correlated measurement errors, 31 items
were retained to represent the 6-factor scale. Then, CFA was again used to fur-
ther confirm their 31-item, 6-factor measure of consumer self-confidence.
Their confirmed structure showed reasonable fit for the CFI (.90) and NNFI
(.89). Lastovicka et al. (1999) first used EFA to trim their consumer frugality
scale (from 60 to 25 items), then used CFA to further refine the scale down to
an 8-item, 1-factor measure. They also used MIs to delete items that threat-
ened dimensionality. Their final scale showed fit indices in the very high range
(CFI = .98, NNFI = .97, and RMSEA < .08).
Netemeyer et al. (1996) used CFA to trim and then confirm their measures
of WFC and FWC. They reported an iterative CFA procedure to retain 5 WFC
items (from a pool of 22) and 5 FWC items (from a pool of 21) via several
CFA criteria involving MIs over three samples. Specifically, in the first
iteration, items were deleted that (a) showed consistent high cross-loadings on
an unintended factor (e.g., a WFC item loading highly on the FWC factor),
(b) showed consistent correlated measurement errors or a large number of
07-Netemeyer.qxd 2/12/03 12:47 PM Page 158
To demonstrate the use of the above evaluative criteria, we use data from
the small business owners and real estate salespeople samples of Netemeyer
et al. (1996). As stated above, the final versions of their WFC and FWC scales
represented two distinct 5-item factors. Thus, over both samples, the hypoth-
esized 2-factor WFC-FWC structure was estimated separately via LISREL8.
Appendix 7A shows the LISREL input file and portions of the output file
printouts for these models.
As per the first criterion (model convergence and “acceptable range” of
parameter estimates), no convergence warnings were noted. Furthermore, and
as shown on the printouts under COMPLETELY STANDARDIZED
SOLUTION LAMBDA-X, all estimates were in acceptable ranges across the
two samples (i.e., no negative error variances or loadings/correlations among
factors greater than 1). As per the second criterion (i.e., fit indices), these two
samples showed good fit, as RMSEA was ≤ .10 and CFI and NNFI were ≥.90
for both samples, as indicated in the GOODNESS OF FIT STATISTICS
section of the printouts. As per the third criterion (i.e., significance of para-
meter estimates and related diagnostics), acceptable levels were found. First,
and as shown in the LISREL ESTIMATES (MAXIMUM LIKELIHOOD)
section of the printouts, the t values across measurement items ranged from
7.19 to 14 .96 (p < .01). As shown in the COMPLETELY STANDARD-
IZED SOLUTION LAMBDA-X section of the printouts, the magnitude of
loadings ranged from .58 to .89 across samples. Second, composite reliability for
the WFC and FWC scales ranged from .83 to .89, and the AVE estimates ranged
from .48 to .64. The estimates needed to calculate composite reliability and
AVE are found in the COMPLETELY STANDARDIZED SOLUTION
LAMBDA-X and the COMPLETELY STANDARDIZED SOLUTION
07-Netemeyer.qxd 2/12/03 12:47 PM Page 159
THETA-DELTA sections of the printouts. For WFC, from the small business
owners sample, composite reliability and AVE are as follows:
p 2
λi
i=1
Composite Reliability = 2 p
p
λi + V(δ)
i=1 i
Next, the WFC and FWC scales showed evidence of discriminant valid-
ity for the three recommended rules of thumb. First, he disattenuated correla-
tions (“phi”) between the two scales were .33 and .42 in the small business
owners and real estate salespeople samples, respectively. These phi estimates
can be found in the LISREL ESTIMATES (MAXIMUM LIKELIHOOD)
section of the printouts. The confidence interval around phi (±2 standard
errors) did not contain a value of 1. Second, when phi for the two factors was
constrained to 1 (constrained model) and compared with the hypothesized
model where phi was freely estimated (unconstrained model), the unconstrained
model showed a significantly lower χ2 than the constrained model. Third, the
AVE for WFC (.60 and .59 in the small business owners and real estate sales-
people samples) was greater than the square of the WFC-FWC parameter esti-
mate (phi = .33, phi2 = .10 in the small business owners sample; phi = .42,
phi2 = .17 in the real estate salespeople sample), and the AVE for FWC (.48
and .64 in the small business owners and real estate salespeople samples) was
greater than the square of the WFC-FWC parameter estimate (phi = .33,
07-Netemeyer.qxd 2/12/03 12:47 PM Page 160
phi2 = .10 in the small business owners sample; phi = .42, phi2 = .17 in the real
estate salespeople sample).
Estimates relevant to the fourth criterion pertain to SRs and MIs. The SR
estimates can be found in the LARGEST NEGATIVE STANDARDIZED
RESIDUALS and LARGEST POSITIVE STANDARDIZED RESIDU-
ALS sections of the printouts. The MI estimates associated with correlated
measurement error can be found in the MODIFICATION INDICES FOR
THETA-DELTA section of the printouts. Although several SRs were signif-
icant (greater than 2.57 in absolute value) and several MIs for correlated
measurement errors were greater than 3.84, none was considered to be a con-
sistent substantial threat to unidimensionality. As for MIs for loading highly
on a factor other than the intended factor—the MODIFCATION INDICES
FOR LAMBDA-X section of the printouts—none was significant for the
small business owners sample (greater than 3.84) and two were significant for
the real estate salespeople sample: WFC5 and WFCT4 loaded on the FWC
factor. Still, the completely standardized loadings for these items to their
intended factors were .79 (WFC5) and .66 (WFCT4), whereas their expected
loadings on the FWC factor were only –.12 (WFC5) and .19 (WFCT4). (See
the COMPLETELY STANDARDIZED EXPECTED CHANGE FOR
LAMBDA-X section of the printouts.)
Overall, then, the finalized WFC and FWC scales met the first four eval-
uative criteria for CFA in scale development. Given this evidence, the mea-
surement invariance hierarchy can be tested: the fifth evaluative criterion.
Appendix 7B shows the LISREL8 input for an invariance hierarchy. (Note that
the model specified toward the bottom of the second sample in Appendix 7B
represents a full measurement invariance model.) To specify each model in the
hierarchy, the procedure is as follows. First, the baseline mode is specified by
placing LX=PS on the MO line of the second sample. Second, the factor load-
ings invariant model is specified by placing LX=IN on the MO line of the sec-
ond sample. Third, the factor loadings and factor covariances invariant model
is specified by placing LX=IN on the MO line of the second sample plus a sep-
arate line stating EQ PH (1,2,1) PH (2,1) in the second sample. Fourth, the
factor loadings, factor covariances, and factor variances invariant model is
specified by LX=IN PH=IN on the MO line of the second sample. Finally, the
factor loadings, factor covariances and variances, and individual item error
terms invariant model is specified by LX=IN PH=IN TD=IN on the MO line
of the second sample (i.e., the full measurement invariance model as shown in
Appendix 7B).
07-Netemeyer.qxd 2/12/03 12:47 PM Page 161
Table 7.1 presents the estimates for the models in the invariance hierarchy.
The baseline model was supported, as adequate levels of fit for the CFI, NNFI,
and RMSEA were found; all loadings of items to their respective factors were
significant across cultures (p < .01); and discriminant validity between the two
factors was supported. (Note that the χ2 and degrees of freedom (df ) for the
baseline model are the sum of the two chi-square and df values from samples
one and two separately.) Next, the model constraining the factor loadings to
be invariant across groups was estimated. The difference between this model
and the baseline model was not significant (χ2diff = 7.68, dfdiff = 8, p > .05), and
the fit indices (RMSEA, CFI, and NNFI) were at adequate levels. The model
constraining the factor loadings and factor covariances to be invariant across
groups was estimated next. The difference between this model and the base-
line model was not significant (χ2diff = 9.08, dfdiff = 9, p > .05), and the fit
indices were adequate. Next, the model constraining the factor loadings, fac-
tor covariances, and variances to be invariant across groups was estimated.
The difference between this model and the baseline model was significant
(χ2diff = 26.34, dfdiff = 11, p < .05), but the fit indices were still at adequate
levels. Last, the model constraining the factor loadings, factor covariances,
factor variances, and item loading error terms to be invariant across groups
was estimated. The difference between this model and the baseline model was
significant (χ2diff = 40.72, dfdiff = 21, p < .05), but the fit was adequate. In sum,
evidence for the statistical invariance of factor loadings and factor covariances
was found, and the other two models showed reasonable levels of fit across
the RMSEA, CFI, and NNFI, suggesting a “practical” level of invariance of
factor variances and item error term loadings.
overall definitional procedure of Bagozzi (1993), the use of spousal dyad data
(e.g., Bearden et al., 2001), and the open-ended question approach of Shimp
and Sharma (1987). Other interesting and insightful examples taken from the
extant scale development literature include the following. First, Richins
(1983) used assertiveness and aggressiveness scales borrowed from psychol-
ogy as measures for assessing convergent validity of her own consumer inter-
action style variables. Bearden, Netemeyer, and Teel (1989) used student
judges to evaluate close friends in terms of their susceptibility to interpersonal
influences. Correlations between the friends’ responses with those of the
judges’ evaluations of their friends were offered as evidence of convergent
validity.
As evidence of discriminant validity, Lastovicka et al. (1999) demon-
strated that their measure of frugality was not redundant with a simple non-
materialistic tendency using the Richins and Dawson (1992) scale. Tian et al.
(2001, p. 58) provided evidence of the discriminant validity of their measure
of consumers’ need for uniqueness via a moderate correlation between
the CNFU scale and a measure of optimum stimulation level (Steenkamp &
Baumgartner, 1992).
ESTABLISHING NORMS
Known-Group Differences
Variance Variance
Source Estimate Percentage Estimate Percentage
variances, factor variances, and factor covariances were equal across the two
groups. We now illustrate that similar conclusions can be reached using
generalizability theory. In addition, use of generalizability theory gives an
estimate of the degree to which scale items differ across groups. For the
Netemeyer et al. (1996) data, group is the differentiation factor, and scale
items and subjects are the generalization factors. Group is chosen as the
differentiation factor because we would like to determine the extent to which
the groups are different with respect to WFC and FWC. Scale items and
subjects are generalization factors because one would like to generalize the
findings across scale items and subjects. That is, we are interested in deter-
mining the extent to which the subjects and/or the scale items are different
across the groups. In estimating variance that is due to the different sources,
it is desirable to have a balanced design—that is, equal cell sizes. An unbal-
anced design (i.e., unequal cell sizes) complicates the estimation of variance.
In cases in which the cell sizes are not equal, Finn and Kayande (1997)
suggested randomly dropping subjects in the cells. The following procedure
was used to obtain equal cell sizes: (a) values for missing data were replaced
by mean values, and (b) observations from the larger group were randomly
chosen to be equal to the number of observations in the smaller group.
Table 7.2 gives the variance estimates obtained using the PROC VARCOMP
procedure in SAS.
07-Netemeyer.qxd 2/12/03 12:47 PM Page 168
From the table, it can be seen that for the WFC scale, almost 57% of the
variance is due to variation in subjects’ response across groups. This is to be
expected, as one would expect subjects to differ in their responses across
groups. The variance estimate of interest, however, is due to the
Group × Items interaction, which is only 0.06%. High variance due to this
effect would suggest that the pattern of responses to items differs across
groups, and low variance would suggest similarity of the pattern of responses
across groups, which is the desired outcome. That is, the scale items can be
generalized across groups. Similar conclusions can be reached for the FWC
scale. The variance decomposition analysis leads us to conclude that the WFC
and FWC scales can be generalized across groups, reinforcing the conclusion
reached earlier using multigroup analysis. But what if multigroup analysis
suggests only partial or no equivalency? In such cases, generalizability analy-
sis provides further insights into the degree to which the scale cannot be gen-
eralized across groups. The following section presents the results of the
CETSCALE administered in multiple countries. (For a detailed discussion, see
Sharma and Weathers [2002].)
The data for this empirical illustration are taken from Netemeyer et al.
(1991). The 17-item CETSCALE was administered to samples of 71 subjects
in the United States, 70 subjects in France, 76 subjects in Japan, and 73 sub-
jects in Germany. In the present case, because 70 was the minimum number of
subjects from any given country (France), 70 subjects were randomly selected
from the United States, Japan, and Germany. Tables 7.3a and 7.3b summarize
the confirmatory factor analysis results. As can be seen from the χ2 difference
tests in the tables, only partial equivalence is achieved.
To assess the degree to which the scale items vary across countries,
generalizability theory was employed to estimate the variance component
analysis using the PROC VARCOMP procedure in SAS. Table 7.4 sum-
marizes these results. As can be seen from the table, the Country factor
accounts for only 7% of the total variation. Of greater interest is the
Items × Countries interaction. For the data analyzed, the variance due to
the Items × Countries interaction accounts for the lowest percentage of
total variation (5.51%), suggesting that there is consistency in response
patterns to items across countries. That is, the items measuring the con-
struct do not appear to be country-specific, suggesting that violation of
complete equivalency is not severe and that the scale can be generalized
across countries.
07-Netemeyer.qxd 2/12/03 12:47 PM Page 169
Model Chi-Square df
SUMMARY
This chapter focused on procedures to finalize the scale and further establish
its psychometric properties. We highlighted the following: (a) the role of
07-Netemeyer.qxd 2/12/03 12:47 PM Page 170
Appendix 7A
LISREL INPUT AND OUTPUT
FOR SMALL BUSINESS OWNERS
AND REAL ESTATE SALESPEOPLE SAMPLES
NOTE: The following input (MO) could have also been used:
mo nx=10 nk=2
171
07-Netemeyer.qxd 2/12/03 12:47 PM Page 172
PHI
KSI 1 KSI 2
-------- --------
KSI 1 1.00
KSI 2 0.33 1.00
(0.09)
3.91
THETA-DELTA
wfc4 wfc5 wfc6 wfct4 wfcs4 fwc1
-------- -------- -------- -------- -------- --------
0.92 1.60 2.01 1.32 1.17 1.24
07-Netemeyer.qxd 2/12/03 12:47 PM Page 173
THETA-DELTA
fwc6 fwc8 fwct4 fwcs1
-------- -------- -------- --------
0.65 1.12 0.73 1.28
(0.11) (0.16) (0.11) (0.16)
6.11 6.95 6.35 7.78
wfc6 - - 1.29
wfct4 - - 1.56
wfcs4 - - 0.47
fwc1 0.96 - -
fwc6 0.93 - -
fwc8 1.37 - -
fwct4 0.09 - -
fwcs1 1.85 - -
wfc4 0.87 - -
wfc5 0.77 - -
wfc6 0.70 - -
wfct4 0.79 - -
wfcs4 0.80 - -
fwc1 - - 0.65
fwc6 - - 0.77
fwc8 - - 0.70
fwct4 - - 0.75
fwcs1 - - 0.58
PHI
KSI 1 KSI 2
-------- --------
KSI 1 1.00
KSI 2 0.33 1.00
THETA-DELTA
wfc4 wfc5 wfc6 wfct4 wfcs4 fwc1
------- ------- ------- ------- ------- -------
.24 .40 .51 .37 .36 .57
THETA-DELTA
fwc6 fwc8 fwct4 fwcs1
------- ------- ------- -------
.41 .51 .43 .66
07-Netemeyer.qxd 2/12/03 12:47 PM Page 177
PHI
KSI 1 KSI 2
-------- --------
KSI 1 1.00
KSI 2 0.42 1.00
(0.07)
6.01
THETA-DELTA
wfc4 wfc5 wfc6 wfct4 wfcs4 fwc1
-------- -------- -------- -------- -------- --------
0.91 1.24 1.03 1.40 1.19 1.22
07-Netemeyer.qxd 2/12/03 12:47 PM Page 178
THETA-DELTA
fwc6 fwc8 fwct4 fwcs1
-------- -------- -------- --------
0.50 1.01 0.80 1.00
(0.08) (0.14) (0.11) (0.12)
5.99 7.43 7.51 8.33
PHI
KSI 1 KSI 2
-------- --------
KSI 1 1.00
KSI 2 0.42 1.00
THETA-DELTA
wfc4 wfc5 wfc6 wfct4 wfcs4 fwc1
------- ------- ------- ------- ------- -------
.32 .38 .35 .56 .43 .47
THETA-DELTA
fwc6 fwc8 fwct4 fwcs1
-------- -------- -------- --------
0.20 0.30 0.31 0.43
Appendix 7B
LISREL INPUT FOR MULTIGROUP ANALYSES: FULL
MEASUREMENT INVARIANCE MODEL
title 'WFC-FWC multi-group'
DA NI=43 NO=151 MA=cm ng=2
cm fu file=a:\wfc2book.dat fo=5
(8f10.6/8f10.6/8f10.6/8f10.6/8f10.6/3f10.6)
La
wfc1 wfc2 wfc3 wfc4 wfc5 wfc6 wfc7
wfct1 wfct2 wfct3 wfct4 wfct5 wfct6 wfct7 wfct8
wfcs1 wfcs2 wfcs3 wfcs4 wfcs5 wfcs6 wfcs7
fwc1 fwc2 fwc3 fwc4 fwc5 fwc6 fwc7 fwc8
fwct1 fwct2 fwct3 fwct4 fwct5 fwct6 fwct7
fwcs1 fwcs2 fwcs3 fwcs4 fwcs5 fwcs6 /
se
wfc4 wfc5 wfc6 wfct4 wfcs4
fwc1 fwc6 fwc8 fwct4 fwcs1/
mo nx=10 nk=2
fr lx(2,1) lx(3,1) lx(4,1) lx(5,1)
fr lx(7,2) lx(8,2) lx(9,2) lx(10,2)
va 1.00 lx(1,1) lx(6,2)
st .5 all
path diagram
OU sc rs mi ss tv it=800 tm=10000 AD=OFF
title ‘WFC-FWC measurement model Study One’
DA NI=43 NO=181 MA=cm
cm fu file=a:\wfc3book.dat fo=5
(8f10.6/8f10.6/8f10.6/8f10.6/8f10.6/3f10.6)
La
wfc1 wfc2 wfc3 wfc4 wfc5 wfc6 wfc7
wfct1 wfct2 wfct3 wfct4 wfct5 wfct6 wfct7 wfct8
wfcs1 wfcs2 wfcs3 wfcs4 wfcs5 wfcs6 wfcs7
fwc1 fwc2 fwc3 fwc4 fwc5 fwc6 fwc7 fwc8
182
07-Netemeyer.qxd 2/12/03 12:47 PM Page 183
EIGHT
CONCLUDING REMARKS
T his text has focused on the development and validation of multi-item
self-administered measures of unobservable, latent constructs. Effective
measurement is a cornerstone of scientific research and is required in the
process of testing theoretical relationships among unobservable constructs.
The assessment of these constructs is often assumed to be indirect via the use
of self-report, paper-and-pencil measures on which multiple reflective items or
indicators are used to operationalize constructs. Throughout the book, we have
tried to emphasize the importance of theoretical aspects of scale development.
As reiterated below, these guiding conceptual issues include construct definition,
domain specification, dimensionality, and the nomological network in which
constructs are embedded. We cannot overemphasize the importance of a
priori theory to the beginning of item development and subsequent empirical
aspects of the scale development process.
Following individual chapters regarding scale dimensionality (Chapter 2),
reliability (Chapter 3), and validity (Chapter 4), our present effort was orga-
nized around the four-step scale development sequence depicted in Figure 1.1
and covered in Chapters 5, 6, and 7. This recommended approach is generally
consistent with much of the extant scale development literature (e.g., Churchill,
1979; Clark & Watson, 1995; DeVellis, 1991; Haynes et al., 1999; Nunnally
& Bernstein, 1994; Spector, 1992). We are certainly indebted to the many
authors who have proposed appropriate scale development procedures and/or
described quality scale development endeavors. As the readers of this book
undoubtedly realize, the recommended stages, as well as the overlapping activ-
ities assumed to constitute each stage, are offered as a logical, sequential
184
08-Netemeyer.qxd 2/12/03 12:48 PM Page 185
REFERENCES
Allen, S. J., & Hubbard, R. (1986). Regression equations of the latent roots of random
data correlation matrices with unities on the diagonal. Multivariate Behavioral
Research, 21, 393-398.
Anastasi, A., & Urbina, S. (1998). Psychological testing. Englewood Cliffs, NJ:
Prentice-Hall.
Anderson, J. C., & Gerbing, D. W. (1988). Structural equation modeling in practice:
A review and recommended two-step approach. Psychological Bulletin, 103,
411-423.
Bagozzi, R. P. (1993). Assessing construct validity in personality research: Application
to measures of self-esteem. Journal of Research in Personality, 27, 49-87.
Bagozzi, R. P., & Edwards, J. R. (1998). A general approach for representing
constructs in organizational research. Organizational Research Methods, 1, 45-87.
Bagozzi, R. P., & Heatherton, T. F. (1994). A general approach to representing multi-
faceted personality constructs: Application to state self-esteem. Structural
Equation Modeling, 1(1), 35-67.
Bagozzi, R. P., & Yi, Y. (1988). On the evaluation of structural equation models.
Journal of the Academy of Marketing Science, 16(1), 74-94.
Bagozzi, R. P., Yi, Y., & Phillips, L. W. (1991). Assessing construct validity in orga-
nizational research. Administrative Science Quarterly, 36(3), 421-458.
Bearden, W. O., Hardesty, D., & Rose, R. (2001). Consumer self-confidence:
Refinements in conceptualization and measurement. Journal of Consumer
Research, 28(June), 121-134.
Bearden, W. O., & Netemeyer, R. G. (1998). Handbook of marketing scales: Multi-
item measures for marketing and consumer behavior research. Thousand Oaks,
CA: Sage.
Bearden, W. O., Netemeyer, R. G., & Teel, J. E. (1989). Measurement of consumer
susceptibility to interpersonal influence. Journal of Consumer Research,
15(March), 473-481.
Bearden, W. O., & Rose, R. L. (1990). Attention to social comparison information: An
individual difference factor affecting conformity. Journal of Consumer Research,
16(March), 461-471.
187
REF-Netemeyer.qxd 2/12/03 12:49 PM Page 188
Bearden, W. O., Sharma, S., & Teel, J. E. (1982). Sample size effects on chi-square and
other statistics used in evaluating causal models. Journal of Marketing Research,
19(November), 425-430.
Bentler, P. M. (1990). Comparative fit indices in structural equation models.
Psychological Bulletin, 107, 238-246.
Bentler, P. M., & Chou, C. (1987). Practical issues in structural modeling. Sociological
Methods & Research, 16(1), 78-117.
Bollen, K. A. (1989). Structural equations with latent variables. New York: John
Wiley & Sons.
Bollen, K. A., & Lennox, R. (1991). Conventional wisdom on measurement: A structural
equations perspective. Psychological Bulletin, 110, 305-314.
Boyle, G. J. (1991). Does item homogeneity indicate internal consistency or item
redundancy in psychometric scales? Personality and Individual Differences, 3,
291-294.
Browne, M. W. (1990). MUTMUM PC: A program for fitting the direct product
models for multitrait-multimethod Data. Pretoria: University of South Africa.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In
K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136-162).
Newbury Park, CA: Sage.
Bruner, G., & Hensel, P. (1997). Marketing scales handbook: A compilation of multi-
item measures (2nd ed.). Chicago: American Marketing Association.
Byrne, B. M. (2001). Structural equation modeling with AMOS: Basic concepts, appli-
cations, and programming. Mahwah, NJ: Erlbaum Associates.
Byrne, B., Shavelson, R. J., & Muthen, B. (1989). Testing the equivalence of factor
covariance and mean structures: The issue of partial measurement invariance.
Psychological Bulletin, 105, 456-466.
Cacciopo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality
and Social Psychology, 42, 116-131.
Calder, B. J., Phillips, L. W., & Tybout, A. M. (1982). The concept of external
validity. Journal of Consumer Research, 9(December), 240-244.
Campbell, D. T. (1960). Recommendations for APA test standards regarding construct,
trait, or discriminant validity. American Psychologist, 15, 546-553.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validity by the
multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Carver, C. S. (1989). How should multi-faceted personality constructs be tested? Issues
illustrated by self-monitoring, attributional style, and hardiness. Journal of
Personality and Social Psychology, 56(4), 577-585.
Cattell, R. B. (1966). The meaning and strategic use of factor analysis. In R. B. Cattell
(Ed.), Handbook of multivariate experimental psychology (pp. 174-243). Chicago:
Rand McNally.
Churchill, G. A. (1979). A paradigm for developing better measures of marketing
constructs. Journal of Marketing Research, 16(February), 64-73.
Churchill, G. A., & Iacobucci, D. (2002). Marketing research methodological foundations
(8th ed.). Fort Worth, TX: Harcourt College Publishers.
REF-Netemeyer.qxd 2/12/03 12:49 PM Page 189
References 189
Churchill, G. A., & Peter, J. P. (1984). Research design effects on the reliability of
rating scales: A meta-analysis. Journal of Marketing Research, 21(November),
360-375.
Clark, L. A., & Watson, D. (1995). Constructing validity: Basic issues in scale
development. Psychological Assessment, 7(3), 309-319.
Cliff, N. (1988). The eigenvalue-greater-than-one rules and reliability of components.
Psychological Bulletin, 103(2), 276-279.
Comrey, A. L. (1988). Factor-analytic methods of scale development in personality and
clinical psychology. Journal of Consulting and Clinical Psychology, 56, 754-761.
Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis
issues for field settings. Boston: Houghton Mifflin.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and appli-
cation. Journal of Applied Psychology, 78, 98-104.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory.
Orlando, FL: Holt, Rinehart, & Winston.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 31, 93-96.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests.
Psychological Bulletin, 52, 281-302.
Crowne, D. P., & Marlowe, D. (1960). A new scale for social desirability independent
of psychopathology. Journal of Consulting Psychology, 24(4), 349-354.
DeVellis, R. F. (1991). Scale development: Theory and applications. Newbury Park,
CA: Sage.
Diamantopoulos, A., & Winklhofer, H. M. (2001). Index construction with formative
indicators: An alternative to scale development. Journal of Marketing Research,
36, 269-277.
Fan, X., Thompson, B., & Wang, L. (1999). Effects of sample size, estimation methods,
and model specification on structural equation modeling fit indices. Structural
Equation Modeling, 6(1), 56-83.
Finn, A., & Kayande, U. (1997). Reliability assessment and optimization of marketing
measurement. Journal of Marketing Research, 34(May), 262-275.
Fisher, R. J. (1993). Social desirability bias and the validity of indirect questioning.
Journal of Consumer Research, 20(September), 303-315.
Floyd, F. J., & Widaman, K. (1995). Factor analysis in the development and refinement
of clinical assessment instruments. Psychological Assessment, 7(3), 286-299.
Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unob-
servable variables and measurement error. Journal of Marketing Research,
18(February), 39-50.
Friestad, M., & Wright, P. (2001). Pre-adult education on marketplace persuasion
tactics: Integrating marketplace literacy and media literacy. Unpublished faculty
working paper, University of Oregon, Eugene.
Ganster, D. C., Hennessey, H. W., & Luthans, F. (1983). Social desirability response
effects: Three alternative models. Academy of Management Journal, 26(June),
321-331.
REF-Netemeyer.qxd 2/12/03 12:49 PM Page 190
Gardner, H. (1993). Frames of mind: The theory of multiple intelligences. New York:
Basic Books.
Gerbing, D. W., & Anderson, J. C. (1984). On the meaning of within-factor correlated
measurement errors. Journal of Consumer Research, 11(June), 572-580.
Gerbing, D. W., & Anderson, J. C. (1988). An updated paradigm for scale development
incorporating unidimensionality and its assessment. Journal of Marketing
Research, 25(May), 186-192.
Green, D. P., Goldman, S. L., & Salovey, P. (1993). Measurement error masks bipolarity
in affect ratings. Journal of Personality and Social Psychology, 64, 1029-1041.
Hair, J. F., Anderson, R. E., Tatham, R. L., & Black, W. C. (1998). Multivariate data
analysis (5th ed.). Englewood Cliffs, NJ: Prentice Hall.
Hambleton, R. K., Swaminathin, H., & Rogers, H. J. (1991). Fundamentals of item
response theory. Newbury Park, CA: Sage.
Hartley, H. O., Rao, J. N. K., & LaMotte, L. (1978). A simple synthesis-based method
of variance component estimation. Biometrics, 34, 233-242.
Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items.
Applied Psychological Measurement, 9(June), 139-164.
Hayduk, L. A. (1996). LISREL: Issues, debates, and strategies. Baltimore, MD: Johns
Hopkins University Press.
Haynes, S., Nelson, N. K., & Blaine, D. (1999). Psychometric issues in assessment
research. In P. C. Kendall, J. N. Butcher, & G. Holmbeck (Eds.), Handbook of
research methods in clinical psychology (pp. 125-154). New York: John Wiley &
Sons.
Haynes, S., Richard, D. C., & Kubany, E. S. (1995). Content validity in psychological
assessment: A functional approach to concepts and methods. Psychological
Assessment, 7, 238-247.
Herche, J., & Engellend, B. (1996). Reversed polarity items and scale dimensionality.
Journal of the Academy of Marketing Science, 24(4), 366-374.
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis.
Psychometrika, 30, 179-186.
Hoyle, R. (1995). Structural equation modeling: Issues and applications. Newbury
Park, CA: Sage.
Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indices in covariance structure
analysis: Conventional criteria versus new alternatives. Structural Equation
Modeling, 6(1), 1-55.
Hull, J. G., Lehn, D. A., & Tedlie, J. (1991). A general approach to testing multifacted
personality constructs. Journal of Personality and Social Psychology, 61(6),
932-945.
Iacobucci, D., Ostrom, A., & Grayson, K. (1995). Distinguishing service quality and
customer satisfaction: The voice of the consumer. Journal of Consumer
Psychology, 4(3), 277-303.
Jarvis, W. B. G., & Petty, R. E. (1996). The need to evaluate. Journal of Personality
and Social Psychology, 70(1), 172-194.
REF-Netemeyer.qxd 2/12/03 12:49 PM Page 191
References 191
Jöreskog, K., & Sörbom, D. (1989). LISREL7: A guide to the program and applications
(2nd ed.). Chicago: SPSS.
Kaplan, R. M., & Saccuzzo, D. P. (1997). Psychological testing: Principles, applica-
tions, and issues (4th ed.). Pacific Grove, CA: Brooks/Cole.
Kelley, J. R., & McGrath, J. E. (1988). On time and method. Beverly Hills, CA: Sage.
Kenny, D. A., & Kashy, D. A. (1992). Analysis of the multitrait-multimethod matrix
by confirmatory factor analysis. Psychological Bulletin, 122, 165-172.
Kishton, J. M., & Widaman, K. F. (1994). Unidimensional versus domain representa-
tive parcelling of questionnaire items: An empirical example. Educational and
Psychological Measurement, 54, 565-575.
Kumar, A., & Dillon, W. R. (1987). Some further remarks on measurement-structure
interaction and the unidimensionality of constructs. Journal of Marketing
Research, 24(November), 438-444.
Lastovicka, J. L., Bettencourt, L. A., Hughner, R. S., & Kuntze, R. J. (1999). Lifestyle
of the tight and frugal: Theory and measurement. Journal of Consumer Research,
26(June), 85-98.
Lastovicka, J. L., Murry, J. P., Jr., & Joachimsthaler, E. (1990). Evaluating measure-
ment validity of lifestyle typologies with qualitative measures and multiplicative
factoring. Journal of Marketing Research, 27(February), 11-23.
Lennox, R. D., & Wolfe, R. N. (1984). Revision of the Self-Monitoring Scale. Journal
of Personality and Social Psychology, 46(6), 1349-1364.
Levine, R. A., & Campbell, D. T. (1972). Ethnocentrism: Theories of conflict, ethnic
attitudes, and group behavior. New York: John Wiley & Sons.
Lichtenstein, D. R., Ridgway, N. M., & Netemeyer, R. G. (1993). Price perceptions and
consumer shopping behavior: A field study. Journal of Marketing Research,
30(May), 234-245.
Loevinger, J. (1957). Objective tests as instruments of psychological theory.
Psychological Reports, 3, 635-694.
MacCallum, R. C., & Browne, M. W. (1993). The use of causal indicators in covari-
ance structure models: Some practical issues. Psychological Bulletin, 114,
533-541.
MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modification in
covariance structure analysis: The problem of capitalization on chance.
Psychological Bulletin, 114, 185-199.
Marcoulides, G. A. (1998). Applied generalizability theory models. In G. A. Marcoulides
(Ed.), Modern methods for business research (pp. 1-28). Mahwah, NJ: Lawrence
Erlbaum.
Marsh, H. (1995). Confirmatory factor analysis models of factorial invariance: A multi-
faceted approach. Structural Equation Modeling, 1, 5-34.
McDonald, R. P. (1981). The dimensionality of tests and items. British Journal of
Mathematical and Statistical Psychology, 34, 100-117.
McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence
Erlbaum Associates.
REF-Netemeyer.qxd 2/12/03 12:49 PM Page 192
References 193
AUTHOR INDEX
Algina, J., 1, 41, 45, 53 Cattell, R. B., 30
Allen, S. J., 30, 33 Chou, C., 13
Anastasi, A., 1, 12 Churchill, G. A., Jr., xiii, 10, 11, 13, 14, 77,
Anderson, J. C., 9, 10, 13, 26, 55, 94, 120, 78, 79, 89, 165, 184
151, 152, 154, 155 Clark, L. A., xiii, 5, 6, 7, 8, 10, 11, 14, 27,
Anderson, R. E., 121, 122, 123, 124, 55, 57, 59, 74, 91, 96, 98, 99, 100,
147, 148, 149, 151, 153, 155 101, 102, 103, 106, 116, 118, 123,
125, 126, 145, 146, 153, 184
Bagozzi, R. P., 13, 79, 94, 147, 149, Cliff, N., 28, 123
153, 163 Comrey, A. L., 99, 100, 123, 147
Baumgartner, H., 155, 156, 163 Cook, T. D., 77
Bearden, W. O., 6, 7, 37, 73, 74, 75, 76, 79, Cortina, J. M., 9, 10, 11, 27, 50, 52, 55, 57,
80, 81, 82, 83, 88, 93, 96, 97, 109, 58, 94, 126
118, 120, 121, 126, 144, 145, 146, Crocker, L., 1, 41, 45, 53
149, 153, 157, 161, 163, 164, 165, 166 Cronbach, L. J., 7, 8, 11, 48,
Bentler, P. M., 13, 151, 152 82, 90, 104
Bernstein, I. H., xiii, 2, 3, 5, 6, 11, 12, 14, Crowne, D. P., 85
45, 50, 57, 72, 73, 78, 82, 89, 92, 93, Cudeck, R., 152
95, 96, 97, 100, 102, 184
Bettencourt, L. A., 80, 82, 88, 92, 93, 97, Dawson, S., 56, 163, 164
110, 118, 144, 146, 157, 163, 164, 165 DeVellis, R. F., xiii, 2, 5, 6, 11, 12,
Black, W. C., 121, 122, 123, 124, 147, 148, 14, 45, 50, 57, 80, 98, 99, 100,
149, 151, 153, 155 102, 116, 184
Blaine, D., xiii, 1, 2, 6, 10, 11, 13, 14, 45, Diamantopoulos, A., 6, 93
71, 72, 73, 89, 91, 97, 102, 103, 184 Dillon, W. R., 10
Boles, J. S., 88, 96, 106, 107, 111, 113, 121, Durvasula, S., 59, 63, 168, 169, 170
128, 144, 145, 157, 158, 166, 167,
181, 183 Edwards, J. R., 147
Bollen, K. A., 6, 13, 82, 93, 147, 155, 156 Engellend, B., 99
Boyle, G. J., 57
Browne, M. W., 6, 80, 93, 152 Fan, X., 149
Bruner, G., 6 Feick, L., 77
Byrne, B. M., 13, 147 Finn, A., 167
Fisher, R. J., 84
Cacciopo, J. T., 99 Fiske, D. W., 13, 77, 162
Calder, B. J., 71 Floyd, F. J., 10, 26, 98, 121, 122, 123, 124,
Campbell, D. T., 13, 77, 104, 162 147, 148, 149, 150
Carver, C. S., 56 Fornell, C., 13, 152, 153, 154
195
A-Index-Netemeyer.qxd 2/12/03 12:49 PM Page 196
SUBJECT INDEX
Adjusted goodness-of-fit (AGFI) generalizability theory, application of,
indices, 151 166-168, 167 (table), 169-170
Akaike information criteria (AIC), 152 (tables)
Alternative-form reliability, 45-46, inter-item correlation and, 148-149
46 (table) literature on, 157-158
Attention-to-social-comparison scale measurement invariance testing
(ATSCI), 163-164 and, 155-157
Attenuation paradox, 98 model evaluation criteria, 150-157
Attributes, 2-3 model evaluation example, 158-161, 162
Average variance extracted estimate (table), 171-183
(AVE), 153-154 norms, establishment of, 164-166
parameter estimates/related diagnostics,
Balanced Inventory of Desirable 152-154
Responding (BIDR), 85 predictive validity and, 163-164
sample size and, 149
Chi-square index, 37, 151, 152 scale development and, 147-148
Coefficient alpha, 49-53 standardized residuals and, 154-155
dimensionality and, 54-56 See also Exploratory factor
example of, 53-54, 54 (table) analysis (EFA)
scale length, inter-item correlation/item Construct definition, 88-89
redundancy and, 57-59 clarity of, 89-90
Comparative fit index (CFI), 152 dimensionality, theory and, 93-94
Composite reliability, 153 effect versus formative indicators, 92-93
Computer technology, 6, 36 expert/individual judging and, 91-92
Concurrent validity, 76-77, 164 literature review and, 91
Confirmatory factor analysis (CFA), 10, nomological networks and, 90, 91
26, 36-37, 37 (table), 38-39 (exhibit), (figure)
98, 120 theoretical foundation of, 90
concurrent validity and, 164 validity judgment guidelines, 102-103
convergence, parameter estimates See also Consumer Ethnocentrism Scale
and, 150-151 (CETSCALE); Consumer price
convergent validity and, 162-163 consciousness scale; Measurement
correlated measurement errors items; Work-family conflict
and, 149-150 (WFC) scale
discriminant validity and, 163 Construct-irrelevant variance, 89-90
factor indicators, number of, 147 Constructs, 4-6
fit indices and, 151-152 literature reviews and, 8-9
199
S-Index-Netemeyer.qxd 2/12/03 12:49 PM Page 200
Richard G. Netemeyer is a Professor of Commerce at the McIntire School of
Commerce at the University of Virginia. He received his PhD in business
administration with a specialization in marketing from the University of South
Carolina in 1986. He was a member of the marketing faculty at Louisiana
State University for 15 years before joining the McIntire faculty in the fall of
2001. He currently teaches quantitative analysis and marketing research at
McIntire and conducts research on consumer and organizational behavior
topics with a focus on measurement and survey-based techniques. His research
has appeared in the Journal of Consumer Research, Journal of Marketing
Research, Journal of Marketing, Journal of Applied Psychology, Organiza-
tional Behavior & Human Decision Processes, and other publications. He is a
coauthor of two books pertaining to measurement and is a member of the edi-
torial review boards of the Journal of Consumer Research, Journal of the
Academy of Marketing Science, and Journal of Public Policy and Marketing.
205
ATA-Netemeyer.qxd 2/12/03 12:49 PM Page 206
Subhash Sharma received his PhD from the University of Texas at Austin in
1978. He is a Professor of Marketing and Charles W. Coker Sr. Distinguished
Foundation Fellow in the Moore School of Business at the University of South
Carolina. His research interests include research methods, structural equation
modeling, pricing, customer relationship management, and e-commerce. He
has published numerous articles in leading academic journals such as the
Journal of Marketing, Journal of Marketing Research, Marketing Science,
Management Science, and Journal of Retailing. He has also authored Applied
Multivariate Techniques (1996). He reviews for a number of journals and is on
the editorial reviews boards of the Journal of Marketing, Journal of Marketing
Research, and Journal of Retailing.
ATA-Netemeyer.qxd 2/12/03 12:49 PM Page 207
ATA-Netemeyer.qxd 2/12/03 12:49 PM Page 208
ATA-Netemeyer.qxd 2/12/03 12:49 PM Page 209
ATA-Netemeyer.qxd 2/12/03 12:49 PM Page 210